Skip to main content

A Grid Based System for Closure Computation and Online Service

  • Conference paper
  • 693 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6082))

Abstract

Record linkage deals with finding records that identify the same real world entity, such as an individual or a business, from a given file or set of files and has many applications. This problem is also referred to as the entity resolution or record recognition problem. To locate those records identifying the same real world entity, in principle, pairwise record analyses have to be performed among all records. Analytical operations are complex and take a lot of time. The number of such analyses is quadratic in terms of the number of records given and therefore is very time consuming. To reduce the number of pairwise record comparisons, blocking techniques are introduced to partition the records into blocks and records in each block are analyzed against one and another. One of the effective blocking methods is the closure approach. In this paper, we describe the design and implementation of a parallel and distributed closure prototype system running in an enterprise grid. The system can either produce all closures to a file in a batch fashion or run as a service where upon receiving a record it returns the closure of that record. Preliminary experiment indicates the approach is efficient and scalable.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Li, W., Zhang, J., Bheemavaram, R.: Efficient algorithms for grouping data to improve data quality. In: Proc. 2006 International Conference on information and knowledge engineering, Las Vegas, pp. 149–154 (2006)

    Google Scholar 

  2. Ballou, D., Wang, H., Pazer, G.: Modeling information manufacturing systems to determining information product quality. Management Science 44(4), 462–484 (1998)

    Article  MATH  Google Scholar 

  3. Ballou, D.: Enhancing data quality in data warehousing environment. Comm. ACM 42(1), 73–78 (1999)

    Article  Google Scholar 

  4. Delone, W., Mclean, E.: Information systems success: The quest for the independent variable. Information Systems Research 3(1), 60–95 (1992)

    Article  Google Scholar 

  5. Redman, T.: The impact of poor data quality on the typical enterprise. Comm. ACM 41(2), 79–82 (1998)

    Article  Google Scholar 

  6. Fellegi, I., Sunter, A.: A theory for record linkage. Journal of the American Statistical Association 64, 1183–1210 (1969)

    Article  Google Scholar 

  7. Do, H., Rahm, E.: Coma - a system for flexible combination of schema matching approaches. In: Proc. ACM SIGKDD ’02 (2002)

    Google Scholar 

  8. Tejada, S., Knoblock, C.A., Minton, S.: Learning domain-independent string transformation for high accuracy object identification. In: Proc. Very Large Data Bases 2002, pp. 610–621 (2002)

    Google Scholar 

  9. Benjelloun, O., Garcia-Molina, H., Su, Q., Widom, J.: Swoosh: A generic approach to entity resolution. Tech. rep., Stanford University (2005)

    Google Scholar 

  10. Baxter, R., Christen, P., Churches, T.: A comparison of fast blocking methods for record linkage. In: ACM SIGKDD 2003 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, Washington, DC, pp. 25–27 (2003)

    Google Scholar 

  11. Bheemavaram, R.: Parallel and distributed grouping algorithms for finding related records of huge data sets on cluster grid. M. s. thesis, University of Akansas (2006)

    Google Scholar 

  12. Li, W., et al.: Paralle and distributed grouping algorithms for finding related records of huge data sets on cluster grids. In: Proc. ALAR conference on Applied Research in Information Technology, Conway (2007)

    Google Scholar 

  13. Li, W., Bheemavaram, R., Zhang, J.: Transitive closure of data records. In: Chan, Y., Talburt, J., Talley, T. (eds.) Data Engineering: Mining, information and Intelligence, pp. 39–74. Springer, New York (2010)

    Google Scholar 

  14. Hayes, D.: A corba-based distributed and multithreaded algorithm finding related records in a large data set. M. s. thesis, University of Akansas, Fayetteville, Arkansas (2008)

    Google Scholar 

  15. Cormen, T., Leiserson, C., Rivest, R., Stein, C.: Introduction To Algorithms. McGraw-Hill Book Company, Cambridge (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Li, WN., Hayes, D., Baran, J., Porter, C., Schweiger, T. (2010). A Grid Based System for Closure Computation and Online Service. In: Hsu, CH., Yang, L.T., Park, J.H., Yeo, SS. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2010. Lecture Notes in Computer Science, vol 6082. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13136-3_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-13136-3_8

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-13135-6

  • Online ISBN: 978-3-642-13136-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics