Abstract
Record linkage deals with finding records that identify the same real world entity, such as an individual or a business, from a given file or set of files and has many applications. This problem is also referred to as the entity resolution or record recognition problem. To locate those records identifying the same real world entity, in principle, pairwise record analyses have to be performed among all records. Analytical operations are complex and take a lot of time. The number of such analyses is quadratic in terms of the number of records given and therefore is very time consuming. To reduce the number of pairwise record comparisons, blocking techniques are introduced to partition the records into blocks and records in each block are analyzed against one and another. One of the effective blocking methods is the closure approach. In this paper, we describe the design and implementation of a parallel and distributed closure prototype system running in an enterprise grid. The system can either produce all closures to a file in a batch fashion or run as a service where upon receiving a record it returns the closure of that record. Preliminary experiment indicates the approach is efficient and scalable.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Li, W., Zhang, J., Bheemavaram, R.: Efficient algorithms for grouping data to improve data quality. In: Proc. 2006 International Conference on information and knowledge engineering, Las Vegas, pp. 149–154 (2006)
Ballou, D., Wang, H., Pazer, G.: Modeling information manufacturing systems to determining information product quality. Management Science 44(4), 462–484 (1998)
Ballou, D.: Enhancing data quality in data warehousing environment. Comm. ACM 42(1), 73–78 (1999)
Delone, W., Mclean, E.: Information systems success: The quest for the independent variable. Information Systems Research 3(1), 60–95 (1992)
Redman, T.: The impact of poor data quality on the typical enterprise. Comm. ACM 41(2), 79–82 (1998)
Fellegi, I., Sunter, A.: A theory for record linkage. Journal of the American Statistical Association 64, 1183–1210 (1969)
Do, H., Rahm, E.: Coma - a system for flexible combination of schema matching approaches. In: Proc. ACM SIGKDD ’02 (2002)
Tejada, S., Knoblock, C.A., Minton, S.: Learning domain-independent string transformation for high accuracy object identification. In: Proc. Very Large Data Bases 2002, pp. 610–621 (2002)
Benjelloun, O., Garcia-Molina, H., Su, Q., Widom, J.: Swoosh: A generic approach to entity resolution. Tech. rep., Stanford University (2005)
Baxter, R., Christen, P., Churches, T.: A comparison of fast blocking methods for record linkage. In: ACM SIGKDD 2003 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, Washington, DC, pp. 25–27 (2003)
Bheemavaram, R.: Parallel and distributed grouping algorithms for finding related records of huge data sets on cluster grid. M. s. thesis, University of Akansas (2006)
Li, W., et al.: Paralle and distributed grouping algorithms for finding related records of huge data sets on cluster grids. In: Proc. ALAR conference on Applied Research in Information Technology, Conway (2007)
Li, W., Bheemavaram, R., Zhang, J.: Transitive closure of data records. In: Chan, Y., Talburt, J., Talley, T. (eds.) Data Engineering: Mining, information and Intelligence, pp. 39–74. Springer, New York (2010)
Hayes, D.: A corba-based distributed and multithreaded algorithm finding related records in a large data set. M. s. thesis, University of Akansas, Fayetteville, Arkansas (2008)
Cormen, T., Leiserson, C., Rivest, R., Stein, C.: Introduction To Algorithms. McGraw-Hill Book Company, Cambridge (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Li, WN., Hayes, D., Baran, J., Porter, C., Schweiger, T. (2010). A Grid Based System for Closure Computation and Online Service. In: Hsu, CH., Yang, L.T., Park, J.H., Yeo, SS. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2010. Lecture Notes in Computer Science, vol 6082. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-13136-3_8
Download citation
DOI: https://doi.org/10.1007/978-3-642-13136-3_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-13135-6
Online ISBN: 978-3-642-13136-3
eBook Packages: Computer ScienceComputer Science (R0)