Learning Top-k Transformation Rules

Patro, Sunanda; Wang, Wei

doi:10.1007/978-3-642-23088-2_12

Learning Top-k Transformation Rules

Sunanda Patro²⁰ &
Wei Wang²⁰

Conference paper

1246 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6860))

Abstract

Record linkage identifies multiple records referring to the same entity even if they are not bit-wise identical. It is thus an essential technology for data integration and data cleansing. Existing record linkage approaches are mainly relying on similarity functions based on the surface forms of the records, and hence are not able to identify complex coreference records. This seriously limits the effectiveness of existing approaches.

In this work, we propose an automatic method to extract top-k high quality transformation rules given a set of possibly coreferent record pairs. We propose an effective algorithm that performs careful local analyses for each record pair and generates candidate rules; the algorithm finally chooses top-k rules based on a scoring function. We have conducted extensive experiments on real datasets, and our proposed algorithm has substantial advantage over the previous algorithm in both effectiveness and efficiency.

This work was partially supported by ARC Discovery Projects DP0987273 and DP0881779.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Arasu, A., Chaudhuri, S., Kaushik, R.: Learning string transformations from examples. PVLDB 2, 514–525 (2009)
Google Scholar
Winkler, W.E.: The state of record linkage and current research problems. Technical report, Statistical Research Division, U.S. Census Bureau (1999)
Google Scholar
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng. 19, 1–16 (2007)
Article Google Scholar
Wang, W., Xiao, C., Lin, X., Zhang, C.: Efficient approximate entity extraction with edit distance constraints. In: SIGMOD (2009)
Google Scholar
Winkler, W.E., Thibaudeau, Y.: An application of the Fellegi-Sunter model of record linkage to the 1990 U.S. Decennial Census. Technical report, US Bureau of the Census (1991)
Google Scholar
Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: WWW (2008)
Google Scholar
Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A comparison of string metrics for matching names and records. In: Proceedings of the IJCAI-2003 Workshop on Information Integration on the Web (2003)
Google Scholar
Moreau, E., Yvon, F., Cappé, O.: Robust similarity measures for named entities matching. In: COLING (2008)
Google Scholar
Bilenko, M., Mooney, R.J.: Learning to combine trained distance metrics for duplicate detection in databases. Technical report, University of Texas at Austin (2002)
Google Scholar
Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: KDD, pp. 269–278 (2002)
Google Scholar
Tejada, S., Knoblock, C.A., Minton, S.: Learning domain-independent string transformation weights for high accuracy object identification. In: KDD, pp. 350–359 (2002)
Google Scholar
Cohen, W.W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: KDD, pp. 475–480 (2002)
Google Scholar
Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: KDD, pp. 39–48 (2003)
Google Scholar
Bilenko, M., Mooney, R.J.: On evaluation and training-set construction for duplicate detection. In: Proceedings of the KDD-2003 Workshop on Data Cleaning, Record Linkage, and Object Consolidation, Washington, DC, pp. 7–12 (2003)
Google Scholar
Minton, S., Nanjo, C., Knoblock, C.A., Michalowski, M., Michelson, M.: A heterogeneous field matching method for record linkage. In: ICDM, pp. 314–321 (2005)
Google Scholar
Arasu, A., Götz, M., Kaushik, R.: On active learning of record matching packages. In: SIGMOD Conference, pp. 783–794 (2010)
Google Scholar
Michelson, M., Knoblock, C.A.: Mining the heterogeneous transformations between data sources to aid record linkage. In: IC-AI (2009)
Google Scholar
Arasu, A., Chaudhuri, S., Kaushik, R.: Transformation-based framework for record matching. In: ICDE, pp. 40–49 (2008)
Google Scholar
Arasu, A., Chaudhuri, S., Ganjam, K., Kaushik, R.: Incorporating string transformations in record matching. In: SIGMOD Conference, pp. 1231–1234 (2008)
Google Scholar
Turney, P.D.: Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In: Flach, P.A., De Raedt, L. (eds.) ECML 2001. LNCS (LNAI), vol. 2167, pp. 491–502. Springer, Heidelberg (2001)
Chapter Google Scholar
Pang, B., Knight, K., Marcu, D.: Syntax-based alignment of multiple translations: Extracting paraphrases and generating new sentences. In: HLT-NAACL (2003)
Google Scholar
Bar-Yossef, Z., Keidar, I., Schonfeld, U.: Do not crawl in the dust: different urls with similar text. In: WWW (2007)
Google Scholar
Dasgupta, A., Kumar, R., Sasturkar, A.: De-duping urls via rewrite rules. In: KDD, pp. 186–194 (2008)
Google Scholar
Jones, R., Rey, B., Madani, O., Greiner, W.: Generating query substitutions. In: WWW (2006)
Google Scholar
Radlinski, F., Broder, A.Z., Ciccolo, P., Gabrilovich, E., Josifovski, V., Riedel, L.: Optimizing relevance and revenue in ad search: a query substitution approach. In: SIGIR (2008)
Google Scholar
Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: ICML, pp. 282–289 (2001)
Google Scholar
Kuhn, H.W.: The hungarian method for the assignment problem. Naval Research Logistics Quarterly 2, 83–97 (1955)
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

University of New South Wales, Australia
Sunanda Patro & Wei Wang

Authors

Sunanda Patro
View author publications
You can also search for this author in PubMed Google Scholar
Wei Wang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institut de Recherche en Informatique de Toulouse (IRIT), Paul Sabatier University, 118, route de Narbonne, 31062, Toulouse Cedex, France
Abdelkader Hameurlain
Brigham Young University, 784 TNRB, 84602, Provo, UT, USA
Stephen W. Liddle
Software Competence Center Hagenberg and Johannes-Keppler-University Linz, Softwarepark 21, 4232, Hagenberg, Austria
Klaus-Dieter Schewe
School of Information Technology and Electrical Engineering, University of Queensland, 4072, Brisbane, QLD, Australia
Xiaofang Zhou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Patro, S., Wang, W. (2011). Learning Top-k Transformation Rules. In: Hameurlain, A., Liddle, S.W., Schewe, KD., Zhou, X. (eds) Database and Expert Systems Applications. DEXA 2011. Lecture Notes in Computer Science, vol 6860. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23088-2_12

Download citation

DOI: https://doi.org/10.1007/978-3-642-23088-2_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23087-5
Online ISBN: 978-3-642-23088-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics