skip to main content
research-article

BLAST2: An Efficient Technique for Loose Schema Information Extraction from Heterogeneous Big Data Sources

Published:10 November 2020Publication History
Skip Abstract Section

Abstract

We present BLAST2, a novel technique to efficiently extract loose schema information, i.e., metadata that can serve as a surrogate of the schema alignment task within the Entity Resolution (ER) process, to identify records that refer to the same real-world entity when integrating multiple, heterogeneous, and voluminous data sources. The loose schema information is exploited for reducing the overall complexity of ER, whose naïve solution would imply O(n2) comparisons, where n is the number of entity representations involved in the process and can be extracted by both structured and unstructured data sources. BLAST2 is completely unsupervised yet able to achieve almost the same precision and recall of supervised state-of-the-art schema alignment techniques when employed for Entity Resolution tasks, as shown in our experimental evaluation performed on two real-world datasets (composed of 7 and 10 data sources, respectively).

References

  1. Domenico Beneventano, Sonia Bergamaschi, Luca Gagliardelli, and Giovanni Simonini. 2019. Entity resolution and data fusion: An integrated approach. In Proceedings of the 27th Italian Symposium on Advanced Database Systems, Castiglione della Pescaia (Grosseto’19). http://ceur-ws.org/Vol-2400/paper-17.pdf.Google ScholarGoogle Scholar
  2. Sonia Bergamaschi, Domenico Beneventano, Francesco Guerra, and Mirko Orsini. 2011. Data integration. In Handbook of Conceptual Modeling: Theory, Practice and Research Challenges, D. W. Embley and B. Thalheim (Eds.). Springer Verlag.Google ScholarGoogle Scholar
  3. Alexander Bilke and Felix Naumann. 2005. Schema matching using duplicates. In Proceedings of the 21st International Conference on Data Engineering (ICDE’05). 69--80. DOI:https://doi.org/10.1109/ICDE.2005.126Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Loredana Caruccio, Vincenzo Deufemia, and Giuseppe Polese. 2015. Relaxed functional dependencies—A survey of approaches. IEEE Trans. Knowl. Data Eng. 28, 1 (2015), 147--165.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Loredana Caruccio, Vincenzo Deufemia, and Giuseppe Polese. 2017. Evolutionary mining of relaxed dependencies from big data collections. In Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics. 1--10.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Peter Christen. 2012. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 24, 9 (2012), 1537--1555.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Sanjib Das, AnHai Doan, Paul Suganthan G. C., Chaitanya Gokhale, and Pradap Konda. [n.d.]. The Magellan Data Repository. Retrieved from https://sites.google.com/site/anhaidgroup/useful-stuff/data.Google ScholarGoogle Scholar
  8. Vasilis Efthymiou, George Papadakis, George Papastefanatos, Kostas Stefanidis, and Themis Palpanas. 2017. Parallel meta-blocking for scaling entity resolution over big heterogeneous data. Inf. Syst. 65, C (2017), 137--157.Google ScholarGoogle Scholar
  9. Luca Gagliardelli, Giovanni Simonini, Domenico Beneventano, and Sonia Bergamaschi. 2019. SparkER: Scaling entity resolution in Spark. In Proceedings of the 22nd International Conference on Extending Database Technology (EDBT’19).Google ScholarGoogle Scholar
  10. Simonini Giovanni, Papadakis George, Palpanas Themis, and Bergamaschi Sonia. 2018. Schema-agnostic progressive entity resolution. In Proceedings of the 21st International Conference on Data Engineering (ICDE’18). 53--64.Google ScholarGoogle Scholar
  11. Pradap Konda, Sanjib Das, Paul Suganthan G. C., AnHai Doan, Adel Ardalan, Jeffrey R. Ballard, Han Li, Fatemah Panahi, Haojun Zhang, Jeff Naughton, et al. 2016. Magellan: Toward building entity matching management systems. Proc. VLDB Endow. 9, 12 (2016), 1197--1208.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman. 2014. Mining of Massive Datasets. Cambridge University Press.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. George A. Miller. 1995. WordNet: A lexical database for English. Commun. ACM 38, 11 (1995), 39--41. DOI:https://doi.org/10.1145/219717.219748Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. George Papadakis, Ekaterini Ioannou, Themis Palpanas, Claudia Niederee, and Wolfgang Nejdl. 2013. A blocking framework for entity resolution in highly heterogeneous information spaces. IEEE Trans. Knowl. Data Eng. 25, 12 (2013), 2665--2682.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. George Papadakis, Georgia Koutrika, Themis Palpanas, and Wolfgang Nejdl. 2014. Meta-blocking: Taking entity resolution to the next level. IEEE Trans. Knowl. Data Eng. 26, 8 (2014), 1946--1960.Google ScholarGoogle ScholarCross RefCross Ref
  16. George Papadakis, George Papastefanatos, and Georgia Koutrika. 2014. Supervised meta-blocking. Proc. VLDB 7, 14 (2014), 1929--1940. DOI:https://doi.org/10.14778/2733085.2733098Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. George Papadakis, George Papastefanatos, Themis Palpanas, and Manolis Koubarakis. 2016. Scaling entity resolution to large, heterogeneous data with enhanced meta-blocking. In Proceedings of the 22nd International Conference on Extending Database Technology (EDBT’16). 221--232.Google ScholarGoogle Scholar
  18. George Papadakis, Leonidas Tsekouras, Emmanouil Thanos, George Giannakopoulos, Themis Palpanas, and Manolis Koubarakis. 2020. Domain-and structure-agnostic end-to-end entity resolution with JedAI. ACM SIGMOD Rec. 48, 4 (2020), 30--36.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Thilina Ranbaduge, Dinusha Vatsalan, and Peter Christen. 2018. A scalable and efficient subgroup blocking scheme for multidatabase record linkage. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 15--27.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Pavel Shvaiko and Jérôme Euzenat. 2013. Ontology matching: State of the art and future challenges. IEEE Trans. Knowl. Data Eng. 25, 1 (2013), 158--176.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Giovanni Simonini, Sonia Bergamaschi, and H. V. Jagadish. 2016. BLAST: A loosely schema-aware meta-blocking approach for entity resolution. VLDB Endow. 9, 12 (2016), 1173--1184.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Giovanni Simonini, Luca Gagliardelli, Sonia Bergamaschi, and H. V. Jagadish. 2019. Scaling entity resolution: A loosely schema-aware approach. Inf. Syst. 83, 7 (2019), 145--165.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Serena Sorrentino, Sonia Bergamaschi, Domenico Beneventano, and Laura Po. 2010. Automatic normalization and annotation for discovering semantic mappings. In Search Computing—Trends and Developments, Lecture Notes in Computer Science, Vol. 6585. Springer, 85--100.Google ScholarGoogle Scholar

Index Terms

  1. BLAST2: An Efficient Technique for Loose Schema Information Extraction from Heterogeneous Big Data Sources

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image Journal of Data and Information Quality
        Journal of Data and Information Quality  Volume 12, Issue 4
        Special Issue on Metadata Discovery for Assessing Data Quality
        December 2020
        118 pages
        ISSN:1936-1955
        EISSN:1936-1963
        DOI:10.1145/3430382
        Issue’s Table of Contents

        Copyright © 2020 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 10 November 2020
        • Online AM: 7 May 2020
        • Accepted: 1 April 2020
        • Revised: 1 March 2020
        • Received: 1 October 2019
        Published in jdiq Volume 12, Issue 4

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format .

      View HTML Format