Abstract
We present BLAST2, a novel technique to efficiently extract loose schema information, i.e., metadata that can serve as a surrogate of the schema alignment task within the Entity Resolution (ER) process, to identify records that refer to the same real-world entity when integrating multiple, heterogeneous, and voluminous data sources. The loose schema information is exploited for reducing the overall complexity of ER, whose naïve solution would imply O(n2) comparisons, where n is the number of entity representations involved in the process and can be extracted by both structured and unstructured data sources. BLAST2 is completely unsupervised yet able to achieve almost the same precision and recall of supervised state-of-the-art schema alignment techniques when employed for Entity Resolution tasks, as shown in our experimental evaluation performed on two real-world datasets (composed of 7 and 10 data sources, respectively).
- Domenico Beneventano, Sonia Bergamaschi, Luca Gagliardelli, and Giovanni Simonini. 2019. Entity resolution and data fusion: An integrated approach. In Proceedings of the 27th Italian Symposium on Advanced Database Systems, Castiglione della Pescaia (Grosseto’19). http://ceur-ws.org/Vol-2400/paper-17.pdf.Google Scholar
- Sonia Bergamaschi, Domenico Beneventano, Francesco Guerra, and Mirko Orsini. 2011. Data integration. In Handbook of Conceptual Modeling: Theory, Practice and Research Challenges, D. W. Embley and B. Thalheim (Eds.). Springer Verlag.Google Scholar
- Alexander Bilke and Felix Naumann. 2005. Schema matching using duplicates. In Proceedings of the 21st International Conference on Data Engineering (ICDE’05). 69--80. DOI:https://doi.org/10.1109/ICDE.2005.126Google ScholarDigital Library
- Loredana Caruccio, Vincenzo Deufemia, and Giuseppe Polese. 2015. Relaxed functional dependencies—A survey of approaches. IEEE Trans. Knowl. Data Eng. 28, 1 (2015), 147--165.Google ScholarDigital Library
- Loredana Caruccio, Vincenzo Deufemia, and Giuseppe Polese. 2017. Evolutionary mining of relaxed dependencies from big data collections. In Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics. 1--10.Google ScholarDigital Library
- Peter Christen. 2012. A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. 24, 9 (2012), 1537--1555.Google ScholarDigital Library
- Sanjib Das, AnHai Doan, Paul Suganthan G. C., Chaitanya Gokhale, and Pradap Konda. [n.d.]. The Magellan Data Repository. Retrieved from https://sites.google.com/site/anhaidgroup/useful-stuff/data.Google Scholar
- Vasilis Efthymiou, George Papadakis, George Papastefanatos, Kostas Stefanidis, and Themis Palpanas. 2017. Parallel meta-blocking for scaling entity resolution over big heterogeneous data. Inf. Syst. 65, C (2017), 137--157.Google Scholar
- Luca Gagliardelli, Giovanni Simonini, Domenico Beneventano, and Sonia Bergamaschi. 2019. SparkER: Scaling entity resolution in Spark. In Proceedings of the 22nd International Conference on Extending Database Technology (EDBT’19).Google Scholar
- Simonini Giovanni, Papadakis George, Palpanas Themis, and Bergamaschi Sonia. 2018. Schema-agnostic progressive entity resolution. In Proceedings of the 21st International Conference on Data Engineering (ICDE’18). 53--64.Google Scholar
- Pradap Konda, Sanjib Das, Paul Suganthan G. C., AnHai Doan, Adel Ardalan, Jeffrey R. Ballard, Han Li, Fatemah Panahi, Haojun Zhang, Jeff Naughton, et al. 2016. Magellan: Toward building entity matching management systems. Proc. VLDB Endow. 9, 12 (2016), 1197--1208.Google ScholarDigital Library
- Jure Leskovec, Anand Rajaraman, and Jeffrey David Ullman. 2014. Mining of Massive Datasets. Cambridge University Press.Google ScholarDigital Library
- George A. Miller. 1995. WordNet: A lexical database for English. Commun. ACM 38, 11 (1995), 39--41. DOI:https://doi.org/10.1145/219717.219748Google ScholarDigital Library
- George Papadakis, Ekaterini Ioannou, Themis Palpanas, Claudia Niederee, and Wolfgang Nejdl. 2013. A blocking framework for entity resolution in highly heterogeneous information spaces. IEEE Trans. Knowl. Data Eng. 25, 12 (2013), 2665--2682.Google ScholarDigital Library
- George Papadakis, Georgia Koutrika, Themis Palpanas, and Wolfgang Nejdl. 2014. Meta-blocking: Taking entity resolution to the next level. IEEE Trans. Knowl. Data Eng. 26, 8 (2014), 1946--1960.Google ScholarCross Ref
- George Papadakis, George Papastefanatos, and Georgia Koutrika. 2014. Supervised meta-blocking. Proc. VLDB 7, 14 (2014), 1929--1940. DOI:https://doi.org/10.14778/2733085.2733098Google ScholarDigital Library
- George Papadakis, George Papastefanatos, Themis Palpanas, and Manolis Koubarakis. 2016. Scaling entity resolution to large, heterogeneous data with enhanced meta-blocking. In Proceedings of the 22nd International Conference on Extending Database Technology (EDBT’16). 221--232.Google Scholar
- George Papadakis, Leonidas Tsekouras, Emmanouil Thanos, George Giannakopoulos, Themis Palpanas, and Manolis Koubarakis. 2020. Domain-and structure-agnostic end-to-end entity resolution with JedAI. ACM SIGMOD Rec. 48, 4 (2020), 30--36.Google ScholarDigital Library
- Thilina Ranbaduge, Dinusha Vatsalan, and Peter Christen. 2018. A scalable and efficient subgroup blocking scheme for multidatabase record linkage. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 15--27.Google ScholarDigital Library
- Pavel Shvaiko and Jérôme Euzenat. 2013. Ontology matching: State of the art and future challenges. IEEE Trans. Knowl. Data Eng. 25, 1 (2013), 158--176.Google ScholarDigital Library
- Giovanni Simonini, Sonia Bergamaschi, and H. V. Jagadish. 2016. BLAST: A loosely schema-aware meta-blocking approach for entity resolution. VLDB Endow. 9, 12 (2016), 1173--1184.Google ScholarDigital Library
- Giovanni Simonini, Luca Gagliardelli, Sonia Bergamaschi, and H. V. Jagadish. 2019. Scaling entity resolution: A loosely schema-aware approach. Inf. Syst. 83, 7 (2019), 145--165.Google ScholarDigital Library
- Serena Sorrentino, Sonia Bergamaschi, Domenico Beneventano, and Laura Po. 2010. Automatic normalization and annotation for discovering semantic mappings. In Search Computing—Trends and Developments, Lecture Notes in Computer Science, Vol. 6585. Springer, 85--100.Google Scholar
Index Terms
- BLAST2: An Efficient Technique for Loose Schema Information Extraction from Heterogeneous Big Data Sources
Recommendations
Uncertain entity resolution: re-evaluating entity resolution in the big data era: tutorial
Entity resolution is a fundamental problem in data integration dealing with the combination of data from different sources to a unified view of the data. Entity resolution is inherently an uncertain process because the decision to map a set of records ...
LinkDB: a probabilistic linkage database system
SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of dataEntity linkage deals with the problem of identifying whether two pieces of information represent the same real world object. The traditional methodology computes the similarity among the entities, and then merges those with similarity above some ...
Siamese Neural Network for Unstructured Data Linkage
iiWAS '20: Proceedings of the 22nd International Conference on Information Integration and Web-based Applications & ServicesData integration is one of the key problems in the era of Big Data analytics. The key challenge of data integration is the identification of records representing the same entities (e.g. person). This task is referred to as Record Linkage. It is uncommon ...
Comments