skip to main content
10.1145/3416028.3416033acmotherconferencesArticle/Chapter ViewAbstractPublication PagesimmsConference Proceedingsconference-collections
research-article

Entity Matching from Unstructured and Dissimilar Data Collections: Semantic and Content Distribution Approach

Published: 21 September 2020 Publication History

Abstract

This paper describes a solution to the problem of extracting data features from a collection of dissimilar, unstructured data sets, gathered from multiple data sources in the web or databases. In this work we present a method of feature extraction and normalization, aiming at closing the gap between a workable data set of uniform content, and a large collection of unstructured and un-normalized collection of unworkable data set. The feature extraction we modeled creates focused, structured data sets as output, and with Big-Data and Analytics perspective. The solution we present automates data ingestion from public data sources and it applies Machine Learning methodology to build data relationships across unstructured data sets. Our research is aiming at extracting key features by using semi-supervised process, semantic relations, and statistical analysis of the distribution of content. The mapping across dissimilar datasets is solved through matching problem of these metrics, constructing a scoring value that maps different entities. We proposed a three-layer matching process of homogenous covariates from different sources semantic and measures are nonstandard using pattern recognition. This work presents a novel way to tackle the entity resolution problem. The result shows that the method works well on real industrial data and provides immediate ROI value for the data management system.

References

[1]
E. Angelino, "Extracting structure from human-readable semistructured text," Eecs.Harvard.Edu, pp. 1--22, 1997.Ding, W. and Marchionini, G. 1997.
[2]
R. Kimura, "Development of the Reference Document Presentation Function for Spec KOKOYOMI System," 2015.
[3]
H. Zhao, "Semantic matching across heterogeneous data sources," Commun. ACM, vol. 50, no. 1, pp. 45--50, 2007.
[4]
H. Ehrenberg, A. Ratner, S. Wu, J. Fries, C. Ré, and S. H. Bach, "Snorkel," Proc. VLDB Endow., vol. 11, no. 3, pp. 269--282, 2017.
[5]
S. Wu et al., "Fonduer: Knowledge Base Construction from Richly Formatted Data," 2017
[6]
M. Smieja, Ł. Struski, J. Tabor, B. Zieliński, and P. Spurek, "Processing of missing data by neural networks," no. Section 4, 2018.
[7]
D. Grangier and I. Melvin, "Feature Set Embedding for Incomplete Data," Adv. Neural Inf. Process. Syst. 23, pp. 793--801, 2010.
[8]
Y. Luo, X. Cai, Y. ZHANG, J. Xu, and Y. xiaojie, "Multivariate Time Series Imputation with Generative Adversarial Networks," Adv. Neural Inf. Process. Syst. 31, no. NeurIPS, pp. 1603--1614, 2018.
[9]
K. Chan, T. Lee, and T. Sejnowski, "Handling missing data with variational Bayesian learning of ICA," Adv. Neural Inf. Process. Syst., 2002.
[10]
"Texas RRC - Railroad Commission of Texas." [Online]. Available: https://www.rrc.state.tx.us/.
[11]
W. NDIC, "North Dakota Industrial Commission," North Dakota Industrial Commission, Department of Mineral Resources, Oil and Gas Divison, 2014. [Online]. Available: www.dmr.nd.gov
[12]
Schlumberger, "Search Results - Schlumberger Oilfield Glossary." 2015
[13]
T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Distributed Representations ofWords and Phrases and their Compositionality," Adv. Neural Inf. Process. Syst., vol. 26, pp. 1--9, 2013.

Cited By

View all
  • (2023)Cross Modal Data Discovery over Structured and Unstructured Data LakesProceedings of the VLDB Endowment10.14778/3611479.361153316:11(3377-3390)Online publication date: 1-Jul-2023

Index Terms

  1. Entity Matching from Unstructured and Dissimilar Data Collections: Semantic and Content Distribution Approach

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      IMMS '20: Proceedings of the 3rd International Conference on Information Management and Management Science
      August 2020
      120 pages
      ISBN:9781450375467
      DOI:10.1145/3416028
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      In-Cooperation

      • Southwest Jiaotong University

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 21 September 2020

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Feature engineering
      2. entity matching
      3. ontology semantic relation

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Funding Sources

      Conference

      IMMS 2020

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)6
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 28 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)Cross Modal Data Discovery over Structured and Unstructured Data LakesProceedings of the VLDB Endowment10.14778/3611479.361153316:11(3377-3390)Online publication date: 1-Jul-2023

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media