skip to main content
10.1145/2452376.2452440acmotherconferencesArticle/Chapter ViewAbstractPublication PagesedbtConference Proceedingsconference-collections
research-article

HIL: a high-level scripting language for entity integration

Published: 18 March 2013 Publication History

Abstract

We introduce HIL, a high-level scripting language for entity resolution and integration. HIL aims at providing the core logic for complex data processing flows that aggregate facts from large collections of structured or unstructured data into clean, unified entities. Such flows typically include many stages of processing that start from the outcome of information extraction and continue with entity resolution, mapping and fusion. A HIL program captures the overall integration flow through a combination of SQL-like rules that link, map, fuse and aggregate entities. A salient feature of HIL is the use of logical indexes in its data model to facilitate the modular construction and aggregation of complex entities. Another feature is the presence of a flexible, open type system that allows HIL to handle input data that is irregular, sparse or partially known.
As a result, HIL can accurately express complex integration tasks, while still being high-level and focused on the logical entities (rather than the physical operations). Compilation algorithms translate the HIL specification into efficient run-time queries that can execute in parallel on Hadoop. We show how our framework is applied to real-world integration of entities in the financial domain, based on public filings archived by the U.S. Securities and Exchange Commission (SEC). Furthermore, we apply HIL on a larger-scale scenario that performs fusion of data from hundreds of millions of Twitter messages into tens of millions of structured entities.

References

[1]
A. Arasu, C. Ré, and D. Suciu. Large-Scale Deduplication with Constraints Using Dedupalog. In ICDE, pages 952--963, 2009.
[2]
K. Beyer, V. Ercegovac, R. Gemulla, A. Balmin, M. Eltabakh, C.-C. Kanne, F. Ozcan, and E. Shekita. Jaql: A Scripting Language for Large Scale Semistructured Data Analysis. In VLDB, 2011.
[3]
J. Bleiholder and F. Naumann. Data Fusion. ACMComp. Surv., 41(1), 2008.
[4]
S. Boriah, V. Chandola, and V. Kumar. Similarity Measures for Categorical Data: A Comparative Evaluation. In SIAM, 2008.
[5]
D. Burdick, M. A. Hernández, H. Ho, G. Koutrika, R. Krishnamurthy, L. Popa, I. R. Stanoi, S. Vaithyanathan, and S. Das. Extracting, Linking and Integrating Data from Public Sources: A Financial Case Study. IEEE Data Eng. Bull., 34(3):60--67, 2011.
[6]
L. Chiticariu, R. Krishnamurthy, Y. Li, S. Raghavan, F. Reiss, and S. Vaithyanathan. SystemT: An Algebraic Approach to Declarative Information Extraction. In ACL, pages 128--137, 2010.
[7]
N. N. Dalvi, R. Kumar, B. Pang, R. Ramakrishnan, A. Tomkins, P. Bohannon, S. Keerthi, and S. Merugu. A Web of Concepts. In PODS, pages 1--12, 2009.
[8]
A. Deutsch, L. Popa, and V. Tannen. Physical Data Independence, Constraints, and Optimization with Universal Plans. In VLDB, pages 459--470, 1999.
[9]
A. Doan, J. F. Naughton, R. Ramakrishnan, A. Baid, X. Chai, F. Chen, T. Chen, E. Chu, P. DeRose, B. J. Gao, C. Gokhale, J. Huang, W. Shen, and B.-Q. Vuong. Information Extraction Challenges in Managing Unstructured Data. SIGMOD Record, 37(4):14--20, 2008.
[10]
A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate Record Detection: A Survey. IEEE TKDE, 19(1):1--16, 2007.
[11]
R. Fagin, L. M. Haas, M. A. Hernández, R. J. Miller, L. Popa, and Y. Velegrakis. Clio: Schema Mapping Creation and Data Exchange. In Conceptual Modeling: Foundations and Applications, pages 198--236, 2009.
[12]
R. Fagin, P. G. Kolaitis, R. J. Miller, and L. Popa. Data Exchange: Semantics and Query Answering. TCS, 336(1):89--124, 2005.
[13]
R. Fagin, P. G. Kolaitis, L. Popa, and W. C. Tan. Composing Schema Mappings: Second-order Dependencies to the Rescue. ACM TODS, 30(4):994--1055, 2005.
[14]
H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C.-A. Saita. Declarative Data Cleaning: Language, Model, and Algorithms. In VLDB, pages 371--380, 2001.
[15]
M. Hernández, G. Koutrika, R. Krishnamurthy, L. Popa, and R. Wisnesky. HIL: A High-Level Scripting Language for Entity Integration. Technical Report RJ10499, IBM Research, June 2012.
[16]
M. Lenzerini. Data Integration: A Theoretical Perspective. In PODS, pages 233--246, 2002.
[17]
S. Melnik, E. Rahm, and P. A. Bernstein. Rondo: A Programming Platform for Generic Model Management. In SIGMOD, pages 193--204, 2003.
[18]
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. In SIGMOD, pages 1099--1110, 2008.
[19]
B. C. Pierce. Types and Programming Languages. MIT Press, 2002.
[20]
E. Rahm and P. A. Bernstein. A Survey of Approaches to Automatic Schema Matching. VLDB Journal, 10(4):334--350, 2001.
[21]
E. Rahm, A. Thor, D. Aumueller, H. H. Do, N. Golovin, and T. Kirsten. iFuice - Information Fusion utilizing Instance Correspondences and Peer Mappings. In WebDB, pages 7--12, 2005.
[22]
M. Weis and I. Manolescu. XClean in Action (Demo). In CIDR, pages 259--262, 2007.

Cited By

View all
  • (2021)The Four Generations of Entity ResolutionSynthesis Lectures on Data Management10.2200/S01067ED1V01Y202012DTM06416:2(1-170)Online publication date: 15-Mar-2021
  • (2020)An Overview of End-to-End Entity Resolution for Big DataACM Computing Surveys10.1145/341889653:6(1-42)Online publication date: 6-Dec-2020
  • (2019)Structural summaries for visual provenance analysisProceedings of the 11th USENIX Conference on Theory and Practice of Provenance10.5555/3359032.3359035(2-2)Online publication date: 3-Jun-2019
  • Show More Cited By

Index Terms

  1. HIL: a high-level scripting language for entity integration

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      EDBT '13: Proceedings of the 16th International Conference on Extending Database Technology
      March 2013
      793 pages
      ISBN:9781450315975
      DOI:10.1145/2452376
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 18 March 2013

      Permissions

      Request permissions for this article.

      Check for updates

      Qualifiers

      • Research-article

      Conference

      EDBT/ICDT '13

      Acceptance Rates

      Overall Acceptance Rate 7 of 10 submissions, 70%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)19
      • Downloads (Last 6 weeks)2
      Reflects downloads up to 08 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2021)The Four Generations of Entity ResolutionSynthesis Lectures on Data Management10.2200/S01067ED1V01Y202012DTM06416:2(1-170)Online publication date: 15-Mar-2021
      • (2020)An Overview of End-to-End Entity Resolution for Big DataACM Computing Surveys10.1145/341889653:6(1-42)Online publication date: 6-Dec-2020
      • (2019)Structural summaries for visual provenance analysisProceedings of the 11th USENIX Conference on Theory and Practice of Provenance10.5555/3359032.3359035(2-2)Online publication date: 3-Jun-2019
      • (2019)SystemERProceedings of the VLDB Endowment10.14778/3352063.335206812:12(1794-1797)Online publication date: 1-Aug-2019
      • (2019)Detecting nondeterministic payment bugs in Ethereum smart contractsProceedings of the ACM on Programming Languages10.1145/33606153:OOPSLA(1-29)Online publication date: 10-Oct-2019
      • (2019)Refinement kinds: type-safe programming with practical type-level computationProceedings of the ACM on Programming Languages10.1145/33605573:OOPSLA(1-30)Online publication date: 10-Oct-2019
      • (2019)Learning Explainable Entity Resolution Algorithms for Small Business Data using SystemERProceedings of the 5th Workshop on Data Science for Macro-modeling with Financial and Economic Datasets10.1145/3336499.3338010(1-6)Online publication date: 30-Jun-2019
      • (2019)Deep Scalable Supervised Quantization by Self-Organizing MapACM Transactions on Multimedia Computing, Communications, and Applications10.1145/332899515:3(1-18)Online publication date: 20-Aug-2019
      • (2019)Learning Click-Based Deep Structure-Preserving Embeddings with Visual AttentionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/332899415:3(1-19)Online publication date: 8-Aug-2019
      • (2019)LiwePMSACM Journal on Emerging Technologies in Computing Systems10.1145/332796315:3(1-24)Online publication date: 10-Jun-2019
      • Show More Cited By

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media