skip to main content
10.1145/3183713.3193569acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Deeper: A Data Enrichment System Powered by Deep Web

Published: 27 May 2018 Publication History

Abstract

Data scientists often spend more than 80% of their time on data preparation. Data enrichment, the act of extending a local database with new attributes from external data sources, is among the most time-consuming tasks. Existing data enrichment works are resource intensive: data-intensive by relying on web tables or knowledge bases, monetarily-intensive by purchasing entire datasets, or time-intensive by fully crawling a web-based data source. In this work, we explore a more targeted alternative that uses resources (in terms of web API calls) proportional to the size of the local database of interest. We build Deeper, a data enrichment system powered by the deep web. The goal of Deeper is to help data scientists to link a local database to a hidden database so that they can easily enrich the local database with the attributes from the hidden database. We find that a challenging problem is how to crawl a hidden database. This is different from a typical deep web crawling problem, whose goal is to crawl the entire hidden database rather than only the content relating to the data enrichment task. We demonstrate the limitations of straightforward solutions and propose an effective new crawling strategy. We also present the Deeper system architecture and discuss how to implement each component. During the demo, we will use Deeper to enrich a publication database and aim to show that (1) Deeper is an end-to-end data enrichment solution, and (2) the proposed crawling strategy is superior to the straightforward ones.

References

[1]
5 Ways to Use Enrichment. https://blog.clearbit.com/5-ways-to-use-clearbits-enrichment-api/. Accessed: 2017-07-12.
[2]
OpenRefine Reconciliation Service. https://github.com/OpenRefine/OpenRefine/ wiki/Reconciliation-Service-API. Accessed: 2018-01-15.
[3]
P. Wang, R. Shea, J. Wang and E. Wu. SmartCrawl: Deep Web Crawling Driven by Data Enrichment. http://deeper.sfucloud.ca/DeepER/.
[4]
M. Balazinska, B. Howe, and D. Suciu. Data markets in the cloud: An opportunity for the database community. PVLDB, 4(12):1482--1485, 2011.
[5]
Z. Bar-Yossef and M. Gurevich. Random sampling from a search engine's index. J. ACM, 55(5):24:1--24:74, 2008.
[6]
M. J. Cafarella, A. Y. Halevy, and N. Khoussainova. Data integration for the relational web. PVLDB, 2(1):1090--1101, 2009.
[7]
S. Deep and P. Koutris. The design of arbitrage-free data pricing schemes. arXiv preprint arXiv:1606.09376, 2016.
[8]
J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In SIGMOD, pages 1--12, 2000.
[9]
P. Koutris, P. Upadhyaya, M. Balazinska, B. Howe, and D. Suciu. Query-based data pricing. Journal of the ACM (JACM), 62(5):43, 2015.
[10]
F. Wang and G. Agrawal. Effective and efficient sampling methods for deep web aggregation queries. In EDBT, pages 425--436, 2011.
[11]
M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri. Infogather: entity augmentation and attribute discovery by holistic matching with web tables. In SIGMOD, pages 97--108, 2012.
[12]
M. Yang, B. Ding, S. Chaudhuri, and K. Chakrabarti. Finding patterns in a knowledge base using keywords to compose table answers. PVLDB, 7(14):1809-- 1820, 2014.
[13]
M. Zhang and K. Chakrabarti. Infogather+: semantic matching and annotation of numeric and time-varying attributes in web tables. In SIGMOD, pages 145--156, 2013.
[14]
M. Zhang, N. Zhang, and G. Das. Mining a search engine's corpus: efficient yet unbiased sampling and aggregate estimation. In SIGMOD, pages 793--804, 2011.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data
May 2018
1874 pages
ISBN:9781450347037
DOI:10.1145/3183713
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 May 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data enrichment
  2. deep web
  3. entity resolution

Qualifiers

  • Research-article

Conference

SIGMOD/PODS '18
Sponsor:

Acceptance Rates

SIGMOD '18 Paper Acceptance Rate 90 of 461 submissions, 20%;
Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)21
  • Downloads (Last 6 weeks)3
Reflects downloads up to 22 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)FeatAug: Automatic Feature Augmentation From One-to-Many Relationship Tables2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00146(1805-1818)Online publication date: 13-May-2024
  • (2023)Spatio-historical data enrichment for toponomastics in Bali, The Island of GodsGeoJournal10.1007/s10708-023-10932-488:5(5489-5510)Online publication date: 18-Aug-2023
  • (2020)ActiveDeeperProceedings of the VLDB Endowment10.14778/3415478.341550013:12(2885-2888)Online publication date: 14-Sep-2020
  • (2020)Data PreparationACM SIGMOD Record10.1145/3444831.344483549:3(18-29)Online publication date: 17-Dec-2020
  • (2019)Progressive Deep Web Crawling Through Keyword Queries For Data EnrichmentProceedings of the 2019 International Conference on Management of Data10.1145/3299869.3319899(229-246)Online publication date: 25-Jun-2019
  • (2018)Business Data Enrichment: Issues and Challenges2018 5th Asia-Pacific World Congress on Computer Science and Engineering (APWC on CSE)10.1109/APWConCSE.2018.00024(98-102)Online publication date: Dec-2018

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media