skip to main content
10.1145/3132847.3132882acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article
Public Access

Spreadsheet Property Detection With Rule-assisted Active Learning

Published: 06 November 2017 Publication History

Abstract

Spreadsheets are a critical and widely-used data management tool. Converting spreadsheet data into relational tables would bring benefits to a number of fields, including public policy, public health, and economics. Research to date has focused on designing domain-specific languages to describe transformation processes or automatically converting a specific type of spreadsheets. To handle a larger variety of spreadsheets, we have to identify various spreadsheet properties, which correspond to a series of transformation programs that contribute towards a general framework that converts spreadsheets to relational tables.
In this paper, we focus on the problem of spreadsheet property detection. We propose a hybrid approach of building a variety of spreadsheet property detectors to reduce the amount of required human labeling effort. Our approach integrates an active learning framework with crude, easy-to-write, user-provided rules to save human labeling effort by generating additional high-quality labeled data especially in the initial training stage. Using a bagging-like technique, Our approach can also tolerate lower-quality user-provided rules. Our experiments show that when compared to a standard active learning approach, we reduced the training data needed to reach the performance plateau by 34-44% when a human provides relatively high-quality rules, and by a comparable amount with low-quality rules. A study on a large-scale web-crawled spreadsheet dataset demonstrates that it is crucial to detect a variety of spreadsheet properties in order to transform a large portion of the spreadsheets into a relational form.

References

[1]
R. Abraham and M. Erwig. Ucheck: A spreadsheet type checker for end users. J. Vis. Lang. Comput., 18(1):71--95, 2007.
[2]
Y. Ahmad, T. Antoniu, S. Goldwater, and S. Krishnamurthi. A type system for statically detecting spreadsheet errors. In ASE, pages 174--183, 2003.
[3]
J. Attenberg and F. Provost. Why label when you can search?: alternatives to active learning for applying human resources to build classification models under extreme class imbalance. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 423--432, 2010.
[4]
L. Breiman. Bagging predictors. Machine learning, 24(2):123--140, 1996.
[5]
Z. Chen and M. Cafarella. Automatic web spreadsheet data extraction. In Proceedings of the 3rd International Workshop on Semantic Search over the Web, page 1. ACM, 2013.
[6]
Z. Chen and M. Cafarella. Integrating spreadsheet data via accurate and low-effort extraction. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 1126--1135. ACM, 2014.
[7]
Z. Chen, M. Cafarella, J. Chen, D. Prevo, and J. Zhuang. Senbazuru: A prototype spreadsheet database management system. Proceedings of the VLDB Endowment, 6(12):1202--1205, 2013.
[8]
Z. Chen, M. Cafarella, and H. Jagadish. Long-tail vocabulary dictionary extraction from the web. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, pages 625--634. ACM, 2016.
[9]
J. Cunha, J. Saraiva, and J. Visser. From spreadsheets to relational databases and back. In PEPM, pages 179--188, 2009.
[10]
P. Donmez, J. G. Carbonell, and P. N. Bennett. Dual strategy active learning. In Machine Learning: ECML 2007, pages 116--127. Springer, 2007.
[11]
G. Druck, B. Settles, and A. McCallum. Active learning by labeling features. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing: Volume 1-Volume 1, pages 81--90. Association for Computational Linguistics, 2009.
[12]
M. Gyssens, L. V. S. Lakshmanan, and I. N. Subramanian. Tables as a paradigm for querying and restructuring. In PODS, pages 93--103, 1996.
[13]
H. He, E. Garcia, et al. Learning from imbalanced data. Knowledge and Data Engineering, IEEE Transactions on, 21(9):1263--1284, 2009.
[14]
V. Hung, B. Benatallah, and R. Saint-Paul. Spreadsheet-based complex data transformation. In CIKM, pages 1749--1754, 2011.
[15]
S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Wrangler: Interactive visual specification of data transformation scripts. In ACM Human Factors in Computing Systems (CHI), 2011.
[16]
L. V. S. Lakshmanan, S. N. Subramanian, N. Goyal, and R. Krishnamurthy. On query spreadsheets. In ICDE, pages 134--141, 1998.
[17]
V. Le and S. Gulwani. Flashextract: A framework for data extraction by examples. In ACM SIGPLAN Notices, volume 49, pages 542--553. ACM, 2014.
[18]
C. Li, Y. Wang, P. Resnick, and Q. Mei. Req-rec: High recall retrieval with query pooling and interactive classification. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, pages 163--172. ACM, 2014.
[19]
C. H. Lin, M. Mausam, and D. S. Weld. Re-active learning: Active learning with relabeling. In AAAI, pages 1845--1852, 2016.
[20]
E. Manino, L. Tran-Thanh, and N. R. Jennings. Efficiency of active learning for the allocation of workers on crowdsourced classification tasks. arXiv preprint arXiv:1610.06106, 2016.
[21]
G.-J. Qi, X.-S. Hua, Y. Rui, J. Tang, and H.-J. Zhang. Two-dimensional active learning for image classification. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1--8. IEEE, 2008.
[22]
V. Raman and J. M. Hellerstein. Potter's wheel: An interactive data cleaning system. In VLDB, volume 1, pages 381--390, 2001.
[23]
R. Reichart, K. Tomanek, U. Hahn, and A. Rappoport. Multi-task active learning for linguistic annotations. In ACL, volume 8, pages 861--869, 2008.
[24]
X. Rong, Z. Chen, Q. Mei, and E. Adar. Egoset: Exploiting word ego-networks and user-generated ontology for multifaceted set expansion. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining, pages 645--654. ACM, 2016.
[25]
G. Salton and C. Buckley. Improving retrieval performance by relevance feedback. Readings in information retrieval, 24(5):355--363, 1997.
[26]
B. Settles. Active learning literature survey. University of Wisconsin, Madison, 52(55--66):11, 2010.
[27]
M. Spenke, C. Beilken, and T. Berlage. Focus: The interactive table for product comparison and selection. In UIST, pages 41--50, 1996.
[28]
K. Tomanek and U. Hahn. Reducing class imbalance during active learning for named entity annotation. In Proceedings of the fifth international conference on Knowledge capture, pages 105--112, 2009.
[29]
X. Zhang, T. Yang, and P. Srinivasan. Online asymmetric active learning with imbalanced data. In KDD, 2016.
[30]
Q. Zhao, V. Hautamaki, and P. Fr"anti. Knee point detection in bic for detecting the number of clusters. In Advanced Concepts for Intelligent Vision Systems, pages 664--673. Springer, 2008.
[31]
J. Zhu, H. Wang, T. Yao, and B. K. Tsou. Active learning with sampling by uncertainty and density for word sense disambiguation and text classification. In Proceedings of the 22nd International Conference on Computational Linguistics-Volume 1, pages 1137--1144, 2008.

Cited By

View all
  • (2024)Classification of Table Cells Based on LLM Prompts2024 IEEE International Conference on Systems, Man, and Cybernetics (SMC)10.1109/SMC54092.2024.10831285(2140-2145)Online publication date: 6-Oct-2024
  • (2023)HUSS: A Heuristic Method for Understanding the Semantic Structure of SpreadsheetsData Intelligence10.1162/dint_a_002015:3(537-559)Online publication date: 1-Aug-2023
  • (2022)Detecting layout templates in complex multiregion filesProceedings of the VLDB Endowment10.14778/3494124.349414515:3(646-658)Online publication date: 4-Feb-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '17: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management
November 2017
2604 pages
ISBN:9781450349185
DOI:10.1145/3132847
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 November 2017

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. active learning
  2. data cleaning
  3. spreadsheets

Qualifiers

  • Research-article

Funding Sources

Conference

CIKM '17
Sponsor:

Acceptance Rates

CIKM '17 Paper Acceptance Rate 171 of 855 submissions, 20%;
Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)96
  • Downloads (Last 6 weeks)13
Reflects downloads up to 07 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Classification of Table Cells Based on LLM Prompts2024 IEEE International Conference on Systems, Man, and Cybernetics (SMC)10.1109/SMC54092.2024.10831285(2140-2145)Online publication date: 6-Oct-2024
  • (2023)HUSS: A Heuristic Method for Understanding the Semantic Structure of SpreadsheetsData Intelligence10.1162/dint_a_002015:3(537-559)Online publication date: 1-Aug-2023
  • (2022)Detecting layout templates in complex multiregion filesProceedings of the VLDB Endowment10.14778/3494124.349414515:3(646-658)Online publication date: 4-Feb-2022
  • (2022)HUSS: A Heuristic Method for Understanding the Semantic Structure of Spreadsheets2022 IEEE International Conference on Knowledge Graph (ICKG)10.1109/ICKG55886.2022.00049(329-336)Online publication date: Nov-2022
  • (2022)CFCT: The cell function classification method for complex tables2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys)10.1109/HPCC-DSS-SmartCity-DependSys57074.2022.00326(2206-2213)Online publication date: Dec-2022
  • (2021)NOAHProceedings of the VLDB Endowment10.14778/3447689.344770114:6(970-983)Online publication date: 12-Apr-2021
  • (2021)Scalable Tabular Metadata Location and Classification in Large-Scale Structured DatasetsDatabase and Expert Systems Applications10.1007/978-3-030-86472-9_4(35-50)Online publication date: 31-Aug-2021
  • (2021)Semi-automatic Column Type Inference for CSV Table UnderstandingSOFSEM 2021: Theory and Practice of Computer Science10.1007/978-3-030-67731-2_39(535-549)Online publication date: 11-Jan-2021
  • (2021)Table understanding approaches for extracting knowledge from heterogeneous tablesWIREs Data Mining and Knowledge Discovery10.1002/widm.140711:4Online publication date: 28-Mar-2021
  • (2020)Learning cell embeddings for understanding table layoutsKnowledge and Information Systems10.1007/s10115-020-01508-6Online publication date: 7-Sep-2020
  • Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media