A Hybrid Machine-Crowdsourcing Approach for Web Table Matching and Cleaning

Li, Chunhua; Zhao, Pengpeng; Sheng, Victor S.; Li, Zhixu; Liu, Guanfeng; Wu, Jian; Cui, Zhiming

doi:10.1007/978-3-319-39958-4_11

Chunhua Li¹⁸,
Pengpeng Zhao¹⁸,
Victor S. Sheng¹⁹,
Zhixu Li¹⁸,
Guanfeng Liu¹⁸,
Jian Wu¹⁸ &
…
Zhiming Cui¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9659))

Included in the following conference series:

International Conference on Web-Age Information Management

1166 Accesses

Abstract

Table matching and data cleaning are two crucial activities in integrating data from different web tables, which have traditionally been considered as separate activities. We show that data cleaning can effectively help us discover table matches, and vice versa. In this paper, we study a hybrid machine-crowdsourcing approach to handle the two activities together with a well-developed knowledge base. Understanding the semantics of tables is fundamental to both matching and cleaning. We select the most valuable columns to crowdsourcing validation and infer others by consolidating crowdsourcing results and machine-generated results. When resolving inconsistency between data and semantics, relative trust is taken into account to validate data or semantics via crowd. Our experimental results show the effectiveness of the proposed approach for matching and cleaning web tables using real-life datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Bernstein, P.A., Madhavan, J., Rahm, E.: Generic schema matching, ten years later. Proc. VLDB Endowment 4(11), 695–701 (2011)
Google Scholar
Cafarella, M.J., Halevy, A., Khoussainova, N.: Data integration for the relational web. Proc. VLDB Endowment 2(1), 1090–1101 (2009)
Article Google Scholar
Cafarella, M.J., Halevy, A., Wang, D.Z., Wu, E., Zhang, Y.: Webtables: exploring the power of tables on the web. Proc. VLDB Endowment 1(1), 538–549 (2008)
Article Google Scholar
Chu, X., Morcos, J., Ilyas, I.F., Ouzzani, M., Papotti, P., Tang, N., Ye, Y.: Katara: A data cleaning system powered by knowledge bases and crowdsourcing. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1247–1261. ACM (2015)
Google Scholar
Cong, G., Fan, W., Geerts, F., Jia, X., Ma, S.: Improving data quality: Consistency and accuracy. In: Proceedings of the 33rd International Conference on Very Large Data Bases, pp. 315–326. VLDB Endowment (2007)
Google Scholar
Deng, D., Jiang, Y., Li, G., Li, J., Yu, C.: Scalable column concept determination for web tables using large knowledge bases. Proc. VLDB Endowment 6(13), 1606–1617 (2013)
Article Google Scholar
Fan, J., Lu, M., Ooi, B.C., Tan, W.C., Zhang, M.: A hybrid machine-crowdsourcing system for matching web tables. In: 2014 IEEE 30th International Conference on Data Engineering (ICDE), pp. 976–987. IEEE (2014)
Google Scholar
Fan, W., Ma, S., Tang, N., Yu, W.: Interaction between record matching and data repairing. J. Data Inf. Qual. (JDIQ) 4(4), 16 (2014)
Google Scholar
Geerts, F., Mecca, G., Papotti, P., Santoro, D.: Mapping and cleaning. In: 2014 IEEE 30th International Conference on Data Engineering (ICDE), pp. 232–243. IEEE (2014)
Google Scholar
Geerts, F., Mecca, G., Papotti, P., Santoro, D.: The llunatic data-cleaning framework. Proc. VLDB Endowment 6(9), 625–636 (2013)
Article Google Scholar
Hoffart, J., Suchanek, F.M., Berberich, K., Weikum, G.: Yago2: A spatially and temporally enhanced knowledge base from wikipedia. In: Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence, pp. 3161–3165. AAAI Press (2013)
Google Scholar
Limaye, G., Sarawagi, S., Chakrabarti, S.: Annotating and searching web tables using entities, types and relationships. Proc. VLDB Endowment 3(1–2), 1338–1347 (2010)
Article Google Scholar
Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001)
Article MATH Google Scholar
Rahm, E., Do, H.H.: Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)
Google Scholar
Venetis, P., Halevy, A., Madhavan, J., Paşca, M., Shen, W., Wu, F., Miao, G., Wu, C.: Recovering semantics of tables on the web. Proc. VLDB Endowment 4(9), 528–538 (2011)
Article Google Scholar
Wang, S., Xiao, X., Lee, C.H.: Crowd-based deduplication: An adaptive approach. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pp. 1263–1277. ACM (2015)
Google Scholar

Download references

Acknowledgments

This work was partially supported by Chinese NSFC (61170020, 61402311, 61440053), Jiangsu Province Colleges and Universities Natural Science Research project (13KJB520021), Jiangsu Province Postgraduate Cultivation and Innovation project (CXZZ13_0813), and the US National Science Foundation (IIS-1115417).

Author information

Authors and Affiliations

School of Computer Science and Technology, Soochow University, Suzhou, China
Chunhua Li, Pengpeng Zhao, Zhixu Li, Guanfeng Liu, Jian Wu & Zhiming Cui
Department of Computer Science, University of Central Arkansas, Conway, USA
Victor S. Sheng

Authors

Chunhua Li
View author publications
You can also search for this author in PubMed Google Scholar
Pengpeng Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Victor S. Sheng
View author publications
You can also search for this author in PubMed Google Scholar
Zhixu Li
View author publications
You can also search for this author in PubMed Google Scholar
Guanfeng Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jian Wu
View author publications
You can also search for this author in PubMed Google Scholar
Zhiming Cui
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pengpeng Zhao .

Editor information

Editors and Affiliations

Peking University , Beijing, China
Bin Cui
The George Washington University , Washington, D.C., USA
Nan Zhang
Hong Kong Baptist University, Kowloon Tong, Hong Kong, China
Jianliang Xu
University of Texas Rio Grande Valley, Edinburg, Texas, USA
Xiang Lian
Jiangxi University of Finance and Economics, Nanchang, Jiangxi, China
Dexi Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, C. et al. (2016). A Hybrid Machine-Crowdsourcing Approach for Web Table Matching and Cleaning. In: Cui, B., Zhang, N., Xu, J., Lian, X., Liu, D. (eds) Web-Age Information Management. WAIM 2016. Lecture Notes in Computer Science(), vol 9659. Springer, Cham. https://doi.org/10.1007/978-3-319-39958-4_11

Download citation

DOI: https://doi.org/10.1007/978-3-319-39958-4_11
Published: 02 June 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-39957-7
Online ISBN: 978-3-319-39958-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics