Pay-as-you-go Data Integration: Experiences and Recurring Themes

Paton, Norman W.; Belhajjame, Khalid; Embury, Suzanne M.; Fernandes, Alvaro A. A.; Maskat, Ruhaila

doi:10.1007/978-3-662-49192-8_7

Norman W. Paton¹⁶,
Khalid Belhajjame¹⁷,
Suzanne M. Embury¹⁶,
Alvaro A. A. Fernandes¹⁶ &
…
Ruhaila Maskat¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9587))

Included in the following conference series:

International Conference on Current Trends in Theory and Practice of Informatics

1210 Accesses
5 Citations

Abstract

Data integration typically seeks to provide the illusion that data from multiple distributed sources comes from a single, well managed source. Providing this illusion in practice tends to involve the design of a global schema that captures the users data requirements, followed by manual (with tool support) construction of mappings between sources and the global schema. This overall approach can provide high quality integrations but at high cost, and tends to be unsuitable for areas with large numbers of rapidly changing sources, where users may be willing to cope with a less than perfect integration. Pay-as-you-go data integration has been proposed to overcome the need for costly manual data integration. Pay-as-you-go data integration tends to involve two steps. Initialisation: automatic creation of mappings (generally of poor quality) between sources. Improvement: the obtaining of feedback on some aspect of the integration, and the application of this feedback to revise the integration. There has been considerable research in this area over a ten year period. This paper reviews some experiences with pay-as-you-go data integration, providing a framework that can be used to compare or develop pay-as-you-go data integration techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://www.mturk.com.
2.
http://www.crowdflower.com.

References

Acosta, M., Zaveri, A., Simperl, E., Kontokostas, D., Auer, S., Lehmann, J.: Crowdsourcing linked data quality assessment. In: Alani, H., Kagal, L., Fokoue, A., Groth, P., Biemann, C., Parreira, J.X., Aroyo, L., Noy, N., Welty, C., Janowicz, K. (eds.) ISWC 2013, Part II. LNCS, vol. 8219, pp. 260–276. Springer, Heidelberg (2013)
Chapter Google Scholar
Amsterdamer, Y., Davidson, S.B., Milo, T., Novgorodov, S., Somech, A.: OASSIS: query driven crowd mining. In: International Conference on Management of Data, SIGMOD 2014, Snowbird, 22–27 June 2014, pp. 589–600 (2014)
Google Scholar
Belhajjame, K., Paton, N.W., Embur, S.M., Fernande, A.A.A., Hedeler, C.: Incrementally improving dataspaces based on user feedback. Inf.Syst. 38(5), 656–687 (2013)
Article Google Scholar
Belhajjame, K., Paton, N.W., Hedeler, C., Fernandes, A.A.A.: Enabling community-driven information integration through clustering. Distrib. Parallel Databases 33(1), 33–67 (2015)
Article Google Scholar
Bernstein, P.A., Haas, L.M.: Information integration in the enterprise. CACM 51(9), 72–79 (2008)
Article Google Scholar
Bleiholder, J., Naumann, F.: Data fusion. ACM Comput. Surv., 41(1) (2008)
Google Scholar
Bonifati, A., Mecca, G., Pappalardo, A., Raunich, S., Summa, G.: Schema mapping verification: the spicy way. In: Proceedings EDBT 2008, 11th International Conference on Extending Database Technology, Nantes, 25–29 March 2008, pp. 85–96 (2008)
Google Scholar
Bozzon, A., Brambilla, M., Ceri, S.: Answering search queries with crowdsearcher. In: Proceeding of 21st WWW, pp. 1009–1018 (2012)
Google Scholar
Cao, C.C., She, J., Tong, Y., Chen, L.: Whom to ask? jury selection for decision making tasks on micro-blog services. PVLDB 5(11), 1495–1506 (2012)
Google Scholar
Cao, H., Qi, Y., Candan, K.S., Sapino, M.L.: Feedback-driven result ranking and query refinement for exploring semi-structured data collections. In: EDBT, pp. 3–14 (2010)
Google Scholar
Chai, X., Vuong, B.-Q., Doan, A., Naughton, J.F.: Efficiently incorporating user feedback into information extraction and integration programs. In: SIGMOD Conference, pp. 87–100 (2009)
Google Scholar
Christodoulou, K., Paton, N.W., Fernandes, A.A.A.: Structure inference for linked data sources using clustering. Trans. Large-Scale Data- Knowl.-Centered Syst. 19, 1–25 (2015)
Google Scholar
Costa, G., Manco, G., Ortale, R.: An incremental clustering scheme for data de-duplication. Data Min. Knowl. Disc. 20(1), 152–187 (2010)
Article MathSciNet Google Scholar
Crescenzi, V., Merialdo, P., Qiu, D.: Crowdsourcing large scale wrapper inference. Distributed and Parallel Databases (October 2014)
Google Scholar
Demartini, G., Difallah, D.E., Cudré-Mauroux, P.: Large-scale linked data integration using probabilistic reasoning and crowdsourcing. VLDB J. 22(5), 665–687 (2013)
Article Google Scholar
Dong, X.L., Gabrilovich, E., Heitz, G., Horn, W., Lao, N., Murphy, K., Strohmann, T., Sun, S., Zhang, W.: Knowledge vault: a web-scale approach to probabilistic knowledge fusion. In: KDD, pp. 601–610 (2014)
Google Scholar
Dong, X.L., Saha, B., Srivastava, D.: Less is more: selecting sources wisely for integration. PVLDB 6(2), 37–48 (2012)
Google Scholar
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE TKDE 19(1), 1–16 (2007)
Google Scholar
Fagin, R., Haas, L.M., Hernández, M., Miller, R.J., Popa, L., Velegrakis, Y.: Clio: schema mapping creation and data exchange. In: Borgida, A.T., Chaudhri, V.K., Giorgini, P., Yu, E.S. (eds.) Conceptual Modeling: Foundations and Applications. LNCS, vol. 5600, pp. 198–236. Springer, Heidelberg (2009)
Chapter Google Scholar
Furche, T., Gottlob, G., Grasso, G., Guo, X., Orsi, G., Schallhart, C., Wang, C.: DIADEM: thousands of websites to a single database. PVLDB 7(14), 1845–1856 (2014)
Google Scholar
Gokhale, C., Das, S., Doan, A., Naughton, J.F., Rampalli, N., Shavlik, J.W., Zhu, X.: Corleone: hands-off crowdsourcing for entity matching. In: SIGMOD Conference, pp. 601–612 (2014)
Google Scholar
Halevy, A.Y., Franklin, M.J., Maie, D.: Principles of dataspace systems. In: PODS, pp. 1–9 (2006)
Google Scholar
Hedeler, C., Belhajjame, K., Fernandes, A.A.A., Embury, S.M., Paton, N.W.: Dimensions of dataspaces. In: Sexton, A.P. (ed.) BNCOD 26. LNCS, vol. 5588, pp. 55–66. Springer, Heidelberg (2009)
Chapter Google Scholar
Quoc, N., Hung, V., Wijaya, T.K., Miklós, Z., Aberer, K., Levy, E., Shafran, V., Gal, A., Weidlich, M.: Minimizing human effort in reconciling match networks. In: ER, pp. 212–226 (2013)
Google Scholar
Isele, R., Bize, C.: Learning linkage rules using genetic programming. In: Proceeding 6th International Workshop on Ontology Matching, vol. 814 of CEUR Workshop Proceedings (2011)
Google Scholar
Isele, R., Bizer, C.: Active learning of expressive linkage rules using genetic programming. J. Web Sem. 23, 2–15 (2013)
Article Google Scholar
Jeffery, S.R., Franklin, M.J., Halevy, A.Y.: Pay-as-you-go user feedback for dataspace systems. In: SIGMOD, pp. 847–860 (2008)
Google Scholar
Kandel, S., Heer, J., Plaisant, C., Kennedy, J., van Ham, F., Riche, N.H., Weaver, C., Lee, B., Brodbeck, D., Buono, P.: Research directions in data wrangling: visuatizations and transformations for usable and credible data. Inf. Vis. 10(4), 271–288 (2011)
Article Google Scholar
Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. PVLDB 3(1), 484–493 (2010)
Google Scholar
Niu, F., Zhang, C., Ré, C., Shavlik, J.W.: Elementary: large-scale knowledge-base construction via machine learning and statistical inference. Int. J. Semantic Web Inf. Syst. 8(3), 42–73 (2012)
Article Google Scholar
Osorno-Gutierrez, F., Paton, N.W., Fernandes, A.A.A.: Crowdsourcing feedback for pay-as-you-go data integration. In: DBCrowd, pp. 32–37 (2013)
Google Scholar
Parameswaran, A.G., Park, H., Garcia-Molina, H., Polyzotis, N., Widom, J.: Deco: declarative crowdsourcing. In: Proceeding 21st CIKM, pp. 1203–1212 (2012)
Google Scholar
Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDBJ 10(4), 334–350 (2001)
Article MATH Google Scholar
Talukdar, P.P., Jacob, M., Mehmood, M.S., Crammer, K., Ives, Z.G., Pereira, F., Guha, S.: Learning to create data-integrating queries. PVLDB 1(1), 785–796 (2008)
Google Scholar
Wang, J., Kraska, T., Franklin, M.J., Feng, J.: Crowder: crowdsourcing entity resolution. Proc. VLDB Endow. 5(11), 1483–1494 (2012)
Article Google Scholar
Whang, S.E., Lofgren, P., Garcia-Molina, H.: Question selection for crowd entity resolution. PVLDB 6(6), 349–360 (2013)
Google Scholar
Yan, Z., Zheng, N., Ives, Z.G., Talukdar, P.P., Yu, C.: Actively soliciting feedback for query answers in keyword search-based data integration. PVLDB 6(3), 205–216 (2013)
Google Scholar
Zhang, C.J., Chen, L., Jagadish, H.V., Cao, C.C.: Reducing uncertainty of schema matching via crowdsourcing. PVLDB 6(9), 757–768 (2013)
Google Scholar
Zhao, B., Rubinstein, B.I.P., Gemmell, J., Han, J.: A bayesian approach to discovering truth from conflicting sources for data integration. PVLDB 5(6), 550–561 (2012)
Google Scholar
Zheng, Y., Cheng, R., Maniu, S., Mo, L.: On optimality of jury selection in crowdsourcing. In: Proceedings of the 18th International Conference on Extending Database Technology, EDBT 2015, Brussels, 23–27 March 2015, pp. 193–204 (2015)
Google Scholar

Download references

Acknowledgement

Research on data integration at Manchester is supported by the VADA Programme Grant of the UK Engineering and Physical Sciences Research Council, whose support we are pleased to acknowledge.

Author information

Authors and Affiliations

School of Computer Science, University of Manchester, Oxford Road, M13 9PL, Manchester, UK
Norman W. Paton, Suzanne M. Embury, Alvaro A. A. Fernandes & Ruhaila Maskat
Université Paris Dauphine, Place du Marchal de Lattre de Tassigny, 75775, Paris Cedex 16, France
Khalid Belhajjame

Authors

Norman W. Paton
View author publications
You can also search for this author in PubMed Google Scholar
Khalid Belhajjame
View author publications
You can also search for this author in PubMed Google Scholar
Suzanne M. Embury
View author publications
You can also search for this author in PubMed Google Scholar
Alvaro A. A. Fernandes
View author publications
You can also search for this author in PubMed Google Scholar
Ruhaila Maskat
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Norman W. Paton .

Editor information

Editors and Affiliations

University of Latvia, Riga, Latvia
Rūsiņš Mārtiņš Freivalds
University of Paderborn, Paderborn, Germany
Gregor Engels
University of Genoa, Genoa, Italy
Barbara Catania

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Paton, N.W., Belhajjame, K., Embury, S.M., Fernandes, A.A.A., Maskat, R. (2016). Pay-as-you-go Data Integration: Experiences and Recurring Themes. In: Freivalds, R., Engels, G., Catania, B. (eds) SOFSEM 2016: Theory and Practice of Computer Science. SOFSEM 2016. Lecture Notes in Computer Science(), vol 9587. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-49192-8_7

Download citation

DOI: https://doi.org/10.1007/978-3-662-49192-8_7
Published: 08 January 2016
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-49191-1
Online ISBN: 978-3-662-49192-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics