Abstract
Data integration typically seeks to provide the illusion that data from multiple distributed sources comes from a single, well managed source. Providing this illusion in practice tends to involve the design of a global schema that captures the users data requirements, followed by manual (with tool support) construction of mappings between sources and the global schema. This overall approach can provide high quality integrations but at high cost, and tends to be unsuitable for areas with large numbers of rapidly changing sources, where users may be willing to cope with a less than perfect integration. Pay-as-you-go data integration has been proposed to overcome the need for costly manual data integration. Pay-as-you-go data integration tends to involve two steps. Initialisation: automatic creation of mappings (generally of poor quality) between sources. Improvement: the obtaining of feedback on some aspect of the integration, and the application of this feedback to revise the integration. There has been considerable research in this area over a ten year period. This paper reviews some experiences with pay-as-you-go data integration, providing a framework that can be used to compare or develop pay-as-you-go data integration techniques.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
References
Acosta, M., Zaveri, A., Simperl, E., Kontokostas, D., Auer, S., Lehmann, J.: Crowdsourcing linked data quality assessment. In: Alani, H., Kagal, L., Fokoue, A., Groth, P., Biemann, C., Parreira, J.X., Aroyo, L., Noy, N., Welty, C., Janowicz, K. (eds.) ISWC 2013, Part II. LNCS, vol. 8219, pp. 260–276. Springer, Heidelberg (2013)
Amsterdamer, Y., Davidson, S.B., Milo, T., Novgorodov, S., Somech, A.: OASSIS: query driven crowd mining. In: International Conference on Management of Data, SIGMOD 2014, Snowbird, 22–27 June 2014, pp. 589–600 (2014)
Belhajjame, K., Paton, N.W., Embur, S.M., Fernande, A.A.A., Hedeler, C.: Incrementally improving dataspaces based on user feedback. Inf.Syst. 38(5), 656–687 (2013)
Belhajjame, K., Paton, N.W., Hedeler, C., Fernandes, A.A.A.: Enabling community-driven information integration through clustering. Distrib. Parallel Databases 33(1), 33–67 (2015)
Bernstein, P.A., Haas, L.M.: Information integration in the enterprise. CACM 51(9), 72–79 (2008)
Bleiholder, J., Naumann, F.: Data fusion. ACM Comput. Surv., 41(1) (2008)
Bonifati, A., Mecca, G., Pappalardo, A., Raunich, S., Summa, G.: Schema mapping verification: the spicy way. In: Proceedings EDBT 2008, 11th International Conference on Extending Database Technology, Nantes, 25–29 March 2008, pp. 85–96 (2008)
Bozzon, A., Brambilla, M., Ceri, S.: Answering search queries with crowdsearcher. In: Proceeding of 21st WWW, pp. 1009–1018 (2012)
Cao, C.C., She, J., Tong, Y., Chen, L.: Whom to ask? jury selection for decision making tasks on micro-blog services. PVLDB 5(11), 1495–1506 (2012)
Cao, H., Qi, Y., Candan, K.S., Sapino, M.L.: Feedback-driven result ranking and query refinement for exploring semi-structured data collections. In: EDBT, pp. 3–14 (2010)
Chai, X., Vuong, B.-Q., Doan, A., Naughton, J.F.: Efficiently incorporating user feedback into information extraction and integration programs. In: SIGMOD Conference, pp. 87–100 (2009)
Christodoulou, K., Paton, N.W., Fernandes, A.A.A.: Structure inference for linked data sources using clustering. Trans. Large-Scale Data- Knowl.-Centered Syst. 19, 1–25 (2015)
Costa, G., Manco, G., Ortale, R.: An incremental clustering scheme for data de-duplication. Data Min. Knowl. Disc. 20(1), 152–187 (2010)
Crescenzi, V., Merialdo, P., Qiu, D.: Crowdsourcing large scale wrapper inference. Distributed and Parallel Databases (October 2014)
Demartini, G., Difallah, D.E., Cudré-Mauroux, P.: Large-scale linked data integration using probabilistic reasoning and crowdsourcing. VLDB J. 22(5), 665–687 (2013)
Dong, X.L., Gabrilovich, E., Heitz, G., Horn, W., Lao, N., Murphy, K., Strohmann, T., Sun, S., Zhang, W.: Knowledge vault: a web-scale approach to probabilistic knowledge fusion. In: KDD, pp. 601–610 (2014)
Dong, X.L., Saha, B., Srivastava, D.: Less is more: selecting sources wisely for integration. PVLDB 6(2), 37–48 (2012)
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE TKDE 19(1), 1–16 (2007)
Fagin, R., Haas, L.M., Hernández, M., Miller, R.J., Popa, L., Velegrakis, Y.: Clio: schema mapping creation and data exchange. In: Borgida, A.T., Chaudhri, V.K., Giorgini, P., Yu, E.S. (eds.) Conceptual Modeling: Foundations and Applications. LNCS, vol. 5600, pp. 198–236. Springer, Heidelberg (2009)
Furche, T., Gottlob, G., Grasso, G., Guo, X., Orsi, G., Schallhart, C., Wang, C.: DIADEM: thousands of websites to a single database. PVLDB 7(14), 1845–1856 (2014)
Gokhale, C., Das, S., Doan, A., Naughton, J.F., Rampalli, N., Shavlik, J.W., Zhu, X.: Corleone: hands-off crowdsourcing for entity matching. In: SIGMOD Conference, pp. 601–612 (2014)
Halevy, A.Y., Franklin, M.J., Maie, D.: Principles of dataspace systems. In: PODS, pp. 1–9 (2006)
Hedeler, C., Belhajjame, K., Fernandes, A.A.A., Embury, S.M., Paton, N.W.: Dimensions of dataspaces. In: Sexton, A.P. (ed.) BNCOD 26. LNCS, vol. 5588, pp. 55–66. Springer, Heidelberg (2009)
Quoc, N., Hung, V., Wijaya, T.K., Miklós, Z., Aberer, K., Levy, E., Shafran, V., Gal, A., Weidlich, M.: Minimizing human effort in reconciling match networks. In: ER, pp. 212–226 (2013)
Isele, R., Bize, C.: Learning linkage rules using genetic programming. In: Proceeding 6th International Workshop on Ontology Matching, vol. 814 of CEUR Workshop Proceedings (2011)
Isele, R., Bizer, C.: Active learning of expressive linkage rules using genetic programming. J. Web Sem. 23, 2–15 (2013)
Jeffery, S.R., Franklin, M.J., Halevy, A.Y.: Pay-as-you-go user feedback for dataspace systems. In: SIGMOD, pp. 847–860 (2008)
Kandel, S., Heer, J., Plaisant, C., Kennedy, J., van Ham, F., Riche, N.H., Weaver, C., Lee, B., Brodbeck, D., Buono, P.: Research directions in data wrangling: visuatizations and transformations for usable and credible data. Inf. Vis. 10(4), 271–288 (2011)
Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. PVLDB 3(1), 484–493 (2010)
Niu, F., Zhang, C., Ré, C., Shavlik, J.W.: Elementary: large-scale knowledge-base construction via machine learning and statistical inference. Int. J. Semantic Web Inf. Syst. 8(3), 42–73 (2012)
Osorno-Gutierrez, F., Paton, N.W., Fernandes, A.A.A.: Crowdsourcing feedback for pay-as-you-go data integration. In: DBCrowd, pp. 32–37 (2013)
Parameswaran, A.G., Park, H., Garcia-Molina, H., Polyzotis, N., Widom, J.: Deco: declarative crowdsourcing. In: Proceeding 21st CIKM, pp. 1203–1212 (2012)
Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDBJ 10(4), 334–350 (2001)
Talukdar, P.P., Jacob, M., Mehmood, M.S., Crammer, K., Ives, Z.G., Pereira, F., Guha, S.: Learning to create data-integrating queries. PVLDB 1(1), 785–796 (2008)
Wang, J., Kraska, T., Franklin, M.J., Feng, J.: Crowder: crowdsourcing entity resolution. Proc. VLDB Endow. 5(11), 1483–1494 (2012)
Whang, S.E., Lofgren, P., Garcia-Molina, H.: Question selection for crowd entity resolution. PVLDB 6(6), 349–360 (2013)
Yan, Z., Zheng, N., Ives, Z.G., Talukdar, P.P., Yu, C.: Actively soliciting feedback for query answers in keyword search-based data integration. PVLDB 6(3), 205–216 (2013)
Zhang, C.J., Chen, L., Jagadish, H.V., Cao, C.C.: Reducing uncertainty of schema matching via crowdsourcing. PVLDB 6(9), 757–768 (2013)
Zhao, B., Rubinstein, B.I.P., Gemmell, J., Han, J.: A bayesian approach to discovering truth from conflicting sources for data integration. PVLDB 5(6), 550–561 (2012)
Zheng, Y., Cheng, R., Maniu, S., Mo, L.: On optimality of jury selection in crowdsourcing. In: Proceedings of the 18th International Conference on Extending Database Technology, EDBT 2015, Brussels, 23–27 March 2015, pp. 193–204 (2015)
Acknowledgement
Research on data integration at Manchester is supported by the VADA Programme Grant of the UK Engineering and Physical Sciences Research Council, whose support we are pleased to acknowledge.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Paton, N.W., Belhajjame, K., Embury, S.M., Fernandes, A.A.A., Maskat, R. (2016). Pay-as-you-go Data Integration: Experiences and Recurring Themes. In: Freivalds, R., Engels, G., Catania, B. (eds) SOFSEM 2016: Theory and Practice of Computer Science. SOFSEM 2016. Lecture Notes in Computer Science(), vol 9587. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-49192-8_7
Download citation
DOI: https://doi.org/10.1007/978-3-662-49192-8_7
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-49191-1
Online ISBN: 978-3-662-49192-8
eBook Packages: Computer ScienceComputer Science (R0)