Skip to main content
Log in

Non-binary evaluation measures for big data integration

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

The evolution of data accumulation, management, analytics, and visualization has led to the coining of the term big data, which challenges the task of data integration. This task, common to any matching problem in computer science involves generating alignments between structured data in an automated fashion. Historically, set-based measures, based upon binary similarity matrices (match/non-match), have dominated evaluation practices of matching tasks. However, in the presence of big data, such measures no longer suffice. In this work, we propose evaluation methods for non-binary matrices as well. Non-binary evaluation is formally defined together with several new, non-binary measures using a vector space representation of matching outcome. We provide empirical analyses of the usefulness of non-binary evaluation and show its superiority over its binary counterparts in several problem domains.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. The proposed term similarity space should not be confused with the one proposed by Zobel and Moffat [49] in the context of document vector spaces.

  2. We use \(\circ \) do denote element-wise (also known as Hadamard) vector multiplication.

  3. https://github.com/tomersagi/ore.

  4. http://www.nisb-project.eu/.

References

  1. Algergawy, A., Nayak, R., Saake, G.: XML schema element similarity measures: a schema matching context. In: On the Move to Meaningful Internet Systems: OTM 2009, pp. 1246–1253 (2009)

  2. Ayat, N., Afsarmanesh, H., Akbarinia, R., Valduriez, P.: Pay-as-you-go data integration using functional dependencies. In: Multidisciplinary Research and Practice for Information Systems, LNCS, vol. 7465, pp. 375–389. Springer, Berlin (2012)

  3. Bellahsene, Z., Bonifati, A., Rahm, E. (eds.): Schema Matching and Mapping. Data-Centric Systems and Applications. Springer, Berlin (2011). https://doi.org/10.1007/978-3-642-16518-4

    Google Scholar 

  4. Ben-Tal, A., Nemirovski, A.: Robust optimization-methodology and applications. Math. Program. 92(3), 453–480 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  5. Berenzweig, A., Logan, B., Ellis, D.P., Whitman, B.: A large-scale evaluation of acoustic and subjective music-similarity measures. Comput. Music J. 28(2), 63–76 (2004)

    Article  Google Scholar 

  6. Berlin, J., Motro, A.: Autoplex: automated discovery of content for virtual databases. In: CoopIS 2001, LNCS, vol. 2172, pp. 108–122. Springer, Berlin (2001)

  7. Bryant, V.: Metric Spaces: Iteration and Application. Cambridge University Press, Cambridge (1985)

    MATH  Google Scholar 

  8. Cardoso, J., Sheth, A.P.: Semantic Web Services, Processes and Applications. Springer, Berlin (2006)

    Book  MATH  Google Scholar 

  9. Christen, P.: Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface. In: KDD ’08: Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1065–1068. ACM, New York (2008). https://doi.org/10.1145/1401890.1402020

  10. Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. (2011). https://doi.org/10.1109/TKDE.2011.127

    Google Scholar 

  11. Das Sarma, A., Dong, X., Halevy, A.: Bootstrapping pay-as-you-go data integration systems. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 861–874. ACM, New York, SIGMOD ’08 (2008). https://doi.org/10.1145/1376616.1376702

  12. Do, H.H., Rahm, E.: COMA: a system for flexible combination of schema matching approaches. In: Proceedings of VLDB, VLDB Endowment, pp. 610–621 (2002)

  13. Doan, A.H., Domingos, P., Halevy, A.Y.: Reconciling schemas of disparate data sources: a machine-learning approach. ACM SIGMOD Rec. 30, 509–520 (2001)

    Article  Google Scholar 

  14. Dong, X., Halevy, A., Yu, C.: Data integration with uncertainty. VLDB J. 18, 469–500 (2009). https://doi.org/10.1007/s00778-008-0119-9

    Article  Google Scholar 

  15. Duchateau, F., Bellahsene, Z., Coletta, R.: Matching and alignment: What is the cost of user post-match effort? In: On the Move to Meaningful Internet Systems: OTM 2011, LNCS, vol. 7044, pp. 421–428. Springer, Berlin (2011). https://doi.org/10.1007/978-3-642-25109-2_28

  16. Engmann, D., Maßmann, S.: Instance matching with coma++. In: BTW Workshops, pp. 28–37 (2007)

  17. Euzenat, J.: Semantic precision and recall for ontology alignment evaluation. In: Proceedings of the IJCAI, pp. 348–353 (2007)

  18. Euzenat, J., Meilicke, C., Stuckenschmidt, H., Shvaiko, P., dos Santos, C.T.: Ontology alignment evaluation initiative: six years of experience. J. Data Semant. 15, 158–192 (2011). https://doi.org/10.1007/978-3-642-22630-4_6

    Article  Google Scholar 

  19. Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969). https://doi.org/10.2307/2286061

    Article  MATH  Google Scholar 

  20. Friedman, E.J.: Active learning for smooth problems. In: Proceedings of the 22nd Annual Conference on Learning Theory (2009)

  21. Gal, A.: Uncertain Schema Matching. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, Los Altos (2011). https://doi.org/10.2200/S00337ED1V01Y201102DTM013

  22. Gal, A., Anaby-Tavor, A., Trombetta, A., Montesi, D.: A framework for modeling and evaluating automatic semantic reconciliation. VLDB J. 14(1), 50–67 (2005)

    Article  Google Scholar 

  23. Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.: Declarative data cleaning: language, model and algorithms. In: Proceedings of the International Conference on Very Large Databases (VLDB) (2001)

  24. Gawinecki, M.: Abbreviation Expansion in Lexical Annotation of Schema. Camogli (Genova), Italy June 25th, 2009 Co-located with SEBD, p. 61 (2009)

  25. Lee, Y., Sayyadian, M., Doan, A.H., Rosenthal, A.S.: eTuner: tuning schema matching software using synthetic scenarios. VLDB J. 16(1), 97–122 (2007)

    Article  Google Scholar 

  26. Li, W., Clifton, C.: SEMINT: a tool for identifying attribute correspondences in heterogeneous databases using neural networks. Data Knowl. Eng. 33(1), 49–84 (2000)

    Article  MATH  Google Scholar 

  27. Luenberger, D.: Optimization by Vector Space Methods. Wiley-Interscience, New York (1997)

    MATH  Google Scholar 

  28. Madhavan, J., Bernstein, P., Doan, A., Halevy, A.: Corpus-based schema matching. In: Proceedings of the ICDE, pp. 57–68 (2005)

  29. Madhavan, J., Jeffery, S., Cohen, S., Dong, X., Ko, D., Yu, C., Halevy, A.: Web-scale data integration: you can only afford to pay as you go. In: Proceedings of the CIDR, pp. 342–350 (2007)

  30. Magnani, M., Rizopoulos, N., McBrien, P., Montesi, D.: Schema integration based on uncertain semantic mappings. In: Conceptual Modeling ER 2005, pp. 31–46 (2005)

  31. Marie, A., Gal, A.: Managing uncertainty in schema matcher ensembles. In: Prade, H., Subrahmanian, V. (eds.) Scalable Uncertainty Management, LNCS, vol. 4772, pp. 60–73. Springer, Berlin (2007). https://doi.org/10.1007/978-3-540-75410-7_5

  32. Marie, A., Gal, A.: On the stable marriage of maximum weight royal couples. In: Proceedings of AAAI Workshop on Information Integration on the Web (2007)

  33. Melnik, S., Garcia-Molina, H., Rahm, E.: Similarity flooding: a versatile graph matching algorithm and its application to schema matching. In: ICDE, pp. 117–128. IEEE (2002)

  34. Mena, E., Kashyap, V., Illarramendi, A., Sheth, A.P.: Imprecise answers in distributed environments: estimation of information loss for multi-ontology based query processing. Int. J. Coop. Inf. Syst. 9(4), 403–425 (2000)

    Article  Google Scholar 

  35. Modica, G., Gal, A., Jamil, H.: The use of machine-generated ontologies in dynamic information seeking. In: CoopIS, pp. 433–447 (2001)

  36. Noy, N.F., Mortensen, J., Musen, M.A., Alexander, P.R.: Mechanical turk as an ontology engineer? Using microtasks as a component of an ontology-engineering workflow. In: Web Science 2013 (co-located with ECRC), WebSci ’13, Paris, pp. 262–271 (2013). https://doi.org/10.1145/2464464.2464482

  37. Peukert, E., Eberius, J., Rahm, E.: AMC—a framework for modelling and comparing matching systems as matching processes. In: ICDE, pp. 1304–1307. IEEE (2011)

  38. Powers, D.: Evaluation: from precision, recall and f-measure to ROC, informedness, markedness & correlation. J. Mach. Learn. Technol. 2(1), 37–63 (2011)

    MathSciNet  Google Scholar 

  39. Ratinov, L., Gudes, E.: Abbreviation expansion in schema matching and web integration. In: Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence. IEEE Computer Society, pp. 485–489 (2004)

  40. Rodriguez-Gianolli, P., Mylopoulos, J.: A semantic approach to XML-based data integration. In: Kunii, H.S., Jajodia, S., Sølvberg, A. (eds.) Conceptual Modeling–ER 2001. Lecture Notes in Computer Science, vol. 2224, pp. 117–132. Springer, Berlin (2001)

    Chapter  Google Scholar 

  41. Sagi, T., Gal, A.: Non-binary evaluation for schema matching. In: Atzeni, P., Cheung, D., Ram, S. (eds.) Conceptual Modeling, Lecture Notes in Computer Science, vol. 7532, pp. 477–486. Springer, Berlin (2012). https://doi.org/10.1007/978-3-642-34002-4_37

  42. Sagi, T., Gal, A.: Schema matching prediction with applications to data source discovery and dynamic ensembling. VLDB J. 22(5), 689–710 (2013). https://doi.org/10.1007/s00778-013-0325-y

    Article  Google Scholar 

  43. Sagi, T., Gal, A.: In schema matching, even experts are human. towards expert sourcing in schema matching. In: 10th International Workshop on Information Integration on the Web (IIWeb ’14), co-located with ICDE 2014. IEEE, Chicago (2014)

  44. Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’02, pp. 269–278. ACM, New York (2002). https://doi.org/10.1145/775047.775087

  45. Shepard, R.: Attention and the metric structure of the stimulus space. J. Math. Psychol. 1(1), 54–87 (1964)

    Article  MathSciNet  Google Scholar 

  46. Steel, R.G.D., Torrie, J.H.: Principles and Procedures of Statistics. McGraw-Hill, New York (1960)

    MATH  Google Scholar 

  47. Weidlich, M., Dijkman, R., Mendling, J.: The ICOP framework: identification of correspondences between process models. In: Advanced Information Systems Engineering, pp. 483–498. Springer, Berlin (2010)

  48. Winkler, W., Yancey, W., Porter, E.: Fast record linkage of very large files in support of decennial and administrative records projects. In: Proceedings of the Section on Survey Research Methods. American Statistical Association (2010)

  49. Zobel, J., Moffat, A.: Exploring the similarity space. SIGIR Forum 32, 18–34 (1998). https://doi.org/10.1145/281250.281256

    Article  Google Scholar 

Download references

Acknowledgements

The research leading to these results has received funding from the European Union’s Seventh Framework Programme (FP7/2007-2013) under the NisB (http://nisb-project.eu/) project, Grant Agreement No. 256955.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tomer Sagi.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sagi, T., Gal, A. Non-binary evaluation measures for big data integration. The VLDB Journal 27, 105–126 (2018). https://doi.org/10.1007/s00778-017-0489-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00778-017-0489-y

Keywords

Navigation