Skip to main content

Integrating (Very) Heterogeneous Data Sources: A Structured and an Unstructured Perspective

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12245))

Abstract

Data integration is a broad area of data management research. It has lead to the development of many useful tools and concepts, each appropriate in a certain class of applicative settings.

We consider the setting in which data sources have heterogeneous data models. This setting is of increasing relevance, as the (once predominant) relational databases are supplemented by data exchanged in formats such as JSON or XML, graphs such as Linked Open (RDF) data, or matrix (numerical) etc. We describe two lines of work in this setting. The first aims on improving performance in a polystore setting, where data sources are queried through a structure, composite query language; the focus here is on dramatically improving performance through the use of view-based rewriting techniques. The second data integration setting assumes that sources are much too heterogeneous for structured querying and thus, explore keyword-based search in an integrated graph built from all the available data.

Designing and setting up data integration architectures remains a rather complex task; data heterogeneity makes it all the more challenging. We believe much remains to be done to consolidate and advance in this area in the future.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   49.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Alotaibi, R., Bursztyn, D., Deutsch, A., Manolescu, I., Zampetakis, S.: Towards scalable hybrid stores: constraint-based rewriting to the rescue. In: SIGMOD (2019). https://hal.inria.fr/hal-02070827

  2. Alotaibi, R., Cautis, B., Deutsch, A., Latrache, M., Manolescu, I., Yang, Y.: ESTOCADA: towards scalable polystore systems (demonstration). In: PVLDB (2020)

    Google Scholar 

  3. Bugiotti, F., Bursztyn, D., Deutsch, A., Ileana, I., Manolescu, I.: Invisible glue: Scalable self-tunning multi-stores. In: CIDR 2015, Proceedings of Seventh Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, 4–7 January 2015 (2015). http://cidrdb.org/cidr2015/Papers/CIDR15_Paper7.pdf

  4. Burger, I., Manolescu, I., Pietriga, E., Suchanek, F.M.: Toward visual interactive exploration of heterogeneous graphs. In: SEAdata 2020 - Workshop on Searching, Exploring and Analyzing Heterogeneous Data in conjunction with EDBT/ICDT, Copenhagen, Denmark, March 2020. https://hal.inria.fr/hal-02468778

  5. Cazalens, S., Lamarre, P., Leblay, J., Manolescu, I., Tannier, X.: A content management perspective on fact-checking. In: The Web Conference, “Journalism, Misinformation and Fact Checking” track (2018). https://hal.archives-ouvertes.fr/hal-01722666

  6. Chanial, C., Dziri, R., Galhardas, H., Leblay, J., Le Nguyen, M.H., Manolescu, I.: ConnectionLens: finding connections across heterogeneous data sources (demonstration). PVLDB 11 (2018). https://doi.org/10.14778/3229863.3236252. https://hal.inria.fr/hal-01841009

  7. Doan, A., Halevy, A.Y., Ives, Z.G.: Principles of Data Integration. Morgan Kaufmann, Burlington (2012). http://research.cs.wisc.edu/dibook/

  8. Franklin, M.J., Halevy, A.Y., Maier, D.: From databases to dataspaces: a new abstraction for information management. SIGMOD Rec. 34(4) (2005). https://doi.org/10.1145/1107499.1107502

  9. Goasdoué, F., Karanasos, K., Katsis, Y., Leblay, J., Manolescu, I., Zampetakis, S.: Fact checking and analyzing the web (demonstration). In: SIGMOD (2013)

    Google Scholar 

  10. Lenzerini, M.: Ontology-based data management. In: CIKM (2011). https://doi.org/10.1145/2063576.2063582. http://doi.acm.org/10.1145/2063576.2063582

  11. Manolescu, I.: Journalistic dataspaces: data management for journalism and fact-checking (keynote talk). In: EDBT/ICDT 2019 Joint Conference, March 2019. https://hal.inria.fr/hal-02081430

  12. Wiederhold, G.: Mediators in the architecture of future information systems. IEEE Comput. 25(3), 38–49 (1992). https://doi.org/10.1109/2.121508

    Article  Google Scholar 

Download references

Acknowledgment

This research has been supported by the ANR projects ContentCheck (Content Management Techniques Content Management Techniques for Fact-Checking: Models, Algorithms, and Tools) and CQFD (Complex Ontological Queries over Federated and Heterogenous Data) and the ANR-DGA AI Chair SourcesSay (Intelligent Analysis and Interconnexion of Heterogeneous Data). We thank the journalists from Les Décodeurs, the fact-checking team of Le Monde, for sharing their insights into data journalism scenarios and needs.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ioana Manolescu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Manolescu, I. (2020). Integrating (Very) Heterogeneous Data Sources: A Structured and an Unstructured Perspective. In: Darmont, J., Novikov, B., Wrembel, R. (eds) Advances in Databases and Information Systems. ADBIS 2020. Lecture Notes in Computer Science(), vol 12245. Springer, Cham. https://doi.org/10.1007/978-3-030-54832-2_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-54832-2_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-54831-5

  • Online ISBN: 978-3-030-54832-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics