skip to main content
10.1145/1247480.1247487acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

Indexing dataspaces

Published: 11 June 2007 Publication History

Abstract

Dataspaces are collections of heterogeneous and partially unstructured data. Unlike data-integration systems that also offer uniform access to heterogeneous data sources, dataspaces do not assume that all the semantic relationships between sources are known and specified. Much of the user interaction with dataspaces involves exploring the data, and users do not have a single schema to which they can pose queries. Consequently, it is important that queries are allowed to specify varying degrees of structure, spanning keyword queries to more structure-aware queries.
This paper considers indexing support for queries that combine keywords and structure. We describe several extensions to inverted lists to capture structure when it is present. In particular, our extensions incorporate attribute labels, relationships between data items, hierarchies of schema elements, and synonyms among schema elements. We describe experiments showing that our indexing techniques improve query efficiency by an order of magnitude compared with alternative approaches, and scale well with the size of the data.

References

[1]
S. Agrawal, S. Chaudhuri, and G. Das. DBXplorer: A system for keyword-based search over relational databases. In Proc. of ICDE, 2002.
[2]
S. Al-Khalifa, H. Jagadish, N. Koudas, J. M. Patel, D. Srivastava, and Y. Wu. Structural joins: A primitive for efficient XML query pattern matching. In ICDE, 2002.
[3]
R. Baeza-Yates and G. Gonnet. Fast text searching for regular expressions or automaton simulation over tires. Journal of the ACM, 43(6):915--936, 1996.
[4]
R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. ACM Press, New York, 1999.
[5]
M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. Open information extraction from the web. In IJCAI, 2007.
[6]
H. Bast and I. Weber. Type less, find more: Fast autocompletion search with a succinct index. In SigIR, 2006.
[7]
G. Bhalotia, A. Hulgeri, C. Nakhe, S. Chakrabarti, and S. Sudarshan. Keyword searching and browsing in databases using BANKS. In Proc. of ICDE, 2002.
[8]
L. Blunschi, J.-P. Dittrich, O. R. Girard, S. K. Karakashian, and M. A. V. Salles. A dataspace odyssey: The iMeMex personal dataspace management system. In CIDR, 2007.
[9]
N. Bruno, N. Koudas, and D. Srivastava. Holistic twig joins: Optimal XML pattern matching. In Sigmod, 2002.
[10]
S. Chakrabarti, K. Puniyani, and S. Das. Optimizing scoring functions and indexes for proximity search in type-annotated corpora. In WWW, 2006.
[11]
Q. Chen, A. Lim, and K. W. Ong. D(k)-index: An adaptive structural summary for graph-structured data. In Proc. of SIGMOD, 2003.
[12]
Z. Chen, J. Gehrke, F. Korn, N. Koudas,J. Shanmugasundaram, and D. Srivastava. Index structures for matching xml twigs using relational query processors. In ICDE Workshops, 2005.
[13]
S.-Y. Chien, Z. Vagena, D. Zhang, V. J. Tsotras, and C. Zaniolo. Efficient structural joins on indexed XML documents. In Proc. of VLDB, 2002.
[14]
J. Cho and S. Rajagopalan. A fast regular expression indexing engine. In Proc. of ICDE, 2001.
[15]
C. Chung, J. Min, and K. Shim. APEX: An adaptive path index for XML data. In Proc. of SIGMOD, 2002.
[16]
B. F. Cooper, N. Sample, M. J.Franklin, G. R. Hjaltason,and M. Shadmon. A fast index for semistructured data. In Proc. of VLDB, 2001.
[17]
L. Denoyer and P. Gallinari. The Wikipedia XML Corpus. SIGIR Forum, 2006.
[18]
P. DeRose, W. Shen, F. Chen, Y. Lee, D. Burdick, A. Doan, and R. Ramakrishnan. D B Life: A community information management platform for the database research community. In CIDR, 2007.
[19]
X. Dong and A. Halevy. A Platform for Personal Information Management and Integration. In CIDR, 2005.
[20]
R. Goldman, N. Shivakumar, S. Venkatasubramanian, and H. Garcia-Molina. Proximity search in databases. In Proc.of VLDB, 1998.
[21]
R. Goldman and J. Widom. Dataguides: Enabling query formulation and optimization in semistructured databases. In Proc. of VLDB, Athens, Greece, 1997.
[22]
J. Graupmann, R. Schenkel, and G. Weikum. The SphereSearch engine for unified ranked retrieval of heterogeneous XML and web documents. In VLDB, 2005.
[23]
M. Gubanov and P. A. Berstein. Structural text search and comparison using automatically extracted schema. In WebDB, 2006.
[24]
A. Y. Halevy, M. J. Franklin, and D. Maier. Principles of dataspace systems. In PODS, 2006.
[25]
H. He and J. Yang. Multiresolution indexing of XML for frequent queries. In Proc. of ICDE, 2004.
[26]
V. Hristidis and Y. Papakonstantinou. DISCOVER: Keyword search in relational databases. In VLDB, 2002.
[27]
V. Hristidis, Y. Papakonstantinou, and A. Balmin. Keyword proximity search on XML graphs. In Proc. of ICDE, 2003.
[28]
Jena. http://jena.sourceforge.net/, 2005.
[29]
H. Jiang, H. Lu, W. Wang, and B. C. Ooi. XR-Tree: Indexing XML data for efficient structural joins. In ICDE, 2003.
[30]
Y.-J. Joung and L.-W. Yang. KISS: A simple prefix search scheme in P2P networks. In WebDB, 2006.
[31]
R. Kaushik, P. Bohannon, J. F. Naughton, and H. F.Korth. Covering indexes for branching path queries. In Proc. of SIGMOD, 2002.
[32]
R. Kaushik, R. Krishnamurthy, J. F. Naughton, and R. Ramakrishnan. On the integration of structure indexes and inverted lists. In Proc. of SIGMOD, 2004.
[33]
R. Kaushik, P. Shenoy, P. Bohannon, and E. Gudes. Exploiting local similarity for indexing paths in graph-structured data. In Proc. of ICDE, 2002.
[34]
Lucene. http://jakarta.apache.org/lucene/docs/index.html,2005.
[35]
T. Milo and D. Suciu. Index structures for path expressions. In Proc. of ICDT, 1999.
[36]
P. Rao and B. Moon. PRIX: Indexing and querying XML using Prufer sequences. In ICDE, 2004.
[37]
M. Sayyadian, H. Lekhac, A. Doan, and L. Gravano. Efficient keyword search across heterogeneous relational databases. In ICDE, 2007.
[38]
A. Schmidt, F. Waas, M. Kersten, M. J. Carey, I. Manolescu, and R. Busse. XMark: A benchmark for XML data management. In VLDB, 2002.
[39]
P. Valduriez. Join indices. ACM transactions on Database Systems, 12(2), 1987.
[40]
H. Wang, S. Park, W. Fan, and P. S. Yu. ViST: A dynamic index method for querying XML data by tree structures. In Proc. of SIGMOD, 2003.
[41]
W. Wang, H. Jiang, H. Lu, and J. X. Yu. PBiTree coding and efficient processing of containment joins. In ICDE, 2003.
[42]
I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and indexing documents and images. Morgan Kaufmann Publishers, San Francisco, 1999.
[43]
Y. Xu and Y. Papakonstantinou. Efficient keyword search for smallest LCAs in XML databases. In SIGMOD, 2005.
[44]
N. Zhang, T. Ozsu, I. F. Ilyas, and A. Aboulnaga. Fix: Feature-based indexing technique for XML documents. In VLDB, 2006.

Cited By

View all
  • (2023)BIR: Biomedical Information Retrieval System for Cancer Treatment in Electronic Health Record Using TransformersSensors10.3390/s2323935523:23(9355)Online publication date: 23-Nov-2023
  • (2023)CQFaRAD: Collaborative Query-Answering Framework for a Research Article DataspaceInternational Journal of Information Technology10.1007/s41870-023-01518-xOnline publication date: 30-Sep-2023
  • (2022)Personal information management practices: how scientists find and organize informationGlobal Knowledge, Memory and Communication10.1108/GKMC-04-2022-008273:6/7(757-774)Online publication date: 8-Nov-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data
June 2007
1210 pages
ISBN:9781595936868
DOI:10.1145/1247480
  • General Chairs:
  • Lizhu Zhou,
  • Tok Wang Ling,
  • Program Chair:
  • Beng Chin Ooi
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 June 2007

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. dataspace
  2. heterogeneity
  3. indexing

Qualifiers

  • Article

Conference

SIGMOD/PODS07
Sponsor:

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)26
  • Downloads (Last 6 weeks)6
Reflects downloads up to 16 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)BIR: Biomedical Information Retrieval System for Cancer Treatment in Electronic Health Record Using TransformersSensors10.3390/s2323935523:23(9355)Online publication date: 23-Nov-2023
  • (2023)CQFaRAD: Collaborative Query-Answering Framework for a Research Article DataspaceInternational Journal of Information Technology10.1007/s41870-023-01518-xOnline publication date: 30-Sep-2023
  • (2022)Personal information management practices: how scientists find and organize informationGlobal Knowledge, Memory and Communication10.1108/GKMC-04-2022-008273:6/7(757-774)Online publication date: 8-Nov-2022
  • (2021)CBenchProceedings of the VLDB Endowment10.14778/3457390.345739814:8(1325-1337)Online publication date: 21-Oct-2021
  • (2021)Towards an Architecture to Support Data Access in Research Data Spaces2021 IEEE 22nd International Conference on Information Reuse and Integration for Data Science (IRI)10.1109/IRI51335.2021.00049(310-317)Online publication date: Aug-2021
  • (2021)Industrial Dataspace for smart manufacturing: connotation, key technologies, and frameworkInternational Journal of Production Research10.1080/00207543.2021.195599661:12(3868-3883)Online publication date: 16-Aug-2021
  • (2020)Data Mining Visualization with the Impact of Nature Inspired Algorithms in Big Data2020 4th International Conference on Trends in Electronics and Informatics (ICOEI)(48184)10.1109/ICOEI48184.2020.9142979(664-668)Online publication date: Jun-2020
  • (2020)A Proposed Ranked Clustering Approach for Unstructured Data from Dataspace using VSM2020 20th International Conference on Computational Science and Its Applications (ICCSA)10.1109/ICCSA50381.2020.00024(80-86)Online publication date: Jul-2020
  • (2019)Concerns and Challenges of Cloud Platforms for BioinformaticsAdvanced Methodologies and Technologies in Medicine and Healthcare10.4018/978-1-5225-7489-7.ch004(45-55)Online publication date: 2019
  • (2019)Identifying Reference Relationship of Desktop Files Based on Access LogsDatabase Systems for Advanced Applications10.1007/978-3-030-18590-9_8(113-127)Online publication date: 24-Apr-2019
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media