skip to main content
research-article

Improving XML search by generating and utilizing informative result snippets

Published: 30 July 2010 Publication History

Abstract

Snippets are used by almost every text search engine to complement the ranking scheme in order to effectively handle user searches, which are inherently ambiguous and whose relevance semantics are difficult to assess. Despite the fact that XML is a standard representation format of Web data, research on generating result snippets for XML search remains limited.
To tackle this important yet open problem, in this article, we present a system eXtract which generates snippets for XML search results. We identify that a good XML result snippet should be a meaningful information unit of a small size that effectively summarizes this query result and differentiates it from others, according to which users can quickly assess the relevance of the query result. We have designed and implemented a novel algorithm to satisfy these requirements. Furthermore, we propose to cluster the query results based on their snippets. Since XML result clustering can only be done at query time, snippet-based clustering significantly improves the efficiency while compromising little clustering accuracy. We verified the efficiency and effectiveness of our approach through experiments.

References

[1]
Aggarwal, C. C., Ta, N., Wang, J., Feng, J., and Zaki, M. 2007. Xpro j: A framework for projected structural clustering of xml documents. In Proceedings of the International SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'07).
[2]
Ali, M. S., Consens, M. P., Khatchadourian, S., and Rizzolo, F. 2008. DescribeX: Interacting with AxPRE summaries (demo description). In Proceedings of the International Conference on Data Engineering (ICDE'08).
[3]
Bao, Z., Ling, T. W., Chen, B., and Lu, J. 2009. Effective XML keyword search with relevance oriented ranking. In Proceedings of the International Conference on Data Engineering (ICDE'09).
[4]
Barg, M. and Wong, R. K. 2001. Structural proximity searching for large collections of semi-structured data. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM'01).
[5]
Carbonell, J. and Goldstein, J. 1998. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the Annual ACM SIGIR Conference on Research and Development in Information Retrieval.
[6]
Clarke, C. L. A. 2005. Controlling overlap in content-oriented XML retrieval. In Proceedings of the Annual ACM SIGIR Conference on Research and Development in Information Retrieval.
[7]
Cohen, S., Mamou, J., Kanza, Y., and Sagiv, Y. 2003. XSEarch: A semantic search engine for XML. In Proceedings of the International Conference on Very Large Databases (VLDB'03).
[8]
Dalamagas, T., Cheng, T., Winkel, K.-J., and Sellis, T. 2006. A methodology for clustering XML documents by structure. Inform. Syst. 31, 3, 187--228.
[9]
Dalamagas, T., Cheng, T., Winkel, K.-J., and Sellis, T. K. 2004. Clustering XML documents using structural summaries. In Proceedings of the International Conference on Extending Database Technology (EDBT'04) Workshops.
[10]
Das, G., Hristidis, V., Kapoor, N., and Sudarshan, S. 2006. Ordering the attributes of query results. In Proceedings of the ACM SIGMOD International Conference on Management of Data.
[11]
Doucet, A. and Ahonen-Myka, H. 2002. Naive clustering of a large XML document collection. In Proceedings of the Initative for the Evaluation of XML Retrieval (INEX'02) Workshop.
[12]
Goldstein, J., Kantrowitz, M., Mittal, V., and Carbonell, J. 1999. Summarizing text documents: Sentence selection and evaluation metrics. In Proceedings of the ACM SIGIR Conference on Research and Development in Information Retrieval.
[13]
Golenberg, K., Kimelfeld, B., and Sagiv, Y. 2008. Keyword proximity search in complex data graphs. In Proceedings of the ACM SIGMOD International Conference on Management of Data.
[14]
Guo, L., Shao, F., Botev, C., and Shanmugasundaram, J. 2003. XRANK: Ranked keyword search over XML documents. In Proceedings of the ACM SIGMOD International Conference on Management of Data.
[15]
He, H., Wang, H., Yang, J., and Yu, P. S. 2007. BLINKS: Ranked keyword searches on graphs. In Proceedings of the ACM SIGMOD International Conference on Management of Data.
[16]
Hristidis, V., Koudas, N., Papakonstantinou, Y., and Srivastava, D. 2006. Keyword proximity search in XML trees. IEEE Trans. Knowl. Data Engin. 18, 4.
[17]
Hristidis, V., Papakonstantinou, Y., and Balmin, A. 2003. Keyword proximity search on XML graphs. In Proceedings of the International Conference on Data Engineering (ICDE'03).
[18]
Huang, Y., Liu, Z., and Chen, Y. 2008. Query biased snippet generation in XML search. In Proceedings of the ACM SIGMOD International Conference on Management of Data.
[19]
Kamps, J., de Rijke, M., and Sigurbjornsson, B. 2004. Length normalization in XML retrieval. In Proceedings of the Annual ACM SIGIR Conference on Research and Development in Information Retrieval.
[20]
Kazai, G., Lalmas, M., and de Vries, A. P. The overlap problem in content-oriented XML retrieval evaluation. In Proceedings of the Annual ACM SIGIR Conference on Research and Development in Information Retrieval.
[21]
Lee, M. L., Yang, L. H., Hsu, W., and Yang, X. 2002. XClust: Clustering XML schemas for effective integration. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM'02).
[22]
Li, G., Feng, J., Wang, J., and Zhou, L. 2007. Effective keyword search for valuable LCAs over XML documents. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM'07).
[23]
Li, Y., Yu, C., and Jagadish, H. V. 2004. Schema-Free XQuery. In Proceedings of the International Conference on Very Large Databases (VLDB'04).
[24]
Lian, W., lok Cheung, D. W., Mamoulis, N., and Yiu, S.-M. 2004. An efficient and scalable algorithm for clustering XML documents by structure. IEEE Trans. Knowl. Data Engin. 16, 1, 82--96.
[25]
Liang, Y.-H., Zhao, T.-J., Yu, H., and Yao, J.-M. 2005. High precision English base noun phrase identification based on “Waterfall” model. In Proceedings of the Conference on Machine Learning and Cybernetics.
[26]
Lin, C.-Y. 2003. Improving summarization performance by sentence compression: A pilot study. In Proceedings of the International Workshop on Information Retrieval with Asia Languages (IRAL'03).
[27]
Liu, Z. and Chen, Y. 2007. Identifying meaningful return information for XML keyword search. In Proceedings of the ACM SIGMOD International Conference on Management of Data.
[28]
Luo, Y., Lin, X., Wang, W., and Zhou, X. 2007. SPARK: Top-k keyword query in relational databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data.
[29]
Nierman, A. and Jagadish, H. V. 2002. Evaluating structural similarity in XML documents. In Proceedings of the International Workshop on Web and Databases (WebDB'02).
[30]
Ogilvie, P. and Callan, J. 2003. Using language models for flat text queries in XML re-trieval. In Proceedings of the Initiative for the Evaluation of XML Retrieval Workshop (INEX'03).
[31]
Piwowarski, B. and Dupret, G. 2006. Evaluation in (XML) information retrieval: Expected precision-recall with user modelling (EPRUM). In Proceedings of the Annual ACM SIGIR Conference on Research and Development in Information Retrieval.
[32]
Polyzotis, N. and Garofalakis, M. 2006. XCluster synopses for structured XML content. In Proceedings of the International Conference on Data Engineering (ICDE'06).
[33]
Ramanath, M. and Kumar, K. S. 2008. A rank-rewrite framework for summarizing XML documents. In Proceedings of the International Workshop on Ranking in Databases (DBRank'08).
[34]
Silber, H. G. and McCoy, K. F. 2002. Efficiently computed lexical chains as an intermediate representation for automatic text summarization. Comput. Linguist. 28, 4.
[35]
Sun, C., Chan, C.-Y., and Goenka, A. 2007. Multiway SLCA-based keyword search in XML data. In Proceedings of the International World Wide Web Conference (WWW'07).
[36]
Szlavik, Z., Tombros, A., and Lalmas, M. 2006. The use of summaries in XML retrieval. In Proceedings of the European Conference on Digital Libraries (ECDL'06).
[37]
Tagarelli, A. and Greco, S. 2006. Toward semantic XML clustering. In Proceedings of the SIAM International Conference on Data Mining (SDM'06).
[38]
Tombros, A. and Sanderson, M. 1998. Advantages of query biased summaries in information retrieval. In Proceedings of the Annual ACM SIGIR Conference on Research and Development in Information Retrieval.
[39]
Tombros, A., Villa, R., and Rijsberge, C. J. V. 2002. The effectiveness of query-specific hierarchic clustering in information retrieval. Inform. Process. Manag. 38, 4, 559--582.
[40]
Turpin, A., Tsegay, Y., Hawking, D., and Williams, H. E. 2007. Fast generation of result snippets in web search. In Proceedings of the Annual ACM SIGIR Conference on Research and Development in Information Retrieval.
[41]
Varadarajan, R. and Hristidis, V. 2005. Structure-Based query-specific document summarization. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM'05).
[42]
Varadarajan, R. and Hristidis, V. 2006. A system for query-specific document summarization. In Proceedings of the ACM International Conference on Information and Knowledge Management (CIKM'06).
[43]
Wacholder, N., Evans, D. K., and Klavans, J. L. 2001. Automatic identification and organization of index terms for interactive browsing. In Proceedings of the ACM/IEEE-CS Joint Conference on Digital Libraries.
[44]
Wang, J. T. L., Liu, J., and Wang, J. 2005. XML clustering and retrieval through principal component analysis. Int. J. Artif. Intell. Tools 14, 4, 683.
[45]
Wang, T., xin Liu, D., and Lin, X.-Z. 2006. XML document clustering by independent component analysis. In Proceedings of the International Workshop on Knowledge Discovery from XML Documents (KDXD'06).
[46]
White, M., Korelsky, T., Cardie, C., Ng, V., Pierce, D., and Wagstaff, K. 2001. Multi-Document summarization via information extraction. In Proceedings of the Annual Meeting of the Association for Computational Linguistics on Human Language Technologies (HLT'01).
[47]
White, R. W., Ruthven, I., and Jose, J. M. 2002. Finding relevant documents using top ranking sentences: An evaluation of two alternative schemes. In Proceedings of the Annual ACM SIGIR Conference on Research and Development in Information Retrieval.
[48]
Xing, G., Guo, J., and Xia, Z. 2006. Classifying XML documents based on structure/content similarity. In Proceedings of the Initiative for the Evaluation of XML Retrieval Workshop (INEX'06).
[49]
Xing, G., Xia, Z., and Guo, J. 2007. Clustering XML documents based on structural similarity. In Proceedings of the International Conference on Database Systems for Advanced Applications (DASFAA'07).
[50]
Xu, Y. and Papakonstantinou, Y. 2005. Efficient keyword search for smallest LCAs in XML databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data.
[51]
Zechner, K. 1996. Fast generation of abstracts from general domain text corpora by extracting relevant sentences. In Proceedings of the 16th International Conference on Computational Linguistics (COLING'96). 986--989.

Cited By

View all

Index Terms

  1. Improving XML search by generating and utilizing informative result snippets

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Database Systems
    ACM Transactions on Database Systems  Volume 35, Issue 3
    July 2010
    311 pages
    ISSN:0362-5915
    EISSN:1557-4644
    DOI:10.1145/1806907
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 30 July 2010
    Accepted: 01 February 2010
    Revised: 01 October 2009
    Received: 01 March 2009
    Published in TODS Volume 35, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. XML
    2. clustering
    3. keyword search
    4. snippets

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)3
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2019)XSnippets: Exploring semi-structured data via snippetsData & Knowledge Engineering10.1016/j.datak.2019.101758Online publication date: Oct-2019
    • (2018)Processing keyword search on XMLWorld Wide Web10.1007/s11280-011-0128-214:5-6(671-707)Online publication date: 25-Dec-2018
    • (2016)Keyword query with structureInformation Technology and Management10.1007/s10799-015-0247-z17:2(151-163)Online publication date: 1-Jun-2016
    • (2015)Reasoning with patterns to effectively answer XML keyword queriesThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-015-0384-324:3(441-465)Online publication date: 1-Jun-2015
    • (2014)Towards improving XML search by using structure clustering techniqueJournal of Information Science10.1177/016555151456052341:2(146-166)Online publication date: 12-Dec-2014
    • (2012)Exploiting and Maintaining Materialized Views for XML Keyword QueriesACM Transactions on Internet Technology (TOIT)10.1145/2390209.239021212:2(1-27)Online publication date: 1-Dec-2012
    • (2012)Differentiating search results on structured dataACM Transactions on Database Systems (TODS)10.1145/2109196.210920037:1(1-30)Online publication date: 6-Mar-2012
    • (2012)LAF: a new XML encoding and indexing strategy for keyword‐based XML searchConcurrency and Computation: Practice and Experience10.1002/cpe.290625:11(1604-1621)Online publication date: 24-Jul-2012
    • (2011)Keyword-based search and exploration on databasesProceedings of the 2011 IEEE 27th International Conference on Data Engineering10.1109/ICDE.2011.5767958(1380-1383)Online publication date: 11-Apr-2011

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media