Extraction of Partial XML Documents Using IR-Based Structure and Contents Analysis

Hatano, Kenji; Kinutani, Hiroko; Yoshikawa, Masatoshi; Uemura, Shunsuke

doi:10.1007/3-540-46140-X_26

Kenji Hatano⁶,
Hiroko Kinutani⁶,
Masatoshi Yoshikawa^6,7 &
…
Shunsuke Uemura⁶

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2465))

Included in the following conference series:

International Conference on Conceptual Modeling

434 Accesses

Abstract

As Internet technologies develop, XML is becoming widely used as a standard data/document format. Although the use of XML documents has attracted public attention, the application of IR technologies in XML document retrieval is still in its premature stage. We foresee that typical XML queries for end-users will be very terse, like those used with current Web search engines. Therefore, an XML search engine should be able to search appropriate retrieval results using only a few keywords. In this paper, we introduce a notion of context nodes. Context nodes are used to automatically extract coherent partial documents without the knowledge of XML document structures. This method is useful because it does not require domain analysts to analyze DTDs and specify candidate partial documents beforehand. We use the term “context search” to represent search methods which employ the notion of context node. As an instantiation of context search methods, we have developed algorithms to identify result partial documents in the vector space model. We made a performance evaluation to verify the effectiveness of our method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

A. Bonifati and S. Ceri. Comparative Analysis of Five XML Query Languages. SIGMOD Record, 29(1):68–79, 2000.
Article Google Scholar
T. Bray, J. Paoli, C. Sperberg-McQueen, and E. Maler. Extensible Markup Language (XML) 1.0 (second edition). http://www.w3.org/TR/REC-xml, 2000.
R. Baeza-Yates and B. Ribeiro-Neto, editors. Modern Information Retrieval. ACM Press, 1999.
Google Scholar
J. Clark and S. DeRose. XML Path Language (XPath) Version 1.0. http://www.w3.org/TR/xpath, 1999.
D. Chamberlin, J. Clark, D. Florescu, J. Robie, J. Siméon, and M. Stefanescu. XQuery: A Query Language for XML. http://www.w3.org/TR/xquery, 2001.
D. Egnor and R. Lord. Structured Information Retrieval using XML. In Proc. of the ACM SIGIR 2000 Workshop on XML and Information Retrieval, 2000.
Google Scholar
N. Fuhr and K. Großjohann. XIRQL: An Extension of XQL for Information Retrieval. In Proc. of the ACM SIGIR 2000 Workshop on XML and Information Retrieval, 2000.
Google Scholar
N. Fuhr and K. Großjohann. XIRQL: A Query Language for Information Retrieval in XML Documents. In Proc. of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 172–180. ACM Press, 2001.
Google Scholar
D. Florescu, D. Kossmann, and I. Manolescu. Integrating Keyword Search into XML Query Processing. In Proc. of the 9th International World Wide Web Conference, 2000.
Google Scholar
R. Goldman and J. Widom. DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases. In Proc. of 23rd International Conference on Very Large Data Bases, pages 436–445. Morgan Kaufmann, 1997.
Google Scholar
R. Goldman and J. Widom. Approximate DataGuides. In Proc. of the Workshop on Query Processing for Semistructured Data and Non-Standard Data Formats, 1999.
Google Scholar
B. Jansen, A. Spink, and T. Saracevic. Real life, real users, and real needs: a study and analysis of user queries on the web. Information Processing and Management, 36(2):207–227, 2000.
Article Google Scholar
H. Katz. XML Query Engine. http://www.fatdog.com/, 1999.
H. Kinutani, M. Yoshikawa, and S. Uemura. Identifying Result Subdocuments of XML Search Conditions. In Proc. of the 2000 Kyoto International Conference on Digital Libraries: Research and Practice, pages 232–239, 2000.
Google Scholar
A. Le Hors, P. Le Hégaret, L. Wood, G. Nical, J. Robie, M. Champion, and S. Byrne. Document Object Model (DOM) Level 2 Core Specification Version 1.0. http://www.w3.org/TR/DOM-Level-2-Core, 2000.
G. Navarro and R. Baeza-Yates. Proximal Nodes: A Model to Query Document Databases by Content and Structure. ACM Transactions on Information Systems, 15(4):400–435, 1997.
Article Google Scholar
J. Robie. XQL (XML Query Language). http://metalab.unc.edu/xql/xql-proposal.xml, 1999.
G. Salton. Automatic Information Organizations and Retrieval. McGrah-Hill, 1968.
Google Scholar
G. Salton and C. Buckley. Term-weighting approaches in automatic retrieval. Information Processing & Management, 24(5):513–523, 1988.
Article Google Scholar
D. Shin, H. Jang, and H. Jin. BUS: An Effective Indexing and Retrieval Scheme in Structured Document. In Proc. of the 3rd ACM Conference on Digital Libraries, pages 235–243. ACM, 1998.
Google Scholar
A. Theobald and G. Weikum. Adding Relevance to XML. In Proc. of the Third International Workshop on the Web and Databases, 2000.
Google Scholar

Download references

Author information

Authors and Affiliations

Nara Institute of Science and Technology, 8916-5 Takayama, Ikoma, 630-0101, Japan
Kenji Hatano, Hiroko Kinutani, Masatoshi Yoshikawa & Shunsuke Uemura
National Institute of Informatics, 2-1-2 Hitotsubashi, Chiyoda, Tokyo, 101-8430, Japan
Masatoshi Yoshikawa

Authors

Kenji Hatano
View author publications
You can also search for this author in PubMed Google Scholar
Hiroko Kinutani
View author publications
You can also search for this author in PubMed Google Scholar
Masatoshi Yoshikawa
View author publications
You can also search for this author in PubMed Google Scholar
Shunsuke Uemura
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Graduate school of environment and information sciences, Yokohama National University, 79-7, Tokiwadai, Hodogaya-ku, yokohama, 240-8501, Japan
Hiroshi Arisawa
Department of Social Informatics, Graduate School of Informatics, Kyoto University, Yoshida, Sakyo, Kyoto, 606-8501, Japan
Yahiko Kambayashi
SICE Computer Networking, University of Missouri-Kansas City, 5100 Rockhill Road, Kansas City, MO, 64110, USA
Vijay Kumar
University of Klagenfurt, Universitätsstraße 65-67, 9020, Klagenfurt, Austria
Heinrich C. Mayr
VP Industry Services DAMA International, PO Box 5786, Bellevue, WA, 98006-5786, USA
Ingrid Hunt

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hatano, K., Kinutani, H., Yoshikawa, M., Uemura, S. (2002). Extraction of Partial XML Documents Using IR-Based Structure and Contents Analysis. In: Arisawa, H., Kambayashi, Y., Kumar, V., Mayr, H.C., Hunt, I. (eds) Conceptual Modeling for New Information Systems Technologies. ER 2001. Lecture Notes in Computer Science, vol 2465. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-46140-X_26

Download citation

DOI: https://doi.org/10.1007/3-540-46140-X_26
Published: 13 September 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44122-9
Online ISBN: 978-3-540-46140-1
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics