Skip to main content

Extraction of Partial XML Documents Using IR-Based Structure and Contents Analysis

  • Conference paper
  • First Online:
Conceptual Modeling for New Information Systems Technologies (ER 2001)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2465))

Included in the following conference series:

  • 434 Accesses

Abstract

As Internet technologies develop, XML is becoming widely used as a standard data/document format. Although the use of XML documents has attracted public attention, the application of IR technologies in XML document retrieval is still in its premature stage. We foresee that typical XML queries for end-users will be very terse, like those used with current Web search engines. Therefore, an XML search engine should be able to search appropriate retrieval results using only a few keywords. In this paper, we introduce a notion of context nodes. Context nodes are used to automatically extract coherent partial documents without the knowledge of XML document structures. This method is useful because it does not require domain analysts to analyze DTDs and specify candidate partial documents beforehand. We use the term “context search” to represent search methods which employ the notion of context node. As an instantiation of context search methods, we have developed algorithms to identify result partial documents in the vector space model. We made a performance evaluation to verify the effectiveness of our method.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. A. Bonifati and S. Ceri. Comparative Analysis of Five XML Query Languages. SIGMOD Record, 29(1):68–79, 2000.

    Article  Google Scholar 

  2. T. Bray, J. Paoli, C. Sperberg-McQueen, and E. Maler. Extensible Markup Language (XML) 1.0 (second edition). http://www.w3.org/TR/REC-xml, 2000.

  3. R. Baeza-Yates and B. Ribeiro-Neto, editors. Modern Information Retrieval. ACM Press, 1999.

    Google Scholar 

  4. J. Clark and S. DeRose. XML Path Language (XPath) Version 1.0. http://www.w3.org/TR/xpath, 1999.

  5. D. Chamberlin, J. Clark, D. Florescu, J. Robie, J. Siméon, and M. Stefanescu. XQuery: A Query Language for XML. http://www.w3.org/TR/xquery, 2001.

  6. D. Egnor and R. Lord. Structured Information Retrieval using XML. In Proc. of the ACM SIGIR 2000 Workshop on XML and Information Retrieval, 2000.

    Google Scholar 

  7. N. Fuhr and K. Großjohann. XIRQL: An Extension of XQL for Information Retrieval. In Proc. of the ACM SIGIR 2000 Workshop on XML and Information Retrieval, 2000.

    Google Scholar 

  8. N. Fuhr and K. Großjohann. XIRQL: A Query Language for Information Retrieval in XML Documents. In Proc. of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 172–180. ACM Press, 2001.

    Google Scholar 

  9. D. Florescu, D. Kossmann, and I. Manolescu. Integrating Keyword Search into XML Query Processing. In Proc. of the 9th International World Wide Web Conference, 2000.

    Google Scholar 

  10. R. Goldman and J. Widom. DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases. In Proc. of 23rd International Conference on Very Large Data Bases, pages 436–445. Morgan Kaufmann, 1997.

    Google Scholar 

  11. R. Goldman and J. Widom. Approximate DataGuides. In Proc. of the Workshop on Query Processing for Semistructured Data and Non-Standard Data Formats, 1999.

    Google Scholar 

  12. B. Jansen, A. Spink, and T. Saracevic. Real life, real users, and real needs: a study and analysis of user queries on the web. Information Processing and Management, 36(2):207–227, 2000.

    Article  Google Scholar 

  13. H. Katz. XML Query Engine. http://www.fatdog.com/, 1999.

  14. H. Kinutani, M. Yoshikawa, and S. Uemura. Identifying Result Subdocuments of XML Search Conditions. In Proc. of the 2000 Kyoto International Conference on Digital Libraries: Research and Practice, pages 232–239, 2000.

    Google Scholar 

  15. A. Le Hors, P. Le Hégaret, L. Wood, G. Nical, J. Robie, M. Champion, and S. Byrne. Document Object Model (DOM) Level 2 Core Specification Version 1.0. http://www.w3.org/TR/DOM-Level-2-Core, 2000.

  16. G. Navarro and R. Baeza-Yates. Proximal Nodes: A Model to Query Document Databases by Content and Structure. ACM Transactions on Information Systems, 15(4):400–435, 1997.

    Article  Google Scholar 

  17. J. Robie. XQL (XML Query Language). http://metalab.unc.edu/xql/xql-proposal.xml, 1999.

  18. G. Salton. Automatic Information Organizations and Retrieval. McGrah-Hill, 1968.

    Google Scholar 

  19. G. Salton and C. Buckley. Term-weighting approaches in automatic retrieval. Information Processing & Management, 24(5):513–523, 1988.

    Article  Google Scholar 

  20. D. Shin, H. Jang, and H. Jin. BUS: An Effective Indexing and Retrieval Scheme in Structured Document. In Proc. of the 3rd ACM Conference on Digital Libraries, pages 235–243. ACM, 1998.

    Google Scholar 

  21. A. Theobald and G. Weikum. Adding Relevance to XML. In Proc. of the Third International Workshop on the Web and Databases, 2000.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Hatano, K., Kinutani, H., Yoshikawa, M., Uemura, S. (2002). Extraction of Partial XML Documents Using IR-Based Structure and Contents Analysis. In: Arisawa, H., Kambayashi, Y., Kumar, V., Mayr, H.C., Hunt, I. (eds) Conceptual Modeling for New Information Systems Technologies. ER 2001. Lecture Notes in Computer Science, vol 2465. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-46140-X_26

Download citation

  • DOI: https://doi.org/10.1007/3-540-46140-X_26

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-44122-9

  • Online ISBN: 978-3-540-46140-1

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics