Skip to main content

Mining Semi-structured Data by Path Expressions

  • Conference paper
  • First Online:
Book cover Discovery Science (DS 2001)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2226))

Included in the following conference series:

Abstract

A new data model for filtering semi-structured texts is presented. Given positive and negative examples of HTML pages labeled by a labelling function, the HTML pages are divided into a set of paths using the XML parser. A path is a sequence of element nodes and text nodes such that a text node appears in only the tail of the path. The labels of an element node and a text node are called a tag and a text, respectively. The goal of a mining algorithm is to find an interesting pattern, called association path, which is a pair of a tag-sequence t and a word-sequence w represented by the word-association pattern [1]. An association path (t,w) agrees with a labelling function on a path p if t is a subsequence of the tag-sequence of p and w matches with the text of p iff p is in a positive example. The importance of such an associate path α is measured by the agreement of a labelling function on given data, i.e., the number of paths on which α agrees with the labelling function. We present a mining algorithm for this problem and show the efficiency of this model by experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Shimozono, S., Arimura, H., and Arikawa, S. Efficient discovery of optimal wordassociation patterns in large text databases. New Generation Computing 18:49–60, 2000.

    Google Scholar 

  2. Arora, S. Polynomial-time approximation schemes for Euclidean TSP and other geometric problems. Proc. 37th IEEE Symposium on Foundations of Computer Science, 2–12, 1996.

    Google Scholar 

  3. Abiteboul, S., Buneman, P., and Suciu, D. Data on the Web: From relations to semistructured data and XML, Morgan Kaufmann, San Francisco, CA, 2000.

    Google Scholar 

  4. Angluin, D. Queries and concept learning. Machine Learning 2:319–342, 1988.

    Google Scholar 

  5. Buneman, P., Davidson, S., Hillebrand, G., and Suciu, D. A query language and optimization techniques for unstructured data. University ofPennsylvania, Computer and Information Science Department, Technical Report MS-CIS 96-09, 1996.

    Google Scholar 

  6. Cohen, W. W. and Fan, W. Learning Page-Independent Heuristics for Extracting Data from Web Pages, Proc. WWW-99. 1999.

    Google Scholar 

  7. Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., and Slattery, S. Learning to construct knowledge bases from the World Wide Web, Artificial Intelligence 118:69–113, 2000.

    Article  MATH  Google Scholar 

  8. Freitag, D. Information extraction from HTML: Application of a general machine learning approach. Proc. the 15th National Conference on Artificial Intelligence, 517–523, 1998

    Google Scholar 

  9. Grieser, G., Jantke, K. P., Lange, S., and Thomas, B. A unifying approach to HTML wrapper representation and learning, Proc. the 3rd International Conference, DS2000, Lecture Notes in Artificial Intelligence 1967:50–64, 2000.

    Google Scholar 

  10. Hammer, J., Garcia-Molina, H., Cho, J., and Crespo, A. Extracting semistructured information from the Web. Proc. Workshop on Management ofSemistructur ed Data, 18–25, 1997.

    Google Scholar 

  11. Hsu, C.-N. Initial results on wrapping semistructured web pages with finite-state transducers and contextual rules. Proc. 1998 Workshop on AI and Information Integration, 66–73, 1998.

    Google Scholar 

  12. Kamada, T. Compact HTML for small information appliances. W3C NOTE 09-Feb-1998. http://www.w3.org/TR/1998/NOTE-compactHTML-19980209, 1998.

  13. Kushmerick, N. Wrapper induction:efficiency and expressiveness. Artificial Intelligence 118:15–68,2000.

    Article  MATH  MathSciNet  Google Scholar 

  14. Lin, S.,and Kernighan, B.W. An effective heuristic algorithm for the travelling salesman problem.Operations Research 21:498–516,1973.

    Article  MATH  MathSciNet  Google Scholar 

  15. Muslea, I., Minton, S.,and Knoblock, C. A. Wrapper induction for semistructured, web-based information sources.Proc.Conference on Automated Learning and Discovery,1998.

    Google Scholar 

  16. Sakamoto, H., Arimura, H.,and Arikawa, S. Identification of tree translation rules from examples.Proc.the 5th International Colloquium on Grammatical Inference, LNAI 1891:241–255,2000.

    Google Scholar 

  17. Thomas, B. Anti-unification based learning of T-Wrappers for information extraction,Proc.AAAI Workshop on Machine Learning for IE,15–20,AAAI,1999.

    Google Scholar 

  18. Valiant, L.G. A theory of the learnable.Comm.ACM 27:1134–1142,1984.

    Article  MATH  Google Scholar 

  19. Wang, J.T., Chirn, G.W., Marr, T.G., Shapiro, B., Shasha, D.,and Zhang, K. Combinatorial pattern discovery for scientific data:Some preliminary results.Proc. SIGMOD’94,115–125,1994.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2001 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Taniguchi, K., Sakamoto, H., Arimura, H., Shimozono, S., Arikawa, S. (2001). Mining Semi-structured Data by Path Expressions. In: Jantke, K.P., Shinohara, A. (eds) Discovery Science. DS 2001. Lecture Notes in Computer Science(), vol 2226. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45650-3_32

Download citation

  • DOI: https://doi.org/10.1007/3-540-45650-3_32

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-42956-2

  • Online ISBN: 978-3-540-45650-6

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics