Mining Semi-structured Data by Path Expressions

Taniguchi, Katsuaki; Sakamoto, Hiroshi; Arimura, Hiroki; Shimozono, Shinichi; Arikawa, Setsuo

doi:10.1007/3-540-45650-3_32

Katsuaki Taniguchi³,
Hiroshi Sakamoto³,
Hiroki Arimura^3,4,
Shinichi Shimozono⁵ &
…
Setsuo Arikawa³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2226))

Included in the following conference series:

International Conference on Discovery Science

390 Accesses
2 Citations

Abstract

A new data model for filtering semi-structured texts is presented. Given positive and negative examples of HTML pages labeled by a labelling function, the HTML pages are divided into a set of paths using the XML parser. A path is a sequence of element nodes and text nodes such that a text node appears in only the tail of the path. The labels of an element node and a text node are called a tag and a text, respectively. The goal of a mining algorithm is to find an interesting pattern, called association path, which is a pair of a tag-sequence t and a word-sequence w represented by the word-association pattern [1]. An association path (t,w) agrees with a labelling function on a path p if t is a subsequence of the tag-sequence of p and w matches with the text of p iff p is in a positive example. The importance of such an associate path α is measured by the agreement of a labelling function on given data, i.e., the number of paths on which α agrees with the labelling function. We present a mining algorithm for this problem and show the efficiency of this model by experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Shimozono, S., Arimura, H., and Arikawa, S. Efficient discovery of optimal wordassociation patterns in large text databases. New Generation Computing 18:49–60, 2000.
Google Scholar
Arora, S. Polynomial-time approximation schemes for Euclidean TSP and other geometric problems. Proc. 37th IEEE Symposium on Foundations of Computer Science, 2–12, 1996.
Google Scholar
Abiteboul, S., Buneman, P., and Suciu, D. Data on the Web: From relations to semistructured data and XML, Morgan Kaufmann, San Francisco, CA, 2000.
Google Scholar
Angluin, D. Queries and concept learning. Machine Learning 2:319–342, 1988.
Google Scholar
Buneman, P., Davidson, S., Hillebrand, G., and Suciu, D. A query language and optimization techniques for unstructured data. University ofPennsylvania, Computer and Information Science Department, Technical Report MS-CIS 96-09, 1996.
Google Scholar
Cohen, W. W. and Fan, W. Learning Page-Independent Heuristics for Extracting Data from Web Pages, Proc. WWW-99. 1999.
Google Scholar
Craven, M., DiPasquo, D., Freitag, D., McCallum, A., Mitchell, T., Nigam, K., and Slattery, S. Learning to construct knowledge bases from the World Wide Web, Artificial Intelligence 118:69–113, 2000.
Article MATH Google Scholar
Freitag, D. Information extraction from HTML: Application of a general machine learning approach. Proc. the 15th National Conference on Artificial Intelligence, 517–523, 1998
Google Scholar
Grieser, G., Jantke, K. P., Lange, S., and Thomas, B. A unifying approach to HTML wrapper representation and learning, Proc. the 3rd International Conference, DS2000, Lecture Notes in Artificial Intelligence 1967:50–64, 2000.
Google Scholar
Hammer, J., Garcia-Molina, H., Cho, J., and Crespo, A. Extracting semistructured information from the Web. Proc. Workshop on Management ofSemistructur ed Data, 18–25, 1997.
Google Scholar
Hsu, C.-N. Initial results on wrapping semistructured web pages with finite-state transducers and contextual rules. Proc. 1998 Workshop on AI and Information Integration, 66–73, 1998.
Google Scholar
Kamada, T. Compact HTML for small information appliances. W3C NOTE 09-Feb-1998. http://www.w3.org/TR/1998/NOTE-compactHTML-19980209, 1998.
Kushmerick, N. Wrapper induction:efficiency and expressiveness. Artificial Intelligence 118:15–68,2000.
Article MATH MathSciNet Google Scholar
Lin, S.,and Kernighan, B.W. An effective heuristic algorithm for the travelling salesman problem.Operations Research 21:498–516,1973.
Article MATH MathSciNet Google Scholar
Muslea, I., Minton, S.,and Knoblock, C. A. Wrapper induction for semistructured, web-based information sources.Proc.Conference on Automated Learning and Discovery,1998.
Google Scholar
Sakamoto, H., Arimura, H.,and Arikawa, S. Identification of tree translation rules from examples.Proc.the 5th International Colloquium on Grammatical Inference, LNAI 1891:241–255,2000.
Google Scholar
Thomas, B. Anti-unification based learning of T-Wrappers for information extraction,Proc.AAAI Workshop on Machine Learning for IE,15–20,AAAI,1999.
Google Scholar
Valiant, L.G. A theory of the learnable.Comm.ACM 27:1134–1142,1984.
Article MATH Google Scholar
Wang, J.T., Chirn, G.W., Marr, T.G., Shapiro, B., Shasha, D.,and Zhang, K. Combinatorial pattern discovery for scientific data:Some preliminary results.Proc. SIGMOD’94,115–125,1994.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Informatics, Kyushu University, 812-8581, Fukuoka, Japan
Katsuaki Taniguchi, Hiroshi Sakamoto, Hiroki Arimura & Setsuo Arikawa
PRESTO, Japan Science Technology Co., Japan
Hiroki Arimura
Department of Artificial Intelligence, Kyushu Institute of Technology, 820-8502, Iizuka, Japan
Shinichi Shimozono

Authors

Katsuaki Taniguchi
View author publications
You can also search for this author in PubMed Google Scholar
Hiroshi Sakamoto
View author publications
You can also search for this author in PubMed Google Scholar
Hiroki Arimura
View author publications
You can also search for this author in PubMed Google Scholar
Shinichi Shimozono
View author publications
You can also search for this author in PubMed Google Scholar
Setsuo Arikawa
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

DFKI GmbH Saarbrücken, 66123, Saarbrücken, Germany
Klaus P. Jantke
Department of Informatics, Kyushu University, 6-10-1 Hakozaki, Higashi-ku, 812-8581, Fukuoka, Japan
Ayumi Shinohara

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Taniguchi, K., Sakamoto, H., Arimura, H., Shimozono, S., Arikawa, S. (2001). Mining Semi-structured Data by Path Expressions. In: Jantke, K.P., Shinohara, A. (eds) Discovery Science. DS 2001. Lecture Notes in Computer Science(), vol 2226. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45650-3_32

Download citation

DOI: https://doi.org/10.1007/3-540-45650-3_32
Published: 20 December 2001
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42956-2
Online ISBN: 978-3-540-45650-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics