Semi-Automatic Construction of Metadata from a Series of Web Documents

Hirokawa, Sachio; Itoh, Eisuke; Miyahara, Tetsuhiro

doi:10.1007/978-3-540-24581-0_81

Sachio Hirokawa⁸,
Eisuke Itoh⁸ &
Tetsuhiro Miyahara⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2903))

Included in the following conference series:

Australasian Joint Conference on Artificial Intelligence

1535 Accesses

Abstract

Metadata plays an important role in discovering, collecting, extracting and aggregating Web data. This paper proposes a method of constructing metadata for a specific topic. The method uses Web pages that are located in a site and are linked from a listing page. Web pages of recipes, real estates, used cars, hotels and syllabi are typical examples of such pages. We call them a series of Web documents. A series of Web pages have the same appearance when a user views them with a browser, because it is often the case that they are written with the same tag pattern. The method uses the tag-pattern as the common structure of the Web pages.

Individual contents of the pages appear as plain texts embedded between two consecutive tags. If we remove the tags, it becomes a sequence of plain texts. The plain texts in the same relative position can be interpreted as attribute values if we presume that the pages represent records of the same kind.

Most of these plain texts in the same position vary page to page. But, it may happen that the same texts show up at the same relative position in almost all pages. These constant texts can be considered as attribute names. “Location”, “Rating” and “Travel from Airport” are examples of such constant texts for pages of hotel information. If the frequency of a text is higher than a threshold, we accept it as a component of metadata.

If we mark a constant text with “N” and a variable text with “V”, the sequence of plain texts forms a series of N’s and V’s. A page in a series contain two kinds of NV sequence pattern. The first pattern is (NV)ⁿ, which we call vertical, where an attribute value follows the attribute name immediately. The second pattern is N ⁿ V ⁿ, which we call horizontal, where names occur in the first row and the same number of values follow in the next row. Thus we can understand the meaning of values and can construct records from a series of Web pages.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aggarwal, C.C., Al-Garawi, F., Yu, P.S.: Intelligent Crawling on the World Wide Web with Arbitrary Predicates. In: Proc. WWW 2001 (2001), http://www10.org/cdrom/papers/110/index.html
Arasu, A., Garcia-Molina, H.: Extracting Structured Data from Web Pages. In: Proc. of ACM SIGMOD/PODS 2003 Conf., pp. 337–348 (2003)
Google Scholar
Chakrabarti, S., Punera, K., Subramanyam, M.: Accelerated Focused Crawling through Online Relevance Feedback. In: Proc. WWW 2002 (2002), http://www2002.org/CDROM/refereed/336/index.html
Chang, C.H., Lui, S.C., Wu, Y.C.: Applying Pattern Mining to Web Information Mining. In: Cheung, D., Williams, G.J., Li, Q. (eds.) PAKDD 2001. LNCS (LNAI), vol. 2035, pp. 4–16. Springer, Heidelberg (2001)
Chapter Google Scholar
Cruz, I.F., Borisov, S., Marks, M.A., Webb, T.R.: Measuring Structural Similarity Among Web Documents: Preliminary Results. In: Hersch, R.D., André, J., Brown, H. (eds.) RIDT 1998 and EPub 1998. LNCS, vol. 1375, pp. 513–524. Springer, Heidelberg (1998)
Chapter Google Scholar
Flesca, S., Manco, G., Masciari, E., Pontieri, L., Pugliese, A.: Detecting Structural Similarities between XML Documents. In: Proc. WEBDB 2002 (2002), http://feast.ucsd.edu/webdb2002/papers/19.pdf
Han, J., Pei, J., Yin, Y.: Mining Frequent Patterns without Candidate Generation. In: Proc. ACM SIGMOD Intl. Conf. Management of Data, pp. 1–12 (2000)
Google Scholar
Lee, J.W., Lee, K., Kim, W.: Preparations for semantics-based XML mining. In: Proc. IEEE Int. Conf. od Data Mining (ICDM) 2001, pp. 345–352 (2001)
Google Scholar
Lerman, K., Knoblock, C., Minton, S.: Automatic Data Extraction from Lists and Tables in Web Sources, http://www.cs.waikato.ac.nz/~ml/publications/1999/99SJC-GH-Innovative-apps.pdf
Leung, H., Chung, F., Chan, S.C.: A New Sequential Mining Approach to XML Document Similarity Computation. In: Whang, K.-Y., Jeon, J., Shim, K., Srivastava, J. (eds.) PAKDD 2003. LNCS (LNAI), vol. 2637, Springer, Heidelberg (2003)
Google Scholar
Miyahara, T., Suzuki, Y., Shoudai, T., Uchida, T., Hirokawa, S., Takahashi, K., Ueda, H.: Discovery of Frequent Tag Tree Patterns in Semistructured Web Documents. In: PAKDD 2003. LNCS (LNAI), vol. 2637, pp. 430–436. Springer, Heidelberg (2003)
Google Scholar
Taguchi, T., Koga, Y., Hirokawa, S.: Integration of Search Sites of the World Wide Web. In: Proc. CUM, vol. 2, pp. 25–32 (2000)
Google Scholar
Yamada, S., Matsunaga, Y., Itoh, E., Hirokawa, S.: A study of design for intelligent web syllabus crawling agent. Trans. of IEICE D-I J86(8), 566–574 (2003) (in Japanese)
Google Scholar
Umehara, M., Iwanuma, K., Nagai, H.: f A Case-Based Semi-automatic Transformation from HTML Documents to XML Ones –Using the Similarity between HTML Documents Constituting a Series. Journal of JSAI 16(5), 408–416 (2001)
Google Scholar
Marshall, C.C.: Making metadata: a study of metadata creation for a mixed physical-digital collection DL 1998. In: Proc. of the 3rd ACM Int’l Conf. on Digital libraries, pp. 162–171 (1998)
Google Scholar
Handschuh, S., Staab, S.: Authoring and annotation of web pages in CREAM. In: Proc. WWW 2002 (2002), http://www2002.org/CDROM/refereed/506/index.html
Stuckenschmidt, H., van Harmelen, F.: Ontology-based metadata generation from semistructured information. In: Proc. of K-CAP 2001, pp. 440–444 (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

Computing and Communications Center, Kyushu University, Hakozaki 6-10-1, Higashi-ku, Fukuoka, 812-8581, Japan
Sachio Hirokawa & Eisuke Itoh
Faculty of Information Sciences, Hiroshima City University, Otsuka-Higashi 3-4-1, Asaminami-ku, Hiroshima, 731-3194, Japan
Tetsuhiro Miyahara

Authors

Sachio Hirokawa
View author publications
You can also search for this author in PubMed Google Scholar
Eisuke Itoh
View author publications
You can also search for this author in PubMed Google Scholar
Tetsuhiro Miyahara
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, Australian National University, ACT 0200, Acton, Australia
Tamás (Tom) Domonkos Gedeon
Murdoch University,
Lance Chun Che Fung

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hirokawa, S., Itoh, E., Miyahara, T. (2003). Semi-Automatic Construction of Metadata from a Series of Web Documents. In: Gedeon, T.(.D., Fung, L.C.C. (eds) AI 2003: Advances in Artificial Intelligence. AI 2003. Lecture Notes in Computer Science(), vol 2903. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24581-0_81

Download citation

DOI: https://doi.org/10.1007/978-3-540-24581-0_81
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-20646-0
Online ISBN: 978-3-540-24581-0
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics