Abstract
As the World-Wide-Web grows at an exponential rate, we are faced with the issue of rating pages in terms of quality and trust. In this siutation, with significant linkage among web pages, what other pages say about a web page can be as important as and more objective than what the page says about itself. The cumulative knowledge of such recommendations (or lack of them) can help a system to decide whether to pursue a page or not. This metadata information can also be used by a web robot program, for example, to derive summary information about web documents written in a foreign language. In this paper, we describe how we exploit this type of metadata to drive a web information gathering system, which forms the backend of a topic-specific search engine. The system uses metadata from hyperlinks to guide itself to crawl the web staying focused on a target topic. The crawler follows links that point to information related to the topic and avoids following links to irrelevant pages. Moreover, the system uses the metadata to improve its definition of the target topic through association mining. Ultimately, the guided crawling system builds a rich repository of metadata information, which is used to serve the search engine.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules. The 20th VLDB Conference. Santiago, Chile, (1994)
K. Bharat, M. Henzinger: Improved Algorithms for Topic Distillation in Hyperlinked Environments. Proc. of 21st Int. ACM SIGIR Conference. Melbourne, Australia, (1998)
Chakrabarti, S., Dom, B., Raghavan, P., Rajagopalan, S., Gibson, D., Kleinberg, J.: Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text.
Chakrabarti, S., van den Berg, M., Dom, B.: Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery. The 8th Int. World Wide Web Conference. Toronto, Canada, (1999)
Chen, H., Chung, Y.M., Ramsey, M. and Yang, C.C.: A Smart Itsy Bitsy Spider for the Web. Journal of American Society of Information Science. 49(7) (1998) 604–618
Cho, J., Garcia-Molina, H., Page, L.: Efficient Crawling through URL Ordering. The 7th Int. World Wide Web Conference. Brisbane, Australia, (1998)
Matthias Eichstaedt, Daniel Ford, Reiner Kraft, Qi Lu, Wayne Niblack, Neel Sundaresan: Grand Central Station. IBM Research Report. IBM Almaden Research Center, (1998)
R. Feldman, H. Hirsh: Mining Associations in Text in the Presence of Background Knowledge. The 2nd Int. Conference on Knowledge Discovery and Data Mining. Portland, Oregon. (1996) 343–346
B, Huberman, P. Pirolli, J. Pitkow, R. Lukose: Strong Regularities in World Wide Web Surfing. Science. 280 (1998) 95–97
J. Kleinberg: Authoritative Sources in a Hyperlinked Environment. Proc. of 9th ACM-SIAM Symposium on Discrete Algorithms. (1997)
Kumar, R., Raghavan, P., Rajagopalan, S., Tomkins, A.: Trawling the Web for Emerging Cyber-Communities. The 8th Int. World Wide Web Conference. Toronto, Canada, (1999)
Lassila, O., Swick, R.R.: Resource Description Framework (RDF) Model, Syntax, Recommendation. W3C, (1999), ”http://www.w3.org/TR/REC-rdf-syntax/”
Lawrence, S., Giles, L.: Accessibility and Distribution of Information on the Web. Nature. 400, (1999) 107–109
McCallum, A., Nigam, K., Rennie, J., Seymore, K.: Building Domain-Specific Search Engines with Machine Learning Techniques. AAAI Spring Symposium. (1999)
Miller, G.: Nouns in WordNet: A Lexical Inheritance System. International Journal of Lexicography. 2(4) (1990) 245–264
E. Spertus: ParaSite: Mining Structure Information on the Web. The 6th Int. World Wide Web Conference. Santa Clara, CA, (1997)
Sundaresan, N., Yi, J., Huang, A.: Using metadata to enhance a web information gathering system. The 3rd ACM SIGMOD Workshop on the Web and Databases. Dallas, TX, (2000) 11–16
Yi, J., Sundaresan, N., Huang, A.: Automated Construction of Topic-specific Web Search Engines with Data Mining Techniques. IBM Research Report. IBM Almaden Research Center. (2000)
Yi, J., Sundaresan N.: Metadata Based Web Mining for Relevance. International database Engineering and Applications Symposium, forthcoming. Yokohama, Japan, (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Yi, J., Sundaresan, N., Huang, A. (2000). Metadata Based Web Mining for Topic-Specific Information Gathering. In: Bauknecht, K., Madria, S.K., Pernul, G. (eds) Electronic Commerce and Web Technologies. EC-Web 2000. Lecture Notes in Computer Science, vol 1875. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44463-7_31
Download citation
DOI: https://doi.org/10.1007/3-540-44463-7_31
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-67981-3
Online ISBN: 978-3-540-44463-3
eBook Packages: Springer Book Archive