Summary
The World-Wide Web provides every internet citizen with access to an abundance of information, but it becomes increasingly difficult to identify the relevant pieces of information. Research in web mining tries to address this problem by applying techniques from data mining and machine learning to Web data and documents. This chapter provides a brief overview of web mining techniques and research areas, most notably hypertext classification, wrapper induction, recommender systems and web usage mining.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
R. Albert, H. Jeong, and A.-L. Barabási. Diameter of the world-wide web. Nature, 401:130–131, September 1999.
I. Androutsopoulos, G. Paliouras, and E. Michelakis. Learning to filter unsolicited commercial e-mail. Technical Report 2004/2, NCSR Demokritos, March 2004.
R. Armstrong, D. Freitag, T. Joachims, and T. Mitchell. WebWatcher: A learning apprentice for the world wide web. In C. Knoblock and A. Levy, editors, Proceedings of AAAI Spring Symposium on Information Gathering from Heterogeneous, Distributed Environments, pages 6–12. AAAI Press, 1995. Technical Report SS-95-08.
M. Balabanovi and Y. Shoham. Learning information retrieval agents: Experiments with automated web browsing. In C. Knoblock and A. Levy, editors, Proceedings of AAAI Spring Symposium on Information Gathering from Heterogeneous, Distributed Environments, pages 13–18. AAAI Press, 1995. Technical Report SS-95-08.
C. Basu, H. Hirsh, W. W. Cohen, and C. Nevill-Manning. Technical paper recommendation:A study in combining multiple information sources. Journal of Artificial Intelligence Research, 14: 231–252, 2001.
B. Berendt. Using site semantics to analyze, visualize, and support navigation. Data Mining and Knowledge Discovery, 6(1): 37–59, 2002.
B. Berendt, A. Hotho, and G. Stumme. Towards semantic web mining. In I. Horrocks and J. Hendler, editors, Proceedings of the 1st International Semantic Web Conference (ISWC-02), pages 264–278. Springer-Verlag, 2002.
T. Berners-Lee, R. Cailliau, A. Loutonen, H. Nielsen, and A. Secret. The World Wide Web. Communications of the ACM, 37(8):76–82, 1994.
T. Berners-Lee, J. Hendler, and O. Lassila. The Semantic Web. Scientific American, May 2001.
K. Bharat and A. Broder. A technique for measuring the relative size and overlap of public web search engines. Computer Networks, 30(1–7):107–117, 1998. Proceedings of the 7th International World Wide Web Conference (WWW-7), Brisbane, Australia.
K. Bharat, A. Broder, M. R. Henzinger, P. Kumar, and S. Venkatasubramanian. The connectivity server: Fast access to linkage information on the Web. Computer Networks, 30(1–7):469–477, 1998. Proceedings of the 7th International World Wide Web Conference (WWW-7), Brisbane, Australia.
K. Bharat and M. R. Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. In Proceedings of the 21st ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR-98), pages 104–111, 1998.
J. S. Breese, D. Heckerman, and C. Kadie. Empirical analysis of predictive algorithms for collaborative filtering. In G. F. Cooper and S. Moral, editors, Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence (UAI-98), pages 43–52, Madison, WI, 1998. Morgan Kaufmann.
S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks, 30(1–7):107–117, 1998. Proceedings of the 7th International World Wide Web Conference (WWW-7), Brisbane, Australia.
A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the Web. Computer Networks, 33(1–6):309–320, 2000. Proceedings of the 9th International World Wide Web Conference (WWW-9).
R. D. Burke, K. J. Hammond, V. Kulyukin, S. L. Lytinen, N. Tomuro, and S. Scott Schoenberg. Frequently-asked question files: Experiences with the FAQ finder system. AI Magazine, 18(2):57–66, 1997.
R. D. Burke, K. J. Hammond, and B. C. Young. Knowledge-based navigation of complex information spaces. In Proceedings of 13th National Conference on Artificial Intelligence (AAAI-96), pages 462–468. AAAI Press, 1996.
M. E. Califf, editor. Machine Learning for Information Extraction: Proceedings of the AAAI-99 Workshop, 1999. AAAI Press. Technical Report WS-99-11.
M. E. Califf. Bottom-up relational learning of pattern matching rules for information extraction. Journal of Machine Learning Research, 4:177–210, 2003.
S. Chakrabarti. Data Mining for hypertext: A tutorial survey. SIGKDD explorations, 1(2):1–11, January 2000.
S. Chakrabarti. Mining the Web: Analysis of Hypertext and Semi Structured Data. Morgan Kaufmann, 2002.
S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. In Proceedings of the ACM SIGMOD International Conference on Management on Data, pages 307–318, Seattle, WA, 1998a. ACM Press.
S. Chakrabarti, B. Dom, P. Raghavan, S. Rajagopalan, D. Gibson, and J. Kleinberg. Automatic resource compilation by analyzing hyperlink structure and associated text. Computer Networks, 30(1–7):65–74, 1998b. Proceedings of the 7th InternationalWorldWide Web Conference (WWW-7), Brisbane, Australia.
G. Chang, M. J. Healy, J. A. M. McHugh, and J. T. L. Wang. Mining the World Wide Web: An Information Search Approach. Kluwer Academic Publishers, 2001.
W. W. Cohen. Learning rules that classify e-mail. In M. Hearst and H. Hirsh, editors, Proceedings of the AAAI Spring Symposium on Machine Learning in Information Access, pages 18–25. AAAI Press, 1996. Technical Report SS-96-05.
W. W. Cohen and W. Fan. Web-collaborative filtering: Recommending music by crawling the web. In Proceedings of the 9th International World Wide Web Conference (WWW-9), 2000.
R. Cooley, B. Mobasher, and J. Srivastava. Data preparation for mining world wide web browsing patterns. Knowledge and Information Systems, 1(1): 5–32, 1999.
M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery. Learning to construct knowledge bases from theWorldWideWeb. Artificial Intelligence, 118(1-2):69–114, 2000.
M. Craven and S. Slattery. Relational learning with statistical predicate invention: Better models for hypertext. Machine Learning, 43(1-2):97–119, 2001.
M. Craven, S. Slattery, and K. Nigam. First-order learning for Web mining. In C. Nédellec and C. Rouveirol, editors, Proceedings of the 10th European Conference on Machine Learning (ECML-98), pages 250–255, Chemnitz, Germany, 1998. Springer-Verlag.
E. Crawford, J. Kay, and E. McCreath. IEMS – The Intelligent Email Sorter. In C. Sammut and A. G. Hoffmann, editors, Proceedings of the 19th International Conference on Machine Learning (ICML-02), pages 263–272, Sydney, Australia, 2002. Morgan Kaufmann.
J. Dean and M. R. Henzinger. Finding related pages in the World Wide Web. In A. Mendelzon, editor, Proceedings of the 8th International World Wide Web Conference (WWW-8), pages 389–401, Toronto, Canada, 1999.
S. C. Deerwester, S. T. Dumais, T. K. Landauer, G.W. Furnas, and R. A. Harshman. Indexing by latent semantic analysis. Journal of the American Society of Information Science, 41(6):391–407, 1990.
T. G. Dietterich. Ensemble methods in machine learning. In J. Kittler and F. Roli, editors, First International Workshop on Multiple Classifier Systems, pages 1–15. Springer-Verlag, 2000.
A. Doan, J. Madhavan, R. Dhamankar, P. Domingos, and A. Y. Halevy. Learning to match ontologies. VLDB Journal, 12(4):303–319, 2003. Special Issue on the Semantic Web.
R. B. Doorenbos, O. Etzioni, and D. S. Weld. A scalable comparison-shopping agent for the World-Wide Web. In Proceedings of the 1st International Conference on Autonomous Agents, pages 39–48, Marina del Rey, CA, 1997.
S. Džeroski and N. Lavraˇc, editors. Relational Data Mining: Inductive Logic Programming for Knowledge Discovery in Databases. Springer-Verlag, 2001.
L. Eikvil. Information extraction from world wide web – a survey. Technical Report 945, Norwegian Computing Center, 1999.
O. Etzioni and D. Weld. A softbot-based interface to the internet. Communications of the ACM, 37(7):72–76, July 1994. Special Issue on Intelligent Agents.
O. Etzioni. Moving up the information food chain: Deploying softbots on the world wide web. In Proceedings of the 13th National Conference on Artificial Intelligence (AAAI-96), pages 1322–1326. AAAI Press, 1996.
M. Faloutsos, P. Faloutsos, and C. Faloutsos. On power-law relationships of the internet topology. In Proceedings of the ACM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM-99), pages 251–262, Cambridge, MA, 1999. ACM Press.
T. Fawcett. “In vivo” spam filtering: A challenge problem for Data Mining. SIGKDD explorations, 5(2), December 2003.
D. Fensel. Ontologies: Silver Bullet for Knowledge Management and Electronic Commerce. Springer-Verlag, Berlin, 2001.
D. Freitag. Information extraction from HTML: Application of a general machine learning approach. In Proceedings of the 15th National Conference on Artificial Intelligence (AAAI-98). AAAI Press, 1998.
J. F¨urnkranz. A study using n-gram features for text categorization. Technical Report OEFAITR-98-30, Austrian Research Institute for Artificial Intelligence, Wien, Austria, 1998.
J. F¨urnkranz. Hyperlink ensembles: A case study in hypertext classification. Information Fusion, 3(4):299–312, December 2002. Special Issue on Fusion of Multiple Classifiers.
J. F¨urnkranz, C. Holzbaur, and R. Temel. User profiling for the Melvil knowledge retrieval system. Applied Artificial Intelligence, 16(4): 243–281, 2002.
J. F¨urnkranz, T. Mitchell, and E. Riloff. A case study in using linguistic phrases for text categorization on the WWW. In M. Sahami, editor, Learning for Text Categorization: Proceedings of the 1998 AAAI/ICML Workshop, pages 5–12, Madison, WI, 1998. AAAI Press. Technical Report WS-98-05.
D. Goldberg, D. Nichols, B. M. Oki, and D. Terry. Using collaborative filtering to weave and information tapestry. Communications of the ACM, 35(12):61–70, December 1992.
G. Gottlob, C. Koch, R. Baumgartner, M. Herzog, and S. Flesca. The Lixto data extraction project—Back and forth between theory and practice. In Proceedings of the Symposium on Principles of Database Systems (PODS-04), 2004.
P. Graham. Better bayesian filtering. In Proceedings of the 2003 Spam Conference, Cambridge, MA, 2003
G. Grieser, K. P. Jantke, S. Lange, and B. Thomas. A unifying approach to HTML wrapper representation and learning. In S. Arikawa and S. Morishita, editors, Proc. 3rd International Conference on Discovery Science, pages 50–64. Springer–Verlag, 2000.
T. Hofmann and J. Puzicha. Latent class models for collaborative filtering. In Proceedings of the 16th International Joint Conference on Artificial Intelligence (IJCAI-99), pages 688–693, 1999.
C. N. Hsu and M. T. Dung. Generating finite-state transducers for semistructured data extraction from the web. Information Systems, 23(8):521–538, 1998. Special Issue on Semistructured Data.
T. Joachims. Optimizing search engines using clickthrough data. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-02), pages 133–142. ACM Press, 2002.
J. M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604–632, September 1999. ISSN 0004-5411.
J. A. Konstan, B. N. Miller, D. Maltz, J. L. Herlocker, L. R. Gordon, and J. Riedl. Grouplens: Applying collaborative filtering to usenet news. Communications of the ACM, 40(3):77–87, 1997. Special Issue on Recommender Systems.
R. Kosala and H. Blockeel. Web mining research: A survey. SIGKDD explorations, 2(1):1–15, 2000
R. Kozierok and P. Maes. Learning interface agents. In Proceedings of the 11th National Conference on Artificial Intelligence (AAAI-93), pages 459–465. AAAI Press, 1993.
N. Kushmerick. Wrapper induction: Efficiency and expressiveness. Artificial Intelligence, 118:15–68, 2000.
K. Lang. NewsWeeder: Learning to filter netnews. In A. Prieditis and S. Russell, editors, Proceedings of the 12th International Conference on Machine Learning (ML-95), pages 331–339. Morgan Kaufmann, 1995.
Y. Lashkari, M. Metral, and P. Maes. Collaborative interface agents. In Proceedings of the 12th National Conference on Artificial Intelligence (AAAI-94), pages 444–450, Seattle, WA, 1994. AAAI Press.
S. Lawrence and C. L. Giles. Searching the world wide web. Science, 280:98–100, 1998.
K. Lerman, S. N. Minton, and C. A. Knoblock. Wrapper maintenance: A machine learning approach. Journal of Artificial Intelligence Research, 18: 149–181, 2003.
M. Levene, J. Borges, and G. Louizou. Zipf’s law for Web surfers. Knowledge and Information Systems, 3(1): 120–129, 2001.
D. D. Lewis. An evaluation of phrasal and clustered representations on a text categorization task. In Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Devlopment in Information Retrieval, pages 37–50, 1992.
W. Lin, S. A. Alvarez, and C. Ruiz. Efficient adaptive-support association rule mining for recommender systems. Data Mining and Knowledge Discovery, 6(1): 83–105, 2002.
A. Maedche, C. Nédellec, S. Staab, and E. Hovy, editors. Proceedings of the 2nd Workshop on Ontology Learning (OL-2001), volume 38 of CEUR Workshop Proceedings, Seattle, WA, 2001. IJCAI-01.
A. Maedche, V. Pekar, and S. Staab. Ontology learning part one—on discovering taxonomic relations from the web. In N.Zhong, J. Liu, and Y. Y. Yao, editors, Web Intelligence, pages 301–321. Springer-Verlag, 2003.
A. Maedche and S. Staab. Learning ontologies for the semantic web. IEEE Intelligent Systems, 16(2), 2001.
P. Maes. Agents that reduce work and information overload. Communications of the ACM, 37(7):30–40, July 1994. Special Issue on Intelligent Agents.
O. A. McBryan. GENVL and WWWW: Tools for taming the Web. In Proceedings of the 1st World-Wide Web Conference (WWW-1), pages 58–67, Geneva, Switzerland, 1994. Elsevier.
A. McCallum and K. Nigam. A comparison of event models for naive bayes text classification. In M. Sahami, editor, Learning for Text Categorization: Proceedings of the 1998 AAAI/ICML Workshop, pages 41–48, Madison, WI, 1998. AAAI Press.
P. Melville, R. J. Mooney, and R. Nagarajan. Content-boosted collaborative filtering for improved recommendations. In Proceedings of the 18th National Conference on Artificial Intelligence (AAAI-2002), pages 187–192, Edmonton, Canada, 2002.
D. Mladenić. Personal WebWatcher: Implementation and design. Technical Report IJS-DP-7472, Department of Intelligent Systems, Jožef Stefan Institute, 1996.
D. Mladenić. Feature subset selection in text-learning. In C. Nédellec and C. Rouveirol, editors, Proceedings of the 10th European Conference on Machine Learning (ECML 98), pages 95–100, Chemnitz, Germany, 1998a. Springer-Verlag.
D. Mladenić. Turning Yahoo into an automatic web-page classifier. In H. Prade, editor, Proceedings of the 13th European Conference on Artificial Intelligence (ECAI-98), pages 473–474, Brighton, U.K., 1998b. Wiley.
D. Mladenić. Text-learning and related intelligent agents: A survey. IEEE Intelligent Systems, 14(4):44–54, July/August 1999.
D. Mladenić and M. Grobelnik. Word sequences as features in text learning. In Proceedings of the 17th Electrotechnical and Computer Science Conference (ERK-98), Ljubljana, Slovenia, 1998. IEEE section.
B. Mobasher, R. Cooley, and J. Srivastava. Automatic personalization based on web usage mining. Communications of the ACM, 43(8):142–151, 2000.
B. Mobasher, H. Dai, T. Luo, and M. Nakagawa. Discovery and evaluation of aggregate usage profiles for web personalization. Data Mining and Knowledge Discovery, 6(1):61–82, 2002.
K. J. Mock. Hybrid hill-climbing and knowledge-based methods for intelligent news filtering. In Proceedings of the 13th National Conference on Artificial Intelligence (AAAI-96), pages 48–53. AAAI Press, 1996.
J. Myllymaki. Effective web data extraction with standard XML technologies (HTML). In Proceedings of the 10th International World Wide Web Conference (WWW-01), Hong Kong, May 2001.
H. J. Oh, S. H. Myaeng, and M.-H. Lee. A practical hypertext categorization method using links and incrementally available class information. In Proceedings of the 23rd ACM International Conference on Research and Development in Information Retrieval (SIGIR-00), pages 264–271, Athens, Greece, 2000.
T. R. Payne and P. Edwards. Interface agents that learn: An investigation of learning issues in a mail agent interface. Applied Artificial Intelligence, 11(1): 1–32, 1997.
M. T. Pazienza, editor. Information Extraction in the Web Era: Natural Language Communication for Knowledge Acquisition and Intelligent Information Agents (SCIE-02), Rome, Italy, 2003. Springer-Verlag.
M. Pazzani, J. Muramatsu, and D. Billsus. Syskill & Webert: Identifying interesting web sites. In Proceedings of the 13th National Conference on Artificial Intelligence (AAAI-96), pages 54–61. AAAI Press, 1996.
M. Perkowitz and O. Etzioni. Towards adaptive web sites: Conceptual framework and case study. Artificial Intelligence, 118:245–275, 2000.
D. Pierrakos, G. Paliouras, C. Papatheodorou, and C. D. Spyropoulos. Web usage mining as a tool for personalization: A survey. User Modeling and User-Adapted Interaction, 13(4):311–372, 2003.
A. Popescul, L. Ungar, D. Pennock, and S. Lawrence. Probabilistic models for unified collaborative and content-based recommendation in sparse-data environments. In Proceedings of the 17th Conference on Uncertainty in Artificial Intelligence (UAI-2001), pages 437–444. Morgan Kaufmann, 2001.
J. R. Quinlan. Learning logical definitions from relations. Machine Learning, 5:239–266, 1990.
J. R. Quinlan. Determinate literals in inductive logic programming. In Proceedings of the 8th International Workshop on Machine Learning (ML-91), pages 442–446, 1991.
P. Resnick and H. R. Varian. Special issue on recommender systems. Communications of the ACM, 40(3), 1997.
B. L. Richards and R. J. Mooney. Learning relations by pathfinding. In Proceedings of the 10th National Conference on Artificial Intelligence (AAAI-92), pages 50–55, San Jose, CA, 1992. AAAI Press.
E. Riloff. Automatically generating extraction patterns from untagged text. In Proceedings of the 13th National Conference on Artificial Intelligence (AAAI-96), pages 1044–1049. AAAI Press, 1996a.
E. Riloff. An empirical study of automated dictionary construction for information extraction in three domains. Artificial Intelligence, 85:101–134, 1996b.
G. Salton. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading, MA, 1989.
G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24 (5):513–523, 1988.
G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11):613–620, November 1975.
B. M. Sarwar, G. Karypis, J. A. Konstan, and J. Riedl. Item-based collaborative filtering recommendation algorithms. In Proceedings of the 10th International World Wide Web Conference (WWW-10), Hong Kong, May 2001.
J. B. Schafer, J. A. Konstan, and J. Riedl. Electronic commerce recommender applications. Data Mining and Knowledge Discovery, 5(1/2): 115–152, 2000.
T. Scheffer. Email answering assistance by semi-supervised text classification. Intelligent Data Analysis, 8(5), 2004.
S. Scott and S. Matwin. Feature engineering for text classification. In I. Bratko and S. Džeroski, editors, Proceedings of 16th International Conference on Machine Learning (ICML-99), pages 379–388, Bled, SL, 1999. Morgan Kaufmann Publishers, San Francisco, US.
F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1–47, March 2002.
B. Sheth and P. Maes. Evolving agents for personalized information filtering. In Proceedings of the 9th Conference on Artificial Intelligence for Applications (CAIA-93), pages 345–352. IEEE Press, 1993.
S. Slattery and T. Mitchell. Discovering test set regularities in relational domains. In P. Langley, editor, Proceedings of the 17th International Conference on Machine Learning (ICML-00), pages 895–902, Stanford, CA, 2000. Morgan Kaufmann.
S. Soderland. Learning information extraction rules for semi-structured and free text. Machine Learning, 34(1–3):233–272, 1999.
E. Spertus. ParaSite: Mining structural information on the Web. Computer Networks and ISDN Systems, 29 (8-13):1205–1215, September 1997. Proceedings of the 6th International World Wide Web Conference (WWW-6).
M. Spiliopoulou. The laborious way from Data Mining to web log mining. Journal of Computer Systems Science and Engineering, 14:113–126, 1999. Special Issue on Semantics of the Web.
J. Srivastava, R. Cooley, M. Deshpande, and P.-N. Tan. Web usage mining: Discovery and applications of usage patterns from web data. SIGKDD explorations, 1(2):12–23, 2000.
S. Staab and A. Maedche. Knowledge portals—ontologies at work. AI Magazine, 21(2):63–75, Summer 2001.
S. Staab, A. Maedche, C. Nédellec, and P. Wiemer-Hastings, editors. Proceedings of the 1st Workshop on Ontology Learning (OL-2000), volume 31 of CEUR Workshop Proceedings, Berlin, 2000. ECAI-00.
S. Staab and R. Studer, editors. Handbook on Ontologies.International Handbooks on Information Systems. Springer-Verlag, 2004.
G. Stumme, A. Hotho, and B. Berendt, editors. Proceedings of the ECML PKDD 2001 Workshop on Semantic Web Mining, Freiburg, Germany, 2001.
G. Stumme, A. Hotho, and B. Berendt, editors. Proceedings of the ECML PKDD 2002 Workshop on Semantic Web Mining, Helsinki, Finland, 2002.
P. N. Tan and V. Kumar. Discovery of web robot sessions based on their navigational patterns. Data Mining and Knowledge Discovery, 6(1): 9–35, 2002.
L. H. Ungar and D. P. Foster. Clustering methods for collaborative filtering. In H. Kautz, editor, Proceedings of the AAAI-98 Workshop on Recommender Systems, page 112, Madison, Wisconsin, 1998. AAAI Press. Technical Report WS-98-08.
Y. Yang and J. O. Pedersen. A comparative study on feature selection in text categorization. In D. Fisher, editor, Proceedings of the 14th International Conference on Machine Learning (ICML-97), pages 412–420, Nashville, TN, 1997. Morgan Kaufmann.
Y. Yang, S. Slattery, and R. Ghani. A study of approaches to hypertext categorization. Journal of Intelligent Information Systems, 18 (2–3):219–241, March 2002. Special Issue on Automatic Text Categorization.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer Science+Business Media, LLC
About this chapter
Cite this chapter
Fürnkranz, J. (2009). Web Mining. In: Maimon, O., Rokach, L. (eds) Data Mining and Knowledge Discovery Handbook. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-09823-4_47
Download citation
DOI: https://doi.org/10.1007/978-0-387-09823-4_47
Published:
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-09822-7
Online ISBN: 978-0-387-09823-4
eBook Packages: Computer ScienceComputer Science (R0)