research-article

Granular modeling of web documents: impact on information retrieval systems

Authors:
Elisabetta Fersini

University of Milano-Bicocca, Milano, Italy

University of Milano-Bicocca, Milano, Italy
View Profile

,
Enza Messina

University of Milano-Bicocca, Milano, Italy

University of Milano-Bicocca, Milano, Italy
View Profile

,
Francesco Archetti

University of Milano-Bicocca, Milano, Italy

University of Milano-Bicocca, Milano, Italy
View Profile

WIDM '08: Proceedings of the 10th ACM workshop on Web information and data managementOctober 2008Pages 111–118https://doi.org/10.1145/1458502.1458520

Published:30 October 2008Publication History

WIDM '08: Proceedings of the 10th ACM workshop on Web information and data management

Pages 111–118

ABSTRACT

One of the most important tasks in Information Retrieval (IR) is related to web page information extraction and processing. It is a common approach to consider a web page as an atomic unit and to model its textual content as a "bag-of-words". However, this kind of representation does not reflect how people perceive a web page. A granular document representation, in terms of semantic objects, can help in identifying semantic areas of a web page and using them for different IR goals. In this paper we use a granular representation to define a new metric for evaluating semantic object importance and to enhance the performance of IR systems. In particular we show that this new metric can be used not only for classification goals, in which instances are assumed as independent and identically distributed, but also to gauge the strength of relationship between hypertextual documents and exploit this information for improving page ranking performance.

References

Fersini, E., Messina, E., Archetti, F. (2008). Enhancing web page classification through image-block importance analysis. Information Processing and Management, 44(4), pp. 1431--1447. Google ScholarDigital Library
Kovacevic, M., Diligenti, M., Gori, M., Milutinovic, V. M.(2002). Recognition of common areas in a web page using visual information: a possible application in a page classification. In Proceedings of the 2002 IEEE International Conference on Data Mining, (pp. 250--257). Washington: IEEE Computer Society. Google ScholarDigital Library
Cai, D., Yu, S., Wen, J. R. Ma, W. Y., Extracting content structure for web pages based on visual representation. In Zhou, X., Zhang, Y., Orlowska, M. E. (Eds.), Proceedings of the Pacific Web Conference, (pp. 406--417). Google ScholarDigital Library
Salton, G., Wong, A. Yang, C., S. A vector space model for automatic indexing. Communications of the ACM, 18(11), 613--620. Google ScholarDigital Library
Salton, G. Buckley, C. (1998). Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513--523. Google ScholarDigital Library
Nicholas, C., Dhillon, I. Kogan, J. (2003). Feature selection and document clustering. In Berry, M. W. (Ed.), A Comprehensive Survey of Text Mining. Springer-Verlag.Google Scholar
John, G.-H., Langely, P. (1995). Estimating continuous distributions in {Bayesian} classifiers. In Besnard, P., Hanks, S. (Eds.), Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, (pp. 338--345). San Francisco: Morgan Kauffman. Google ScholarDigital Library
Platt, J., C. (1999). Fast training of support vector machines using sequential minimal optimization. In Schölkopf, B., Burges, C. J. C. Smola, A. J. (Eds.), Advances in kernel methods: support vector learning, (pp. 185--208). Cambridge: MIT Press. Google ScholarDigital Library
Witten, I., H. Frank, E. (1999). Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. San Francisco: Morgan Kauffman. Google ScholarDigital Library
J. Kleinberg. Authoritative sources in a hyperlinked environment. In Proc. Ninth Ann. ACM-SIAM Symp. Discrete Algorithms, pages 668--677, ACM Press, New York, 1998 Google ScholarDigital Library
Sergey Brin; Larry Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine Google ScholarDigital Library
T. H. Haveliwala. Topic-sensitive PageRank. In Proceedings of the Eleventh International World Wide Web Conference, 2002, pages 517--526. Google ScholarDigital Library
A. Borodin, C. O. Roberts, J.S. Rosenthal, P. Tsaparas. Link Analysis Ranking: Algorithms, Theory, and Experiments. ACM Transactions on Internet Technology, Volume 5 , Issue 1, Pages: 231 -- 297 , 2005. Google ScholarDigital Library
Quinlan, J., R. (1993). C4.5: programs for machine learning. San Francisco: Morgan Kauffman. Google ScholarDigital Library
Aha, D. W., Kibler, D., Albert, M. K. (1991). Instance-based learning algorithms. Machine Learning, 6(1), 37--66. Google ScholarDigital Library
Song, R., Liu, H., Wen, J-R Ma, W.-Y. (2004). Learning block importance models for web pages. In Feldman, S. I., Uretsky, M., Najork, M., Wills, C. E. (Eds.), Proceedings of the 13th international conference on World Wide Web, (pp. 203--211). New York: ACM Press. Google ScholarDigital Library
Jun Hirai, Sriram Raghavan, Andreas Paepcke, and Hector Garcia-Molina. "WebBase : A repository of Web pages," In Proceedings of the 9th Internationall World Wide Web Conference (WWW9), Amsterdam, May 2000. Google ScholarDigital Library
Richardson, M., Domingos, P. (2002). The intelligent surfer: Probabilistic combination of link and content information in PageRank. In T. G. Dietterich, S. Becker and Z. Ghahramani (Eds.), Advances in Neural Information Processing Systems 14, 1441--1448. Cambridge, MA: MIT Press.Google Scholar
R. Lempel and S. Moran, "The stochastic approach for link-structure analysis (SALSA) and the TKC effect.", Proc. 9th International World Wide Web Conference, 2000. http://citeseer.ist.psu.edu/lempel00stochastic.html Google ScholarDigital Library
D. Cohn and H. Chang. Learning to probabilistically identify authoritative documents. In Proceedings of the 17th International Conference on Machine Learning, pages 167--174, Stanford University, 2000. Google ScholarDigital Library
Gao, Y., Fan, J., Xue, X., Jain, R. (2006). Automatic image annotation by incorporating feature hierarchy and boosting to scale up SVM classifiers. In Nahrstedt, K., Turk, M., Rui, Y., Klas, W., Mayer-Patel, K. (Eds.), Proceedings of the 14th annual ACM International Conference on Multimedia, (pp. 901--910). New York: ACM Press. Google ScholarDigital Library
Li, F., Perona, P. (2005). A Bayesian Hierarchical Model for Learning Natural Scene Categories. In Proceeding of 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, (pp. 524--531). San Diego: IEEE Computer Society. Google ScholarDigital Library
Jarvelin, K., Kekalainen, J. (2002). Cumulated gain-based evaluation of IR techniques. ACM Trans. Inf. Syst., 20(4), pp. 422--446. New York: ACM Press. Google ScholarDigital Library
Cai, D., He, X., Wen, J. Ma, W. (2004). Block Level Link Analysis. In Proceedings of the 27th annual international ACM SIGIR conference on Research and Development in Information Retrieval, (pp. 440--447), New York: ACM Press. Google ScholarDigital Library

Index Terms

Granular modeling of web documents: impact on information retrieval systems
1. Applied computing
  1. Document management and text processing
    1. Document preparation
      1. Hypertext / hypermedia creation
2. Information systems
  1. Information retrieval
    1. Retrieval models and ranking

Recommendations

Categorisation of web documents using extraction ontologies

Automatically recognising which HTML documents on the Web contain items of interest for a user is non-trivial. As a step toward solving this problem, we propose an approach based on information-extraction ontologies. Given HTML documents, tables, and forms, ...
Read More
An automatic approach to classify web documents using a domain ontology
PReMI'05: Proceedings of the First international conference on Pattern Recognition and Machine Intelligence

This paper suggests an automated method for document classification using an ontology, which expresses terminology information and vocabulary contained in Web documents by way of a hierarchical structure. Ontologybased document classification involves ...
Read More
Automatic keyphrase extraction for Arabic news documents based on KEA system

A keyphrase is a sequence of words that play an important role in the identification of the topics that are embedded in a given document. Keyphrase extraction is a process which extracts such phrases. This has many important applications such as document ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WIDM '08: Proceedings of the 10th ACM workshop on Web information and data management
October 2008
164 pages
ISBN:9781605582603
DOI:10.1145/1458502
Program Chairs:
Chee-Yong Chan
National University of Singapore, Singapore
,
Neoklis Polyzotis
University of California-Santa Cruz, USA
Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 30 October 2008
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
document classification
relational granular document modeling
visual layout analysis
web page ranking
Qualifiers
- research-article
Conference
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 219
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Granular modeling of web documents: impact on information retrieval systems

WIDM '08: Proceedings of the 10th ACM workshop on Web information and data management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Categorisation of web documents using extraction ontologies

An automatic approach to classify web documents using a domain ontology

Automatic keyphrase extraction for Arabic news documents based on KEA system