An Ontology-Based Binary-Categorization Approach for Recognizing Multiple-Record Web Documents Using a Probabilistic Retrieval Model

Wang, Quan; Ng, Yiu-Kai

doi:10.1023/A:1026024513043

An Ontology-Based Binary-Categorization Approach for Recognizing Multiple-Record Web Documents Using a Probabilistic Retrieval Model

Published: September 2003

Volume 6, pages 295–332, (2003)
Cite this article

Download PDF

Information Retrieval Aims and scope Submit manuscript

An Ontology-Based Binary-Categorization Approach for Recognizing Multiple-Record Web Documents Using a Probabilistic Retrieval Model

Download PDF

Quan Wang¹ &
Yiu-Kai Ng¹

102 Accesses
12 Citations
3 Altmetric
Explore all metrics

Abstract

The Web contains a tremendous amount of information. It is challenging to determine which Web documents are relevant to a user query, and even more challenging to rank them according to their degrees of relevance. In this paper, we propose a probabilistic retrieval model using logistic regression for recognizing multiple-record Web documents against an application ontology, a simple conceptual modeling approach. We notice that many Web documents contain a sequence of chunks of textual information, each of which constitutes a “record.” This type of documents is referred to as multiple-record documents. In our categorization approach, a document is represented by a set of term frequencies of index terms, a density heuristic value, and a grouping heuristic value. We first apply the logistic regression analysis on relevant probabilities using the (i) index terms, (ii) density value, and (iii) grouping value of each training document. Hereafter, the relevant probability of each test document is interpolated from the fitting curves. Contrary to other probabilistic retrieval models, our model makes only a weak independent assumption and is capable of handling any important dependent relationships among index terms. In addition, we use logistic regression, instead of linear regression analysis, because the relevance probabilities of training documents are discrete. Using a test set of car-ads and another one for obituary Web documents, our probabilistic model achieves the averaged recall ratio of 100%, precision ratio of 83.3%, and accuracy ratio of 92.5%.

Article PDF

Exploration of Document Classification with Linked Data and PageRank

Selective Retrieval for Categorization of Semi-structured Web Resources

An Unsupervised Method for Concept Association Analysis in Text Collections

References

Albert A and Anderson JA (1984) On the existence of maximum likelihood estimates in logistic regression models. Biometrika, 71(1):1–10.
Google Scholar
Baeza-Yates R and Riberro-Neto B (1999) Modern Information Retrieval. Addison-Wesley.
Bryson MC and Johnson ME (1981) The incidence of Monotone likelihood in the Cox model. Technometrics, 23(4).
Bunge MA (1977) Treatise on Basic Philosophy: Vol. 3: Ontology I: The Furniture of the World. Reidel, Boston.
Google Scholar
Bunge MA (1979) Treatise on Basic Philosophy: Vol. 4: Ontology II: A World of Systems. Reidel, Boston.
Cooper WS (1995) Some inconsistencies and misnomers in probabilistic information retrieval. ACMTransactions on Information Systems, 13(1):100–111.
Google Scholar
Cooper WS, Gey FC and Dabney DP (1992) Probabilistic retrieval based on staged logistic regression. In: 15th Annual International Conference on Information Retrieval (SIGIR), pp. 198–210.
Crestani F, Lalmas M and van Rijsbergen CJ (1998) Information Retrieval: Uncertainty and Logics—Advanced Models for the Representation and Retrieval of Information. Kluwer Academic Publishers.
Crestani F, Lalmas M, van Rijsbergen CJ and Campbell I (1998) Is this document relevant?...Probably: A survey of probabilistic models in information retrieval. ACM Computing Surveys, 30(4):528–552.
Google Scholar
Embley DW, Jiang YS and Ng Y-K (1999) Record-boundary discovery in Web document. In: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data (SIGMOD'99), pp. 467–478.
Embley DW, Ng Y-K and Xu L (2001) Recognizing ontology-applicable multiple-record Web documents. In: Proceedings of the 20th International Conference on Conceptual Modeling (ER 2001).
Fuhr N and Buckley C (1991) A probabilistic learning approach for document indexing. ACM Transactions on Information Systems, 9(3):223–248.
Google Scholar
Hosmer DW and Lemesshow S (1989) Applied Logistic Regression. John Wiley and Sons, New York.
Google Scholar
Neter J, Wasserman W and Hutner MH (1983) Applied Linear Regression Models. Richard D. Irwin, Inc.
Oh H-J, Myaeng SHand Lee M-H(2000)Apractical hypertext categorization method using links and incrementally available class information. In: 23rd Annual International ACM SIGIR Conference, pp. 264–271.
Robertson SE (1977) The probability ranking principle in IR. Journal of Documentation, 33(4):294–304.
Google Scholar
Ruiz ME (2002) Hierarchical text categorization using neural networks. Information Retrieval, 5(1):87–118.
Google Scholar
Sebastiani F (2002) Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1–47.
Google Scholar
Storey VC, Dey D, Ullrich H and Sundaresan S (1998) An ontology-based expert system for database design. Data & Knowledge Engineering, 28(1):31–46.
Google Scholar
Wand Y (1989) A proposal for a formal model of objects. In: Kim W and Lochovsky FH, Eds., Object-Oriented Concepts, Databases, and Applications, ACM Press, New York, pp. 537–559.
Google Scholar
Yang Y (1999) An evaluation of statistical approaches to text categorization. Information Retrieval, 1(1):69–90.
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Department, Brigham Young University, Provo, Utah, 84602, USA
Quan Wang & Yiu-Kai Ng

Authors

Quan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yiu-Kai Ng
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, Q., Ng, YK. An Ontology-Based Binary-Categorization Approach for Recognizing Multiple-Record Web Documents Using a Probabilistic Retrieval Model. Information Retrieval 6, 295–332 (2003). https://doi.org/10.1023/A:1026024513043

Download citation

Issue Date: September 2003
DOI: https://doi.org/10.1023/A:1026024513043

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

An Ontology-Based Binary-Categorization Approach for Recognizing Multiple-Record Web Documents Using a Probabilistic Retrieval Model

Abstract

Article PDF

Similar content being viewed by others

Exploration of Document Classification with Linked Data and PageRank

Selective Retrieval for Categorization of Semi-structured Web Resources

An Unsupervised Method for Concept Association Analysis in Text Collections

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

An Ontology-Based Binary-Categorization Approach for Recognizing Multiple-Record Web Documents Using a Probabilistic Retrieval Model

Abstract

Article PDF

Similar content being viewed by others

Exploration of Document Classification with Linked Data and PageRank

Selective Retrieval for Categorization of Semi-structured Web Resources

An Unsupervised Method for Concept Association Analysis in Text Collections

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation