Abstract
The Web contains a tremendous amount of information. It is challenging to determine which Web documents are relevant to a user query, and even more challenging to rank them according to their degrees of relevance. In this paper, we propose a probabilistic retrieval model using logistic regression for recognizing multiple-record Web documents against an application ontology, a simple conceptual modeling approach. We notice that many Web documents contain a sequence of chunks of textual information, each of which constitutes a “record.” This type of documents is referred to as multiple-record documents. In our categorization approach, a document is represented by a set of term frequencies of index terms, a density heuristic value, and a grouping heuristic value. We first apply the logistic regression analysis on relevant probabilities using the (i) index terms, (ii) density value, and (iii) grouping value of each training document. Hereafter, the relevant probability of each test document is interpolated from the fitting curves. Contrary to other probabilistic retrieval models, our model makes only a weak independent assumption and is capable of handling any important dependent relationships among index terms. In addition, we use logistic regression, instead of linear regression analysis, because the relevance probabilities of training documents are discrete. Using a test set of car-ads and another one for obituary Web documents, our probabilistic model achieves the averaged recall ratio of 100%, precision ratio of 83.3%, and accuracy ratio of 92.5%.
Article PDF
Similar content being viewed by others
References
Albert A and Anderson JA (1984) On the existence of maximum likelihood estimates in logistic regression models. Biometrika, 71(1):1–10.
Baeza-Yates R and Riberro-Neto B (1999) Modern Information Retrieval. Addison-Wesley.
Bryson MC and Johnson ME (1981) The incidence of Monotone likelihood in the Cox model. Technometrics, 23(4).
Bunge MA (1977) Treatise on Basic Philosophy: Vol. 3: Ontology I: The Furniture of the World. Reidel, Boston.
Bunge MA (1979) Treatise on Basic Philosophy: Vol. 4: Ontology II: A World of Systems. Reidel, Boston.
Cooper WS (1995) Some inconsistencies and misnomers in probabilistic information retrieval. ACMTransactions on Information Systems, 13(1):100–111.
Cooper WS, Gey FC and Dabney DP (1992) Probabilistic retrieval based on staged logistic regression. In: 15th Annual International Conference on Information Retrieval (SIGIR), pp. 198–210.
Crestani F, Lalmas M and van Rijsbergen CJ (1998) Information Retrieval: Uncertainty and Logics—Advanced Models for the Representation and Retrieval of Information. Kluwer Academic Publishers.
Crestani F, Lalmas M, van Rijsbergen CJ and Campbell I (1998) Is this document relevant?...Probably: A survey of probabilistic models in information retrieval. ACM Computing Surveys, 30(4):528–552.
Embley DW, Jiang YS and Ng Y-K (1999) Record-boundary discovery in Web document. In: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data (SIGMOD'99), pp. 467–478.
Embley DW, Ng Y-K and Xu L (2001) Recognizing ontology-applicable multiple-record Web documents. In: Proceedings of the 20th International Conference on Conceptual Modeling (ER 2001).
Fuhr N and Buckley C (1991) A probabilistic learning approach for document indexing. ACM Transactions on Information Systems, 9(3):223–248.
Hosmer DW and Lemesshow S (1989) Applied Logistic Regression. John Wiley and Sons, New York.
Neter J, Wasserman W and Hutner MH (1983) Applied Linear Regression Models. Richard D. Irwin, Inc.
Oh H-J, Myaeng SHand Lee M-H(2000)Apractical hypertext categorization method using links and incrementally available class information. In: 23rd Annual International ACM SIGIR Conference, pp. 264–271.
Robertson SE (1977) The probability ranking principle in IR. Journal of Documentation, 33(4):294–304.
Ruiz ME (2002) Hierarchical text categorization using neural networks. Information Retrieval, 5(1):87–118.
Sebastiani F (2002) Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1–47.
Storey VC, Dey D, Ullrich H and Sundaresan S (1998) An ontology-based expert system for database design. Data & Knowledge Engineering, 28(1):31–46.
Wand Y (1989) A proposal for a formal model of objects. In: Kim W and Lochovsky FH, Eds., Object-Oriented Concepts, Databases, and Applications, ACM Press, New York, pp. 537–559.
Yang Y (1999) An evaluation of statistical approaches to text categorization. Information Retrieval, 1(1):69–90.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Wang, Q., Ng, YK. An Ontology-Based Binary-Categorization Approach for Recognizing Multiple-Record Web Documents Using a Probabilistic Retrieval Model. Information Retrieval 6, 295–332 (2003). https://doi.org/10.1023/A:1026024513043
Issue Date:
DOI: https://doi.org/10.1023/A:1026024513043