Abstract
This paper discuss topic distillation, an information retrieval problemthat is emerging as a critical task for the www. Algorithms for this problemmust distill a small number of high-quality documents addressing a broadtopic from a large set of candidates.We give a review of the literature, and compare the problem with relatedtasks such as classification, clustering, and indexing. We then describe ageneral approach to topic distillation with applications to searching andpartitioning, based on the algebraic properties of matrices derived fromparticular documents within the corpus. Our method – which we call special filtering – combines the use of terms, hyperlinks and anchor-textto improve retrieval performance. We give results for broad-topic querieson the www, and also give some anecdotal results applying the sametechniques to US Supreme Court law cases, US patents, and a set of WallStreet Journal newspaper articles.
Access this article
We’re sorry, something doesn't seem to be working properly.
Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Arocena, G. O., Mendelzon, A. O. & Mihaila, G. A. (1997). Applications of a Web Query Language. Proc. 6th International World Wide Web Conference.
Bayer, A. E., Smart, J. C. & McLaughlin, G. W. (1990). Mapping Intellectual Structure of Scientific Subfields Through Author Co-Citations. J. American Soc. Info. Sci. 41: 444-452.
Bharat, K. & Broder, Andrei. (1998). A Technique for Measuring the Relative Size and Overlap of Public Web Search Engines. Proceedings of the 7th World-Wide Web Conference (WWW7).
Bharat K. & Henzinger, M. R. (1998). Improved Algorithms for Topic Distillation in a Hyperlinked Environment. Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval, 469-477. Compressed postscript version: http://www.research.digital.com/SRC/personal/monika/papers/sigir98.ps.gz.
Bollobás B. (1985). Random Graphs. Academic Press.
Botafogo, Rodrigo A. & Shneiderman, Ben (1991). Identifying Aggregates in Hypertext Structures. Proceedings of ACM Hypertext '91: 63-74.
Botafogo, R., Rivlin, E. & Shneiderman, B. (1992). Structural Analysis of Hypertext: Identifying Hierarchies and Useful Metrics. ACM Trans. Inf. Sys. 10: 142-180.
Brin, S. & Page, L. (1998). The Anatomy of a Large-Scale Hypertextual Web Search Engine. Proceedings of the 7th World-Wide Web Conference (WWW7).
Bruce Croft, W. & Turtle, Howard (1989). A Retrieval Model for Incorporazting Hypertext Links. Proceedings of ACM Hypertext '89, 213-224.
Carrière, J. & Kazman, R. (1997). WebQuery: Searching and Visualizing the Web Through Connectivity. Proc 6th International World Wide Web Conference.
Chakrabarti, S., Dom, B., Gibson, D., Kleinberg, J., Raghvan, P. & Rajagopalan, S. (1998). Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text. Proceedings of the 7th World-Wide Web Conference (WWW7).
Chakrabarti, S., Dom, B. E., Gibson, D., Kumar, R., Raghavan, P., Rajagopalan, S. & Tomkins, A. (1998). Spectral Filtering for Resource Discovery. SIGIR 98 Workshop on Hypertext Information Retrieval and the Web.
Chakrabarti, S., Dom, B., Agrawal, R. & Raghavan, P. (1997). Using Taxonomy, Discriminants, and Signatures to Navigate in Text Databases. 23rd International Conference on Very Large Data Bases (VLDB). Athens, Greece.
Chakrabarti, S., Dom, B. & Indyk, P. (1998). Enhanced Hypertext Classification Using Hyperlinks. ACM SIGMOD Conference on Management of Data. Seattle, WA.
Chen, C. (1997). Structuring and Visualizing the WWW by Generalized Similarity Analysis. Proc. 8th ACM Conference on Hypertext, 177-186.
Cohen, P. R. & Kjeldsen, R. (1987). Information Retrieval by Constrained Spreading Activation in Semantic Networks. Information Processing and Management 23: 255-268.
Cutting, D. R., Pedersen, J. O., Karger, D. R. & Turkey, J. W. (1992). Scatter/Gather: A Cluster-Based Approach to Browsing Large Document Collections. Proceedings of ACM SIGIR, 318-329.
Deerwester, S., Dumais, S., Landauer, T., Furnas, G. & Harshman, R. (1990). Indexing by Latent Semantic Analysis. J. American Soc. Info. Sci. 41.
Digital Equipment Corporation. Alta Vista Search Engine, altavista, digital.com/.
Donath, W. E. & Hoffman, A. J. (1972). Algorithms for Partitioning of Graphs and Computer Logic Based on Eigenvectors of Connections Matrices. IBM Technical Disclosure Bulletin 15.
Excite Inc. Excite, www.excite.com.
FindLaw. FindLaw — LawCrawler, www.lawcrawler, com.
Frakes, W. & Baeza-Yates, R. (eds.) (1992). Information Retrieval: Data Structures and Algorithms. Prentice-Hall.
Frisse, M. E. (19??). Searching for Information in a Hypertext Medical Handbook. Communications of the ACM 31(7): 880-886.
Fukunaga, K. (1990). An Introduction to Statistical Pattern Recognition, 2nd edn. Academic Press: New York.
Furuta, R., Shipman III, F. M., Marshall, C. C., Brenner, C. & Hsieh, H-W. (1997). Hypertext Paths and the World-Wide Web: experiences with Walden's Paths. Proc. 8th ACM Conference on Hypertext, 167-176.
Garfield, E. (1972). Citation Analysis as a Tool in Journal Evaluation. Science 178: 471-479.
Garfield, E. (1994). The Impact Factor. Current Contents, June 20.
Golovchinsky, G. (1997). What the Query Told the Link: The Integration of Hypertext and Information Retrieval. Proc. 8th ACM Conference on Hypertext, 67-74.
Golub, G. & Van Loan, C. F. (1989). Matrix Computations. John Hopkins University Press.
Infoseek Corporation. Infoseek search engine, www.infoseek.com.
International Business Machines. IBM patent server, patent.womplex.ibm.com.
Kessler, M. M. (1963). Bibliographic Coupling Between Scientific Papers. American Documentation 14: 10-25.
Kleinberg, J. (1997). Authoritative Sources in a Hyperlinked Environment. Proc. ACM-SIAM Symposium on Discrete Algorithms, 1998. Also appears as IBM Research Report RJ 10076(91892) May and as www.cs.cornell.edu/home/kleinber/auth.ps.
Kochtanek, T. R. (1983). Document Clustering Using Macro Retrieval Techniques”, J. American Soc. Info. Sci. 34: 356-359.
Larson, R. (1996). “Bibliometrics of the World Wide Web: An Exploratory Analysis of the Intellectual Structure of Cyberspace”. Ann. Meeting of the American Soc. Info. Sci.
Liu, Mengxiong. (1993). Progress in Documentation the Complexities of Citation Practice: A Review of Citation Studies. J. Documentation 49(4): 370-408.
Marchiori, Massimo (1997). The Quest for Correct Information on the Web: Hyper Search Engines. The 6th International World Wide Web Conference (WWW6). Also available at http://atlanta.cs.nchu.edu.tw/www/PAPER222.html.
Mukherjea, S. & Hara, Y. (1997). Focus+Context Views of World-Wide Web Nodes. Proc. 8th ACM Conference on Hypertext, 187-196.
Page, Larry. (1997). PageRank: Bringing Order to the Web. Stanford Digital Libraries Working Paper 1997-0072. http://www-pcd.stanford.edu/page/papers/pagerank/index.htm.
Pirolli, P., Pitkow, J. & Rao, R. (1996). Silk from a Sow's Ear: Extracting Usable Structures from the Web. Proc. ACM SIGCHI Conference on Human Factors in Computing (http://www.acm.org:82/sigs/sigchi/chi96/proceedings/papers/Pirolli_2/ppw.html).
van Rijsbergen, C. J. (1979). Information Retrieval. Butterworths. Also at dcs.glasgow.ac.uk./Keith/Preface.html.
Rivlin, E., Botaforgo, R. & Shneiderman, B. (1994). Navigating in Hyperspace: Designing a Structure-Based Toolbox. Communications of the ACM 37(2): 87-96.
Rousseau, R. & Van Hooydonk, G. (1996). Journal Production and Journal Impact Factors, J. American Soc. Info. Sci. 47: 775-780.
Salton, G. (1989). Automatic Text Processing. Addison-Wesley: Reading, MA.
Savoy, Jaques (1993). Searching Information in Hypertext Systems Using Multiple Sources of Evidence. Int. J. Man-Machine Studies 38: 1017-1030.
Savoy, Jaques (1996). An Extended Vector-Processing Scheme for Searching Information in Hypertext Systems. Information Processing and Management 32(2): 155-170.
Savoy, Jaques (1997). Ranking Schemes in Hybrid Boolean Systems: A New Approach. J. Am. Soc. Information Sci. 48(3): 235-253.
Schwanke, R. W. & Platoff, M. A. (1993). Cross References Are Features. In Hanson, S. J., Remmele, W. & Rivest, R. L. (eds.) Machine Learning: From Theory to Applications. Springer.
Shaw, W. M. (1991). Subject and Citation Indexing. Part I: The Clustering Structure of Composite Representations in the Cystic Fibrosis Document Collection. J. American Soc. Info. Sci. 42: 669-675.
Shaw, W. M. (1991). Subject and Citation Indexing. Part II: The Optimal, Cluster-Based Retrieval Performance of Composite Representations. J. American Soc. Info. Sci. 42: 676-684.
Small, H. (1973). Co-Citation in the Scientific Literature: A New Measure of the Relationship Between Two Documents. J. American Soc. Info. Sci. 24: 265-269.
Spertus, E. (1997). ParaSite: Mining Structural Information on the Web. Proc. 6th International World Wide Web Conference.
Spielman, D. & Teng, S. (1996). Spectral Partitioning Works: Planar Graphs and Finite-Element Meshes. Processedings of the 37th IEEE Symposium on Foundations of Computer Science.
TREC — Text REtrieval Conference. Co-sponsored by the National Institute of Standards & Technology (NIST) and the Information Technology Office of the Defense Advanced Research Projects Agency (DARPA) as part of the TIPSTER Text Program. (http://trec.nist.gov/).
Wang, Q., Baldonado, M. & Winograd, T. (1997). SenseMaker: An Information-Exploration Interface Supporting the Contextual Evaluation of a User's Interests. Proc. ACM SIGCHI Conference on Human Factors in Computing.
Weinberg, Bella Hass (1974). Bibliographic Coupling: A Review. Information Storage and Retrieval 10: 189-196.
Weinreb, Lloyd L. (1982). Leading Constitutional Cases on Criminal Justice. Foundation Press.
Weiss, R., Velez, B., Sheldon, M., Nemprempre, C., Szilagyi, P. & Gifford, D. K. (1996). HyPursuit: A Hierarchical Network Search Engine that Exploits Content-Link Hypertext Clustering. Proceedings of the Seventh ACM Conference on Hypertext.
White, H. D. & McCain, K. W. (1989). Bibliometrics. Ann. Rev. Info. Sci. and Technology, 119-186. Elsevier.
Willet, Peter. (1988). Recent Trends in Hierarchical Document Clustering: a Critical Review. Information Processing and Management 24(5): 577-597.
World Wide Web Consortium. World Wide Web Virtual Library, www.w3.org/vl/.
Yahoo! Corp. Yahoo!, www.yahoo.com.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Chakrabarti, S., Dom, B.E., Gibson, D. et al. Topic Distillation and Spectral Filtering. Artificial Intelligence Review 13, 409–435 (1999). https://doi.org/10.1023/A:1006596506229
Issue Date:
DOI: https://doi.org/10.1023/A:1006596506229