Abstract.
This paper describes an intelligent information system for effectively managing huge amounts of online text documents (such as Web documents) in a hierarchical manner. The organizational capabilities of this system are able to evolve semi-automatically with minimal human input. The system starts with an initial taxonomy in which documents are automatically categorized, and then evolves so as to provide a good indexing service as the document collection grows or its usage changes. To this end, we propose a series of algorithms that utilize text-mining technologies such as document clustering, document categorization, and hierarchy reorganization. In particular, clustering and categorization algorithms have been intensively studied in order to provide evolving facilities for hierarchical structures and categorization criteria. Through experiments using the Reuters-21578 document collection, we evaluate the performance of the proposed clustering and categorization methods by comparing them to those of well-known conventional methods.
Similar content being viewed by others
References
1999cagg Aggarwal CC, Gates SC, Yu PS (1999) On the merits of building categorization systems by supervised clustering. In Proceedings of the 5th international conference on knowledge discovery and data mining (KDD ‘99), San Diego, CA, pp~352–356
2000ragg Aggrawal R, Bayardo RJ, Srikant R (2000) Athena: mining-based interactive management of text databases. In Proceedings of the 7th international conference on extending database technology (EDBT 2000), Konstanz, Germany, pp~365–379
1999shl Argamon-Engelson S, Dagan I (1999) Committee-based sample selection for probabilistic classifiers. Journal of Artificial Intelligence Research 11:335–360
1996ben Bensaid A, Hall L, Bezdek J, Clarke L (1996) Partially supervised clustering for image segmentation. Pattern Recognition 29(5):859–871
1995burgin Burgin R (1995) The retrieval effectiveness of five clustering algorithm as a function of indexing exhaustivity. Journal of the American Society for Information Science 46(8):562–572
1991cov Cover TM, Thomas JA (1991) Elements of information theory. Wiley, New York, pp 12–14
2000dem Demiriz A, Bennett KP (2000) Optimization approaches to semi-supervised learning. In Proceedings of advances in neural information processing systems, Denver, CO, pp 368–374
1986el El-Hamdouchi A, Willett P (1986) Hierarchical document clustering using Ward’s method. In Proceedings of the 9th international ACM SIGIR conference on research and development in information retrieval (SIGIR ‘86), Pisa, Italy, pp 149–156
1987cob Fisher DH (1987) Knowledge acquisition via conceptual clustering. Machine Learning 2:139–172
1992fox Fox C (1992) Information retrieval data structures and algorithms. Prentice-Hall, Englewood Cliffs, NJ, pp 102–130
1997fri Friedman JH (1997) Selective sampling using the query by committee algorithm. Machine Learning 28:133–168
1998gro Grossman DA, Frieder O (1998) Information retrieval: algorithms and heuristics. Kluwer, Boston, MA, pp 11–81
1998cure Guha S, Rastogi R, Shim K (1998) CURE: an efficient clustering algorithm for large databases. In Proceedings of the 25th ACM SIGMOD international conference on management of data, Seattle, WA, pp 73–84
1991knn Han E, Karypis G, Kumar V (1991) Text categorization using weight adjusted k-nearest neighbor classification. In Proceedings of the 5th Pacific–Asia conference on knowledge discovery and data mining (PAKDD ‘91), Hong Kong, pp 53–65
1995har Harries M, Horn K (1995) Detecting concept drift in financial time series prediction using symbolic machine learning. In Proceedings of the 8th Australian joint conference on artificial intelligence, Singapore, pp 91–98
2001inktomi Inktomi (2001) Inktomi Directory Engine. http://www.inktomi.com/products/search/
1997svm Joachims T (1997) Text categorization with support vector machines: learning with many relevant features. Technical report LS8-Report, University of Dortmund
1990kau Kaufman L, Rousseeuw P (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York
2000kim Kim HJ, Lee SG (2000) A semi-supervised document clustering technique for information organization. In Proceedings of the 9th international conference on information and knowledge management (CIKM 2000), Washington, DC, pp 30–37
1998kling Klingenberg R, Renz I (1998) Adaptive information filtering: learning in the presence of concept drift. In Proceedings of AAAI/ICML-98 workshop on learning for text categorization, Madison, WI, pp 33–40
2000klink Klinkenberg R, Thorsten J (2000) Detecting concept drift with support vector machines. In Proceedings of the 17th international conference on machine learning (ICML 2000), San Francisco, CA, pp 487–494
1995geo Klir GJ, Yuan B (1995) Fuzzy sets and fuzzy logic: theory and applications, Prentice-Hall, Englewood Cliffs’n NJ, pp 119–149
1998lab Labzour T, Bensaid A, Bezdek J (1998) Improved semi-supervised point-prototype clustering algorithms. In Proceedings of the 7th international conference on fuzzy systems (FUZZ-IEEE ‘98), Anchorage, AK, pp 1383–1387
1992lewis92 Lewis DD (1992) An evaluation of phrasal and clustered representations on a text categorization task. In Proceedings of the 15th international ACM SIGIR conference on research and development in information retrieval, Copenhagen, Denmark, pp 37–50
1997reuter Lewis DD (1997) Reuters-21578 text categorization test collection. http://www.daviddlewis.com/resources/ testcollections/reuters21578/
1994lewis Lewis DD, Gale WA (1994) A sequential algorithm for training text classifiers. In Proceedings of the 17th ACM conference on research and development in information retrieval (SIGIR ‘94), Dublin, Ireland, pp 3–11
1999mic Lindenbaum M, Markovitch S, Rusakov D (1999) Selective sampling for nearest neighbor classifiers. In Proceedings of the 16th national conference on artificial intelligence (AAAI ‘99), pp 366–371
1997mit Mitchell TM (1997) Machine learning. McGraw-Hill, New York, pp 154–200
2001northen Northern Light (2001) Northern Light Search Engine. http://www.northernlight.com/
2001opendir ODP (2001) About the Open Directory Project. http://dmoz.org/about.html
1991oga Ogawa Y, Moria T, Kobayashi K (1991) A fuzzy document retrieval system using the key word connection matrix and a learning method. Fuzzy Sets and Systems 39:163–179
1998ros Roscheisen M, Baldonado M, Chang C, Gravano L, Ketchpel S, Paepcke A (1998) The Stanford InfoBus and its service layers: augmenting the Internet with higher-level information management protocols. Lecture Notes in Computer Science 1392:213–230
1998sah Sahami M, Yusufali S, Baldonado MQ (1998) SONIA: A service for organizing networked information autonomously. In Proceedings of ACM digital library, Pittsburgh, PA, pp 200–209
1999san Sanderson M, Croft B (1999) Deriving concept hierarchies from text. In Proceedings of the 22nd ACM conference on research and development in information retrieval (SIGIR ‘99), Berkeley, CA, pp 206–213
1986sch Schlimmer JC (1986) Tracking concept drift. In Proceedings of the 6th national conference on artificial intelligence (AAAI ‘86), Philadelphia, PA, pp 502–507
1999scott Scott S, Matwin S (1999) Feature engineering for text classification. In Proceedings of the 16th international conference of machine learning (ICML ‘99), Bled, Slovenia, pp 379–388
1997sei Seidl T, Kriegel HP (1997) Efficient user-adaptable similarity search in large multimedia databases. In Proceedings of the 23rd international conference on very large databases (VLDB ‘97), Athens, Greece, pp 506–515
2001semio Semio (2001) Semio taxonomy. http://www.semio.com/
1999tal Talavera L, Bejar J (1999) Integrating declarative knowledge in hierarchical clustering tasks. In Proceedings of the 3rd international conference on intelligent data analysis (IDA ‘99), Amsterdam, The Netherlands, pp 211–222
1995cv Tresch M, Palmer N, Luniewski A (1995) Type classification of semi-structured documents. In Proceedings of the 21st international conference on very large databases (VLDB ‘95), Zurich, Switzerland, pp 263–274
1999vai Vaithyanathan S, Dom B (1999) Model selection in unsupervised learning with applications to document clustering. In Proceedings of the 16th international conference of machine learning (ICML ‘99), Bled, Slovenia, pp 433–443
1986voorhees Voorhees EM (1986) Implementing agglomerative hierarchical clustering algorithms for use in document retrieval. Information Processing and Management 22:465–476
2001bot WebBot (1999) The W3C Libwww Robot. http://www.w3.org/Robot/
1988willet Willet P (1988) Recent trends in hierarchical document clustering: a critical review. Information Processing and Management 24(5):577–597
1997yang Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In Proceedings of the 14th international conference of machine learning (ICML ‘97), Nashville, TN, pp 412–420
Author information
Authors and Affiliations
Corresponding author
Additional information
2 May 2001
Rights and permissions
About this article
Cite this article
Kim, Hj., Lee, Sg. An Intelligent Information System for Organizing Online Text Documents. Knowledge and Information Systems 6, 125–149 (2004). https://doi.org/10.1007/s10115-003-0103-z
Revised:
Accepted:
Issue Date:
DOI: https://doi.org/10.1007/s10115-003-0103-z