An Intelligent Information System for Organizing Online Text Documents

Kim, Han-joon; Lee, Sang-goo

doi:10.1007/s10115-003-0103-z

An Intelligent Information System for Organizing Online Text Documents

Published: March 2004

Volume 6, pages 125–149, (2004)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Han-joon Kim¹ &
Sang-goo Lee²

95 Accesses
12 Citations
Explore all metrics

Abstract.

This paper describes an intelligent information system for effectively managing huge amounts of online text documents (such as Web documents) in a hierarchical manner. The organizational capabilities of this system are able to evolve semi-automatically with minimal human input. The system starts with an initial taxonomy in which documents are automatically categorized, and then evolves so as to provide a good indexing service as the document collection grows or its usage changes. To this end, we propose a series of algorithms that utilize text-mining technologies such as document clustering, document categorization, and hierarchy reorganization. In particular, clustering and categorization algorithms have been intensively studied in order to provide evolving facilities for hierarchical structures and categorization criteria. Through experiments using the Reuters-21578 document collection, we evaluate the performance of the proposed clustering and categorization methods by comparing them to those of well-known conventional methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

1999cagg Aggarwal CC, Gates SC, Yu PS (1999) On the merits of building categorization systems by supervised clustering. In Proceedings of the 5th international conference on knowledge discovery and data mining (KDD ‘99), San Diego, CA, pp~352–356
2000ragg Aggrawal R, Bayardo RJ, Srikant R (2000) Athena: mining-based interactive management of text databases. In Proceedings of the 7th international conference on extending database technology (EDBT 2000), Konstanz, Germany, pp~365–379
1999shl Argamon-Engelson S, Dagan I (1999) Committee-based sample selection for probabilistic classifiers. Journal of Artificial Intelligence Research 11:335–360
Google Scholar
1996ben Bensaid A, Hall L, Bezdek J, Clarke L (1996) Partially supervised clustering for image segmentation. Pattern Recognition 29(5):859–871
Article Google Scholar
1995burgin Burgin R (1995) The retrieval effectiveness of five clustering algorithm as a function of indexing exhaustivity. Journal of the American Society for Information Science 46(8):562–572
Google Scholar
1991cov Cover TM, Thomas JA (1991) Elements of information theory. Wiley, New York, pp 12–14
2000dem Demiriz A, Bennett KP (2000) Optimization approaches to semi-supervised learning. In Proceedings of advances in neural information processing systems, Denver, CO, pp 368–374
1986el El-Hamdouchi A, Willett P (1986) Hierarchical document clustering using Ward’s method. In Proceedings of the 9th international ACM SIGIR conference on research and development in information retrieval (SIGIR ‘86), Pisa, Italy, pp 149–156
1987cob Fisher DH (1987) Knowledge acquisition via conceptual clustering. Machine Learning 2:139–172
Article Google Scholar
1992fox Fox C (1992) Information retrieval data structures and algorithms. Prentice-Hall, Englewood Cliffs, NJ, pp 102–130
1997fri Friedman JH (1997) Selective sampling using the query by committee algorithm. Machine Learning 28:133–168
Article Google Scholar
1998gro Grossman DA, Frieder O (1998) Information retrieval: algorithms and heuristics. Kluwer, Boston, MA, pp 11–81
Google Scholar
1998cure Guha S, Rastogi R, Shim K (1998) CURE: an efficient clustering algorithm for large databases. In Proceedings of the 25th ACM SIGMOD international conference on management of data, Seattle, WA, pp 73–84
1991knn Han E, Karypis G, Kumar V (1991) Text categorization using weight adjusted k-nearest neighbor classification. In Proceedings of the 5th Pacific–Asia conference on knowledge discovery and data mining (PAKDD ‘91), Hong Kong, pp 53–65
1995har Harries M, Horn K (1995) Detecting concept drift in financial time series prediction using symbolic machine learning. In Proceedings of the 8th Australian joint conference on artificial intelligence, Singapore, pp 91–98
2001inktomi Inktomi (2001) Inktomi Directory Engine. http://www.inktomi.com/products/search/
1997svm Joachims T (1997) Text categorization with support vector machines: learning with many relevant features. Technical report LS8-Report, University of Dortmund
Google Scholar
1990kau Kaufman L, Rousseeuw P (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York
Google Scholar
2000kim Kim HJ, Lee SG (2000) A semi-supervised document clustering technique for information organization. In Proceedings of the 9th international conference on information and knowledge management (CIKM 2000), Washington, DC, pp 30–37
1998kling Klingenberg R, Renz I (1998) Adaptive information filtering: learning in the presence of concept drift. In Proceedings of AAAI/ICML-98 workshop on learning for text categorization, Madison, WI, pp 33–40
Google Scholar
2000klink Klinkenberg R, Thorsten J (2000) Detecting concept drift with support vector machines. In Proceedings of the 17th international conference on machine learning (ICML 2000), San Francisco, CA, pp 487–494
1995geo Klir GJ, Yuan B (1995) Fuzzy sets and fuzzy logic: theory and applications, Prentice-Hall, Englewood Cliffs’n NJ, pp 119–149
MATH Google Scholar
1998lab Labzour T, Bensaid A, Bezdek J (1998) Improved semi-supervised point-prototype clustering algorithms. In Proceedings of the 7th international conference on fuzzy systems (FUZZ-IEEE ‘98), Anchorage, AK, pp 1383–1387
1992lewis92 Lewis DD (1992) An evaluation of phrasal and clustered representations on a text categorization task. In Proceedings of the 15th international ACM SIGIR conference on research and development in information retrieval, Copenhagen, Denmark, pp 37–50
1997reuter Lewis DD (1997) Reuters-21578 text categorization test collection. http://www.daviddlewis.com/resources/ testcollections/reuters21578/
1994lewis Lewis DD, Gale WA (1994) A sequential algorithm for training text classifiers. In Proceedings of the 17th ACM conference on research and development in information retrieval (SIGIR ‘94), Dublin, Ireland, pp 3–11
1999mic Lindenbaum M, Markovitch S, Rusakov D (1999) Selective sampling for nearest neighbor classifiers. In Proceedings of the 16th national conference on artificial intelligence (AAAI ‘99), pp 366–371
1997mit Mitchell TM (1997) Machine learning. McGraw-Hill, New York, pp 154–200
2001northen Northern Light (2001) Northern Light Search Engine. http://www.northernlight.com/
2001opendir ODP (2001) About the Open Directory Project. http://dmoz.org/about.html
1991oga Ogawa Y, Moria T, Kobayashi K (1991) A fuzzy document retrieval system using the key word connection matrix and a learning method. Fuzzy Sets and Systems 39:163–179
Article Google Scholar
1998ros Roscheisen M, Baldonado M, Chang C, Gravano L, Ketchpel S, Paepcke A (1998) The Stanford InfoBus and its service layers: augmenting the Internet with higher-level information management protocols. Lecture Notes in Computer Science 1392:213–230
Google Scholar
1998sah Sahami M, Yusufali S, Baldonado MQ (1998) SONIA: A service for organizing networked information autonomously. In Proceedings of ACM digital library, Pittsburgh, PA, pp 200–209
1999san Sanderson M, Croft B (1999) Deriving concept hierarchies from text. In Proceedings of the 22nd ACM conference on research and development in information retrieval (SIGIR ‘99), Berkeley, CA, pp 206–213
1986sch Schlimmer JC (1986) Tracking concept drift. In Proceedings of the 6th national conference on artificial intelligence (AAAI ‘86), Philadelphia, PA, pp 502–507
1999scott Scott S, Matwin S (1999) Feature engineering for text classification. In Proceedings of the 16th international conference of machine learning (ICML ‘99), Bled, Slovenia, pp 379–388
1997sei Seidl T, Kriegel HP (1997) Efficient user-adaptable similarity search in large multimedia databases. In Proceedings of the 23rd international conference on very large databases (VLDB ‘97), Athens, Greece, pp 506–515
2001semio Semio (2001) Semio taxonomy. http://www.semio.com/
1999tal Talavera L, Bejar J (1999) Integrating declarative knowledge in hierarchical clustering tasks. In Proceedings of the 3rd international conference on intelligent data analysis (IDA ‘99), Amsterdam, The Netherlands, pp 211–222
1995cv Tresch M, Palmer N, Luniewski A (1995) Type classification of semi-structured documents. In Proceedings of the 21st international conference on very large databases (VLDB ‘95), Zurich, Switzerland, pp 263–274
1999vai Vaithyanathan S, Dom B (1999) Model selection in unsupervised learning with applications to document clustering. In Proceedings of the 16th international conference of machine learning (ICML ‘99), Bled, Slovenia, pp 433–443
1986voorhees Voorhees EM (1986) Implementing agglomerative hierarchical clustering algorithms for use in document retrieval. Information Processing and Management 22:465–476
Article Google Scholar
2001bot WebBot (1999) The W3C Libwww Robot. http://www.w3.org/Robot/
1988willet Willet P (1988) Recent trends in hierarchical document clustering: a critical review. Information Processing and Management 24(5):577–597
Article Google Scholar
1997yang Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In Proceedings of the 14th international conference of machine learning (ICML ‘97), Nashville, TN, pp 412–420

Download references

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, The University of Seoul, 90 Jeonnong-dong, Dongdaemun-gu, Seoul, 130-743, Korea
Han-joon Kim
School of Computer Science and Engineering, Seoul National University, Seoul, Korea
Sang-goo Lee

Authors

Han-joon Kim
View author publications
You can also search for this author in PubMed Google Scholar
Sang-goo Lee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Han-joon Kim.

Additional information

2 May 2001

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kim, Hj., Lee, Sg. An Intelligent Information System for Organizing Online Text Documents. Knowledge and Information Systems 6, 125–149 (2004). https://doi.org/10.1007/s10115-003-0103-z

Download citation

Revised: 05 July 2002
Accepted: 25 August 2002
Issue Date: March 2004
DOI: https://doi.org/10.1007/s10115-003-0103-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Intelligent Information System for Organizing Online Text Documents

Abstract.

Access this article

Similar content being viewed by others

Learning Structural Representations of Text Documents in Large Document Collections

SMGKM: An Efficient Incremental Algorithm for Clustering Document Collections

Hierarchical clustering of text documents

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An Intelligent Information System for Organizing Online Text Documents

Abstract.

Access this article

Similar content being viewed by others

Learning Structural Representations of Text Documents in Large Document Collections

SMGKM: An Efficient Incremental Algorithm for Clustering Document Collections

Hierarchical clustering of text documents

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation