Skip to main content

Advertisement

Log in

An Intelligent Information System for Organizing Online Text Documents

  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract.

This paper describes an intelligent information system for effectively managing huge amounts of online text documents (such as Web documents) in a hierarchical manner. The organizational capabilities of this system are able to evolve semi-automatically with minimal human input. The system starts with an initial taxonomy in which documents are automatically categorized, and then evolves so as to provide a good indexing service as the document collection grows or its usage changes. To this end, we propose a series of algorithms that utilize text-mining technologies such as document clustering, document categorization, and hierarchy reorganization. In particular, clustering and categorization algorithms have been intensively studied in order to provide evolving facilities for hierarchical structures and categorization criteria. Through experiments using the Reuters-21578 document collection, we evaluate the performance of the proposed clustering and categorization methods by comparing them to those of well-known conventional methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. 1999cagg Aggarwal CC, Gates SC, Yu PS (1999) On the merits of building categorization systems by supervised clustering. In Proceedings of the 5th international conference on knowledge discovery and data mining (KDD ‘99), San Diego, CA, pp~352–356

  2. 2000ragg Aggrawal R, Bayardo RJ, Srikant R (2000) Athena: mining-based interactive management of text databases. In Proceedings of the 7th international conference on extending database technology (EDBT 2000), Konstanz, Germany, pp~365–379

  3. 1999shl Argamon-Engelson S, Dagan I (1999) Committee-based sample selection for probabilistic classifiers. Journal of Artificial Intelligence Research 11:335–360

    Google Scholar 

  4. 1996ben Bensaid A, Hall L, Bezdek J, Clarke L (1996) Partially supervised clustering for image segmentation. Pattern Recognition 29(5):859–871

    Article  Google Scholar 

  5. 1995burgin Burgin R (1995) The retrieval effectiveness of five clustering algorithm as a function of indexing exhaustivity. Journal of the American Society for Information Science 46(8):562–572

    Google Scholar 

  6. 1991cov Cover TM, Thomas JA (1991) Elements of information theory. Wiley, New York, pp 12–14

  7. 2000dem Demiriz A, Bennett KP (2000) Optimization approaches to semi-supervised learning. In Proceedings of advances in neural information processing systems, Denver, CO, pp 368–374

  8. 1986el El-Hamdouchi A, Willett P (1986) Hierarchical document clustering using Ward’s method. In Proceedings of the 9th international ACM SIGIR conference on research and development in information retrieval (SIGIR ‘86), Pisa, Italy, pp 149–156

  9. 1987cob Fisher DH (1987) Knowledge acquisition via conceptual clustering. Machine Learning 2:139–172

    Article  Google Scholar 

  10. 1992fox Fox C (1992) Information retrieval data structures and algorithms. Prentice-Hall, Englewood Cliffs, NJ, pp 102–130

  11. 1997fri Friedman JH (1997) Selective sampling using the query by committee algorithm. Machine Learning 28:133–168

    Article  Google Scholar 

  12. 1998gro Grossman DA, Frieder O (1998) Information retrieval: algorithms and heuristics. Kluwer, Boston, MA, pp 11–81

    Google Scholar 

  13. 1998cure Guha S, Rastogi R, Shim K (1998) CURE: an efficient clustering algorithm for large databases. In Proceedings of the 25th ACM SIGMOD international conference on management of data, Seattle, WA, pp 73–84

  14. 1991knn Han E, Karypis G, Kumar V (1991) Text categorization using weight adjusted k-nearest neighbor classification. In Proceedings of the 5th Pacific–Asia conference on knowledge discovery and data mining (PAKDD ‘91), Hong Kong, pp 53–65

  15. 1995har Harries M, Horn K (1995) Detecting concept drift in financial time series prediction using symbolic machine learning. In Proceedings of the 8th Australian joint conference on artificial intelligence, Singapore, pp 91–98

  16. 2001inktomi Inktomi (2001) Inktomi Directory Engine. http://www.inktomi.com/products/search/

  17. 1997svm Joachims T (1997) Text categorization with support vector machines: learning with many relevant features. Technical report LS8-Report, University of Dortmund

    Google Scholar 

  18. 1990kau Kaufman L, Rousseeuw P (1990) Finding groups in data: an introduction to cluster analysis. Wiley, New York

    Google Scholar 

  19. 2000kim Kim HJ, Lee SG (2000) A semi-supervised document clustering technique for information organization. In Proceedings of the 9th international conference on information and knowledge management (CIKM 2000), Washington, DC, pp 30–37

  20. 1998kling Klingenberg R, Renz I (1998) Adaptive information filtering: learning in the presence of concept drift. In Proceedings of AAAI/ICML-98 workshop on learning for text categorization, Madison, WI, pp 33–40

    Google Scholar 

  21. 2000klink Klinkenberg R, Thorsten J (2000) Detecting concept drift with support vector machines. In Proceedings of the 17th international conference on machine learning (ICML 2000), San Francisco, CA, pp 487–494

  22. 1995geo Klir GJ, Yuan B (1995) Fuzzy sets and fuzzy logic: theory and applications, Prentice-Hall, Englewood Cliffs’n NJ, pp 119–149

    MATH  Google Scholar 

  23. 1998lab Labzour T, Bensaid A, Bezdek J (1998) Improved semi-supervised point-prototype clustering algorithms. In Proceedings of the 7th international conference on fuzzy systems (FUZZ-IEEE ‘98), Anchorage, AK, pp 1383–1387

  24. 1992lewis92 Lewis DD (1992) An evaluation of phrasal and clustered representations on a text categorization task. In Proceedings of the 15th international ACM SIGIR conference on research and development in information retrieval, Copenhagen, Denmark, pp 37–50

  25. 1997reuter Lewis DD (1997) Reuters-21578 text categorization test collection. http://www.daviddlewis.com/resources/ testcollections/reuters21578/

  26. 1994lewis Lewis DD, Gale WA (1994) A sequential algorithm for training text classifiers. In Proceedings of the 17th ACM conference on research and development in information retrieval (SIGIR ‘94), Dublin, Ireland, pp 3–11

  27. 1999mic Lindenbaum M, Markovitch S, Rusakov D (1999) Selective sampling for nearest neighbor classifiers. In Proceedings of the 16th national conference on artificial intelligence (AAAI ‘99), pp 366–371

  28. 1997mit Mitchell TM (1997) Machine learning. McGraw-Hill, New York, pp 154–200

  29. 2001northen Northern Light (2001) Northern Light Search Engine. http://www.northernlight.com/

  30. 2001opendir ODP (2001) About the Open Directory Project. http://dmoz.org/about.html

  31. 1991oga Ogawa Y, Moria T, Kobayashi K (1991) A fuzzy document retrieval system using the key word connection matrix and a learning method. Fuzzy Sets and Systems 39:163–179

    Article  Google Scholar 

  32. 1998ros Roscheisen M, Baldonado M, Chang C, Gravano L, Ketchpel S, Paepcke A (1998) The Stanford InfoBus and its service layers: augmenting the Internet with higher-level information management protocols. Lecture Notes in Computer Science 1392:213–230

    Google Scholar 

  33. 1998sah Sahami M, Yusufali S, Baldonado MQ (1998) SONIA: A service for organizing networked information autonomously. In Proceedings of ACM digital library, Pittsburgh, PA, pp 200–209

  34. 1999san Sanderson M, Croft B (1999) Deriving concept hierarchies from text. In Proceedings of the 22nd ACM conference on research and development in information retrieval (SIGIR ‘99), Berkeley, CA, pp 206–213

  35. 1986sch Schlimmer JC (1986) Tracking concept drift. In Proceedings of the 6th national conference on artificial intelligence (AAAI ‘86), Philadelphia, PA, pp 502–507

  36. 1999scott Scott S, Matwin S (1999) Feature engineering for text classification. In Proceedings of the 16th international conference of machine learning (ICML ‘99), Bled, Slovenia, pp 379–388

  37. 1997sei Seidl T, Kriegel HP (1997) Efficient user-adaptable similarity search in large multimedia databases. In Proceedings of the 23rd international conference on very large databases (VLDB ‘97), Athens, Greece, pp 506–515

  38. 2001semio Semio (2001) Semio taxonomy. http://www.semio.com/

  39. 1999tal Talavera L, Bejar J (1999) Integrating declarative knowledge in hierarchical clustering tasks. In Proceedings of the 3rd international conference on intelligent data analysis (IDA ‘99), Amsterdam, The Netherlands, pp 211–222

  40. 1995cv Tresch M, Palmer N, Luniewski A (1995) Type classification of semi-structured documents. In Proceedings of the 21st international conference on very large databases (VLDB ‘95), Zurich, Switzerland, pp 263–274

  41. 1999vai Vaithyanathan S, Dom B (1999) Model selection in unsupervised learning with applications to document clustering. In Proceedings of the 16th international conference of machine learning (ICML ‘99), Bled, Slovenia, pp 433–443

  42. 1986voorhees Voorhees EM (1986) Implementing agglomerative hierarchical clustering algorithms for use in document retrieval. Information Processing and Management 22:465–476

    Article  Google Scholar 

  43. 2001bot WebBot (1999) The W3C Libwww Robot. http://www.w3.org/Robot/

  44. 1988willet Willet P (1988) Recent trends in hierarchical document clustering: a critical review. Information Processing and Management 24(5):577–597

    Article  Google Scholar 

  45. 1997yang Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In Proceedings of the 14th international conference of machine learning (ICML ‘97), Nashville, TN, pp 412–420

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Han-joon Kim.

Additional information

2 May 2001

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kim, Hj., Lee, Sg. An Intelligent Information System for Organizing Online Text Documents. Knowledge and Information Systems 6, 125–149 (2004). https://doi.org/10.1007/s10115-003-0103-z

Download citation

  • Revised:

  • Accepted:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-003-0103-z

Keywords

Navigation