Abstract
Existing topic identification techniques must tackle an important problem: they depend on human intervention, thus incurring major preparation costs and lacking operational flexibility when facing novelty. To resolve this issue, we propose an adaptable and autonomous algorithm that discovers topics in unstructured text documents. The algorithm is based on principles that differ from existing natural language processing and artificial intelligence techniques. These principles involve the retrieval, activation and decay of general-purpose lexical knowledge, inspired by how the brain may process information when someone reads. The algorithm handles words sequentially in a single document, contrary to the usual corpus-based bag-of-words approach. Empirical results demonstrate the potential of the new algorithm.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Blei, D., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. Journal of Machine Learning Research (3), 993–1022 (2003)
Landauer, T.K., Dumais, S.T.: Solution to Plato’s Problem: The Latent Semantic Analysis Theory of Acquisition, Induction and Representation of Knowledge. Psychological Review 104(2), 211–240 (1997)
McNamara, D.S.: Computational methods to extract meaning from text and advance theories of human cognition. Topics in Cognitive Science 3(1), 3–17 (2011)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34, 1 (2002)
Qi, X., Davison, B.D.: Web page classification: Features and algorithms. ACM Comput. Surv. 41, 2 (2009)
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. 31, 3 (1999)
Feldman, R., Sanger, J.: The Text Mining Handbook: Advanced Approaches in Analyzing Unstructured Data. Cambridge University Press, NY (2006)
Massey, L.: On the quality of ART1 text clustering. Neural Networks 16, 5–6 (2003)
Salton, G., Lesk, M.E.: Computer evaluation of indexing and text processing. J. ACM 15, 1 (1968)
Spärck Jones, K.: A statistical interpretation of term specificity and its application in retrieval. J. of Doc. 28, 1 (1972)
Halliday, M.A.K., Hasan, R.: Cohesion in English. Longman Pub. Group, NY (1976)
Hotho, A., Staab, S., Stumme, G.: Wordnet improves text document clustering. In: Proceedings of Semantic Web Workshop, the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, NY (2003)
Hu, J., Fang, L., Cao, Y., Zeng, H., Li, H., Yang, Q., Chen, Z.: Enhancing text clustering by leveraging Wikipedia semantics. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in information Retrieval, pp. 179–186. ACM, NY (2008)
Scott, S., Matwin, S.: Feature engineering for text classification. In: Proceedings of 16th International Conference on Machine Learning, pp. 379–388 (1999)
Lenat, D.B.: CYC: A Large-Scale Investment in Knowledge Infrastructure. Commun. ACM 38, 11 (1995)
Milne, D., Witten, I.H.: Learning to link with wikipedia. In: Proceeding of the 17th ACM Conference on Information and Knowledge Management (CIKM 2008), pp. 509–518. ACM, New York (2008)
Kim, H.L., Scerri, S., Breslin, J.G., Decker, S., Kim, H.G.: The state of the art in tag ontologies: a semantic model for tagging and folksonomies. In: Proceedings of the 2008 International Conference on Dublin Core and Metadata Applications (DCMI 2008), Dublin Core Metadata Initiative, pp. 128–137 (2008)
Turney, P.D.: Learning algorithms for keyphrase extraction. Information Retrieval 2(4), 303–336 (2000)
Velardi, P., Navigli, R., D’Amadio, P.: Mining the Web to Create Specialized Glossaries. IEEE Intelligent Systems 23(5), 18–25 (2008)
Wong, W., Liu, W., Bennamoun, M.: A probabilistic framework for automatic term recognition. Intelligent Data Analysis 13(4), 499–539 (2009)
Luhn, H.P.: A statistical approach to mechanized encoding and searching of literary information. IBM Journal of Research and Development 1(4), 390 (1957)
Cabre-Castellvi, T., Estopa, R., Vivaldi-Palatresi, J.: Automatic term detection: A review of current systems. In: Bourigault, D., Jacquemin, C., L’Homme, M.C. (eds.) Recent Advances in Computational Terminology. John Benjamins, Amsterdam (2001)
Hulth, A.: Improved automatic keyword extraction given more linguistic knowledge. In: Proc. of the 2003 Conference on Empirical Methods in Natural Language Processing, vol. 10, pp. 216–223. Association for Computational Linguistics, Morristown (2003)
Milne, D.N., Witten, I.H., Nichols, D.M.: A knowledge-based search engine powered by wikipedia. In: Proceedings of the Sixteenth ACM Conference on Conference on Information and Knowledge Management, pp. 445–454 (2007)
Jarvella, R.J.: Syntactic processing of connected speech. J. Verb. Learn. Verb. Behav. 10 (1971)
Just, M.A., Carpenter, P.A.: A capacity theory of comprehension: Individual differences in working memory. Psychol. Rev. 99 (1992)
Fellbaum, C.: WordNet: An Electronic Lexical Database (1998)
Navigli, R.: Word sense disambiguation: A survey. ACM Comput. Surv. 41, 2 (2009)
Miller, G.A.: The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychol. Rev. 63 (1956)
Lewis, D.D.: Reuters-21578 Distribution 1.0, http://www.daviddlewis.com/resources/testcollections/reuters21578 (last retrieved April 22, 2010)
Massey, L.: Evaluating and Comparing Text Clustering Results. In: Proceedings of 2005 IASTED International Conference on Computational Intelligence (2005)
Dhillon, I.S., Modha, D.M.: Concept Decompositions for Large Sparse Text Data using Clustering. Mach. Learn. 42, 1 (2001)
Gabrilovich, E., Markovitch, S.: Overcoming the brittleness bottleneck using wikipedia: enhancing text categorization with encyclopedic knowledge. In: Proceedings of the 21st National Conference on Artificial intelligence, pp. 1301–1306 (2006)
Gabrilovich, E., Broder, A., Fontoura, M., Joshi, A., Josifovski, V., Riedel, L., Zhang, T.: Classifying search queries using the Web as a source of knowledge. ACM Trans. Web 3, 2 (2009)
Berners-Lee, T., Hendler, J., Lassila, O.: The Semantic Web. Sci. Am. 284, 5 (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Massey, L. (2011). Autonomous and Adaptive Identification of Topics in Unstructured Text. In: König, A., Dengel, A., Hinkelmann, K., Kise, K., Howlett, R.J., Jain, L.C. (eds) Knowlege-Based and Intelligent Information and Engineering Systems. KES 2011. Lecture Notes in Computer Science(), vol 6882. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-23863-5_1
Download citation
DOI: https://doi.org/10.1007/978-3-642-23863-5_1
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-23862-8
Online ISBN: 978-3-642-23863-5
eBook Packages: Computer ScienceComputer Science (R0)