Abstract
Latent variable model has been widely used to extract topics from document collections, but this unsupervised mode makes it difficult to interpret the topics covered by the constructed cluster due to no artificial tags. Considering the influence of preprocessing on subsequent data mining, this study extracts news feature words from CSCP word segmentation method and named entity identification (NER). For NER, names of people, places, organizations are extracted by syntax rules and verified by Wikipedia. In terms of CSCP, the probability of continuing characters, the distribution probabilities and numbers of links before and after the character are used to constantly merge the unit words, extract the important narrative words effectively, and then conduct Unigram, Compounds and Mixture word processing. The corpus of words obtained after word processing by CSCP and NER were used as LDA prior knowledge. Finally, the topic coherence of NER and CSCP is evaluated by UMass Topic Coherence Measurement. The experimental results show that Compounds are of specific meaning; Mixtures represents its diverse scope; Unigrams and NER are relatively short, while NER can accurately represent the important features of news content, the topic is more cohesive. In terms of efficiency, NER-LDA takes the longest, while it had the highest degree of topic coherence.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Mimno, D., Wallach, H.M., Talley, E., Leenders, M., McCallum, A.: Optimizing semantic coherence in topic models. In: EMNLP, pp. 262–272 (2011)
Andrzejewski, D., Zhu, X., Craven, M.: Incorporating domain knowledge into topic modeling via Dirichlet Forest priors. In: ICML, pp. 25–32 (2009)
Wallach, H.M.: Topic modeling: beyond bag-of-words. In: Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, Pennsylvania, USA, pp. 977–984 (2006)
Huang, C.-M.: Investigating topic coherence and task performance for varied types of words in LDA, pp. 260–265 (2017)
Shanthi, V., Lalitha, S.: Lexical chaining process for text generations. In: International Conference on Process Automation, Control and Computing (PACC), pp. 1–6 (2011)
Newman, D., Chemudugunta, C., Smyth, P.: Statistical entity-topic models. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, USA, pp. 680–686 (2006)
Hu, L., Li, J., Li, Z., Shao, C., Li, Z.: Incorporating entities in news topic modeling. In: Zhou, G., Li, J., Zhao, D., Feng, Y. (eds.) NLPCC 2013. CCIS, vol. 400, pp. 139–150. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41644-6_14
Huang, C.-M.: Using part-of-speech tagging based approach with Wikipedia to assist Chinese named entity recognition and disambiguation (in Chinese). J. Libr. Inform. Sci. Res. 11(2), 139–179 (2017)
Huang, C.-M.: Applying lexical chain theory to extract context descriptors as web image annotation. Commun. ICISA: Int. J. 15(2), 22–39 (2014)
Acknowledgment
This research is supported by Ministry of Science and Technology, Taiwan, R.O.C. under grant number MOST 105-2221-E-224-053.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Huang, CM. (2019). Incorporating Prior Knowledge by Selective Context Features to Enhance Topic Coherence. In: Chang, CY., Lin, CC., Lin, HH. (eds) New Trends in Computer Technologies and Applications. ICS 2018. Communications in Computer and Information Science, vol 1013. Springer, Singapore. https://doi.org/10.1007/978-981-13-9190-3_32
Download citation
DOI: https://doi.org/10.1007/978-981-13-9190-3_32
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-9189-7
Online ISBN: 978-981-13-9190-3
eBook Packages: Computer ScienceComputer Science (R0)