Incorporating Prior Knowledge by Selective Context Features to Enhance Topic Coherence

Huang, Chuen-Min

doi:10.1007/978-981-13-9190-3_32

Chuen-Min Huang¹⁰

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1013))

Included in the following conference series:

International Computer Symposium

1411 Accesses

Abstract

Latent variable model has been widely used to extract topics from document collections, but this unsupervised mode makes it difficult to interpret the topics covered by the constructed cluster due to no artificial tags. Considering the influence of preprocessing on subsequent data mining, this study extracts news feature words from CSCP word segmentation method and named entity identification (NER). For NER, names of people, places, organizations are extracted by syntax rules and verified by Wikipedia. In terms of CSCP, the probability of continuing characters, the distribution probabilities and numbers of links before and after the character are used to constantly merge the unit words, extract the important narrative words effectively, and then conduct Unigram, Compounds and Mixture word processing. The corpus of words obtained after word processing by CSCP and NER were used as LDA prior knowledge. Finally, the topic coherence of NER and CSCP is evaluated by UMass Topic Coherence Measurement. The experimental results show that Compounds are of specific meaning; Mixtures represents its diverse scope; Unigrams and NER are relatively short, while NER can accurately represent the important features of news content, the topic is more cohesive. In terms of efficiency, NER-LDA takes the longest, while it had the highest degree of topic coherence.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Improving Topic Coherence Using Parsimonious Language Model and Latent Semantic Indexing

Text Segmentation with Topic Modeling and Entity Coherence

Topic Modeling for Short Texts via Adaptive P $$\acute{o}$$ lya Urn Dirichlet Multinomial Mixture

References

Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Mimno, D., Wallach, H.M., Talley, E., Leenders, M., McCallum, A.: Optimizing semantic coherence in topic models. In: EMNLP, pp. 262–272 (2011)
Google Scholar
Andrzejewski, D., Zhu, X., Craven, M.: Incorporating domain knowledge into topic modeling via Dirichlet Forest priors. In: ICML, pp. 25–32 (2009)
Google Scholar
Wallach, H.M.: Topic modeling: beyond bag-of-words. In: Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, Pennsylvania, USA, pp. 977–984 (2006)
Google Scholar
Huang, C.-M.: Investigating topic coherence and task performance for varied types of words in LDA, pp. 260–265 (2017)
Google Scholar
Shanthi, V., Lalitha, S.: Lexical chaining process for text generations. In: International Conference on Process Automation, Control and Computing (PACC), pp. 1–6 (2011)
Google Scholar
Newman, D., Chemudugunta, C., Smyth, P.: Statistical entity-topic models. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, USA, pp. 680–686 (2006)
Google Scholar
Hu, L., Li, J., Li, Z., Shao, C., Li, Z.: Incorporating entities in news topic modeling. In: Zhou, G., Li, J., Zhao, D., Feng, Y. (eds.) NLPCC 2013. CCIS, vol. 400, pp. 139–150. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41644-6_14
Chapter Google Scholar
Huang, C.-M.: Using part-of-speech tagging based approach with Wikipedia to assist Chinese named entity recognition and disambiguation (in Chinese). J. Libr. Inform. Sci. Res. 11(2), 139–179 (2017)
Google Scholar
Huang, C.-M.: Applying lexical chain theory to extract context descriptors as web image annotation. Commun. ICISA: Int. J. 15(2), 22–39 (2014)
Google Scholar

Download references

Acknowledgment

This research is supported by Ministry of Science and Technology, Taiwan, R.O.C. under grant number MOST 105-2221-E-224-053.

Author information

Authors and Affiliations

National Yunlin University of Science and Technology, Douliou, Yunlin, Taiwan
Chuen-Min Huang

Authors

Chuen-Min Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chuen-Min Huang .

Editor information

Editors and Affiliations

National Yunlin University of Science and Technology, Douliu, Taiwan
Chuan-Yu Chang
National Yunlin University of Science and Technology, Douliu, Taiwan
Chien-Chou Lin
Southern Taiwan University of Science and Technology, Tainan, Taiwan
Horng-Horng Lin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Huang, CM. (2019). Incorporating Prior Knowledge by Selective Context Features to Enhance Topic Coherence. In: Chang, CY., Lin, CC., Lin, HH. (eds) New Trends in Computer Technologies and Applications. ICS 2018. Communications in Computer and Information Science, vol 1013. Springer, Singapore. https://doi.org/10.1007/978-981-13-9190-3_32

Download citation

DOI: https://doi.org/10.1007/978-981-13-9190-3_32
Published: 11 July 2019
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-9189-7
Online ISBN: 978-981-13-9190-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics