Skip to main content

Incorporating Prior Knowledge by Selective Context Features to Enhance Topic Coherence

  • Conference paper
  • First Online:
New Trends in Computer Technologies and Applications (ICS 2018)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1013))

Included in the following conference series:

  • 1411 Accesses

Abstract

Latent variable model has been widely used to extract topics from document collections, but this unsupervised mode makes it difficult to interpret the topics covered by the constructed cluster due to no artificial tags. Considering the influence of preprocessing on subsequent data mining, this study extracts news feature words from CSCP word segmentation method and named entity identification (NER). For NER, names of people, places, organizations are extracted by syntax rules and verified by Wikipedia. In terms of CSCP, the probability of continuing characters, the distribution probabilities and numbers of links before and after the character are used to constantly merge the unit words, extract the important narrative words effectively, and then conduct Unigram, Compounds and Mixture word processing. The corpus of words obtained after word processing by CSCP and NER were used as LDA prior knowledge. Finally, the topic coherence of NER and CSCP is evaluated by UMass Topic Coherence Measurement. The experimental results show that Compounds are of specific meaning; Mixtures represents its diverse scope; Unigrams and NER are relatively short, while NER can accurately represent the important features of news content, the topic is more cohesive. In terms of efficiency, NER-LDA takes the longest, while it had the highest degree of topic coherence.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)

    MATH  Google Scholar 

  2. Mimno, D., Wallach, H.M., Talley, E., Leenders, M., McCallum, A.: Optimizing semantic coherence in topic models. In: EMNLP, pp. 262–272 (2011)

    Google Scholar 

  3. Andrzejewski, D., Zhu, X., Craven, M.: Incorporating domain knowledge into topic modeling via Dirichlet Forest priors. In: ICML, pp. 25–32 (2009)

    Google Scholar 

  4. Wallach, H.M.: Topic modeling: beyond bag-of-words. In: Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, Pennsylvania, USA, pp. 977–984 (2006)

    Google Scholar 

  5. Huang, C.-M.: Investigating topic coherence and task performance for varied types of words in LDA, pp. 260–265 (2017)

    Google Scholar 

  6. Shanthi, V., Lalitha, S.: Lexical chaining process for text generations. In: International Conference on Process Automation, Control and Computing (PACC), pp. 1–6 (2011)

    Google Scholar 

  7. Newman, D., Chemudugunta, C., Smyth, P.: Statistical entity-topic models. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, USA, pp. 680–686 (2006)

    Google Scholar 

  8. Hu, L., Li, J., Li, Z., Shao, C., Li, Z.: Incorporating entities in news topic modeling. In: Zhou, G., Li, J., Zhao, D., Feng, Y. (eds.) NLPCC 2013. CCIS, vol. 400, pp. 139–150. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41644-6_14

    Chapter  Google Scholar 

  9. Huang, C.-M.: Using part-of-speech tagging based approach with Wikipedia to assist Chinese named entity recognition and disambiguation (in Chinese). J. Libr. Inform. Sci. Res. 11(2), 139–179 (2017)

    Google Scholar 

  10. Huang, C.-M.: Applying lexical chain theory to extract context descriptors as web image annotation. Commun. ICISA: Int. J. 15(2), 22–39 (2014)

    Google Scholar 

Download references

Acknowledgment

This research is supported by Ministry of Science and Technology, Taiwan, R.O.C. under grant number MOST 105-2221-E-224-053.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chuen-Min Huang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Huang, CM. (2019). Incorporating Prior Knowledge by Selective Context Features to Enhance Topic Coherence. In: Chang, CY., Lin, CC., Lin, HH. (eds) New Trends in Computer Technologies and Applications. ICS 2018. Communications in Computer and Information Science, vol 1013. Springer, Singapore. https://doi.org/10.1007/978-981-13-9190-3_32

Download citation

  • DOI: https://doi.org/10.1007/978-981-13-9190-3_32

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-13-9189-7

  • Online ISBN: 978-981-13-9190-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics