Content Quality of Latent Dirichlet Allocation Summaries Constituted Using Unique Significant Words

Annamalai, Muthukkaruppan; Narawi, Amrah Mohd

doi:10.1007/978-981-10-7242-0_26

Muthukkaruppan Annamalai^12,13 &
Amrah Mohd Narawi¹²

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 788))

Included in the following conference series:

International Conference on Soft Computing in Data Science

792 Accesses

Abstract

The accessibility to the Big Data platform today has raised hopes to analyse the large volume of online documents and to make sense of them quickly, which has provided further impetus for automated unsupervised text summarisation. In this regards, Latent Dirichlet Allocation (LDA) is a popular topic model based text summarisation method. However, the generated LDA topic word summaries contain many redundant words, i.e., duplicates and morphological variants. We hypothesise that duplicate words do not improve the content quality of summary, but for good reasons, the morphological variants do. The work sets out to investigate this hypothesis. Consequently, a unique LDA summary of significant topic words is constituted from the LDA summary by removing the duplicate words, but retaining the distinctive morphological variants. The divergence probability of the unique LDA summary is compared against the LDA baseline summary of the same size. Short summaries of 0.67% and 2.0% of the full text size of the input documents are evaluated. Our findings show that the content quality of unique LDA summary is no better than its corresponding LDA baseline summary. However, if the duplicate words are removed from the baseline summary, producing a compressed version of itself with unique words, i.e., a unique LDA baseline summary; and, if the compression ratio is taken into consideration, it will appear that the content quality of a LDA summary constituted using unique significant words have indeed improved.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Blei, D., Ng, A., Jordan, M.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Lee, S., Song, J., Kim, Y.: An empirical comparison of four text mining methods. J. Comput. Inf. 51(1), 1–10 (2010)
Google Scholar
Annamalai, M., Mukhlis, S.F.N.: Content quality of clustered latent Dirichlet allocation short summaries. In: Jaafar, A., Mohamad Ali, N., Mohd Noah, S.A., Smeaton, A.F., Bruza, P., Bakar, Z.A., Jamil, N., Sembok, T.M.T. (eds.) AIRS 2014. LNCS, vol. 8870, pp. 494–504. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-12844-3_42
Google Scholar
Dopichaj, P., Harder, T.: Conflation method and spelling mistakes – a sensitivity analysis in information retrieval. Citeseer (2004)
Google Scholar
Sirsat, S.R., Chavan, V., Mahalle, H.S.: Strength and accuracy analysis of affix removal stemming algorithms. Int. J. Comput. Sci. Inf. Technol.gies 4(2), 265–269 (2013)
Google Scholar
Hobson, S.F.: Text summarization evaluation: correlation human performance on an extrinsic task with automatic intrinsic metrics. Dissertation, University of Maryland, College Park (2007)
Google Scholar
Kullback, S.: The Kullback-Leibler distance. Am. Stat. 41(4), 340–341 (1987)
Google Scholar
McCallum A.K.: MALLET: a machine learning for language toolkit (2002). http://mallet.cs.umass.edu
Saranyamol, C., Sindhu, L.: A survey on automatic text summarisation. Int. J. Comput. Sci. Inf. Technol. 5, 7889–7893 (2014)
Google Scholar
Hassel, M.: Evaluation of automatic text summarization: a practical implementation. Licentiate Thesis, University of Stockholm, Sweden (2004)
Google Scholar
Steinberger, J., Jezek, K.: Evaluation measures for text summarization. Comput. Inform. 28, 1001–1026 (2007)
MATH Google Scholar
Lloret, E., Palomar, M.: Text summarisation in progress: a literature review. Artif. Intell. Rev. 37, 1–41 (2011). Springer Science+Business Media B.V.
Article Google Scholar
Louis, A., Nenkova, A.: Automatically evaluating content selection in summarization without human models. In: Conference on Empirical Methods in Natural Language Processing, Singapore, pp. 306–314 (2009)
Google Scholar
Manning, C.D., Surdeanu, M., Bauer, J.F., Jenny, B., Steven, J., McClosky, D.: The Stanford CoreNLP natural language processing toolkit. In: 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55–60 (2014)
Google Scholar

Download references

Acknowledgement

The authors thankfully acknowledge Universiti Teknologi MARA (UiTM) for support of this work, which was funded under the Malaysian Ministry of Higher Education’s Fundamental Research Grant Scheme (ref. no FRGS/1/2016/ICT01/UITM/02/3).

Author information

Authors and Affiliations

Faculty of Computer and Mathematical Sciences, Universiti Teknologi MARA (UiTM), 40450, Shah Alam, Selangor, Malaysia
Muthukkaruppan Annamalai & Amrah Mohd Narawi
Knowledge and Software Engineering Research Group, Advanced Computing and Communication Communities of Research, Universiti Teknologi MARA (UiTM), 40450, Shah Alam, Selangor, Malaysia
Muthukkaruppan Annamalai

Authors

Muthukkaruppan Annamalai
View author publications
You can also search for this author in PubMed Google Scholar
Amrah Mohd Narawi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Muthukkaruppan Annamalai .

Editor information

Editors and Affiliations

Universiti Teknologi MARA, Shah Alam, Selangor, Malaysia
Azlinah Mohamed
University of Tennessee at Knoxville, Knoxville, Tennessee, USA
Michael W. Berry
Universiti Teknologi MARA, Shah Alam, Selangor, Malaysia
Bee Wah Yap

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Annamalai, M., Narawi, A.M. (2017). Content Quality of Latent Dirichlet Allocation Summaries Constituted Using Unique Significant Words. In: Mohamed, A., Berry, M., Yap, B. (eds) Soft Computing in Data Science. SCDS 2017. Communications in Computer and Information Science, vol 788. Springer, Singapore. https://doi.org/10.1007/978-981-10-7242-0_26

Download citation

DOI: https://doi.org/10.1007/978-981-10-7242-0_26
Published: 24 November 2017
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-7241-3
Online ISBN: 978-981-10-7242-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics