Skip to main content

An Unsupervised Method for Concept Association Analysis in Text Collections

  • Conference paper
  • First Online:
  • 1616 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11799))

Abstract

This paper addresses the challenge of content categorization to support document navigation and retrieval. The work is motivated by the need to categorize all legislation of a country, where the existing metadata for each document is not sufficient for effective categorization, as concepts vary considerably among documents, resulting in highly sparse vector-space models. To address this challenge, we survey recent related work and propose a solution that integrates currently dispersed principles in a new unsupervised knowledge discovery process combining principles from topic modeling and formal concept analysis, thus not requiring prior domain knowledge to be applied in large document collections. The results confirm the potential of the proposed approach.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   64.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   84.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://dre.pt/.

  2. 2.

    http://trec.nist.gov/data.html.

  3. 3.

    https://dre.pt/web/guest/home/-/dre/3535010/details/maximized.

  4. 4.

    http://latviz.loria.fr/.

References

  1. Abualigah, L.M., Khader, A.T., Al-Betar, M.A., Awadallah, M.A.: A krill herd algorithm for efficient text documents clustering. In: 2016 IEEE Symposium on Computer Applications and Industrial Electronics (ISCAIE), pp. 67–72. IEEE (2016)

    Google Scholar 

  2. Amigó, E., Gonzalo, J., Verdejo, F.: A general evaluation measure for document organization tasks. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 643–652. ACM (2013)

    Google Scholar 

  3. Boudin, F.: Pke: an open source python-based keyphrase extraction toolkit. In: COLING, Osaka, Japan, pp. 69–73 (2016)

    Google Scholar 

  4. Carpineto, C., Romano, G.: Concept Data Analysis: Theory and Applications. Wiley, Hoboken (2004)

    Book  Google Scholar 

  5. Castellanos, A., Cigarrán, J., García-Serrano, A.: Formal concept analysis for topic detection: a clustering quality experimental analysis. Inf. Syst. 66, 24–42 (2017)

    Article  Google Scholar 

  6. Chen, Y.L., Liu, Y.H., Ho, W.L.: A text mining approach to assist the general public in the retrieval of legal documents. IJ Am. Soc. Inf. Sci. Technol. 64(2), 280–290 (2013)

    Article  Google Scholar 

  7. Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W.: Scatter/Gather: a cluster-based approach to browsing large document collections. In: ACM SIGIR, pp. 318–329. ACM (1992)

    Google Scholar 

  8. Daud, A., Li, J., Zhou, L., Muhammad, F.: Knowledge discovery through directed probabilistic topic models: a survey. Front. Comput. Sci. China 4(2), 280–301 (2010)

    Article  Google Scholar 

  9. Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. PAMI-1(2), 224–227 (1979)

    Article  Google Scholar 

  10. Dunn, J.C.: Well-separated clusters and optimal fuzzy partitions. J. Cybern. 4(1), 95–104 (1974)

    Article  MathSciNet  Google Scholar 

  11. El-Hamdouchi, A., Willett, P.: Comparison of hierarchic agglomerative clustering methods for document retrieval. Comput. J. 32(3), 220–227 (1989)

    Article  Google Scholar 

  12. Gandomi, A.H., Alavi, A.H.: Krill herd: a new bio-inspired optimization algorithm. Commun. Nonlinear Sci. Numer. Simul. 17(12), 4831–4845 (2012)

    Article  MathSciNet  Google Scholar 

  13. Gonçalves, T., Quaresma, P.: Evaluating preprocessing techniques in a text classification problem. SBC-Sociedade Brasileira de Computação, São Leopoldo, RS, Brasil (2005)

    Google Scholar 

  14. Henriques, R., Madeira, S.C.: BSig: evaluating the statistical significance of biclustering solutions. Data Min. Knowl. Discov. 32, 124–161 (2017)

    Article  MathSciNet  Google Scholar 

  15. Ignatov, D.I.: Introduction to formal concept analysis and its applications in information retrieval and related fields. In: Braslavski, P., Karpov, N., Worring, M., Volkovich, Y., Ignatov, D.I. (eds.) RuSSIR 2014. CCIS, vol. 505, pp. 42–141. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25485-2_3

    Chapter  Google Scholar 

  16. Jaganathan, P., Jaiganesh, S.: An improved k-means algorithm combined with particle swarm optimization approach for efficient web document clustering. In: ICGCE, pp. 772–776. IEEE (2013)

    Google Scholar 

  17. Jiang, S., Pang, G., Wu, M., Kuang, L.: An improved k-nearest-neighbor algorithm for text categorization. Expert Syst. Appl. 39(1), 1503–1509 (2012)

    Article  Google Scholar 

  18. Jin, W., Srihari, R.K., Ho, H.H., Wu, X.: Improving knowledge discovery in document collections through combining text retrieval and link analysis techniques. In: ICDM, pp. 193–202 (2007)

    Google Scholar 

  19. Kadhim, A.I., Cheah, Y.N., Ahamed, N.H.: Text document preprocessing and dimension reduction techniques for text document clustering. In: 2014 4th International Conference on Artificial Intelligence with Applications in Engineering and Technology, pp. 69–73. IEEE (2014)

    Google Scholar 

  20. Kalman, D.: A singularly valuable decomposition: the SVD of a matrix. Coll. Math. J. 27(1), 2–23 (1996)

    Article  MathSciNet  Google Scholar 

  21. Karypis, M.S.G., Kumar, V., Steinbach, M.: A comparison of document clustering techniques. In: IW on Text Mining at SIGKDD (2000)

    Google Scholar 

  22. Kozak, M.: “A dendrite method for cluster analysis” by Caliński and Harabasz: a classical work that is far too often incorrectly cited. Commun. Stat.-Theory Methods 41(12), 2279–2280 (2012)

    Article  Google Scholar 

  23. Kuzuetsov, S.: Stability as an estimate of the degree of substantiation of hypotheses derived on the basis of operational, similarity (1990)

    Google Scholar 

  24. Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse Process. 25(2–3), 259–284 (1998)

    Article  Google Scholar 

  25. Li, C.H., Yang, J.C., Park, S.C.: Text categorization algorithms using semantic approaches, corpus-based thesaurus and wordnet. Expert Syst. Appl. 39(1), 765–772 (2012)

    Article  Google Scholar 

  26. Li, X., Jin, W.: Cross-document knowledge discovery using semantic concept topic model. In: ICMLA, pp. 108–114. IEEE (2016)

    Google Scholar 

  27. Mishra, R.K., Saini, K., Bagri, S.: Text document clustering on the basis of inter passage approach by using k-means. In: IC on Computing, Communication and Automation, pp. 110–113. IEEE (2015)

    Google Scholar 

  28. Myat, N.N., Hla, K.H.S.: Organizing web documents resulting from an information retrieval system using formal concept analysis. In: Asia-Pacific Symposium on Information and Telecommunication Technologies, pp. 198–203. IEEE (2005)

    Google Scholar 

  29. Quan, T.T., Hui, S.C., Cao, T.H.: A fuzzy FCA-based approach to conceptual clustering for automatic generation of concept hierarchy on uncertainty data. In: CLA, pp. 1–12 (2004)

    Google Scholar 

  30. Raghuveer, K.: Legal documents clustering using latent dirichlet allocation. IAES Int. J. Artif. Intell. 2(1), 34–37 (2012)

    Google Scholar 

  31. Rajaraman, A., Ullman, J.D.: Data Mining, pp. 1–17. Cambridge University Press, Cambridge (2011)

    Google Scholar 

  32. Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)

    Article  Google Scholar 

  33. Shi, Y., Eberhart, R.C.: Parameter selection in particle swarm optimization. In: Porto, V.W., Saravanan, N., Waagen, D., Eiben, A.E. (eds.) EP 1998. LNCS, vol. 1447, pp. 591–600. Springer, Heidelberg (1998). https://doi.org/10.1007/BFb0040810

    Chapter  Google Scholar 

  34. Singh, V.K., Tiwari, N., Garg, S.: Document clustering using k-means, heuristic k-means and fuzzy c-means. In: IC on Computational Intelligence and Communication Networks, pp. 297–301. IEEE (2011)

    Google Scholar 

  35. Srividhya, V., Anitha, R.: Evaluating preprocessing techniques in text categorization. Int. J. Comput. Sci. Appl. 47(11), 49–51 (2010)

    Google Scholar 

  36. Stevens, K., Kegelmeyer, P., Andrzejewski, D., Buttler, D.: Exploring topic coherence over many models and many topics. In: Joint Conference on Empirical Methods in NLP and Computational Natural Language Learning, pp. 952–961. Association for Computational Linguistics (2012)

    Google Scholar 

  37. Tan, P.N.: Introduction to Data Mining. Pearson Education, Delhi (2018)

    Google Scholar 

  38. van der Merwe, D., Obiedkov, S., Kourie, D.: AddIntent: a new incremental algorithm for constructing concept lattices. In: Eklund, P. (ed.) ICFCA 2004. LNCS (LNAI), vol. 2961, pp. 372–385. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24651-0_31

    Chapter  MATH  Google Scholar 

  39. Venkatesh, R.K.: Legal documents clustering and summarization using hierarchical latent Dirichlet allocation. IAES Int. J. Artif. Intell. 2(1) (2013)

    Google Scholar 

  40. Wang, X., McCallum, A., Wei, X.: Topical n-grams: phrase and topic discovery, with an application to information retrieval. In: ICDM, pp. 697–702. IEEE (2007)

    Google Scholar 

  41. Wille, R.: Restructuring lattice theory: an approach based on hierarchies of concepts. In: Rival, I. (ed.) Ordered Sets. ASIC, vol. 83, pp. 445–470. Springer, Dordrecht (1982). https://doi.org/10.1007/978-94-009-7798-3_15

    Chapter  Google Scholar 

Download references

Acknowledgement

This work was supported by Imprensa Nacional Casa da Moeda (INCM) and national funds through Fundação para a Ciência e a Tecnologia (FCT) with reference UID/CEC/50021/2019.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pavlo Kovalchuk .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Kovalchuk, P., Proença, D., Borbinha, J., Henriques, R. (2019). An Unsupervised Method for Concept Association Analysis in Text Collections. In: Doucet, A., Isaac, A., Golub, K., Aalberg, T., Jatowt, A. (eds) Digital Libraries for Open Knowledge. TPDL 2019. Lecture Notes in Computer Science(), vol 11799. Springer, Cham. https://doi.org/10.1007/978-3-030-30760-8_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-30760-8_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-30759-2

  • Online ISBN: 978-3-030-30760-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics