Skip to main content

Multilingual Documents Clustering Based on Closed Concepts Mining

  • Conference paper
  • First Online:
Book cover Database and Expert Systems Applications (Globe 2015, DEXA 2015)

Abstract

The scarcity of bilingual and multilingual parallel corpora has prompted many researchers to accentuate the need for new methods to enhance the quality of comparable corpora. In this paper, we highlight the interest and usefulness of Formal Concept Analysis in multiligual document clustering to improve corpora comparability. We propose a statistical approach for clustering multiligual documents based on multilingual Closed Concepts Mining to partition the documents belonging to one or more collections, writing in more than one language, in a set of classes. Experimental evaluation was conducted on two collections and showed a significant improvement of comparability of the generated classes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    In this paper, we denote by |X| the cardinality of the set X.

  2. 2.

    http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/.

  3. 3.

    https://translate.google.com/.

  4. 4.

    http://www.lemurproject.org/.

References

  1. Chen, H.-H., Lin, M.-S., Wei, Y.-C.: Novel association measures using web search with double checking. ACL-44: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, pp. 1009–1016 (2006)

    Google Scholar 

  2. Evans, D., Klavans, J.: A platform for multilingual news summarization. Technical Report, Department of Computer Science, Columbia University (2003)

    Google Scholar 

  3. Ganter, B., Wille, R.: Formal Concept Analysis. Springer, Heidelberg (1999)

    Book  MATH  Google Scholar 

  4. Gliozzo A., Strapparava C.: Cross language text categorization by acquiring multi-lingual domain models from comparable corpora. ParaText 2005: Proceedings of the ACL Workshop on Building and Using Parallel Texts (2005)

    Google Scholar 

  5. Mimouni, N., Nazarenko, A., S. Salotti: Classification conceptuelle d’une collection documentaire, intertextualité et recherche d’information. CORIA 2012: 9th French Information Retrieval Conference. Bordeaux, France (2012)

    Google Scholar 

  6. Montalvo, S., Martínez, R., Casillas, A., Fresno, V.: Multilingual news document clustering: two algorithms based on cognate named entities. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2006. LNCS (LNAI), vol. 4188, pp. 165–172. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  7. Pasquier, N., Bastide, Y., Taouil, R., Stumme, G., Lakhal, L.: Generating a condensed representation for association rules. J. Intell. Inf. Syst. 24(1), 2560 (2005)

    Article  Google Scholar 

  8. Peters C.: Result of the CLEF 2003 cross-language system evaluation campaign. In: Notes for the CLEF 2003 Workshop, 21–22 August, Trondheim, Norway (2003)

    Google Scholar 

  9. Salton, G., Buckely, C.: Term weighting approaches in automatic text retrieval. Inf. Process. Manage. 24(5), 513–523 (1988)

    Article  Google Scholar 

  10. Romeo, S., Ienco, D., Tagarelli, A.: Knowledge-based representation for transductive multilingual document classification. In: Hanbury, A., Kazai, G., Rauber, A., Fuhr, N. (eds.) ECIR 2015. LNCS, vol. 9022, pp. 92–103. Springer, Heidelberg (2015)

    Google Scholar 

  11. Wei, C.-P., Yang, C.-C., Lin, C.-M.: A latent semantic indexing-based approach to multilingual document clustering. Decis. Support. Syst. 45(3), 606–620 (2008)

    Article  Google Scholar 

  12. Zaki, M.-J., Hsiao, C.-J.: Efficient algorithms for mining closed itemsets and their lattice structure. IEEE Trans. Knowl. Data Eng. 17(4), 462–478 (2005)

    Article  Google Scholar 

Download references

Acknowledgements

This work is partially funded by the DGRST-CNRS \(n\circ \) 14/R 1401 Franco-Tunisian project, entitled “Text mining for construction of bilingual lexicons and multilingual information retrieval”

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mohamed Chebel .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Chebel, M., Latiri, C., Gaussier, E. (2015). Multilingual Documents Clustering Based on Closed Concepts Mining . In: Chen, Q., Hameurlain, A., Toumani, F., Wagner, R., Decker, H. (eds) Database and Expert Systems Applications. Globe DEXA 2015 2015. Lecture Notes in Computer Science(), vol 9261. Springer, Cham. https://doi.org/10.1007/978-3-319-22849-5_36

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-22849-5_36

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-22848-8

  • Online ISBN: 978-3-319-22849-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics