Skip to main content

Cluster Labeling for Multilingual Scatter/Gather Using Comparable Corpora

  • Conference paper
Advances in Information Retrieval (ECIR 2012)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7224))

Included in the following conference series:

Abstract

Scatter/Gather systems are increasingly becoming useful in browsing document corpora. Usability of the present-day systems are restricted to monolingual corpora, and their methods for clustering and labeling do not easily extend to the multilingual setting, especially in the absence of dictionaries/machine translation. In this paper, we study the cluster labeling problem for multilingual corpora in the absence of machine translation, but using comparable corpora. Using a variational approach, we show that multilingual topic models can effectively handle the cluster labeling problem, which in turn allows us to design a novel Scatter/Gather system ShoBha. Experimental results on three datasets, namely the Canadian Hansards corpus, the entire overlapping Wikipedia of English, Hindi and Bengali articles, and a trilingual news corpus containing 41,000 articles, confirm the utility of the proposed system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. JMLR (2003)

    Google Scholar 

  2. Carmel, D., Roitman, H., Zwerdling, N.: Enhancing cluster labeling using wikipedia. In: SIGIR 2009 (2009)

    Google Scholar 

  3. Carpineto, C., Osiski, S., Romano, G., Weiss, D.: A survey of web clustering engines. ACM Comput. Surv. (July 2009)

    Google Scholar 

  4. Chen, H.-H., Kuo, J.-J., Su, T.-C.: Clustering and Visualization in a Multi-lingual Multi-document Summarization System. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 266–280. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  5. Chin, O.S., Kulathuramaiyer, N., Yeo, A.W.: Automatic discovery of concepts from text. In: WI 2006 (2006)

    Google Scholar 

  6. Cohen, J.: Weighted kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psychological Bulletin (October 1968)

    Google Scholar 

  7. Cutting, D.R., Karger, D.R., Pedersen, J.O., Tukey, J.W.: Scatter/gather: a cluster-based approach to browsing large document collections. In: SIGIR 1992 (1992)

    Google Scholar 

  8. Geraci, F., Pellegrini, M., Maggini, M., Sebastiani, F.: Cluster generation and labeling for web snippets: A fast, accurate hierarchical solution. Internet Math. (2007)

    Google Scholar 

  9. Glover, E., Pennock, D.M., Lawrence, S., Krovetz, R.: Inferring hierarchical descriptions. In: CIKM 2002 (2002)

    Google Scholar 

  10. Honarpisheh, M.A., Ghassem-Sani, G., Mirroshandel, G.: A multi-document multi-lingual automatic summarization system. In: IJCNLP 2009 (2009)

    Google Scholar 

  11. Ke, W., Sugimoto, C.R., Mostafa, J.: Dynamicity vs. e ectiveness: studying online clustering for scatter/gather. In: SIGIR 2009 (2009)

    Google Scholar 

  12. Kuo, J.-J., Chen, H.-H.: Multidocument summary generation: Using informative and event words. TALIP (February 2008)

    Google Scholar 

  13. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press (2008)

    Google Scholar 

  14. Mimno, D., Wallach, H.M., Naradowsky, J., Smith, D.A., McCallum, A.: Polylingual topic models. In: EMNLP 2009 (2009)

    Google Scholar 

  15. Ming, Z.-Y., Wang, K., Chua, T.-S.: Prototype hierarchy based clustering for the categorization and navigation of web collections. In: SIGIR 2010 (2010)

    Google Scholar 

  16. Osinski, S., Weiss, D.: A concept-driven algorithm for clustering search results. IEEE Intell. Sys. (May 2005)

    Google Scholar 

  17. Radev, D.R., Jing, H., Styś, M., Tam, D.: Centroid-based summarization of multiple documents. Inf. Proc. Manag (November 2004)

    Google Scholar 

  18. Toda, H., Kataoka, R.: A clustering method for news articles retrieval system. In: WWW 2005 (2005)

    Google Scholar 

  19. Treeratpituk, P., Callan, J.: Automatically labeling hierarchical clusters. Digital Government Research (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Tholpadi, G., Das, M.K., Bhattacharyya, C., Shevade, S. (2012). Cluster Labeling for Multilingual Scatter/Gather Using Comparable Corpora. In: Baeza-Yates, R., et al. Advances in Information Retrieval. ECIR 2012. Lecture Notes in Computer Science, vol 7224. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28997-2_33

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-28997-2_33

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-28996-5

  • Online ISBN: 978-3-642-28997-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics