skip to main content
10.1145/2009916.2010042acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

No free lunch: brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity

Authors Info & Claims
Published:24 July 2011Publication History

ABSTRACT

This work explores the problem of cross-lingual pairwise similarity, where the task is to extract similar pairs of documents across two different languages. Solutions to this problem are of general interest for text mining in the multi-lingual context and have specific applications in statistical machine translation. Our approach takes advantage of cross-language information retrieval (CLIR) techniques to project feature vectors from one language into another, and then uses locality-sensitive hashing (LSH) to extract similar pairs. We show that effective cross-lingual pairwise similarity requires working with similarity thresholds that are much lower than in typical monolingual applications, making the problem quite challenging. We present a parallel, scalable MapReduce implementation of the sort-based sliding window algorithm, which is compared to a brute-force approach on German and English Wikipedia collections. Our central finding can be summarized as "no free lunch": there is no single optimal solution. Instead, we characterize effectiveness-efficiency tradeoffs in the solution space, which can guide the developer to locate a desirable operating point based on application- and resource-specific constraints.

References

  1. M. Anderka, B. Stein, and M. Potthast. Cross-language high similarity search. ECIR, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. CACM, 51(1):117--122, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. WWW, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Broder. On the resemblance and containment of documents. SEQUENCES, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. M. Charikar. Similarity estimation techniques from rounding algorithms. STOC, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Chowdhury, O. Frieder, D. Grossman, and M. McCabe. Collection statistics for fast duplicate document detection. ACM TOIS, 20(2):171--191, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. O. Chum, J. Philbin, and A. Zisserman. Near Duplicate Image Detection: min-Hash and tf-idf Weighting. British Machine Vision Conference, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  8. K. Darwish and D. Oard. Probabilistic structured query methods. SIGIR, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. G. de Melo and G. Weikum. Untangling the cross-lingual link structure of Wikipedia. ACL, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. OSDI, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. T. Elsayed, J. Lin, and D. Oard. Pairwise document similarity in large collections with MapReduce. HLT, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. Hadjieleftheriou, A. Chandel, N. Koudas, and D. Srivastava. Fast indexes and algorithms for set similarity selection queries. ICDE, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. SIGIR, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. L. Huang, L. Wang, and X. Li. Achieving both high precision and high recall in near-duplicate detection. CIKM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. T. Kiefer, P. Volk, and W. Lehner. Pairwise element computation with MapReduce. HPDC, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. P. Koehn. Statistical Machine Translation. Cambridge University Press, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. Kolcz, A. Chowdhury, and J. Alspector. Improved robustness of signature-based near-replica detection via lexicon randomization. KDD, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. Lin. Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce. SIGIR, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. Lin, D. Metzler, T. Elsayed, and L. Wang. Of Ivory and Smurfs: Loxodontan MapReduce experiments for web search. TREC, 2009.Google ScholarGoogle Scholar
  20. G. Manku, A. Jain, and A. Das Sarma. Detecting near-duplicates for web crawling. WWW, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. H. Masuichi, R. Flournoy, S. Kaufmann, and S. Peters. A bootstrapping method for extracting bilingual text pairs. COLING, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. Matthews and T. Williams. MrsRF: An efficient MapReduce algorithm for analyzing large collections of evolutionary trees. BMC Bioinformatics, 11(Suppl 1):S15, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  23. C. Moretti, J. Bulosan, D. Thain, and P. Flynn. All-Pairs: An abstraction for data-intensive cloud computing. IPDPS, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  24. D. Munteanu and D. Marcu. Improving machine translation performance by exploiting non-parallel corpora. Comp. Ling., 31(4):477--504, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. F. Och and H. Ney. A systematic comparison of various statistical alignment models. Comp. Ling., 29(1):19--51, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. P. Pantel, E. Crestan, A. Borkovsky, A.-M. Popescu, and V. Vyas. Web-scale distributional similarity and entity set expansion. EMNLP, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. Platt, K. Toutanova, and W.-t. Yih. Translingual document representations from discriminative projections. EMNLP, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. D. Ravichandran, P. Pantel, and E. Hovy. Randomized algorithms and NLP: Using locality sensitive hash functions for high speed noun clustering. ACL, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. P. Resnik and N. Smith. The web as a parallel corpus. Comp. Ling., 29(3):349--380, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. S. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, and M. Gatford. Okapi at TREC-3. TREC-3, 1994.Google ScholarGoogle Scholar
  31. J. Smith, C. Quirk, and K. Toutanova. Extracting parallel sentences from comparable corpora using document level alignment. HLT, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. M. Theobald, J. Siddharth, and A. Paepcke. SpotSigs: robust and efficient near duplicate detection in large web collections. SIGIR, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. R. Vernica, M. Carey, and C. Li. Efficient parallel set-similarity joins using MapReduce. SIGMOD, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. S. Vogel, H. Ney, and C. Tillmann. HMM-based word alignment in statistical translation. COLING, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. H. Yang and J. Callan. Near-duplicate detection by instance-level constrained clustering. SIGIR, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. D. Zhang, J. Wang, D. Cai, and J. Lu. Self-taught hashing for fast similarity search. SIGIR, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. No free lunch: brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
      July 2011
      1374 pages
      ISBN:9781450307574
      DOI:10.1145/2009916

      Copyright © 2011 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 24 July 2011

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate792of3,983submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader