research-article

No free lunch: brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity

Authors:
Ferhan Ture

University of Maryland, College Park, MD, USA

University of Maryland, College Park, MD, USA
View Profile

,
Tamer Elsayed

King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia

King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
View Profile

,
Jimmy Lin

University of Maryland, College Park, MD, USA

University of Maryland, College Park, MD, USA
View Profile

SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information RetrievalJuly 2011Pages 943–952https://doi.org/10.1145/2009916.2010042

Published:24 July 2011Publication History

SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval

Pages 943–952

ABSTRACT

This work explores the problem of cross-lingual pairwise similarity, where the task is to extract similar pairs of documents across two different languages. Solutions to this problem are of general interest for text mining in the multi-lingual context and have specific applications in statistical machine translation. Our approach takes advantage of cross-language information retrieval (CLIR) techniques to project feature vectors from one language into another, and then uses locality-sensitive hashing (LSH) to extract similar pairs. We show that effective cross-lingual pairwise similarity requires working with similarity thresholds that are much lower than in typical monolingual applications, making the problem quite challenging. We present a parallel, scalable MapReduce implementation of the sort-based sliding window algorithm, which is compared to a brute-force approach on German and English Wikipedia collections. Our central finding can be summarized as "no free lunch": there is no single optimal solution. Instead, we characterize effectiveness-efficiency tradeoffs in the solution space, which can guide the developer to locate a desirable operating point based on application- and resource-specific constraints.

References

M. Anderka, B. Stein, and M. Potthast. Cross-language high similarity search. ECIR, 2010. Google ScholarDigital Library
A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. CACM, 51(1):117--122, 2008. Google ScholarDigital Library
R. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. WWW, 2007. Google ScholarDigital Library
A. Broder. On the resemblance and containment of documents. SEQUENCES, 1997. Google ScholarDigital Library
M. Charikar. Similarity estimation techniques from rounding algorithms. STOC, 2002. Google ScholarDigital Library
A. Chowdhury, O. Frieder, D. Grossman, and M. McCabe. Collection statistics for fast duplicate document detection. ACM TOIS, 20(2):171--191, 2002. Google ScholarDigital Library
O. Chum, J. Philbin, and A. Zisserman. Near Duplicate Image Detection: min-Hash and tf-idf Weighting. British Machine Vision Conference, 2008.Google ScholarCross Ref
K. Darwish and D. Oard. Probabilistic structured query methods. SIGIR, 2003. Google ScholarDigital Library
G. de Melo and G. Weikum. Untangling the cross-lingual link structure of Wikipedia. ACL, 2010. Google ScholarDigital Library
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. OSDI, 2004. Google ScholarDigital Library
T. Elsayed, J. Lin, and D. Oard. Pairwise document similarity in large collections with MapReduce. HLT, 2008. Google ScholarDigital Library
M. Hadjieleftheriou, A. Chandel, N. Koudas, and D. Srivastava. Fast indexes and algorithms for set similarity selection queries. ICDE, 2008. Google ScholarDigital Library
M. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. SIGIR, 2006. Google ScholarDigital Library
L. Huang, L. Wang, and X. Li. Achieving both high precision and high recall in near-duplicate detection. CIKM, 2008. Google ScholarDigital Library
T. Kiefer, P. Volk, and W. Lehner. Pairwise element computation with MapReduce. HPDC, 2010. Google ScholarDigital Library
P. Koehn. Statistical Machine Translation. Cambridge University Press, 2010. Google ScholarDigital Library
A. Kolcz, A. Chowdhury, and J. Alspector. Improved robustness of signature-based near-replica detection via lexicon randomization. KDD, 2004. Google ScholarDigital Library
J. Lin. Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce. SIGIR, 2009. Google ScholarDigital Library
J. Lin, D. Metzler, T. Elsayed, and L. Wang. Of Ivory and Smurfs: Loxodontan MapReduce experiments for web search. TREC, 2009.Google Scholar
G. Manku, A. Jain, and A. Das Sarma. Detecting near-duplicates for web crawling. WWW, 2007. Google ScholarDigital Library
H. Masuichi, R. Flournoy, S. Kaufmann, and S. Peters. A bootstrapping method for extracting bilingual text pairs. COLING, 2000. Google ScholarDigital Library
S. Matthews and T. Williams. MrsRF: An efficient MapReduce algorithm for analyzing large collections of evolutionary trees. BMC Bioinformatics, 11(Suppl 1):S15, 2010.Google ScholarCross Ref
C. Moretti, J. Bulosan, D. Thain, and P. Flynn. All-Pairs: An abstraction for data-intensive cloud computing. IPDPS, 2008.Google ScholarCross Ref
D. Munteanu and D. Marcu. Improving machine translation performance by exploiting non-parallel corpora. Comp. Ling., 31(4):477--504, 2005. Google ScholarDigital Library
F. Och and H. Ney. A systematic comparison of various statistical alignment models. Comp. Ling., 29(1):19--51, 2003. Google ScholarDigital Library
P. Pantel, E. Crestan, A. Borkovsky, A.-M. Popescu, and V. Vyas. Web-scale distributional similarity and entity set expansion. EMNLP, 2009. Google ScholarDigital Library
J. Platt, K. Toutanova, and W.-t. Yih. Translingual document representations from discriminative projections. EMNLP, 2010. Google ScholarDigital Library
D. Ravichandran, P. Pantel, and E. Hovy. Randomized algorithms and NLP: Using locality sensitive hash functions for high speed noun clustering. ACL, 2005. Google ScholarDigital Library
P. Resnik and N. Smith. The web as a parallel corpus. Comp. Ling., 29(3):349--380, 2003. Google ScholarDigital Library
S. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, and M. Gatford. Okapi at TREC-3. TREC-3, 1994.Google Scholar
J. Smith, C. Quirk, and K. Toutanova. Extracting parallel sentences from comparable corpora using document level alignment. HLT, 2010. Google ScholarDigital Library
M. Theobald, J. Siddharth, and A. Paepcke. SpotSigs: robust and efficient near duplicate detection in large web collections. SIGIR, 2008. Google ScholarDigital Library
R. Vernica, M. Carey, and C. Li. Efficient parallel set-similarity joins using MapReduce. SIGMOD, 2010. Google ScholarDigital Library
S. Vogel, H. Ney, and C. Tillmann. HMM-based word alignment in statistical translation. COLING, 1996. Google ScholarDigital Library
H. Yang and J. Callan. Near-duplicate detection by instance-level constrained clustering. SIGIR, 2006. Google ScholarDigital Library
D. Zhang, J. Wang, D. Cai, and J. Lu. Self-taught hashing for fast similarity search. SIGIR, 2010. Google ScholarDigital Library

Index Terms

No free lunch: brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity
1. Information systems
  1. Information retrieval

Recommendations

DSH: data sensitive hashing for high-dimensional k-nnsearch
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

The need to locate the k-nearest data points with respect to a given query point in a multi- and high-dimensional space is common in many applications. Therefore, it is essential to provide efficient support for such a search. Locality Sensitive Hashing ...
Read More
Word Sense Based Hindi-Tamil Statistical Machine Translation

Corpus based natural language processing has emerged with great success in recent years. It is not only used for languages like English, French, Spanish, and Hindi but also is widely used for languages like Tamil, Telugu etc. This paper focuses to ...
Read More
Recognizing textual entailment in non-english text via automatic translation into english
MICAI'12: Proceedings of the 11th Mexican international conference on Advances in Computational Intelligence - Volume Part II

We show that a task that typically involves rather deep semantic processing of text--being recognizing textual entailment our case study--can be successfully solved without any tools at all specific for the language of the texts on which the task is ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
July 2011
1374 pages
ISBN:9781450307574
DOI:10.1145/2009916
General Chairs:
Wei-Ying Ma
Microsoft Research Asia, China
,
Jian-Yun Nie
University of Montreal, Canada
,
Program Chairs:
Ricardo Baeza-Yates
Yahoo! Research, Spain
,
Tat-Seng Chua
National University of Singapore
,
W. Bruce Croft
University of Massachusetts, Amherst, USA
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 July 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
lsh
machine translation
wikipedia
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 24
  Total Citations
  View Citations
- 475
  Total Downloads
- Downloads (Last 12 months)9
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

No free lunch: brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity

SIGIR '11: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

DSH: data sensitive hashing for high-dimensional k-nnsearch

Word Sense Based Hindi-Tamil Statistical Machine Translation

Recognizing textual entailment in non-english text via automatic translation into english