ABSTRACT
This work explores the problem of cross-lingual pairwise similarity, where the task is to extract similar pairs of documents across two different languages. Solutions to this problem are of general interest for text mining in the multi-lingual context and have specific applications in statistical machine translation. Our approach takes advantage of cross-language information retrieval (CLIR) techniques to project feature vectors from one language into another, and then uses locality-sensitive hashing (LSH) to extract similar pairs. We show that effective cross-lingual pairwise similarity requires working with similarity thresholds that are much lower than in typical monolingual applications, making the problem quite challenging. We present a parallel, scalable MapReduce implementation of the sort-based sliding window algorithm, which is compared to a brute-force approach on German and English Wikipedia collections. Our central finding can be summarized as "no free lunch": there is no single optimal solution. Instead, we characterize effectiveness-efficiency tradeoffs in the solution space, which can guide the developer to locate a desirable operating point based on application- and resource-specific constraints.
- M. Anderka, B. Stein, and M. Potthast. Cross-language high similarity search. ECIR, 2010. Google ScholarDigital Library
- A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. CACM, 51(1):117--122, 2008. Google ScholarDigital Library
- R. Bayardo, Y. Ma, and R. Srikant. Scaling up all pairs similarity search. WWW, 2007. Google ScholarDigital Library
- A. Broder. On the resemblance and containment of documents. SEQUENCES, 1997. Google ScholarDigital Library
- M. Charikar. Similarity estimation techniques from rounding algorithms. STOC, 2002. Google ScholarDigital Library
- A. Chowdhury, O. Frieder, D. Grossman, and M. McCabe. Collection statistics for fast duplicate document detection. ACM TOIS, 20(2):171--191, 2002. Google ScholarDigital Library
- O. Chum, J. Philbin, and A. Zisserman. Near Duplicate Image Detection: min-Hash and tf-idf Weighting. British Machine Vision Conference, 2008.Google ScholarCross Ref
- K. Darwish and D. Oard. Probabilistic structured query methods. SIGIR, 2003. Google ScholarDigital Library
- G. de Melo and G. Weikum. Untangling the cross-lingual link structure of Wikipedia. ACL, 2010. Google ScholarDigital Library
- J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. OSDI, 2004. Google ScholarDigital Library
- T. Elsayed, J. Lin, and D. Oard. Pairwise document similarity in large collections with MapReduce. HLT, 2008. Google ScholarDigital Library
- M. Hadjieleftheriou, A. Chandel, N. Koudas, and D. Srivastava. Fast indexes and algorithms for set similarity selection queries. ICDE, 2008. Google ScholarDigital Library
- M. Henzinger. Finding near-duplicate web pages: a large-scale evaluation of algorithms. SIGIR, 2006. Google ScholarDigital Library
- L. Huang, L. Wang, and X. Li. Achieving both high precision and high recall in near-duplicate detection. CIKM, 2008. Google ScholarDigital Library
- T. Kiefer, P. Volk, and W. Lehner. Pairwise element computation with MapReduce. HPDC, 2010. Google ScholarDigital Library
- P. Koehn. Statistical Machine Translation. Cambridge University Press, 2010. Google ScholarDigital Library
- A. Kolcz, A. Chowdhury, and J. Alspector. Improved robustness of signature-based near-replica detection via lexicon randomization. KDD, 2004. Google ScholarDigital Library
- J. Lin. Brute force and indexed approaches to pairwise document similarity comparisons with MapReduce. SIGIR, 2009. Google ScholarDigital Library
- J. Lin, D. Metzler, T. Elsayed, and L. Wang. Of Ivory and Smurfs: Loxodontan MapReduce experiments for web search. TREC, 2009.Google Scholar
- G. Manku, A. Jain, and A. Das Sarma. Detecting near-duplicates for web crawling. WWW, 2007. Google ScholarDigital Library
- H. Masuichi, R. Flournoy, S. Kaufmann, and S. Peters. A bootstrapping method for extracting bilingual text pairs. COLING, 2000. Google ScholarDigital Library
- S. Matthews and T. Williams. MrsRF: An efficient MapReduce algorithm for analyzing large collections of evolutionary trees. BMC Bioinformatics, 11(Suppl 1):S15, 2010.Google ScholarCross Ref
- C. Moretti, J. Bulosan, D. Thain, and P. Flynn. All-Pairs: An abstraction for data-intensive cloud computing. IPDPS, 2008.Google ScholarCross Ref
- D. Munteanu and D. Marcu. Improving machine translation performance by exploiting non-parallel corpora. Comp. Ling., 31(4):477--504, 2005. Google ScholarDigital Library
- F. Och and H. Ney. A systematic comparison of various statistical alignment models. Comp. Ling., 29(1):19--51, 2003. Google ScholarDigital Library
- P. Pantel, E. Crestan, A. Borkovsky, A.-M. Popescu, and V. Vyas. Web-scale distributional similarity and entity set expansion. EMNLP, 2009. Google ScholarDigital Library
- J. Platt, K. Toutanova, and W.-t. Yih. Translingual document representations from discriminative projections. EMNLP, 2010. Google ScholarDigital Library
- D. Ravichandran, P. Pantel, and E. Hovy. Randomized algorithms and NLP: Using locality sensitive hash functions for high speed noun clustering. ACL, 2005. Google ScholarDigital Library
- P. Resnik and N. Smith. The web as a parallel corpus. Comp. Ling., 29(3):349--380, 2003. Google ScholarDigital Library
- S. Robertson, S. Walker, S. Jones, M. Hancock-Beaulieu, and M. Gatford. Okapi at TREC-3. TREC-3, 1994.Google Scholar
- J. Smith, C. Quirk, and K. Toutanova. Extracting parallel sentences from comparable corpora using document level alignment. HLT, 2010. Google ScholarDigital Library
- M. Theobald, J. Siddharth, and A. Paepcke. SpotSigs: robust and efficient near duplicate detection in large web collections. SIGIR, 2008. Google ScholarDigital Library
- R. Vernica, M. Carey, and C. Li. Efficient parallel set-similarity joins using MapReduce. SIGMOD, 2010. Google ScholarDigital Library
- S. Vogel, H. Ney, and C. Tillmann. HMM-based word alignment in statistical translation. COLING, 1996. Google ScholarDigital Library
- H. Yang and J. Callan. Near-duplicate detection by instance-level constrained clustering. SIGIR, 2006. Google ScholarDigital Library
- D. Zhang, J. Wang, D. Cai, and J. Lu. Self-taught hashing for fast similarity search. SIGIR, 2010. Google ScholarDigital Library
Index Terms
- No free lunch: brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity
Recommendations
DSH: data sensitive hashing for high-dimensional k-nnsearch
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of DataThe need to locate the k-nearest data points with respect to a given query point in a multi- and high-dimensional space is common in many applications. Therefore, it is essential to provide efficient support for such a search. Locality Sensitive Hashing ...
Word Sense Based Hindi-Tamil Statistical Machine Translation
Corpus based natural language processing has emerged with great success in recent years. It is not only used for languages like English, French, Spanish, and Hindi but also is widely used for languages like Tamil, Telugu etc. This paper focuses to ...
Recognizing textual entailment in non-english text via automatic translation into english
MICAI'12: Proceedings of the 11th Mexican international conference on Advances in Computational Intelligence - Volume Part IIWe show that a task that typically involves rather deep semantic processing of text--being recognizing textual entailment our case study--can be successfully solved without any tools at all specific for the language of the texts on which the task is ...
Comments