Skip to main content

A Fast Algorithm for Local Rank Distance: Application to Arabic Native Language Identification

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9490))

Abstract

A novel distance measure for strings, termed Local Rank Distance (LRD), was recently introduced. LRD is inspired from rank distance, but it is designed to conform to more general principles, while being more adapted for specific data types, such as DNA strings or text. More precisely, LRD measures the local displacement of character n-grams among two strings. Local Rank Distance has already demonstrated promising results in computational biology and native language identification, but the algorithm used to compute LRD is computationally expensive. In this paper, an efficient algorithm for LRD is proposed. The main efficiency improvement is to build a positional inverted index for the character n-grams in one of the compared strings. Then, for each n-gram in the other string, a binary search is used to find the position of the nearest matching n-gram in the positional inverted index. The proposed algorithm is more than two orders of magnitude faster than the original algorithm. An application of the described algorithm is also exhibited in this paper. Indeed, state of the art results are presented for Arabic native language identification from text documents.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Alfaifi, A., Atwell, E., Hedaya, I.: Arabic learner corpus (ALC) v2: a new written and spoken corpus of Arabic learners. In: Proceedings of the Learner Corpus Studies in Asia and the World, May 2014

    Google Scholar 

  2. Bykh, S., Meurers, D.: Native language identification using recurring \(n\)-grams - investigating abstraction and domain dependence. In: Proceedings of COLING 2012, pp. 425–440, December 2012

    Google Scholar 

  3. Dinu, L.P., Ionescu, R.T.: An efficient rank based approach for closest string and closest substring. PLoS ONE 7(6), e37576 (2012)

    Article  Google Scholar 

  4. Dinu, L.P., Ionescu, R.-T., Popescu, M.: Local patch dissimilarity for images. In: Huang, T., Zeng, Z., Li, C., Leung, C.S. (eds.) ICONIP 2012, Part I. LNCS, vol. 7663, pp. 117–126. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  5. Dinu, L.P., Ionescu, R.T., Tomescu, A.I.: A rank-based sequence aligner with applications in phylogenetic analysis. PLoS ONE 9(8), e104006 (2014)

    Article  Google Scholar 

  6. Dinu, L.P., Manea, F.: An efficient approach for the rank aggregation problem. Theor. Comput. Sci. 359(1–3), 455–461 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  7. Dinu, L.P., Popescu, M., Dinu, A.: Authorship identification of romanian texts with controversial paternity. In: Proceedings of LREC (2008)

    Google Scholar 

  8. Dinu, L.P., Sgarro, A.: A low-complexity distance for DNA strings. Fundam. Informaticae 73(3), 361–372 (2006)

    MathSciNet  MATH  Google Scholar 

  9. Grozea, C., Gehl, C., Popescu, M.: ENCOPLOT: pairwise sequence matching in linear time applied to plagiarism detection. In: 3rd Pan Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse, p. 10 (2009)

    Google Scholar 

  10. Ionescu, R.T.: Local rank distance. In: Proceedings of SYNASC, pp. 219–226 (2013)

    Google Scholar 

  11. Ionescu, R.T., Popescu, M., Cahill, A.: Can characters reveal your native language? a language-independent approach to native language identification. In: Proceedings of EMNLP, pp. 1363–1373, October 2014

    Google Scholar 

  12. Leslie, C.S., Eskin, E., Noble, W.S.: The spectrum kernel: a string kernel for SVM protein classification. In: Proceedings of Pacific Symposium on Biocomputing, pp. 566–575 (2002)

    Google Scholar 

  13. Li, M., Chen, X., Li, X., Ma, B., Vitanyi, P.M.B.: The similarity metric. IEEE Trans. Inf. Theor. 50(12), 3250–3264 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  14. Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.J.C.H.: Text classification using string kernels. J. Mach. Learn. Res. 2, 419–444 (2002)

    MATH  Google Scholar 

  15. Malmasi, S., Dras, M.: Arabic native language identification. In: Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP), pp. 180–186, October 2014

    Google Scholar 

  16. Malmasi, S., Dras, M.: Chinese native language identification. In: Proceedings of EACL, vol. 2, pp. 95–99 (2014)

    Google Scholar 

  17. Melsted, P., Pritchard, J.: Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinformatics 12(1), 333 (2011)

    Article  Google Scholar 

  18. Popescu, M., Grozea, C.: Kernel methods and string kernels for authorship analysis. In: CLEF (Online Working Notes/Labs/Workshop), September 2012

    Google Scholar 

  19. Popescu, M., Ionescu, R.T.: The story of the characters, the DNA and the native language. In: Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 270–278, June 2013

    Google Scholar 

  20. Popov, Y.V.: Multiple genome rearrangement by swaps and by element duplications. Theor. Comput. Sci. 385(1–3), 115–126 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  21. Sanderson, C., Guenter, S.: Short text authorship attribution via sequence kernels, markov chains and author unmasking: an investigation. In: Proceedings of EMNLP, pp. 482–491, July 2006

    Google Scholar 

  22. Shapira, D., Storer, J.A.: Large edit distance with multiple block operations. In: Nascimento, M.A., de Moura, E.S., Oliveira, A.L. (eds.) SPIRE 2003. LNCS, vol. 2857, pp. 369–377. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  23. Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, New York (2004)

    Book  MATH  Google Scholar 

  24. Tetreault, J., Blanchard, D., Cahill, A.: A report on the first native language identification shared task. In: Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 48–57, June 2013

    Google Scholar 

  25. Vezzi, F., Fabbro, C.D., Tomescu, A.I., Policriti, A.: rNA: a fast and accurate short reads numerical aligner. Bioinformatics 28(1), 123–124 (2012)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Radu Tudor Ionescu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Ionescu, R.T. (2015). A Fast Algorithm for Local Rank Distance: Application to Arabic Native Language Identification. In: Arik, S., Huang, T., Lai, W., Liu, Q. (eds) Neural Information Processing. ICONIP 2015. Lecture Notes in Computer Science(), vol 9490. Springer, Cham. https://doi.org/10.1007/978-3-319-26535-3_45

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-26535-3_45

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-26534-6

  • Online ISBN: 978-3-319-26535-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics