A Fast Algorithm for Local Rank Distance: Application to Arabic Native Language Identification

Ionescu, Radu Tudor

doi:10.1007/978-3-319-26535-3_45

A Fast Algorithm for Local Rank Distance: Application to Arabic Native Language Identification

Radu Tudor Ionescu¹⁷

Conference paper
First Online: 10 November 2015

1724 Accesses
5 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9490))

Abstract

A novel distance measure for strings, termed Local Rank Distance (LRD), was recently introduced. LRD is inspired from rank distance, but it is designed to conform to more general principles, while being more adapted for specific data types, such as DNA strings or text. More precisely, LRD measures the local displacement of character n-grams among two strings. Local Rank Distance has already demonstrated promising results in computational biology and native language identification, but the algorithm used to compute LRD is computationally expensive. In this paper, an efficient algorithm for LRD is proposed. The main efficiency improvement is to build a positional inverted index for the character n-grams in one of the compared strings. Then, for each n-gram in the other string, a binary search is used to find the position of the nearest matching n-gram in the positional inverted index. The proposed algorithm is more than two orders of magnitude faster than the original algorithm. An application of the described algorithm is also exhibited in this paper. Indeed, state of the art results are presented for Arabic native language identification from text documents.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Alfaifi, A., Atwell, E., Hedaya, I.: Arabic learner corpus (ALC) v2: a new written and spoken corpus of Arabic learners. In: Proceedings of the Learner Corpus Studies in Asia and the World, May 2014
Google Scholar
Bykh, S., Meurers, D.: Native language identification using recurring \(n\)-grams - investigating abstraction and domain dependence. In: Proceedings of COLING 2012, pp. 425–440, December 2012
Google Scholar
Dinu, L.P., Ionescu, R.T.: An efficient rank based approach for closest string and closest substring. PLoS ONE 7(6), e37576 (2012)
Article Google Scholar
Dinu, L.P., Ionescu, R.-T., Popescu, M.: Local patch dissimilarity for images. In: Huang, T., Zeng, Z., Li, C., Leung, C.S. (eds.) ICONIP 2012, Part I. LNCS, vol. 7663, pp. 117–126. Springer, Heidelberg (2012)
Chapter Google Scholar
Dinu, L.P., Ionescu, R.T., Tomescu, A.I.: A rank-based sequence aligner with applications in phylogenetic analysis. PLoS ONE 9(8), e104006 (2014)
Article Google Scholar
Dinu, L.P., Manea, F.: An efficient approach for the rank aggregation problem. Theor. Comput. Sci. 359(1–3), 455–461 (2006)
Article MathSciNet MATH Google Scholar
Dinu, L.P., Popescu, M., Dinu, A.: Authorship identification of romanian texts with controversial paternity. In: Proceedings of LREC (2008)
Google Scholar
Dinu, L.P., Sgarro, A.: A low-complexity distance for DNA strings. Fundam. Informaticae 73(3), 361–372 (2006)
MathSciNet MATH Google Scholar
Grozea, C., Gehl, C., Popescu, M.: ENCOPLOT: pairwise sequence matching in linear time applied to plagiarism detection. In: 3rd Pan Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse, p. 10 (2009)
Google Scholar
Ionescu, R.T.: Local rank distance. In: Proceedings of SYNASC, pp. 219–226 (2013)
Google Scholar
Ionescu, R.T., Popescu, M., Cahill, A.: Can characters reveal your native language? a language-independent approach to native language identification. In: Proceedings of EMNLP, pp. 1363–1373, October 2014
Google Scholar
Leslie, C.S., Eskin, E., Noble, W.S.: The spectrum kernel: a string kernel for SVM protein classification. In: Proceedings of Pacific Symposium on Biocomputing, pp. 566–575 (2002)
Google Scholar
Li, M., Chen, X., Li, X., Ma, B., Vitanyi, P.M.B.: The similarity metric. IEEE Trans. Inf. Theor. 50(12), 3250–3264 (2004)
Article MathSciNet MATH Google Scholar
Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.J.C.H.: Text classification using string kernels. J. Mach. Learn. Res. 2, 419–444 (2002)
MATH Google Scholar
Malmasi, S., Dras, M.: Arabic native language identification. In: Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP), pp. 180–186, October 2014
Google Scholar
Malmasi, S., Dras, M.: Chinese native language identification. In: Proceedings of EACL, vol. 2, pp. 95–99 (2014)
Google Scholar
Melsted, P., Pritchard, J.: Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinformatics 12(1), 333 (2011)
Article Google Scholar
Popescu, M., Grozea, C.: Kernel methods and string kernels for authorship analysis. In: CLEF (Online Working Notes/Labs/Workshop), September 2012
Google Scholar
Popescu, M., Ionescu, R.T.: The story of the characters, the DNA and the native language. In: Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 270–278, June 2013
Google Scholar
Popov, Y.V.: Multiple genome rearrangement by swaps and by element duplications. Theor. Comput. Sci. 385(1–3), 115–126 (2007)
Article MathSciNet MATH Google Scholar
Sanderson, C., Guenter, S.: Short text authorship attribution via sequence kernels, markov chains and author unmasking: an investigation. In: Proceedings of EMNLP, pp. 482–491, July 2006
Google Scholar
Shapira, D., Storer, J.A.: Large edit distance with multiple block operations. In: Nascimento, M.A., de Moura, E.S., Oliveira, A.L. (eds.) SPIRE 2003. LNCS, vol. 2857, pp. 369–377. Springer, Heidelberg (2003)
Chapter Google Scholar
Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, New York (2004)
Book MATH Google Scholar
Tetreault, J., Blanchard, D., Cahill, A.: A report on the first native language identification shared task. In: Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 48–57, June 2013
Google Scholar
Vezzi, F., Fabbro, C.D., Tomescu, A.I., Policriti, A.: rNA: a fast and accurate short reads numerical aligner. Bioinformatics 28(1), 123–124 (2012)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Mathematics and Computer Science, University of Bucharest, 14 Academiei Street, Bucharest, Romania
Radu Tudor Ionescu

Authors

Radu Tudor Ionescu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Radu Tudor Ionescu .

Editor information

Editors and Affiliations

University of Istanbul, Istanbul, Turkey
Sabri Arik
University at Qatar, Doha, Qatar
Tingwen Huang
Tunku Abdul Rahman University College, Kuala Lumpur, Malaysia
Weng Kin Lai
University of Science Technology, Wuhan, China
Qingshan Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ionescu, R.T. (2015). A Fast Algorithm for Local Rank Distance: Application to Arabic Native Language Identification. In: Arik, S., Huang, T., Lai, W., Liu, Q. (eds) Neural Information Processing. ICONIP 2015. Lecture Notes in Computer Science(), vol 9490. Springer, Cham. https://doi.org/10.1007/978-3-319-26535-3_45

Download citation

DOI: https://doi.org/10.1007/978-3-319-26535-3_45
Published: 10 November 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-26534-6
Online ISBN: 978-3-319-26535-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics