Abstract
A novel distance measure for strings, termed Local Rank Distance (LRD), was recently introduced. LRD is inspired from rank distance, but it is designed to conform to more general principles, while being more adapted for specific data types, such as DNA strings or text. More precisely, LRD measures the local displacement of character n-grams among two strings. Local Rank Distance has already demonstrated promising results in computational biology and native language identification, but the algorithm used to compute LRD is computationally expensive. In this paper, an efficient algorithm for LRD is proposed. The main efficiency improvement is to build a positional inverted index for the character n-grams in one of the compared strings. Then, for each n-gram in the other string, a binary search is used to find the position of the nearest matching n-gram in the positional inverted index. The proposed algorithm is more than two orders of magnitude faster than the original algorithm. An application of the described algorithm is also exhibited in this paper. Indeed, state of the art results are presented for Arabic native language identification from text documents.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Alfaifi, A., Atwell, E., Hedaya, I.: Arabic learner corpus (ALC) v2: a new written and spoken corpus of Arabic learners. In: Proceedings of the Learner Corpus Studies in Asia and the World, May 2014
Bykh, S., Meurers, D.: Native language identification using recurring \(n\)-grams - investigating abstraction and domain dependence. In: Proceedings of COLING 2012, pp. 425–440, December 2012
Dinu, L.P., Ionescu, R.T.: An efficient rank based approach for closest string and closest substring. PLoS ONE 7(6), e37576 (2012)
Dinu, L.P., Ionescu, R.-T., Popescu, M.: Local patch dissimilarity for images. In: Huang, T., Zeng, Z., Li, C., Leung, C.S. (eds.) ICONIP 2012, Part I. LNCS, vol. 7663, pp. 117–126. Springer, Heidelberg (2012)
Dinu, L.P., Ionescu, R.T., Tomescu, A.I.: A rank-based sequence aligner with applications in phylogenetic analysis. PLoS ONE 9(8), e104006 (2014)
Dinu, L.P., Manea, F.: An efficient approach for the rank aggregation problem. Theor. Comput. Sci. 359(1–3), 455–461 (2006)
Dinu, L.P., Popescu, M., Dinu, A.: Authorship identification of romanian texts with controversial paternity. In: Proceedings of LREC (2008)
Dinu, L.P., Sgarro, A.: A low-complexity distance for DNA strings. Fundam. Informaticae 73(3), 361–372 (2006)
Grozea, C., Gehl, C., Popescu, M.: ENCOPLOT: pairwise sequence matching in linear time applied to plagiarism detection. In: 3rd Pan Workshop on Uncovering Plagiarism, Authorship and Social Software Misuse, p. 10 (2009)
Ionescu, R.T.: Local rank distance. In: Proceedings of SYNASC, pp. 219–226 (2013)
Ionescu, R.T., Popescu, M., Cahill, A.: Can characters reveal your native language? a language-independent approach to native language identification. In: Proceedings of EMNLP, pp. 1363–1373, October 2014
Leslie, C.S., Eskin, E., Noble, W.S.: The spectrum kernel: a string kernel for SVM protein classification. In: Proceedings of Pacific Symposium on Biocomputing, pp. 566–575 (2002)
Li, M., Chen, X., Li, X., Ma, B., Vitanyi, P.M.B.: The similarity metric. IEEE Trans. Inf. Theor. 50(12), 3250–3264 (2004)
Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., Watkins, C.J.C.H.: Text classification using string kernels. J. Mach. Learn. Res. 2, 419–444 (2002)
Malmasi, S., Dras, M.: Arabic native language identification. In: Proceedings of the EMNLP 2014 Workshop on Arabic Natural Language Processing (ANLP), pp. 180–186, October 2014
Malmasi, S., Dras, M.: Chinese native language identification. In: Proceedings of EACL, vol. 2, pp. 95–99 (2014)
Melsted, P., Pritchard, J.: Efficient counting of k-mers in DNA sequences using a bloom filter. BMC Bioinformatics 12(1), 333 (2011)
Popescu, M., Grozea, C.: Kernel methods and string kernels for authorship analysis. In: CLEF (Online Working Notes/Labs/Workshop), September 2012
Popescu, M., Ionescu, R.T.: The story of the characters, the DNA and the native language. In: Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 270–278, June 2013
Popov, Y.V.: Multiple genome rearrangement by swaps and by element duplications. Theor. Comput. Sci. 385(1–3), 115–126 (2007)
Sanderson, C., Guenter, S.: Short text authorship attribution via sequence kernels, markov chains and author unmasking: an investigation. In: Proceedings of EMNLP, pp. 482–491, July 2006
Shapira, D., Storer, J.A.: Large edit distance with multiple block operations. In: Nascimento, M.A., de Moura, E.S., Oliveira, A.L. (eds.) SPIRE 2003. LNCS, vol. 2857, pp. 369–377. Springer, Heidelberg (2003)
Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, New York (2004)
Tetreault, J., Blanchard, D., Cahill, A.: A report on the first native language identification shared task. In: Proceedings of the Eighth Workshop on Innovative Use of NLP for Building Educational Applications, pp. 48–57, June 2013
Vezzi, F., Fabbro, C.D., Tomescu, A.I., Policriti, A.: rNA: a fast and accurate short reads numerical aligner. Bioinformatics 28(1), 123–124 (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Ionescu, R.T. (2015). A Fast Algorithm for Local Rank Distance: Application to Arabic Native Language Identification. In: Arik, S., Huang, T., Lai, W., Liu, Q. (eds) Neural Information Processing. ICONIP 2015. Lecture Notes in Computer Science(), vol 9490. Springer, Cham. https://doi.org/10.1007/978-3-319-26535-3_45
Download citation
DOI: https://doi.org/10.1007/978-3-319-26535-3_45
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-26534-6
Online ISBN: 978-3-319-26535-3
eBook Packages: Computer ScienceComputer Science (R0)