Abstract
Script identification is being widely accepted techniques for selection of the particular script OCR (Optical Character Recognition) in multilingual document images. Extensive research has been done in this field, but still it suffers from low identification accuracy. This is due to the presence of faded document images, illuminations and positions while scanning. Noise is also a major obstacle in the script identification process. However, it can only be minimized up to a level, but cannot be removed completely. In this paper, an attempt is made to analyze and classify various script identification schemes for document images. The comparison is also made between these schemes, and discussion is made based upon their merits and demerits on a common platform. This will help the researchers to understand the complexity of the issue and identify possible directions for research in this field.
Similar content being viewed by others
References
Namboodiri AM, Jain AK (2004) Online handwritten script recognition. IEEE Trans Pattern Anal Mach Intell 26:124–130. doi:10.1109/TPAMI.2004.1261096
Pati PB, Ramakrishnan AG (2008) Word level multi-script identification. Pattern Recogn Lett 29:1218–1229. doi:10.1016/j.patrec.2008.01.027
Sharma N, Pal U, Blumenstein M (2014) A study on word-level multi-script identification from video frames. In: International joint conference on neural networks, Beijing, pp 1827–1833. doi:10.1109/IJCNN.2014.6889906
Shijian L, Tan CL (2008) Script and language identification in noisy and degraded document Images. IEEE Trans Pattern Anal Mach Intell 30:14–24. doi:10.1109/TPAMI.2007.1158
Patil SB, Subbareddy NV (2002) Neural network based system for script identification in Indian documents. Sadhana 27:83–97. doi:10.1007/BF02703314
Zhu G, Yu X, Li Y, Doermann D (2009) Language identification for handwritten document images using a shape codebook. Pattern Recogn 42:3184–3191. doi:10.1016/j.patcog.2008.12.022
Joshi GD, Garg S, Sivaswamy J (2007) A generalised framework for script identification. Int J Doc Anal Recogn 10:55–68. doi:10.1007/s10032-007-0043-3
Shivakumara P, Yuan Z, Zhao D, Lu T, Tan CL (2015) New gradient-spatial-structural features for video script identification. Comput Vis Image Underst 130:35–53. doi:10.1016/j.cviu.2014.09.003
Hochberg J, Bowers K, Cannon M, Kelly P (1999) Script and language identification for handwritten document images. Int J Doc Anal Recogn 2:45–52. doi:10.1007/s100320050036
Li Y, Zheng Y, Doermann D, Jaeger S (2008) Script-independent text line segmentation in freestyle handwritten documents. IEEE Trans Pattern Anal Mach Intell 30:1313–1329. doi:10.1109/TPAMI.2007.70792
Marti U, Bunke H (2006) The IAM-database: an English sentence database for offline handwriting recognition. Int J Doc Anal Recognit 5:39–46. doi:10.1007/s100320200071
Lu S, Li L, Tan CL (2010) Identification of scripts and orientations of degraded document images. Pattern Anal Appl 13:469–475. doi:10.1007/s10044-009-0169-7
Tan TT (1998) Rotation invariant texture features and their use in automatic script identification. IEEE Trans Pattern Anal Mach Intell 20:751–756. doi:10.1109/34.689305
Busch A, Boles WW, Sridharan S (2005) Texture for script identification. IEEE Trans Pattern Anal Mach Intell 27:1720–1732. doi:10.1109/TPAMI.2005.227
Hiremath PS, Shivashankar S (2008) Wavelet based co-occurrence histogram features for texture classification with an application to script identification in a document image. Pattern Recogn Lett 29:1182–1189. doi:10.1016/j.patrec.2008.01.012
Singh PK, Dalal SK, Sarkar R, Nasipuri M (2015) Page-level script identification from multi- script handwritten documents. In: 3rd international conference on computer, communication, control and information technology, Hooghly, pp 1–6. doi:10.1109/C3IT.2015.7060113
Benjelil M, Kanoun S, Mullot R, Alimi AM (2009) Arabic and Latin script identification in printed and handwritten types based on steerable pyramid features. In: 10th international conference on document analysis and recognition, Barcelona, pp 591–595. doi:10.1109/ICDAR.2009.287
Zhou L, Ping XJ, Zheng EG, Guo L (2010) Script identification based on wavelet energy histogram moment features. In: IEEE 10th international conference on signal processing, Beijing, pp 980–983. doi:10.1109/ICOSP.2010.5655843
Peake GS, Tan TN (1997) Script and language identification from document images. In: Proceedings of workshop on document image analysis, Washington DC, pp 10–17, doi:10.1109/DIA.1997.627086
Pan WM, Suen CY, Bui TD (2005) Script identification using steerable Gabor filters. In: Proceedings of the eight international conference on document analysis and recognition, Seoul, pp 883–887. doi:10.1109/ICDAR.2005.206
Singhal V, Navin N, Ghosh D (2003) Script-based classification of hand-written text documents in a multilingual environment. In: Proceedings of 13th international workshop on research issues in data engineering: multi-lingual information management, Hyderabad, pp 47–54. doi:10.1109/RIDE.2003.1249845
Rajput GG, Anita HB (2010) Handwritten script recognition using dct and wavelet features at block level. IJCA, Special issue on RTIPPR 3:158–163
Lee WS, Kim NC, Jang IH (2010) Texture feature-based language identification using wavelet-domain bdip, bvlc, and nrma features. In: IEEE international workshop on machine learning for signal processing, Finland, pp 444–449. doi:10.1109/MLSP.2010.5588751
Valkealahti K, Oja E (2007) Reduced multidimensional co-occurrence histograms in texture classification. IEEE Trans Pattern Anal Mach Intell 20:90–95. doi:10.1109/34.655653
Brodić D, Milivojević ZN, Maluckov CA (2015) An approach to the script discrimination in the Slavic documents. Soft Comput 19:2655–2665. doi:10.1007/s00500-014-1435-1
Hochberg J, Kelly P, Thomas T, Kerns LL (1997) Automatic script identification from document images using cluster-based templates. IEEE Trans Pattern Anal Mach Intell 19:176–181. doi:10.1109/34.574802
Silva C, Ribeiro B (2007) On text-based mining with active learning and background knowledge using SVM. Soft Comput 11:519–530. doi:10.1007/s00500-006-0080-8
Pal U, Chaudhuri BB (2002) Identification of different script lines from multi-script documents. Image Vis Comput 20:945–954. doi:10.1016/S0262-8856(02)00101-4
Pal U, Chaudhuri BB (2001) Automatic identification of English, Chinese, Arabic, Devnagari and Bangla script line. In: Proceedings of sixth international conference on document analysis and recognition, Seattle, pp 790–794. doi:10.1109/ICDAR.2001.953896
Gopakumar R, Subbareddy NV, Makkithaya K, Acharya UD (2010) Script identification from multilingual Indian documents using structural features. J Comput 2:106–111
Gopakumar R, Subbareddy NV, Makkithaya K, Acharya UD (2010) Zone-based structural feature extraction for script identification from Indian documents. In: 5th international conference on industrial and information systems, Mangalore, pp 420–425. doi:10.1109/ICIINFS.2010.5578668
Padma MC, Vijaya PA (2010) Script identification from trilingual documents using profile based features. Int J Comput Sci Appl 7:16–33
Aithal PK, Rajesh G, Acharya DU, Krishnamoorthi M, Subbareddy NV (2011) Script identification for a tri-lingual document. In: 2nd international conference on advances in communication, network, and computing, pp 434–439. doi:10.1007/978-3-642-19542-6_82
Aithal PK, Rajesh G, Acharya DU, Krishnamoorthi M, Subbareddy NV (2010) Text line script identification for a tri-lingual document. In: 2nd international conference on computing, communication and networking technologies, Karur, pp 1–3. doi:10.1109/ICCCNT.2010.5592562
Prakash O, Shrivastava V, Kumar A (2013) An efficient approach for script identification. Int J Comput Trends Technol 4:1626–1631
Phan TQ, Shivakumara P, Ding Z, Lu S, Tan CL (2011) Video script identification based on text lines. In: International conference on document analysis and recognition, Beijing, pp 1240–1244. doi:10.1109/ICDAR.2011.250
Tan GX, Gaudin CV, Kot AC (2009) Information retrieval model for online handwritten script identification. In: 10th international conference on document analysis and recognition, Barcelona, pp 336–340. doi:10.1109/ICDAR.2009.162
Bashir R, Quadri SMK (2014) Entropy based script identification of a multilingual document image. In: International conference on computing for sustainable global development, New Delhi, pp 19–23. doi:10.1109/IndiaCom.2014.6828005
Bashir R, Quadri SMK (2013) Identification of Kashmiri script in a bilingual document image. In: Proceedings of the IEEE second international conference on image information processing, Waknaghat, pp 575–579. doi:10.1109/ICIIP.2013.6707658
Bashir R, Quadri SMK (2015) Density based script identification of a multilingual document image. Int J Image Graph Signal Process 2:8–14. doi:10.5815/ijigsp.2015.02.02
Ghosh S, Chaudhuri BB (2011) Composite script identification and orientation detection for Indian text images. In: International conference on document analysis and recognition, Beijing, pp 294–298. doi:10.1109/ICDAR.2011.67
Cheng J, Ping X, Zhou G, Yang Y (2006) Script identification of document image analysis. In: Proceedings of the 1st international conference on innovative computing, information and control, Beijing, pp 178–181. doi:10.1109/ICICIC.2006.518
Moussa SB, Zahour A, Benabdelhafid A, Alimi AM (2008) Fractal-based system for Arabic/Latin, printed/handwritten script identification. In: 19th international conference on pattern recognition, Florida, pp 1–4. doi:10.1109/ICPR.2008.4761838
Padma MC, Vijaya PA (2009) Monothetic separation of Telugu, Hindi and English text lines from a multi script document. In: Proceedings of the IEEE international conference on systems, man, and cybernetics, San, Antonio, pp 4870–4875. doi:10.1109/ICSMC.2009.5346045
Rajput GG, Anita HB (2011) Handwritten script identification from a bi-script document at line level using Gabor filters. In: Proceeding of SCAKD, pp 94–101
Jindal M, Hemrajani N (2013) Script identification for printed document images at text-line level using dct and pca. IOSR J Comput Eng 12:97–102
Obaidullah SM, Nibaran D, Roy K (2014) Gabor filter based technique for offline Indic script identification from handwritten document images. In: International conference on devices, circuits and communications, Ranchi, pp 1–5. doi:10.1109/ICDCCom.2014.7024723
Lu S, Li L, Tan CL (2007) Identification of Latin-based languages through character stroke categorization. In: 9th international conference on document analysis and recognition, Brazil, pp 352–356. doi:10.1109/ICDAR.2007.4378731
Spitz AL (1997) Determination of the script and language content of document images. IEEE Trans Pattern Anal Mach Intell 19:235–345. doi:10.1109/34.584100
Das MS, Rani DS, Reddy CRK (2012) Heuristic based script identification from multilingual text documents. In: 1st international conference on recent advances in information technology, Dhanbad, pp 487–492. doi:10.1109/RAIT.2012.6194627
Yeotikar PP, Deshmukh PR (2013) Script identification of text words from multilingual Indian document. Int J Comput Appl 1:22–29
Dhandra BV, Hangarge M (2011) Morphological reconstruction for word level script identification. Int J Comput Sci Secur 1:41–51
Chanda S, Pal S, Franke K, Pal U (2009) Two-stage approach for word-wise script identification. In: 10th international conference on document analysis and recognition, Barcelona, pp 926–930. doi:10.1109/ICDAR.2009.239
Chanda S, Pal U, Franke K, Kimura F (2010) Script identification—a Han and Roman script perspective. In: 20th international conference on pattern recognition, Istanbul, pp 2708–2711. doi:10.1109/ICPR.2010.1127
Roy K, Alaei A, Pal U (2010) Word-wise handwritten Persian and Roman script identification. In: International conference on frontiers in handwriting recognition, Kolkata, pp 628–633. doi:10.1109/ICFHR.2010.103
Roy K, Das SK, Obaidullah SM (2011) Script identification from handwritten document. In: 3rd national conference on computer vision, pattern recognition, image processing and graphics, Hubli, pp 66–69. doi:10.1109/NCVPRIPG.2011.22
Obaidullah SM, Roy K, Das N (2013) Comparison of different classifiers for script identification from handwritten document. In: IEEE international conference on signal processing, computing and control, Waknaghat, pp 1–6. doi:10.1109/ISPCC.2013.6663388
Piao M, Cui RR (2013) An approach to script identification in multi-language text image. In: 6th international conference on intelligent networks and intelligent systems, Shenyang, pp 248–251. doi:10.1109/ICINIS.2013.70
Chanda S, Terrades OR, Pal U (2007) SVM based scheme for Thai and English script identification. In: 9th international conference on document analysis and recognition, Brazil, pp 551–555. doi:10.1109/ICDAR.2007.4378770
Chanda S, Pal U, Kimura F (2007) Identification of Japanese and English script from a single document page. In: 7th IEEE international conference on computer and information technology, Fukushima, pp 656–661. doi:10.1109/CIT.2007.109
Dhandra BV, Hangarge M (2007) Global and local features based handwritten text words and numerals script identification. In: International conference on conference on computational intelligence and multimedia applications, Sivakasi, pp 471–475. doi:10.1109/ICCIMA.2007.125
Singh S, Kumar A, Shaw DK, Ghosh D (2014) Script separation in machine printed bilingual (Devnagari and Gurumukhi) documents using morphological approach. In: 20th national conference on communications, Kanpur, pp 1–5. doi:10.1109/NCC.2014.6811361
Lin XR, Guo CY, Chang F (2011) Classifying textual components of bilingual documents with decision-tree support vector machines. In: International conference on document analysis and recognition, Beijing, pp 498–502. doi:10.1109/ICDAR.2011.106
Echi AK, Saidani A, Belaid A (2014) How to separate between machine-printed/handwritten and Arabic/Latin Words? Electron Lett Comput Vis Image Anal 13:1–16. doi:10.5565/rev/elcvia.572
Haboubi S, Maddouri SS, Amiri H (2011) Separation between Arabic and Latin scripts from bilingual text using structural features. In: 1st international conference innovative computing technology, Brazil, pp 132–143. doi:10.1007/978-3-642-22247-4_12
Sharma N, Chanda S, Pal U, Blumenstein M (2013) Word-wise script identification from video frames. In: 12th international conference on document analysis and recognition, Washington DC, pp 867–871. doi:10.1109/ICDAR.2013.177
Ma H, Doermann D (2004) Word level script identification for scanned document images. In: Proceeding of international conference on document recognition and retrieval, San Jose, pp 178–191
Ferrer MA, Morales A, Rodríguez N, Pal U (2014) Multiple training—one test methodology for handwritten word-script identification. In: 14th international conference on frontiers in handwriting recognition, Greece, pp 754–759. doi:10.1109/ICFHR.2014.132
Singh PK, Khan A, Sarkar R, Nasipuri M (2014) A texture based approach to word-level script identification from multi-script handwritten documents. In: International conference on computational intelligence and communication networks, Udaipur, pp 228–232. doi:10.1109/CICN.2014.60
Angadi SA, Kodabagi MM (2013) A fuzzy approach for word level script identification of text in low resolution display board images using wavelet features. In: International conference on advances in computing, communications and informatics, Mysore, pp 1804–1811. doi:10.1109/ICACCI.2013.6637455
Pechwitz M, Maddouri SS, Märgner V, Ellouze N, Amiri H (2002) IFN/ENIT-database of handwritten ARABIC words. In: 7th colloque international francophone Sur l’Ecrit et le Document, Tunis, pp 129–136
Malemath VS, Kulkarni AH, Mallikarjun H (2014) Word-wise script identification in document images based on steerable Gaussian filtering technique. Int J Adv Res Comput Commun Eng 3:6844–6848
Rezaee H, Geravanchizadeh M, Razzazi F (2009) Automatic language identification of bilingual English and Farsi scripts. In: International conference on application of information and communication technologies, Baku, pp 1–4. doi:10.1109/ICAICT.2009.5372532
Rani R, Dhir R, Lehal GS (2013) Script identification of pre-segmented multi-font characters and digits. In: 12th international conference on document analysis and recognition, Washington DC, pp 1150–154. doi:10.1109/ICDAR.2013.233
Pal S, Alireza A, Pal U, Blumenstein M (2012) Multi-script off-line signature identification. In: 12th international conference on hybrid intelligent systems, Pune, pp 236–240. doi:10.1109/HIS.2012.6421340
Obaidullah SM, Halder C, Das N, Roy K (2015) Numeral script identification from handwritten document images. In: 11th international multi-conference on information processing, Bangalore, pp 585–594. doi:10.1016/j.procs.2015.06.067
Hangarge M, Santosh KC, Pardeshi R (2013) Directional discrete cosine transform for handwritten script identification. In: 12th international conference on document analysis and recognition, Washington DC, pp 344–348. doi:10.1109/ICDAR.2013.76
Hangarge M, Santosh KC (2014) Word-level handwritten script identification from multi-script documents. In: Recent advances in information technology, advances in intelligent systems and computing, Dhanbad, pp 49–55. doi:10.1007/978-81-322-1856-2_6
Pardeshi R, Chaudhuri BB, Hangarge M, Santosh KC (2014) Automatic handwritten Indian scripts identification. In: 14th international conference on frontiers in handwriting recognition, Greece, pp 375–380. doi:10.1109/ICFHR.2014.69
Marti U, Bunke H (1999) A full English sentence database for off-line handwriting recognition. In: Proceedings of the 5th international conference on document analysis and recognition, Bangalore, pp 705–708. doi:10.1109/ICDAR.1999.791885
Sarkar R, Das N, Basu S, Kundu M, Nasipuri M, Basu DK (2012) Cmaterdb1: a database of unconstrained handwritten Bangla and Bangla English mixed script document image. Int J Doc Anal Recogn 15:71–83. doi:10.1007/s10032-011-0148-6
Selamat A, Ng CC (2011) Arabic script web page language identifications using decision tree neural networks. Pattern Recogn 44:133–144. doi:10.1016/j.patcog.2010.07.009
Ng CC, Selamat A (2009) Improved letter weighting feature selection on Arabic script language identification. In: 1st Asian conference on intelligent information and database systems, Vietnam, pp 150–154. doi:10.1109/ACIIDS.2009.33
Selamat A, Lee ZS (2008) Language identifications of Arabic script web documents using independent component analysis. In: 2nd Asia international conference on modeling and simulation, Kuala Lumpur, pp 427–432. doi:10.1109/AMS.2008.46
Shi B, Bai X, Yao C (2016) Script identification in the wild via discriminative convolutional neural network. Pattern Recogn 52:448–458. doi:10.1016/j.patcog.2015.11.005
Behrad A, Khoddami M, Salehpour M (2010) A novel framework for Farsi and Latin script identification and Farsi handwritten digit recognition. J Autom Control 20:17–25. doi:10.2298/JAC1001017B
Rani R, Dhir R, Lehal GS (2011) Comparative analysis of Gabor and discriminating feature extraction techniques for script identification. In: International conference on information systems for Indian languages, Patiala, pp 174–179. doi:10.1007/978-3-642-19403-0_27
Mezghani A, Slimane F, Kanoun S, Margner V (2014) Identification of Arabic/French–handwritten/printed words using Gmm-based system. In: Proceedings of CIFED, France, pp 371–374
Abainia K, Ouamour S, Sayoud H (2014) Robust language identification of noisy texts: proposal of hybrid approaches. In: 25th international workshop on database and expert systems applications, Munich, pp 228–232. doi:10.1109/DEXA.2014.55
Yadav P, Kaur S (2013) Language identification and correction in corrupted texts of regional Indian languages. In: International conference oriental held jointly with conference on Asian spoken language research and evaluation, Gurgaon, pp 1–5. doi:10.1109/ICSDA.2013.6709877
Hebert D, Barlas P, Chatelain C, Adam S, Paquet T (2014) Writing type and language identification in heterogeneous and complex documents. In: 14th international conference on frontiers in handwriting recognition, Greece, pp 411–416. doi:10.1109/ICFHR.2014.75
Ablavsky V, Stevens MR (2003) Automatic feature selection with applications to script identification of degraded documents. In: Proceedings of 7th international conference on document analysis and recognition, Edinburgh, pp 750–754. doi:10.1109/ICDAR.2003.1227762
Obaidullah SM, Mondal A, Roy K (2014) Structural feature based approach for script identification from printed Indian document. In: International conference on signal processing and integrated networks, Noida, pp 120–124. doi:10.1109/SPIN.2014.6776933
Obaidullah SM, Mondal A, Das N, Roy K (2014) Script identification from printed Indian document images and performance evaluation using different classifiers. Appl Comput Intell Soft Comput. doi:10.1155/2014/896128
Dhanya D, Ramakrishnan AG, Pati PB (2002) Script identification in printed bilingual documents. Sadhana 27:73–82. doi:10.1007/3-540-45869-7_2
Singh PK, Mondal A, Bhowmik S, Sarkar R, Nasipuri M (2014) Word-level script identification from handwritten multi-script documents. In: Proceedings of the 3rd international conference on frontiers of intelligent computing: theory and applications, Bhubaneswar, pp 551–558. doi:10.1007/978-3-319-11933-5_62
Shi B, Yao C, Zhang C, Guo X, Huang F, Bai X (2015) Automatic script identification in the wild. In: Proceedings of international conference on document analysis and recognition, Nancy
Mezghani A, Kanoun S, Khemakhem M, El AH (2012) A database for Arabic handwritten text image recognition and writer identification. In: International conference on frontiers in handwriting recognition, Bari, pp 399–402. doi:10.1109/ICFHR.2012.155
Grosicki E, Carré M, Brodin JM, Geoffrois E (2009) Results of the RIMES evaluation campaign for handwritten mail processing. In: International conference on document analysis and recognition, Barcelona, pp 941–945. doi:10.1109/ICDAR.2009.224
Slimane F, Ingold R, Kanoun S, Alimi AM, Hennebert J (2009) A new Arabic printed text image database and evaluation protocols. In: International conference on document analysis and recognition, Barcelona, pp 946–950. doi:10.1109/ICDAR.2009.155
Gomez L, Nicolaou A, Karatzas D (2017) Improving patch-based scene text script identification with ensembles of conjoined networks. Pattern Recogn 67:85–96. doi:10.1016/j.patcog.2017.01.032
Sharma N, Mandal R, Sharma R, Pal U, Blumenstein M (2015) ICDAR2015 competition on video script identification (CVSI 2015). In: IEEE 13th international conference on document analysis and recognition (ICDAR), 2015, Tunis, pp 1196–1200. doi:10.1109/ICDAR.2015.7333950
Arabnejad E, Moghaddam RF, Cheriet M (2017) PSI: Patch-based script identification using non-negative matrix factorization. Pattern Recogn 67:328–339. doi:10.1016/j.patcog.2017.02.020
Saba T, Rehman A, Altameem A, Uddin M (2014) Annotated comparisons of proposed preprocessing techniques for script recognition. Neural Comput Appl 25:1337–1347. doi:10.1007/s00521-014-1618-9
Kacem A, Asma S (2016) A texture-based approach for word script and nature identification. Pattern Anal Appl. doi:10.1007/s10044-016-0555-x
Obaidullah SM, Halder C, Santosh KC, Das N, Roy K (2017) PHDIndic_11: page-level handwritten document image dataset of 11 official Indic scripts for script identification. Multimed Tools Appl. doi:10.1007/s11042-017-4373-y
Singh PK, Sarkar R, Das N, Basu S, Kundu M, Nasipuri M (2017) Benchmark databases of handwritten Bangla-Roman and Devanagari-Roman mixed-script document images. Multimed Tools Appl. doi:10.1007/s11042-017-4745-3
Brodic’ D, Amelio A, Milivojevic’ ZN (2016) Language discrimination by texture analysis of the image corresponding to the text. Neural Comput Appl. doi:10.1007/s00521-016-2527-x
Brodić D, Amelio A, Milivojević ZN (2016) Identification of Fraktur and Latin scripts in German historical documents using image texture analysis. Appl Artif Intell Int J 30(5):379–395. doi:10.1080/08839514.2016.1185855
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Sahare, P., Dhok, S.B. Script identification algorithms: a survey. Int J Multimed Info Retr 6, 211–232 (2017). https://doi.org/10.1007/s13735-017-0130-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13735-017-0130-2