TERI Bookstore
Print
World Digital Libraries: An International Journal (WDL)
Vol.9(1)  June 2016
Print ISSN : 0974-567X
Online ISSN : 0975-7597

Digital Library of India: An Initiative for the Preservation and Dissemination of the National Heritage and Rare Books and Manuscripts Collection

Debal C Kar: University Librarian, Ambedkar University Delhi, Delhi, India. (E): debal@aud.ac.in
DOI: 10.18329/09757597/2016/9104

Abstract

Digital Library of India (DLI) is an initiative taken by the Government of India to digitally preserve and disseminate all the significant literary, artistic, and scientific work of  human available in India and thus, it has been made freely available, from every corner of the world, for education, study, appreciation, and for the future generations. The project started with the primary long-term objective of capturing all copyright free books and manuscripts, available in India, in digital format. The planning started with an aim to digitize one million books (less than one per cent of all books in all languages ever published) by 2005 in the first phase. Presently, it has succeeded in disseminating 537,350 books with 187656339 pages in 46 languages (Indian and Foreign) available in India. The books and manuscripts are available on the website with free access at <www.dli.gov.in> and <www.dli.ernet.in>. The Government of India has also taken initiatives to digitize cultural heritage, such as facts, monuments, heritage building, temples, and a thousand-year-old manuscripts and walk-through.


The article also discuss about the initiation, planning, and successful execution of the project. The expenditure incurred as of today, the sources of fund and coverage in respect of subjects, languages, type of collections (e.g. books, manuscripts, hand written manuscripts in leafs, journals, newspapers, etc.), libraries, cities, centres, and many more. It will be also described how the different libraries and digitization centres share their resources and network amongst each other for better usage of the documents available since 1985 or earlier.


The article will also describe the philosophy behind the content selection, duplication of work, and copyright issues policy of the projects. Furthermore, the process workflow for the digitization of books shall be summarized in terms of three major process elements—pre scanning process, scanning process, and post scanning process and tools used. The criteria behind deciding which manuscripts are to be included in the collection for digitization is also described.


The article will also provide the language-wise as well as centre-wise status of the digital collection. The digitization process used, steps taken pre-digitization and the process used post-digitization have also been described in detail. Besides, the article also discusses the necessary precautions taken for preserving of the digitized data and steps for dissemination across the world. Other projects and actions initiated, thus far, to educate and empower human resources and outreach activities to popularize DLI and increase the usage have also been elaborated upon. Statistics to show the usage and number pages downloaded in a particular month have also been provided.


While concluding, the article describes the benefits derived from project DLI and usage of the initiatives of the preservation and dissemination of national heritage and suggested future plans to popularize DLI on rare books and manuscripts.


1. Introduction

Digital Library of India (DLI) is a digital collection of freely accessible rare books collected from various libraries in India. This is an initiative for the preservation and dissemination of the national heritage, rare books and manuscripts collection available in India by the Government of India. DLI aims to digitally preserve and disseminate all the significant literary, artistic, and scientific work of people, available in India, and made freely available, for education, study, appreciation and for future generations all over the world.


As a first step in realizing this vision, it is proposed to create the Digital Library with a free-to-read, searchable collection of one million books, predominantly in Indian languages. The project was initiated by the Office of the Principal Scientific Advisor to the Government of India and subsequently taken over by the Department of Information Technology (DIT) (now known as Department of Electronics and Information Technology-DeitY), Ministry of Communications and Information Technology (MCIT), Government of India. The idea was also to create a test bed for researchers to improve scanning techniques, optical character recognition, intelligent indexing, and in general to promote research in Indian language technology.


The project primarily began with the long-term objective to capture all copyright free books, available in India, in digital format. The planning was started with an aim to digitize one million books (less than 1 per cent of all books in all languages ever published) by 2005 in the first phase.


The basic idea behind this project is to explore the possibility of storing, in digital form, all the knowledge ever produced by mankind and making this content available free of charge to be browsed and searched by anyone, anywhere, and anytime. This vision is the goal of the Universal Digital Library Project (UDL). The trend would be such that any information that is not online and accessible to search engines may become unusable. In a thousand years, only a few of the paper documents we have today will survive the ravages of deterioration, loss, and outright destruction. Hence, there is an urgent need to preserve our knowledge and heritage in the digital form (Balakrishnan et al. 2006).


As a part of Raj Reddy’s grand vision, a mission, known as the Million Books to the Web Project (MBP) to digitize one million books was embarked upon as a collaborative project involving many countries, especially India, the USA, and China .


To support UDL in India, DeitY, Ministry of Communication and Information Technology, Government of India, has sponsored a project for digitization of copyright free books available in India . Ever since its inception in November 2002, initially operating at three centers, the project has been successfully digitizing books, which are a dominant store of knowledge and culture.


DLI now host of more than 537,350 books composed of 187,656,339 pages in more than 46 languages (Indian and Foreign) which have been scanned at more than 42 centers across the country. Some of the scanning centers are given in Table 1.


All the scanning centres send the data (scanned images in Tagged Image File Format (TIFF) along with the metadata of a book) to the Indian Institute of Science. After checking for quality errors, the Institute hosts these documents on the Digital Library website in Portable Document Format (PDF).


While scanning, we have faced many problems, which opens up many research opportunities in language technologies, particularly for Indian languages. It was thought that language, especially the Indian languages, should not be a barrier to information access where knowledge exists free of cost. Individual mother tongues in India number several hundreds. According to Census of India of 2001, India has 122 major languages and 1,599 other languages. The 2001 Census recorded 30 languages which were spoken by more than a million native speakers and 122 which were spoken by more than 10,000 people. While in process, it had realized that digital representation and storage mechanisms for Indian languages are big problems. With DeitY’s financial support through projects, digital representation and storage mechanisms have been developed for Indian languages, and a large number of applications are being built to store, process, retrieve, and present the Indian language content. The DLI fosters a large number of research activities pertaining to language technologies for Indian languages and development in areas, such as information retrieval, optical character recognition, text summarization, machine translation, and transliteration (OM Transliteration Scheme for Indian languages), handwriting recognition, Universal Dictionary, Cross Lingual Information Retrieval and Search, Speech Recognition in Indian Languages, Automatic Summarization, and natural language parsing and morphological analyses. 


The projects are a very high collaborative effort and in distributed environment. While maintaining a uniform standard, it has become an important priority in such collaborative effort and distributed environment. Isolated set up does not promote collaboration across geographically distributed points of operation centers for server management and administration along with resolution of process-oriented issues. So a distributed environment becomes a requisite. Therefore, the process of scanning books, image processing, cleaning, and enabling the web have concurrently occurred at different places. In doing so, we have faced a few issues with reference to the selection of books for digitization, duplication effort for operating and establishing protocol, good quality of digital output, preservation of digitized books, and user friendly and reliable access.


2. Reason for Digital Archives

Existing archives of books have many shortcomings. Many other similar works, in existence today, are rare and only accessible to a small population of scholars and collectors at specific geographic locations. A single wanton act of destruction can destroy an entire line of heritage. Furthermore, contrary to the popular beliefs, the libraries, museums, and publishers do not routinely maintain broadly comprehensive archives of the considered works of man. No one can afford to do this, unless the archive is digital.

3. Vision

The vision for the DLI project was as below (http://www.dli.gov.in):


All the significant literary, artistic, and scientific works of mankind can be digitally preserved and made freely available, in every corner of the world, for education, study, appreciation, and for all our future generations.


4. Mission

The mission is to create a portal for the Digital Library of India which will foster creativity and free access to all human knowledge. As a first step in realizing this mission, it is proposed to create the Digital Library with a free-to-read, searchable collection of one million books, predominantly in Indian languages, available to everyone over the Internet. This portal will also become an aggregator of all the knowledge and digital content created by other digital library initiatives in India. Very soon we expect that this portal would provide a gateway to Indian digital libraries in science, arts, culture, music, movies, traditional medicine, palm leaves, and many more. The result will be a unique resource accessible to anyone in the world 24×7, without regard to socioeconomic background or nationality.

5. Goals

The primary long-term objective is to capture all books in digital format. As a first step we are planning to demonstrate the feasibility by undertaking to digitize one million books (less than one per cent of all books in all languages ever published) by 2005.


A secondary objective of this project will be to provide a test bed that will support other researchers who are working on improved scanning techniques, optical character recognition, and indexing.


6. Content Selection

DLI envisages developing a collection of books by adopting an approach as described below. The DLI has adhered to the copyright law.

  Coordination of Selection

Creating one digital copy and mirroring it in different locations will suffice, and will support the multiple usage at any time. Books denoting ancient historical events of India as well as cultural and social books in different languages are digitized. These materials are obtained from authorized universities institutes, libraries of religious organizations and public libraries in India. Palm leaves, journals, and manuscripts are also digitized.

  Non-copyright materials

Materials which are free of copyright as per the Indian Copyright Act, 1957, have been scanned for DLI. The first selected materials were government textbooks published in 11 of the 18 official languages of India.

  Future Activities for Selection of Books

DLI will seek publisher permission to scan books where books are not copyright free. However there are numerous difficulties, in particular, due to lack of publisher records, return of copyright to authors, and other circumstances.


Publishers increasingly see that the digital presentation of their works can attract buyers. They are interested in exploring ways in which their out-of-print titles may be returned to profitability. Continued work with publishers through the course of this project may attract many of them to it. That would be most beneficial in enriching the content to be made available in digital format to everyone.


7. Workflow

The procurement team identifies the books to be digitized (Ambatiet al. 2006). The books are then sent to the various scanning location operated under the regional mega scanning center (RMSC). Prior to digitization, the expert librarian enters the regular metadata for the books.  Thereafter, the metadata is uploaded into the DLI system for checking duplicates from the existing DLI records. Books are then digitized and sent back to the library; the digitized product is then tested for quality standards and approved for uploading on the DLI servers.  


The process workflow for the digitization of books is summarized in terms of the following three major process elements (http://www.dli.gov.in):


  • Pre-scanning process

  • Scanning process

  • Post-scanning process

The pre-scanning process involves the following stages:


  • Identification of the books

  • The books procured are enlisted using the Regular Meta Data information

  • Books are then submitted to the ‘Digital Library of India’ system for duplication verification

  • The system checks for the duplicates and generates the barcode only for non-duplicate books

  • The barcode assigned books are then issued to the contractors for the scanning operation.

 The scanning process involves the following stages:


  • The books are scanned at a particular location or centre with an allotted scanning machine and an operator

  • The operator creates the structural meta information for the book he/she scans

  • The operator/ meta data entry operator enters the admin meta information for the books scanned on a particular day

  • The scanned books undergo processing and OCR along with quality control by the contractors/supervisors

  • The scanned and processed books are copied onto the hard disks and DVDs

  • The DVDs are submitted to the source / location/scanning centre

  • The hard disks are brought to the central server location for web enablement

 The post-scanning process comprises the following steps:


  • The contractors/supervisors upload the meta information obtained in two formats— Admin and Structural to the central server

  • The system admin coordinates the meta information and the actual content obtained and allocates a server for the same

  • The quality assurance team works on the content before they are uploaded onto the server

Once the quality assurance certifies the quality of the content and meta information, the content is uploaded else it appears offline unless the content is corrected based on the defects found


8. Tools Used

The specialty tools developed, customized, and used for the DLI are as follows:



  • Meta Data Software

  • Pre-scanning Meta Form

  • Post-scanning Meta Form

  • OM transliteration Software
        • Indian Fonts and iTrans manual and mapping table for Indian languages

  • Quality Assurance Software Package
        • Duplication Checking Tool
        • Image Quality Checking Tool
        • Meta Data Quality Checking Tool

  • Server Management Tool
        • Server Software and its services

  • Workflow Management Software

9. Cooperation and Collaboration

The Indian Institute of Science (IISc), Carnegie Mellon University (CMU), International Institute of Information Technology, Hyderabad (IITH), and many other academic, religious, and government organizations as content creation centres, as mentioned below, have become partners in the DLI initiative for digitization and preservation of Indian heritage present in the form of books, manuscripts, art, and music. The scanning operations and preservation of digital data takes place at different RMS centres across India. These RMSCs themselves function as individual entities with several scanning units in different locations in the region. RMSC is operating parallelly and independently in distributed regions across the country. The Functions of a RMSC include collection of books from different libraries of the region, distributing them among scanning locations within the region, return the collected books after digitization, gathering back the digitized content from the scanning locations, and hosting the same. Every scanning location consists of trained personnel to execute the scanning and image processing operations. Each centre brings its own unique collection of literature as well as libraries of surrounding areas into the digital library. Many other academic, religious, and other institutions, including many authors, individually, have cooperated by contributing their collection and books to the DLI free of cost.


Following are the institutions are collaborated for implementing the DLI:


10. Coordination and Research Centres in India

  • Indian Institute of Science, Bangalore, Karnataka

  • International Institute of Information Technology, Hyderabad, Andhra Pradesh 

11. Academic Institutions

  • Anna University, Chennai, Tamil Nadu

  • Arulmigu Kalasligam College of Engineering (AKCE), Srivilliputur, Madurai, Tamil Nadu

  • Goa University, Goa 

  • Indian Institute of Astrophysics, Bengaluru, Karnataka 

  • Indian Institute of Information Technology, Allahabad, Uttar Pradesh 

  • International Institute of Information Technology, Hyderabad, Andhra Pradesh 

  • Osmania University, Hyderabad, Andhra Pradesh 

  • Punjab Technical University, Punjab 

  • Shanmugha Art, Science, Technology & Research Academy, Thanjavur, Tamil Nadu 

  • University of Hyderabad, Hyderabad, Andhra Pradesh 

  • University of Pune, Pune, Maharashtra

12. Religious and Cultural Institutions

  • Kanchi University, Kanchi, Tamil Nadu 

  • Poornapragna Vidyapeetha, Bengaluru, Karnataka 

  • Salarjung Museum, Hyderabad, Andhra Pradesh 

  • Sringeri Mutt, Sringeri, Karnataka  Tirumala Tirupati Devasthanams, Tirupati, Andhra Pradesh 

  • Tibetan Monasteries and Literature on Jainism 

13. Government and Research Agencies

  • Academy of Sanskrit Research, Melkote, Karnataka 

  • CDAC– Noida  

  • Maharashtra Industrial Development Corporation (MIDC), Mumbai, Maharashtra 

  • Rashtrapathi Bhavan, New Delhi 

  • CDAC – Kolkata, West Bengal

14. Industrial Partners

  • Thirinaina Informatics Ltd, Hyderabad, Andhra Pradesh

  • Par Informatics Ltd, Hyderabad, Andhra Pradesh

  • Graphix Imaging Systems, Noida, Uttar Pradesh

  • Softdot Technologies, New Delhi

  • SV Infosys, Tirumala Tirupathi Devastanams, Tirupati, Andhra Pradesh

  • Microsoft

15. Funding Resources

The funding for the Million Book Project is coming from multiple sources. The Office of the Principal Scientific Advisor to the Government of India funded the project at the Indian Institute of Science, Bangalore. Subsequently, the Department of Electronics and Information Technology (DeitY), Ministry of Communication and Information Technology (MCIT), Government of India, has funded the project at various partner centres of the DLI. So the DLI project being implemented by different centres, was funded by DeitY, MCIT, Government of India. Various centres have also pledged their local resources to make DLI a reality. The National Science Foundation is provided funding for scanners and software research and development. Few Book Scanners and Software necessary for digital library processing have been provided by IISc Bangalore.

16. Copyright Policy

The copyright policy adopted for DLI, is as per the Indian Copyright Act, 1957.  Materials which are free of copyright as per the Indian Copyright Act, 1957, have been scanned for DLI. However, in case of a possible error in copyright checking, if the author or publisher sends a written request for removal, such a request will be validated and complied with.


The following works are included in the DLI: 


  • Free of copyright restriction, available for use by anyone

  • Expired copyright

  • Dedication to the public

  • Works ‘born public’, for instance, the works of the Government of India

  • Works in the public domain

17. Present Status

Table 2 provides the scanning centre-wise report as on February 10, 2016, provided on the DLI website which includes the number of books and number of pages available for access, free of cost. The number of books scanned by these centers may be more; the list provided only those books which are available on the website.. It is found that the maximum number of books and pages scanned available on the website are Banasthali University, Rajasthan (101,774 books with 40,345,710) pages and C-DAC, Noida (106,897 books with 33107306 pages. Table 3 provides language-wise number of books and pages available on the website as on February 10, 2016. It is found that the maximum number of books, that is more than 50 per cent, are in English. Out of total number of books available in the DLI (537,350), English language books are 288,576 (115,197,499 pages). More than 10 per cent books (54,220 books) are found in Hindi in 16,766,290 pages. The substantial number of books are also found in Gujarati (39,605), Sanskrit (35,431), Urdu (32,360), Bengali (25,176), and Telugu (23,370). Table 4 provides subject-wise number of books available on the DLI website as on February 10, 2016. The maximum number of books are on Literature (65,631);  substantial number of books are also found in Science (36,994), History (31,574), Geography (21,597), and Religion (21,030). It is noted that 309,247 number of books are not yet assigned any subject.





18. Usage

Table 5 depicts the high usage of the DLI website and books for the month of December 2010.  It  provides usage of October 2012 in Table 6 and graphical representation of usage of DLI may also  be seen in Figure 1. Usage for the month of September 2015 has been analysed in Table 7. These indicate that usage is very high as people prefer using DLI for their research, personal reading, and to supplement their education. The tables also show the unique visitors, returning visitors, and first time visitors to the website of the DLI.  Tables are also provided the number of pages downloaded on a particular day of the month.







19. Challenges

The process of digitizing and web uploading poses certain challenges; these are discussed as follows:

  Book selection

DLI has been able to preserve the rich culture and heritage of India only through book and paper media. It was found that usage is not very high as it was thought in the planning stage. This could be because of the selection process and policy. As a policy, DLI selects only those books which are copyright free as it was very difficult to provide free access to the copyright books. As all copyright free works may not that useful and usage are not found very high. So book selection policy, e.g., only copyright free books were to be digitized and uploaded to the DLI website, effect its’ usage. Also, while digitizing, many books get damaged, probably due to manhandling by the staff responsible for scanning, many libraries are found reluctant to provide books to DLI for scanning.

  Duplication

Even though necessary precautions were taken, we found that many duplicate books were provided by the different RMSC to DLI; reasons unknown. DLI did not put them on the portal however, cost has been already incurred towards the digitization process for those books.

  Incorrect and Incomplete Metadata

Because of initial negligence of the staff, it was discovered at DLI that many metadata are either incorrect or incomplete which creates unnecessary problems and duplication of records.

  Data management

Data synchronization and management, across the centres in order to reduce duplication is another problem that needs to be tackled.  There is no concrete solution for long-term digital preservation (Barroso et al. 2003)

  OCR of Indian Language Books

The next big challenge in this project would be related to full text indexing and searching of contents of a book. The major challenge in full text search is for Indian language books because there is no suitable Optical Character Recognition (OCR) software available that provides high accuracy. Significant work needs to be done for full text searching of Indian language books.

20. Outreach

Several actions have been initiated in terms of outreach activities to popularize DLI and increase the usage. Several workshops have been organized in different parts of the country and presented to the DLI at different conferences. DeitY, Ministry of Communication and Information Technology, Government of India, has also initiated two projects to organize workshops. Most of these workshops have been organized within the library by professionals or papers have been presented at the conferences on library science where participants are from library professionals only. It was not enough to popularize DLI amongst the common man. The workshop should be done through community centres or public libraries. More training should be imparted to the public, researchers, scholars faculty, and student communities to popularize the DLI among them.

21. Benefits

The principle benefit of the DLI has been to supplement the formal education system by making knowledge available to anyone who can read and has access. DLI has enhanced the learning process by making the huge number of work of mankind available free to everyone around the world and playing a vital role in the advancement of human society. This large knowledge repository is revolutionizing research at all levels of education and providing a much-needed boost at minimal cost to the national education infrastructure. This impact has further enhanced given the convenience of online access and the benefit of word and phrase levels in the realm of full text searching.


A secondary benefit of DLI is to the process of locating the relevant information inside books far more reliable and in a much easier way. Student success in finding exactly what they seek has increased and this increased success has enhanced student willingness to perform research using this resource. This digital library is open all the 168 hours the week on a 24×7×365 basis. More than one individual are able to use the same book at the same time from anywhere. Thus, every book is available to a greater number of people  all the time.


This DLI has produced an extensive and rich test bed for use in further textual language processing research. There are many books available in more than one language, providing a unique resource for example based machine translation.


Many believe that information is now doubling every two years. Machine summarization, intelligent indexing, and information mining are tools that will be needed for individuals to keep up the discipline in their work, their businesses, and in their personal interests. This large digitization project may enable extensive research in these areas.


DLI also works to stimulate research in Indian Language technologies. Some of these are as listed as follows:


  • Digital representation and storage mechanisms have been developed for Indian languages

  • A large number of applications are being built to store, process, retrieve, and present the Indian language content

  • The respective developments in areas such as information retrieval, optical character recognition, automatic text summarization, machine translation and transliteration (OM Transliteration Scheme for Indian languages), and handwriting recognition.

  • It has also developed Universal Dictionary, Cross Lingual Information Retrieval and Search, Speech Recognition in Indian Languges, and natural language parsing and morphological analyses.

22. Future Activities

Some of the future activities required to plan and popularize the better usage of DLI have been enumerated as follows: 


  • Compensating for Creating Contents (3Cs)

  • Create content for small screen, mobile, and short period

  • Move towards larger bandwidth, even preservation of heritage, and tourism

  • Bring DLI closer to the people by arranging several workshops in different parts of the country.

References

Ambati V et al. 2006. The digital library of India project: Process, politics and Architecture. ICDL Proceedings, New Delhi.


Balakrishnan et al. 2006.  Digital library of India: A testbed for Indian language research. TCDL Bulletin 3(1).


Barroso et al. 2003. Web search for planet: The google cluster architecture. IEEE Micro 32(2): 22–28.


Kar Debal C. 2013. Digital Library of India. In Book of Abstract, 5th Quantitative and Qualitative Methods in Libraries International Conference (QQML2013), University of Piraeus Library, Rome, Italy, 4–7 June 2013, pp. 68–69. Available at <http://www.isast.org/images/Book_of_ABSTRACTS_2013.pdf>, last accessed on February 10, 2016.


Digital Library of India. Available at <http://www.dli.gov.in/> (between 25–31 March 2012, December 6, 2012 and February 10, 2016).


Digital Library of India. Available at <http://www.dli.ernet.in/> (between 25–31 March 2012, December 6, 2012 and February 10, 2016).


Census Data 2001. General Note. Census of India.