World Digital Libraries: An International Journal (WDL) Vol.9(1) June 2016 Print ISSN : 0974-567X Online ISSN : 0975-7597 |
Digital Library of India: An Initiative for the Preservation and Dissemination of the National Heritage and Rare Books and Manuscripts Collection |
Debal C Kar: University Librarian, Ambedkar University Delhi, Delhi, India. (E): debal@aud.ac.in |
DOI: 10.18329/09757597/2016/9104 |
Abstract |
Digital Library of India (DLI) is an initiative taken by the Government of India to digitally preserve and disseminate all the significant literary, artistic, and scientific work of human available in India and thus, it has been made freely available, from every corner of the world, for education, study, appreciation, and for the future generations. The project started with the primary long-term objective of capturing all copyright free books and manuscripts, available in India, in digital format. The planning started with an aim to digitize one million books (less than one per cent of all books in all languages ever published) by 2005 in the first phase. Presently, it has succeeded in disseminating 537,350 books with 187656339 pages in 46 languages (Indian and Foreign) available in India. The books and manuscripts are available on the website with free access at <www.dli.gov.in> and <www.dli.ernet.in>. The Government of India has also taken initiatives to digitize cultural heritage, such as facts, monuments, heritage building, temples, and a thousand-year-old manuscripts and walk-through. The article also discuss about the initiation, planning, and successful execution of the project. The expenditure incurred as of today, the sources of fund and coverage in respect of subjects, languages, type of collections (e.g. books, manuscripts, hand written manuscripts in leafs, journals, newspapers, etc.), libraries, cities, centres, and many more. It will be also described how the different libraries and digitization centres share their resources and network amongst each other for better usage of the documents available since 1985 or earlier. The article will also describe the philosophy behind the content selection, duplication of work, and copyright issues policy of the projects. Furthermore, the process workflow for the digitization of books shall be summarized in terms of three major process elements—pre scanning process, scanning process, and post scanning process and tools used. The criteria behind deciding which manuscripts are to be included in the collection for digitization is also described. The article will also provide the language-wise as well as centre-wise status of the digital collection. The digitization process used, steps taken pre-digitization and the process used post-digitization have also been described in detail. Besides, the article also discusses the necessary precautions taken for preserving of the digitized data and steps for dissemination across the world. Other projects and actions initiated, thus far, to educate and empower human resources and outreach activities to popularize DLI and increase the usage have also been elaborated upon. Statistics to show the usage and number pages downloaded in a particular month have also been provided. While concluding, the article describes the benefits derived from project DLI and usage of the initiatives of the preservation and dissemination of national heritage and suggested future plans to popularize DLI on rare books and manuscripts. |
1. Introduction |
Digital Library of India (DLI) is a digital collection of freely accessible rare books collected from various libraries in India. This is an initiative for the preservation and dissemination of the national heritage, rare books and manuscripts collection available in India by the Government of India. DLI aims to digitally preserve and disseminate all the significant literary, artistic, and scientific work of people, available in India, and made freely available, for education, study, appreciation and for future generations all over the world. As a first step in realizing this vision, it is proposed to create the Digital Library with a free-to-read, searchable collection of one million books, predominantly in Indian languages. The project was initiated by the Office of the Principal Scientific Advisor to the Government of India and subsequently taken over by the Department of Information Technology (DIT) (now known as Department of Electronics and Information Technology-DeitY), Ministry of Communications and Information Technology (MCIT), Government of India. The idea was also to create a test bed for researchers to improve scanning techniques, optical character recognition, intelligent indexing, and in general to promote research in Indian language technology. The project primarily began with the long-term objective to capture all copyright free books, available in India, in digital format. The planning was started with an aim to digitize one million books (less than 1 per cent of all books in all languages ever published) by 2005 in the first phase. The basic idea behind this project is to explore the possibility of storing, in digital form, all the knowledge ever produced by mankind and making this content available free of charge to be browsed and searched by anyone, anywhere, and anytime. This vision is the goal of the Universal Digital Library Project (UDL). The trend would be such that any information that is not online and accessible to search engines may become unusable. In a thousand years, only a few of the paper documents we have today will survive the ravages of deterioration, loss, and outright destruction. Hence, there is an urgent need to preserve our knowledge and heritage in the digital form (Balakrishnan et al. 2006). As a part of Raj Reddy’s grand vision, a mission, known as the Million Books to the Web Project (MBP) to digitize one million books was embarked upon as a collaborative project involving many countries, especially India, the USA, and China . To support UDL in India, DeitY, Ministry of Communication and Information Technology, Government of India, has sponsored a project for digitization of copyright free books available in India . Ever since its inception in November 2002, initially operating at three centers, the project has been successfully digitizing books, which are a dominant store of knowledge and culture. DLI now host of more than 537,350 books composed of 187,656,339 pages in more than 46 languages (Indian and Foreign) which have been scanned at more than 42 centers across the country. Some of the scanning centers are given in Table 1. All the scanning centres send the data (scanned images in Tagged Image File Format (TIFF) along with the metadata of a book) to the Indian Institute of Science. After checking for quality errors, the Institute hosts these documents on the Digital Library website in Portable Document Format (PDF). While scanning, we have faced many problems, which opens up many research opportunities in language technologies, particularly for Indian languages. It was thought that language, especially the Indian languages, should not be a barrier to information access where knowledge exists free of cost. Individual mother tongues in India number several hundreds. According to Census of India of 2001, India has 122 major languages and 1,599 other languages. The 2001 Census recorded 30 languages which were spoken by more than a million native speakers and 122 which were spoken by more than 10,000 people. While in process, it had realized that digital representation and storage mechanisms for Indian languages are big problems. With DeitY’s financial support through projects, digital representation and storage mechanisms have been developed for Indian languages, and a large number of applications are being built to store, process, retrieve, and present the Indian language content. The DLI fosters a large number of research activities pertaining to language technologies for Indian languages and development in areas, such as information retrieval, optical character recognition, text summarization, machine translation, and transliteration (OM Transliteration Scheme for Indian languages), handwriting recognition, Universal Dictionary, Cross Lingual Information Retrieval and Search, Speech Recognition in Indian Languages, Automatic Summarization, and natural language parsing and morphological analyses. The projects are a very high collaborative effort and in distributed environment. While maintaining a uniform standard, it has become an important priority in such collaborative effort and distributed environment. Isolated set up does not promote collaboration across geographically distributed points of operation centers for server management and administration along with resolution of process-oriented issues. So a distributed environment becomes a requisite. Therefore, the process of scanning books, image processing, cleaning, and enabling the web have concurrently occurred at different places. In doing so, we have faced a few issues with reference to the selection of books for digitization, duplication effort for operating and establishing protocol, good quality of digital output, preservation of digitized books, and user friendly and reliable access. |
2. Reason for Digital Archives |
Existing archives of books have many shortcomings. Many other similar works, in existence today, are rare and only accessible to a small population of scholars and collectors at specific geographic locations. A single wanton act of destruction can destroy an entire line of heritage. Furthermore, contrary to the popular beliefs, the libraries, museums, and publishers do not routinely maintain broadly comprehensive archives of the considered works of man. No one can afford to do this, unless the archive is digital. |
3. Vision |
The vision for the DLI project was as below (http://www.dli.gov.in): All the significant literary, artistic, and scientific works of mankind can be digitally preserved and made freely available, in every corner of the world, for education, study, appreciation, and for all our future generations. |
4. Mission |
The mission is to create a portal for the Digital Library of India which will foster creativity and free access to all human knowledge. As a first step in realizing this mission, it is proposed to create the Digital Library with a free-to-read, searchable collection of one million books, predominantly in Indian languages, available to everyone over the Internet. This portal will also become an aggregator of all the knowledge and digital content created by other digital library initiatives in India. Very soon we expect that this portal would provide a gateway to Indian digital libraries in science, arts, culture, music, movies, traditional medicine, palm leaves, and many more. The result will be a unique resource accessible to anyone in the world 24×7, without regard to socioeconomic background or nationality. |
5. Goals |
The primary long-term objective is to capture all books in digital format. As a first step we are planning to demonstrate the feasibility by undertaking to digitize one million books (less than one per cent of all books in all languages ever published) by 2005. A secondary objective of this project will be to provide a test bed that will support other researchers who are working on improved scanning techniques, optical character recognition, and indexing. |
6. Content Selection |
DLI envisages developing a collection of books by adopting an approach as described below. The DLI has adhered to the copyright law. |
|
Creating one digital copy and mirroring it in different locations will suffice, and will support the multiple usage at any time. Books denoting ancient historical events of India as well as cultural and social books in different languages are digitized. These materials are obtained from authorized universities institutes, libraries of religious organizations and public libraries in India. Palm leaves, journals, and manuscripts are also digitized. |
|
Materials which are free of copyright as per the Indian Copyright Act, 1957, have been scanned for DLI. The first selected materials were government textbooks published in 11 of the 18 official languages of India. |
|
DLI will seek publisher permission to scan books where books are not copyright free. However there are numerous difficulties, in particular, due to lack of publisher records, return of copyright to authors, and other circumstances. Publishers increasingly see that the digital presentation of their works can attract buyers. They are interested in exploring ways in which their out-of-print titles may be returned to profitability. Continued work with publishers through the course of this project may attract many of them to it. That would be most beneficial in enriching the content to be made available in digital format to everyone. |
7. Workflow |
The procurement team identifies the books to be digitized (Ambatiet al. 2006). The books are then sent to the various scanning location operated under the regional mega scanning center (RMSC). Prior to digitization, the expert librarian enters the regular metadata for the books. Thereafter, the metadata is uploaded into the DLI system for checking duplicates from the existing DLI records. Books are then digitized and sent back to the library; the digitized product is then tested for quality standards and approved for uploading on the DLI servers. The process workflow for the digitization of books is summarized in terms of the following three major process elements (http://www.dli.gov.in):
The pre-scanning process involves the following stages:
The scanning process involves the following stages:
The post-scanning process comprises the following steps:
Once the quality assurance certifies the quality of the content and meta information, the content is uploaded else it appears offline unless the content is corrected based on the defects found |
8. Tools Used |
The specialty tools developed, customized, and used for the DLI are as follows:
|
9. Cooperation and Collaboration |
The Indian Institute of Science (IISc), Carnegie Mellon University (CMU), International Institute of Information Technology, Hyderabad (IITH), and many other academic, religious, and government organizations as content creation centres, as mentioned below, have become partners in the DLI initiative for digitization and preservation of Indian heritage present in the form of books, manuscripts, art, and music. The scanning operations and preservation of digital data takes place at different RMS centres across India. These RMSCs themselves function as individual entities with several scanning units in different locations in the region. RMSC is operating parallelly and independently in distributed regions across the country. The Functions of a RMSC include collection of books from different libraries of the region, distributing them among scanning locations within the region, return the collected books after digitization, gathering back the digitized content from the scanning locations, and hosting the same. Every scanning location consists of trained personnel to execute the scanning and image processing operations. Each centre brings its own unique collection of literature as well as libraries of surrounding areas into the digital library. Many other academic, religious, and other institutions, including many authors, individually, have cooperated by contributing their collection and books to the DLI free of cost. Following are the institutions are collaborated for implementing the DLI: |
10. Coordination and Research Centres in India |
|
11. Academic Institutions |
|
12. Religious and Cultural Institutions |
|
13. Government and Research Agencies |
|
14. Industrial Partners |
|
15. Funding Resources |
The funding for the Million Book Project is coming from multiple sources. The Office of the Principal Scientific Advisor to the Government of India funded the project at the Indian Institute of Science, Bangalore. Subsequently, the Department of Electronics and Information Technology (DeitY), Ministry of Communication and Information Technology (MCIT), Government of India, has funded the project at various partner centres of the DLI. So the DLI project being implemented by different centres, was funded by DeitY, MCIT, Government of India. Various centres have also pledged their local resources to make DLI a reality. The National Science Foundation is provided funding for scanners and software research and development. Few Book Scanners and Software necessary for digital library processing have been provided by IISc Bangalore. |
16. Copyright Policy |
The copyright policy adopted for DLI, is as per the Indian Copyright Act, 1957. Materials which are free of copyright as per the Indian Copyright Act, 1957, have been scanned for DLI. However, in case of a possible error in copyright checking, if the author or publisher sends a written request for removal, such a request will be validated and complied with. The following works are included in the DLI:
|
17. Present Status |
Table 2 provides the scanning centre-wise report as on February 10, 2016, provided on the DLI website which includes the number of books and number of pages available for access, free of cost. The number of books scanned by these centers may be more; the list provided only those books which are available on the website.. It is found that the maximum number of books and pages scanned available on the website are Banasthali University, Rajasthan (101,774 books with 40,345,710) pages and C-DAC, Noida (106,897 books with 33107306 pages. Table 3 provides language-wise number of books and pages available on the website as on February 10, 2016. It is found that the maximum number of books, that is more than 50 per cent, are in English. Out of total number of books available in the DLI (537,350), English language books are 288,576 (115,197,499 pages). More than 10 per cent books (54,220 books) are found in Hindi in 16,766,290 pages. The substantial number of books are also found in Gujarati (39,605), Sanskrit (35,431), Urdu (32,360), Bengali (25,176), and Telugu (23,370). Table 4 provides subject-wise number of books available on the DLI website as on February 10, 2016. The maximum number of books are on Literature (65,631); substantial number of books are also found in Science (36,994), History (31,574), Geography (21,597), and Religion (21,030). It is noted that 309,247 number of books are not yet assigned any subject. |
18. Usage |
Table 5 depicts the high usage of the DLI website and books for the month of December 2010. It provides usage of October 2012 in Table 6 and graphical representation of usage of DLI may also be seen in Figure 1. Usage for the month of September 2015 has been analysed in Table 7. These indicate that usage is very high as people prefer using DLI for their research, personal reading, and to supplement their education. The tables also show the unique visitors, returning visitors, and first time visitors to the website of the DLI. Tables are also provided the number of pages downloaded on a particular day of the month. |
19. Challenges |
The process of digitizing and web uploading poses certain challenges; these are discussed as follows: |
|
DLI has been able to preserve the rich culture and heritage of India only through book and paper media. It was found that usage is not very high as it was thought in the planning stage. This could be because of the selection process and policy. As a policy, DLI selects only those books which are copyright free as it was very difficult to provide free access to the copyright books. As all copyright free works may not that useful and usage are not found very high. So book selection policy, e.g., only copyright free books were to be digitized and uploaded to the DLI website, effect its’ usage. Also, while digitizing, many books get damaged, probably due to manhandling by the staff responsible for scanning, many libraries are found reluctant to provide books to DLI for scanning. |
|
Even though necessary precautions were taken, we found that many duplicate books were provided by the different RMSC to DLI; reasons unknown. DLI did not put them on the portal however, cost has been already incurred towards the digitization process for those books. |
|
Because of initial negligence of the staff, it was discovered at DLI that many metadata are either incorrect or incomplete which creates unnecessary problems and duplication of records. |
|
Data synchronization and management, across the centres in order to reduce duplication is another problem that needs to be tackled. There is no concrete solution for long-term digital preservation (Barroso et al. 2003) |
|
The next big challenge in this project would be related to full text indexing and searching of contents of a book. The major challenge in full text search is for Indian language books because there is no suitable Optical Character Recognition (OCR) software available that provides high accuracy. Significant work needs to be done for full text searching of Indian language books. |
20. Outreach |
Several actions have been initiated in terms of outreach activities to popularize DLI and increase the usage. Several workshops have been organized in different parts of the country and presented to the DLI at different conferences. DeitY, Ministry of Communication and Information Technology, Government of India, has also initiated two projects to organize workshops. Most of these workshops have been organized within the library by professionals or papers have been presented at the conferences on library science where participants are from library professionals only. It was not enough to popularize DLI amongst the common man. The workshop should be done through community centres or public libraries. More training should be imparted to the public, researchers, scholars faculty, and student communities to popularize the DLI among them. |
21. Benefits |
The principle benefit of the DLI has been to supplement the formal education system by making knowledge available to anyone who can read and has access. DLI has enhanced the learning process by making the huge number of work of mankind available free to everyone around the world and playing a vital role in the advancement of human society. This large knowledge repository is revolutionizing research at all levels of education and providing a much-needed boost at minimal cost to the national education infrastructure. This impact has further enhanced given the convenience of online access and the benefit of word and phrase levels in the realm of full text searching. A secondary benefit of DLI is to the process of locating the relevant information inside books far more reliable and in a much easier way. Student success in finding exactly what they seek has increased and this increased success has enhanced student willingness to perform research using this resource. This digital library is open all the 168 hours the week on a 24×7×365 basis. More than one individual are able to use the same book at the same time from anywhere. Thus, every book is available to a greater number of people all the time. This DLI has produced an extensive and rich test bed for use in further textual language processing research. There are many books available in more than one language, providing a unique resource for example based machine translation. Many believe that information is now doubling every two years. Machine summarization, intelligent indexing, and information mining are tools that will be needed for individuals to keep up the discipline in their work, their businesses, and in their personal interests. This large digitization project may enable extensive research in these areas. DLI also works to stimulate research in Indian Language technologies. Some of these are as listed as follows:
|
22. Future Activities |
Some of the future activities required to plan and popularize the better usage of DLI have been enumerated as follows:
|
References |
Ambati V et al. 2006. The digital library of India project: Process, politics and Architecture. ICDL Proceedings, New Delhi. Balakrishnan et al. 2006. Digital library of India: A testbed for Indian language research. TCDL Bulletin 3(1). Barroso et al. 2003. Web search for planet: The google cluster architecture. IEEE Micro 32(2): 22–28. Kar Debal C. 2013. Digital Library of India. In Book of Abstract, 5th Quantitative and Qualitative Methods in Libraries International Conference (QQML2013), University of Piraeus Library, Rome, Italy, 4–7 June 2013, pp. 68–69. Available at <http://www.isast.org/images/Book_of_ABSTRACTS_2013.pdf>, last accessed on February 10, 2016. Digital Library of India. Available at <http://www.dli.gov.in/> (between 25–31 March 2012, December 6, 2012 and February 10, 2016). Digital Library of India. Available at <http://www.dli.ernet.in/> (between 25–31 March 2012, December 6, 2012 and February 10, 2016). Census Data 2001. General Note. Census of India. |