Book scanning

Book scanning or book digitization (also: magazine scanning or magazine digitization) is the process of converting physical books and magazines into digital media such as images, electronic text, or electronic books (e-books) by using an image scanner.

Digital books can be easily distributed, reproduced, and read on-screen. Common file formats are DjVu, Portable Document Format (PDF), and Tagged Image File Format (TIFF). To convert the raw images optical character recognition (OCR) is used to turn book pages into a digital text format like ASCII or other similar format, which reduces the file size and allows the text to be reformatted, searched, or processed by other applications.

Image scanners may be manual or automated. In an ordinary commercial image scanner, the book is placed on a flat glass plate (or platen), and a light and optical array moves across the book underneath the glass. In manual book scanners, the glass plate extends to the edge of the scanner, making it easier to line up the book's spine. Other book scanners place the book face up in a v-shaped frame, and photograph the pages from above. Pages may be turned by hand or by automated paper transport devices. Glass or plastic sheets are usually pressed against the page to flatten it.

After scanning, software adjusts the document images by lining it up, cropping it, picture-editing it, and converting it to text and final e-book form. Human proofreaders usually check the output for errors.

Scanning at 118 dots/centimeter (300 dpi) is adequate for conversion to digital text output, but for archival reproduction of rare, elaborate or illustrated books, much higher resolution is used. High-end scanners capable of thousands of pages per hour can cost thousands of dollars, but do-it-yourself (DIY), manual book scanners capable of 1200 pages per hour have been built for US$300.[1]

Scribe Book Scanner
Internet Archive Scribe book scanner in 2011
Internet Archive book scanner 1
Internet Archive book scanner

Commercial book scanners

V-shaped-cradle - en
Sketch of a V-shaped book scanner from Atiz
Book scanner
Sketch of a typical manual book scanner

Commercial book scanners are not like normal scanners; these book scanners are usually a high quality digital camera with light sources on either side of the camera mounted on some sort of frame to provide easy access for a person or machine to flip through the pages of the book. Some models involve V-shaped book cradles, which provide support for book spines and also center book position automatically.

The advantage of this type of scanner is that it is very fast, compared to the productivity of overhead scanners.

Large-scale projects

Projects like Project Gutenberg (est. 1971), Million Book Project (est. circa 2001), Google Books (est. 2004), and the Open Content Alliance (est. 2005) scan books on a large scale.

One of the main challenges to this is the sheer volume of books that must be scanned. In 2010 the total number of works appearing as books in human history was estimated to be around 130 million.[2] All of these must be scanned and then made searchable online for the public to use as a universal library. Currently, there are three main ways that large organizations are relying on: outsourcing, scanning in-house using commercial book scanners, and scanning in-house using robotic scanning solutions.

As for outsourcing, books are often shipped to be scanned by low-cost sources to India or China. Alternatively, due to convenience, safety and technology improvement, many organizations choose to scan in-house by using either overhead scanners which are time-consuming, or digital camera-based scanning machines which are substantially faster and is a method employed by Internet Archive as well as Google. Traditional methods have included cutting off the book's spine and scanning the pages in a scanner with automatic page-feeding capability, with subsequent rebinding of the loose pages.

Once the page is scanned, the data is either entered manually or via OCR, another major cost of the book scanning projects.

Due to copyright issues, most scanned books are those that are out of copyright; however, Google Book Search is known to scan books still protected under copyright unless the publisher specifically prohibits this.

Collaborative projects

There are many collaborative digitization projects throughout the United States. Two of the earliest projects were the Collaborative Digitization Project in Colorado and NC ECHO – North Carolina Exploring Cultural Heritage Online,[3] based at the State Library of North Carolina.

These projects establish and publish best practices for digitization and work with regional partners to digitize cultural heritage materials. Additional criteria for best practices have more recently been established in the UK, Australia and the European Union.[4] Wisconsin Heritage Online[5] is a collaborative digitization project modeled after the Colorado Collaborative Digitization Project. Wisconsin uses a wiki[6] to build and distribute collaborative documentation. Georgia's collaborative digitization program, the Digital Library of Georgia,[7] presents a seamless virtual library on the state's history and life, including more than a hundred digital collections from 60 institutions and 100 agencies of government. The Digital Library of Georgia is a GALILEO[8] initiative based at the University of Georgia Libraries.

In the twentieth century, the Hill Museum and Manuscript Library photographed books in Ethiopia that were subsequently destroyed amidst political violence in 1975. The library has since worked to photograph manuscripts in Middle Eastern countries.[9]

In South-Asia, the Nanakshahi trust is digitizing manuscripts of Gurmukhīscript.

In Australia, there have been many collaborative projects between the National Library of Australia and universities to improve the repository infrastructure that digitized information would be stored in.[10] Some of these projects include, the ARROW (Australian Research Repositories Online to the World) project and the APSR (Australian Partnership for Sustainable Repository) project.

Destructive scanning methods

For book scanning on a low budget, the least expensive method to scan a book or magazine is to cut off the binding. This converts the book or magazine into a sheaf of looseleaf papers, which can then be loaded into a standard automatic document feeder (ADF) and scanned using inexpensive and common scanning technology. While this is not a desirable solution for very old and uncommon books, it is a useful tool for book and magazine scanning where the book is not an expensive collector's item and replacement of the scanned content is easy. There are two technical difficulties with this process, first with the cutting and second with the scanning.


More precise and less destructive than cutting pages with a paper guillotine or razor or scissors is the technique of meticulous unbinding by hand, assisted with tools. This technique has been successfully employed for tens of thousands of pages of archival original paper scanned for the Riazanov Library digital archive project from newspapers and magazines and pamphlets, varying from 50 to 100 years old and more, and often composed of fragile, brittle paper. Although the monetary value for some collectors (and for most sellers of this sort of material) is destroyed by unbinding, unbinding in many cases actually greatly assists preservation of the physical pages themselves, making them more accessible to researchers and less likely to be damaged when subsequently examined. The down side is that unbound stacks of pages are "fluffed up", and therefore more exposed to oxygen in the air, which may in some cases (theoretically) speed deterioration. This can be addressed by putting weights on the pages after they are unbound, and storage in appropriate containers.

Hand unbinding will preserve text that runs into the gutters of bindings, and most critically allows more easy and complete high quality scans to be made of two page wide material, such as center cartoons, graphic art, and photos in magazines. The digital archive of The Liberator 1918-1924 on Marxist Internet Archive nicely demonstrates the quality of two page wide graphic art scans made possible by careful hand unbinding prior to flat bed or other scanning.

Unbinding techniques vary with the binding technology, from simply removing a few staples to unbending and removing nails to meticulously grinding down of layers of glue on the spine of a book to precisely the right point, followed by laborious removal of the string used to hold the book together.

Note that with some newspapers (such as Labor Action 1950-1952) there are columns on the center facing pages that run right in-between the pages. Chopping off part of the spine of a bound volume of such papers will lose part of this text. Even the Greenwood Reprint of this publication failed to preserve the text content of those center columns, cutting off significant amounts of text there. Only when bound volumes of the original newspaper were meticulously unbound, and the opened pair of center pages were scanned as a single page on a flat bed scanner was the center column content made digitally available. Alternatively, one can present the two facing center pages as three scans. One of each individual page, and one of a page sized area situated over the center of the two pages.


One method of cutting a stack of 500 to 1000 pages in one pass is accomplished with a guillotine paper cutter. This is a large steel table with a paper vise that screws down onto the stack and firmly secures it before cutting. The cut is accomplished with a large sharpened steel blade which moves straight down and cuts the entire length of each sheet all at once. A lever on the blade permits several hundred pounds of force to be applied to the blade for a quick one-pass cut.

A clean cut through a thick stack of paper cannot be made with a traditional inexpensive sickle-shaped hinged paper cutter. These cutters are only intended for a few sheets, with up to ten sheets being the practical cutting limit. A large stack of paper applies torsional forces on the hinge, pulling the blade away from the cutting edge on the table. The cut becomes more inaccurate as the cut moves away from the hinge, and the force required to hold the blade against the cutting edge increases as the cut moves away from the hinge.

The guillotine cutting process dulls the blade over time, requiring that it be resharpened. Coated paper such as slick magazine paper dulls the blade more quickly than plain book paper, due to the kaolinite clay coating. Additionally, removing the binding of an entire hardcover book causes excessive wear due to cutting through the cover's stiff backing material. Instead the outer cover can be removed and only interior pages need be cut.

An alternate method of unbinding books is to use a table saw. While this method is potentially dangerous and does not leave as smooth an edge as the guillotine paper cutter method, it is more readily available to the average person. The ideal method is to clamp the book between two thick boards using heavy machine screws to provide the clamping force. The entire wood and book package is fed through the table saw using the rip fence as a guide. A sharp fine carbide tooth blade is ideal for generating an acceptable cut. The quality of the cut depends on the blade, feed rate, type of paper, paper coating, and binding material.


Once the paper is liberated from the spine, it can be scanned one sheet at a time using a traditional flatbed scanner or automatic document feeder.

Pages with a decorative riffled edging or curving in an arc due to a non-flat binding can be difficult to scan using an ADF, as they are designed to scan pages of uniform shape and size, and variably sized or shaped pages can lead to improper scanning. The riffled edges or curved edge can be guillotined off to render the outer edges flat and smooth before the binding is cut.

The coated paper of magazines and bound textbooks can make them difficult for the rollers in an ADF to pick up and guide along the paper path. An ADF which uses a series of rollers and channels to flip sheets over may jam or misfeed when fed coated paper. Generally there are fewer problems by using as straight of a paper path as is possible, with few bends and curves. The clay can also rub off the paper over time and coat sticky pickup rollers, causing them to loosely grip the paper. The ADF rollers may need periodic cleaning to prevent this slipping.

Magazines can pose a bulk-scanning challenge due to small nonuniform sheets of paper in the stack, such as magazine subscription cards and fold out pages. These need to be removed before the bulk scan begins, and are either scanned separately if they include worthwhile content, or are simply left out of the scan process.

Non-destructive scanning

An example of a DIY non-destructive book scanner/digitizer, with the book downwards design, allowing gravity to flatten pages

Software driven machines and robots have been developed to scan books without the need of unbinding them in order to preserve both the contents of the document and create a digital image archive of its current state. This recent trend has been due in part to ever improving imaging technologies that allow a high quality digital archive image to be captured with little or no damage to a rare or fragile book in a reasonably short period of time.

The first fully automated book scanner was the DL (Digitizing Line) scanner, manufactured by 4DigitalBooks in Switzerland. The first known installation was at Stanford University in 2001.[11][12] The scanner received a Dow Jones Runner-Up award under Business Applications Category in 2001.[13]

Video of the robotic book scanner DL mini

Most high-end commercial robotic scanners use traditional air and suction technology while some others use alternative approaches like bionic fingers for turning pages. Some scanners take advantage of ultrasonic sensors or photoelectric sensors to detect dual pages and prevent skipping of pages. With reports of machines being able to scan up to 2900 pages per hour,[14] robotic book scanners are specifically designed for large-scale digitization projects.

Google's patent 7508978 shows an infrared camera technology which allows detection and automatic adjustment of the three-dimensional shape of the page.[15][16] Researchers from the University of Tokyo have an experimental non-destructive book scanner[17] that includes a 3D surface scanner to allow images of a curved page to be straightened in software. Thus the book or magazine can be scanned as quickly as the operator can flip through the pages, about 200 pages per minute.

See also

A Real Page-Turner
Turning the pages in between taking scans.


  1. ^ "DIY High-Speed Book Scanner from Trash and Cheap Cameras". Retrieved 19 January 2014.
  2. ^ Taycher, Leonid (2010-08-05). "As of Aug 5, 2010, google estimates that there are 129,864,880 different books in the world". Retrieved 2014-08-08.
  3. ^ "North Carolina ECHO : Exploring Cultural Heritage Online".
  4. ^ Digital Libraries: Principles and Practice in a Global Environment, Ariadne April 2005.
  5. ^ "Recollection Wisconsin". Recollection Wisconsin.
  6. ^ "Wisconsin Heritage Online [licensed for non-commercial use only] / FrontPage".
  7. ^ "Welcome to the Digital Library of Georgia".
  8. ^ "GALILEO".
  9. ^ "Codices decoded". The Economist. 18 December 2010. p. 151.
  10. ^ Libraries in the twenty-first century: Charting new directions in information services. Edited by Stuart Ferguson, 2007, pg 84
  11. ^ Davies, John. "4DigitalBooks launches digital book scanner". PrintWeek.
  12. ^ "Stanford University Libraries (SUL) Robotic Book Scanner". Stanford University Libraries (SUL).
  13. ^ "Technology Innovation Awards: Winners 2001". Dow Jones. Archived from the original on 2015-09-23. Retrieved 2017-08-07.
  14. ^ Rapp, David. "Product Watch: Library Scanners". Library Journal. Retrieved 11 May 2014.
  15. ^ US 7508978, Lefevere, Francois-Marie & Marin Saric, "Detection of grooves in scanned images", issued March 24, 2009, assigned to Google
  16. ^ The Secret Of Google's Book Scanning Machine Revealed, by Maureen Clements, April 30, 2009.
  17. ^ Guizzo, Erico (2010-03-17). ""Superfast Scanner Lets You Digitize Book By Flipping Pages", IEEE Spectrum, March 17, 2010". Retrieved 2014-08-08.

External links

Authors Guild

The Authors Guild is America's oldest and largest professional organization for writers and provides advocacy on issues of free expression and copyright protection. Since its founding in 1912 as the Authors League of America, it has counted among its board members notable authors of fiction, nonfiction, and poetry, including numerous winners of the Nobel and Pulitzer Prizes and National Book Awards. It has over 9,000 members, who receive free legal advice and guidance on contracts with publishers as well as insurance services and assistance with subsidiary licensing and royalties.The group lobbies at the national and state levels on censorship and tax concerns, and it has initiated or supported several major lawsuits in defense of authors' copyrights. In one of those, a class-action suit claiming that Google acted illegally when it scanned millions of copyrighted books without permission, the Authors Guild lost on appeal in the United States Court of Appeals for the Second Circuit.

Recently the Authors Guild has fought the consolidation of the publishing industry through the mergers of large publishers, and it has pressed the publishers to increase royalty rates for ebooks.

Digital library

A digital library, digital repository, or digital collection, is an online database of digital objects that can include text, still images, audio, video, or other digital media formats. Objects can consist of digitized content like print or photographs, as well as originally produced digital content like word processor files or social media posts. In addition to storing content, digital libraries provide means for organizing, searching, and retrieving the content contained in the collection.

Digital libraries can vary immensely in size and scope, and can be maintained by individuals or organizations. The digital content may be stored locally, or accessed remotely via computer networks. These information retrieval systems are able to exchange information with each other through interoperability and sustainability.


Digitization, less commonly digitalization, is the process of converting information into a digital (i.e. computer-readable) format, in which the information is organized into bits. The result is the representation of an object, image, sound, document or signal (usually an analog signal) by generating a series of numbers that describe a discrete set of its points or samples. The result is called digital representation or, more specifically, a digital image, for the object, and digital form, for the signal. In modern practice, the digitized data is in the form of binary numbers, which facilitate computer processing and other operations, but, strictly speaking, digitizing simply means the conversion of analog source material into a numerical format; the decimal or any other number system that can be used instead.

Digitization is of crucial importance to data processing, storage and transmission, because it "allows information of all kinds in all formats to be carried with the same efficiency and also intermingled". Unlike analog data, which typically suffers some loss of quality each time it is copied or transmitted, digital data can, in theory, be propagated indefinitely with absolutely no degradation. This is why it is a favored way of preserving information for many organisations around the world.


An electronic book, also known as an e-book or eBook, is a book publication made available in digital form, consisting of text, images, or both, readable on the flat-panel display of computers or other electronic devices. Although sometimes defined as "an electronic version of a printed book", some e-books exist without a printed equivalent. E-books can be read on dedicated e-reader devices, but also on any computer device that features a controllable viewing screen, including desktop computers, laptops, tablets and smartphones.

In the 2000s, there was a trend of print and e-book sales moving to the Internet, where readers buy traditional paper books and e-books on websites using e-commerce systems. With print books, readers are increasingly browsing through images of the covers of books on publisher or bookstore websites and selecting and ordering titles online; the paper books are then delivered to the reader by mail or another delivery service. With e-books, users can browse through titles online, and then when they select and order titles, the e-book can be sent to them online or the user can download the e-book. At the start of 2012 in the U.S., more e-books were published online than were distributed in hardcover.The main reasons for people buying e-books online are possibly lower prices, increased comfort (as they can buy from home or on the go with mobile devices) and a larger selection of titles. With e-books, "[e]lectronic bookmarks make referencing easier, and e-book readers may allow the user to annotate pages." "Although fiction and non-fiction books come in e-book formats, technical material is especially suited for e-book delivery because it can be [electronically] searched" for keywords. In addition, for programming books, code examples can be copied. The amount of e-book reading is increasing in the U.S.; by 2014, 28% of adults had read an e-book, compared to 23% in 2013. This is increasing, because by 2014 50% of American adults had an e-reader or a tablet, compared to 30% owning such devices in 2013.

Eighteenth Century Collections Online

Eighteenth Century Collections Online (ECCO) is a digital collection of books published in Great Britain during the 18th century.Gale, an education publishing company in the United States, assembled the collection by digitally scanning microfilm reproductions of 136,291 titles. Documents scanned after 2002 are added to a second collection, ECCO II. As of January 2014, ECCO II comprises 46,607 titles.


Google LLC is an American multinational technology company that specializes in Internet-related services and products, which include online advertising technologies, search engine, cloud computing, software, and hardware. It is considered one of the Big Four technology companies, alongside Amazon, Apple and Facebook.Google was founded in 1998 by Larry Page and Sergey Brin while they were Ph.D. students at Stanford University in California. Together they own about 14 percent of its shares and control 56 percent of the stockholder voting power through supervoting stock. They incorporated Google as a privately held company on September 4, 1998. An initial public offering (IPO) took place on August 19, 2004, and Google moved to its headquarters in Mountain View, California, nicknamed the Googleplex. In August 2015, Google announced plans to reorganize its various interests as a conglomerate called Alphabet Inc. Google is Alphabet's leading subsidiary and will continue to be the umbrella company for Alphabet's Internet interests. Sundar Pichai was appointed CEO of Google, replacing Larry Page who became the CEO of Alphabet.

The company's rapid growth since incorporation has triggered a chain of products, acquisitions, and partnerships beyond Google's core search engine (Google Search). It offers services designed for work and productivity (Google Docs, Google Sheets, and Google Slides), email (Gmail/Inbox), scheduling and time management (Google Calendar), cloud storage (Google Drive), social networking (Google+), instant messaging and video chat (Google Allo, Duo, Hangouts), language translation (Google Translate), mapping and navigation (Google Maps, Waze, Google Earth, Street View), video sharing (YouTube), note-taking (Google Keep), and photo organizing and editing (Google Photos). The company leads the development of the Android mobile operating system, the Google Chrome web browser, and Chrome OS, a lightweight operating system based on the Chrome browser. Google has moved increasingly into hardware; from 2010 to 2015, it partnered with major electronics manufacturers in the production of its Nexus devices, and it released multiple hardware products in October 2016, including the Google Pixel smartphone, Google Home smart speaker, Google Wifi mesh wireless router, and Google Daydream virtual reality headset. Google has also experimented with becoming an Internet carrier (Google Fiber, Project Fi, and Google Station) is the most visited website in the world. Several other Google services also figure in the top 100 most visited websites, including YouTube and Blogger. Google is the most valuable brand in the world as of 2017, but has received significant criticism involving issues such as privacy concerns, tax avoidance, antitrust, censorship, and search neutrality. Google's mission statement is "to organize the world's information and make it universally accessible and useful", and its unofficial slogan was "Don't be evil" until the phrase was removed from the company's code of conduct around May 2018.

Google Books

Google Books (previously known as Google Book Search and Google Print and by its codename Project Ocean) is a service from Google Inc. that searches the full text of books and magazines that Google has scanned, converted to text using optical character recognition (OCR), and stored in its digital database. Books are provided either by publishers and authors, through the Google Books Partner Program, or by Google's library partners, through the Library Project. Additionally, Google has partnered with a number of magazine publishers to digitize their archives.The Publisher Program was first known as Google Print when it was introduced at the Frankfurt Book Fair in October 2004. The Google Books Library Project, which scans works in the collections of library partners and adds them to the digital inventory, was announced in December 2004.

The Google Books initiative has been hailed for its potential to offer unprecedented access to what may become the largest online body of human knowledge and promoting the democratization of knowledge.

However, it has also been criticized for potential copyright violations, and lack of editing to correct the many errors introduced into the scanned texts by the OCR process.

As of October 2015, the number of scanned book titles was over 25 million, but the scanning process has slowed down in American academic libraries. Google estimated in 2010 that there were about 130 million distinct titles in the world, and stated that it intended to scan all of them.

Image scanner

An image scanner—often abbreviated to just scanner, although the term is ambiguous out of context (barcode scanner, CT scanner etc.)—is a device that optically scans images, printed text, handwriting or an object and converts it to a digital image. Commonly used in offices are variations of the desktop flatbed scanner where the document is placed on a glass window for scanning. Hand-held scanners, where the device is moved by hand, have evolved from text scanning "wands" to 3D scanners used for industrial design, reverse engineering, test and measurement, orthotics, gaming and other applications. Mechanically driven scanners that move the document are typically used for large-format documents, where a flatbed design would be impractical.

Modern scanners typically use a charge-coupled device (CCD) or a contact image sensor (CIS) as the image sensor, whereas drum scanners, developed earlier and still used for the highest possible image quality, use a photomultiplier tube (PMT) as the image sensor. A rotary scanner, used for high-speed document scanning, is a type of drum scanner that uses a CCD array instead of a photomultiplier. Non-contact planetary scanners essentially photograph delicate books and documents. All these scanners produce two-dimensional images of subjects that are usually flat, but sometimes solid; 3D scanners produce information on the three-dimensional structure of solid objects.

Digital cameras can be used for the same purposes as dedicated scanners. When compared to a true scanner, a camera image is subject to a degree of distortion, reflections, shadows, low contrast, and blur due to camera shake (reduced in cameras with image stabilization). Resolution is sufficient for less demanding applications. Digital cameras offer advantages of speed, portability and non-contact digitizing of thick documents without damaging the book spine. As of 2010 scanning technologies were combining 3D scanners with digital cameras to create full-color, photo-realistic 3D models of objects.In the biomedical research area, detection devices for DNA microarrays are called scanners as well. These scanners are high-resolution systems (up to 1 µm/ pixel), similar to microscopes. The detection is done via CCD or a photomultiplier tube.

Internet Archive

The Internet Archive is a San Francisco–based nonprofit digital library with the stated mission of "universal access to all knowledge." It provides free public access to collections of digitized materials, including websites, software applications/games, music, movies/videos, moving images, and millions of public-domain books. In addition to its archiving function, the Archive is an activist organization, advocating for a free and open Internet.

The Internet Archive allows the public to upload and download digital material to its data cluster, but the bulk of its data is collected automatically by its web crawlers, which work to preserve as much of the public web as possible. Its web archive, the Wayback Machine, contains hundreds of billions of web captures. The Archive also oversees one of the world's largest book digitization projects.

Island Hermitage

Island Hermitage on (Polgasduwa) Dodanduwa Island, Galle District, Sri Lanka is a famous Buddhist forest monastery founded by Ven. Nyanatiloka Mahathera in 1911. It is a secluded place for Buddhist monks to study and meditate in the Buddhist tradition. It also has an excellent English and German library.

The Island Hermitage was the first centre of Theravāda Buddhist study and practice set up by and for Westerners. Its many prominent residents, monks and laymen, studied Theravada Buddhism and the Pali language, made translations of Pali scriptures, wrote books on Theravada Buddhism and practiced meditation. The Island Hermitage once formed an essential link with Theravāda Buddhism in the West.

In 1951 Nyanatiloka moved to the Forest Hermitage in Kandy, then joined by Nyanaponika. Since 2003 the hermitage is run by a group of young Sri Lankan monks. Currently there is only one western monk who has been living here for about four years.

Michigan Digitization Project

The Michigan Digitization Project is a project in partnership with Google Books to digitize the entire print collection of the University of Michigan Library. The digitized collection is available through the University of Michigan Library catalog, Mirlyn, the HathiTrust Digital Library, and Google Book Search. Full-text of works that are out of copyright or in the public domain are available.According to the University of Michigan University Library, they embarked on this partnership for a number of reasons:

The project will create new ways for users to search and access Library content, opening up our library collections to our own users and to users throughout the world

Although we have engaged in large-scale (preservation-based) conversion of parts of the Library's collection for several years, we know that only through partnerships of this sort can something of this scale be achieved

We believe that, beyond providing basic access to Library collections, this activity is critically transformative, enabling the University Library to build on and reconceive vital Library services for the new millennium.

The project has received academic and media attention.In February 2008, the University of Michigan announced that over 1 million books from the University Library have been digitized. In September 2008, the University of Michigan announced the establishment of HathiTrust, a multi-institutional digital repository.

Million Book Project

The Million Book Project (or the Universal Library) was a book digitization project led by Carnegie Mellon University School of Computer Science and University Libraries from 2007–2008. Working with government and research partners in India (Digital Library of India) and China, the project scanned books in many languages, using OCR to enable full text searching, and providing free-to-read access to the books on the web. As of 2007, they have completed the scanning of 1 million books and have made the entire catalog accessible online.


Noisebridge is an award-winning anarchistic educational hackerspace in San Francisco, inspired by hackerspaces in Europe, like the Metalab in Vienna and c-base in Berlin. It is a registered non-profit California corporation, with IRS 501(c)(3) charitable status. According to the Noisebridge website's Vision page, "Noisebridge is a space for sharing, creation, collaboration, research, development, mentoring, and learning. Noisebridge is also more than a physical space, it's a community with roots extending around the world." It was organized and began regularly meeting in 2007 and has had permanent facilities since 2008.

Optical character recognition

Optical character recognition or optical character reader, often abbreviated as OCR, is the mechanical or electronic conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene-photo (for example the text on signs and billboards in a landscape photo) or from subtitle text superimposed on an image (for example from a television broadcast).Widely used as a form of information entry from printed paper data records – whether passport documents, invoices, bank statements, computerised receipts, business cards, mail, printouts of static-data, or any suitable documentation – it is a common method of digitising printed texts so that they can be electronically edited, searched, stored more compactly, displayed on-line, and used in machine processes such as cognitive computing, machine translation, (extracted) text-to-speech, key data and text mining. OCR is a field of research in pattern recognition, artificial intelligence and computer vision.

Early versions needed to be trained with images of each character, and worked on one font at a time. Advanced systems capable of producing a high degree of recognition accuracy for most fonts are now common, and with support for a variety of digital image file format inputs. Some systems are capable of reproducing formatted output that closely approximates the original page including images, columns, and other non-textual components.

Out-of-print book

An out-of-print book is a book that is no longer being published. The term can apply to specific editions of more popular works, which may then go in and out of print repeatedly, or to the sole printed edition of a work, which is not picked up again by any future publishers for reprint. Most works that have ever been published are out of print at any given time, while certain highly popular books, such as the Bible, are always "in print". Less popular out of print books are often rare and may be difficult to acquire unless scanned or electronic copies of the books are available. With the advent of book scanning, and print-on-demand technology, fewer and fewer works are now considered truly out of print.

A publisher creates a print run of a fixed number of copies of a new book. Print runs for most modern books number in the thousands. These books can be ordered in bulk by booksellers, and when all the bookseller's copies are sold, the bookseller has the option to order additional copies. If the initial print run sells out and demand still exists, the publisher will have more copies printed, if possible. When the book is no longer selling either at a rate fast enough to pay for the inventory costs, or to justify another print run, the publisher will cease to print additional copies, and may remainder or pulp the remaining unsold copies. When all of the books in a print run have been sold to booksellers, the book is said to be "out of print", meaning that a bookseller cannot get any further copies from the publisher. If a book sells out unexpectedly quickly, it may be considered out of print briefly when its initial print run is exhausted, but is usually soon reprinted.

Publishers may choose to list a book as "out of stock indefinitely", instead of declaring it out of print, as the publisher may have to give up copyright when declaring it out of print. Publishers will often let a book go out of stock for long periods, then reprint the book, usually with a new cover and formatting, to catch the presumably built up demand for the book. The author or their estate may have copyright reverted to them once the publisher has declared it out of print.

The longer a book has been out of print, the more difficult it may be to obtain a copy. If there is enough demand for an out-of-print book, and all copyright issues can be resolved, another publisher may republish the book in the same manner as the original publisher might have reprinted it. In some cases, an out-of-print book, even one that sold very poorly, may be republished if the author becomes popular again.

A reader who wishes to purchase an out-of-print book must either find a bookseller that still has a copy, wait for another print run, or find someone who will sell their own copy as a used book. The advent of the Internet has made this process much easier, as many websites sell used books offered by bookstores and individuals.

Some publishers intentionally limit the print run of some or all titles to fewer copies than the anticipated demand, in creating limited editions marketed to collectors. In these cases, there is an implicit or explicit promise to collectors that the book will not be reprinted, at least in the same form as originally published. For instance, Madonna's book Sex, with a limited edition print run, is, according to, the most sought after out-of-print book in the United States.

Project Gutenberg

Project Gutenberg (PG) is a volunteer effort to digitize and archive cultural works, to "encourage the creation and distribution of eBooks". It was founded in 1971 by American writer Michael S. Hart and is the oldest digital library. Most of the items in its collection are the full texts of public domain books. The project tries to make these as free as possible, in long-lasting, open formats that can be used on almost any computer. As of 23 June 2018, Project Gutenberg reached 57,000 items in its collection of free eBooks.The releases are available in plain text but, wherever possible, other formats are included, such as HTML, PDF, EPUB, MOBI, and Plucker. Most releases are in the English language, but many non-English works are also available. There are multiple affiliated projects that are providing additional content, including regional and language-specific works. Project Gutenberg is also closely affiliated with Distributed Proofreaders, an Internet-based community for proofreading scanned texts.


SimpleDL is digital collection management software that allows for the upload, description, management and access of digital collections and is UTF-8 compatible. SimpleDL is not limited by format and is capable of handling documents, PDFs, images, videos, audio files, and data only objects. In addition to simple digital files, SimpleDL can also connect content so multipage documents, scores, or books can be uploaded and organized into chapters, books or by page number. SimpleDL can also combine any number of images into one display object. SimpleDL is mostly used by libraries, archives, museums, government agencies, universities, corporations, historical societies, and other organizations that wish to host a digital collection.

Ted Striphas

Ted Striphas is an American academic, professor and author of The Late Age of Print.

Universal library

A universal library is a library with universal collections. This may be expressed in terms of it containing all existing information, useful information, all books, all works (regardless of format) or even all possible works. This ideal, although unrealizable, has influenced and continues to influence librarians and others and be a goal which is aspired to. Universal libraries are often assumed to have a complete set of useful features (such as finding aids, translation tools, alternative formats, etc.).


This page is based on a Wikipedia article written by authors (here).
Text is available under the CC BY-SA 3.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.