Google Print™, Million Book Project, and Google Scholar™ Presented to the FAO in Rome February 2005 Gloriana St. Clair Dean of University Libraries “Commercialize the great research libraries with a handshake, suddenly and epochally.” Rory Litwin, in Library Juice1 “This is the day the world changes.” John Wilkin, University of Michigan2 Thesis Google‟s new projects are exciting and, of course, commercial This talk will compare Google Print™ with the NSF-funded Million Book Project, and then touch briefly on Google Scholar™ Main Points Why / Genesis - Leaders, Partners Realities - Collections, Logistics Worries – Duplication, Copyright, Copyright, Copyright, Printing . . . Sources For This Talk News / web / talks / interviews, with help: Jean Alexander, Head, and the Hunt Library Reference Department Denise Troll Covey, Special Projects Librarian Missy Harvey, Computer Science Librarian Penn State Reference Department David Seaman, Digital Library Federation Anthony Tomasic, E-XMLMedia Michael Lesk, Rutgers University Google Print™ Leaders/Partners Google, Inc. U. Michigan Stanford University Harvard University U. Oxford New York Public Library Million Book Project Leaders/Partners in India Indian Institute of Science International Institute of Information Technology Indian Institute of Information Technology Anna University Mysore University University of Pune Goa University Tirumala Tirupati Devasthanams Shanmugha Arts, Science, Technology & Research Academy Arulmigu Kalasalingam College of Engineering Maharashtra Industrial Development Corporation Million Book Project Leaders/Partners in China Chinese Academy of Science Chinese Ministry of Education Fudan University Nanjing University Peking University Tsinghua University Zhejiang University Google Print™ Collections Stanford – entire collection Harvard – 40,000-volume pilot from a 15-million volume collection U. Michigan – virtually the entire collection; add seven million to search engine; Michigan to “receive and own a high quality digital copy”3 and provide access New York Public Library – a subset of a 20million volume collection; selection criteria = in public domain (1923), interesting, not too fragile Million Book Project Targeted Subcollections Books for College Libraries (best books) University presses / scholarly societies (copyright permissions work) U.N.‟s Food and Agriculture Organization content Google Print™ Handling the Copyright Issue Displays “a snippet of text”4 online for books in copyright A „snippet‟ is defined as three lines A search returns three snippets per book, and lists the number of times your search terms appear in the book BUY button Million Book Project Handling the Copyright Issue After extensive work, we are experiencing growing success in efforts to gain permission from university presses / scholarly societies to digitize books in searchable full text Million Book Project Research Initiatives Machine translation Massive distributed database Storage formats Use of digital libraries Distribution and sustainability Security Search engines Image processing Optical Character Recognition (OCR) Language processing Copyright laws Google™ began as a research project at Stanford in 1995. Google Print™ Logistics “Google will be doing all the digitizing with their own staff at Google headquarters and supposedly at Harvard and Michigan.”5 Six-year time frame 2.25 books per minute Onsite Million Book Project Logistics ● With scanning time @ one page per second: ● 20,000 pages per day shift x 200 working days per year ● 100 years to scan 1 million books ÷ (number of operators/machines) ● Several mega scanning centers are set up in India and China Million Book Project Finances India - $25M annually to support a large set of language translation research projects China - $8.46M from Ministry of Education over 3 yrs (2006) United States - $3.63M from NSF over 4 yrs (2005); and equipment, staff and money from the Internet Archive Google Print™ has funding of $???, but estimates costs at $10 per book. Worries Duplication “De-duplication is NOT part of the [Google Print™] process. NOTE Stanford is interested in having multiple copies of the same materials across various partners.”6 Million Book Project will use OCLC’s Digital Registry as soon as batch loading is available. Worries Copyright Google will be responsible for determining what‟s in copyright.”7 “A team is working on copyright issues but, in the meantime, Google is treating [copyright] conservatively.”8 “Google will disable printing for out-of-copyright books.”9 Printing More Worries Google Print™ Rory Litwin, “On Google‟s Monetization of Libraries”10 1. Privacy [cookies] 2. Introduction of commercial bias 3. Questions about democratization and equity of access 4. Disintermediation issues 5. Decontextualization of knowledge 6. Closing of the information commons More Worries Million Book Project 1. 2. 3. 4. Getting it done Sustainability Cohesion of content Usefulness Google Scholar™ Beta Reviewed by Péter’s,11 Anthony Tomasic, and reference librarians at Carnegie Mellon and Penn State: Not as good as Citebase, Research Index, RePEc/LogEc (Péter’s) Not as good as CiteSeer (Tomasic) Not as suitable as CiteSeer (Lesk) Not as good as Google press releases indicate (St. Clair) Google Scholar™ Beta What:12 Offers free access to bibliographic records and some abstracts May lead to full text if the university library subscribes or if free-to-read May lead to a document delivery company Does not penetrate the invisible Web Has significantly enlarged the scope by crawling additional publishers, preprint and reprint servers Competes with other aggregators, such as SFX Google Scholar™ Beta What: Meets the needs of students looking for a different kind of material, and targets advertising to them It is easy for a human to identify a scholarly article, but it is a challenge for a machine (Tomasic) Additional Challenges for a Better Scholarly Search Engine13 Exploit highly structured and tagged web pages with rich metadata from scholarly publishers Create field-specific indexes for many distinct data elements Offer advanced navigation with pull-down menus for limited search by document type, publisher, publication year, journal Consolidate cited references Collect information from all relevant materials Develop utilities to help libraries find all materials subscribed to, not just one path Thank you Gloriana St. Clair Dean of University Libraries Carnegie Mellon University email@example.com or 412-268-2447 If you would like an electronic copy of this talk, contact Cindy Carroll, firstname.lastname@example.org Endnotes 1. 2. 3. Litwin, Rory. “On Google‟s Monetization of Libraries. Library Juice 7,26 (December 17, 2004). Available: http://www.libr.org/Juice/issues/vol7/LJ_7.26.html#3. Wilkin, John. Quoted in “Google to Scan Books from Major Libraries.” MSNBC Tech News & Reviews. Available: http://www.msnbc.msn.com/id/6709342. University of Michigan (Nancy Connell). “Google/U-M Project Opens the Way to Universal Access to Information .“ University of Michigan News Service (December 14, 2004). Available: http://www.umich.edu/news/?Releases/2004/Dec04/library/index. University of Michigan. “Google/U-M Project Questions and Answers.” The University Record Online (January 7, 2005). Available: http://www.umich.edu/~urecord/0405/Dec13_04/lib_qa.shtml. 4. Endnotes 5. Misseli. “The Google Deal (Down on the Farm).” Message posted by a Stanford staff member to Confessions of a Mad Librarian. Available: http://edwards.orcas.net/~misseli/blog/archives/000222.html. Ibid. Ibid. Adam Smith, Senior Business and Product Manager for Google Print and Google Scholar, speaking informally with the ALA Electronic Text Centers Discussion Group. American Library Association Mid-Winter Conference (January 15, 2005). Price, Gary. “Google Partners with Oxford, Harvard & Others to Digitize Libraries.” Search Engine Watch (December 14, 2004). Available: http://searchenginewatch.com/searchday/article.php/3447411. 6. 7. 8. 9. Endnotes 10. Litwin. 11. Péter’s Digital Reference Shelf. “Google Scholar Beta.” (December 2004). Available: http://www.galegroup.com/servlet/HTMLFileServlet?imprint=9999&r egion=7&fileName=reference/archive/200412/googlescholar.html. 12. Ibid. 13. Ibid.