Google BooksWhere we're going and how we got here
Jon OrwantEngineering ManagerGoogle Books
Google Confidential and Proprietary
Overview
• Why and how Google scans books • The Google Books settlement• From pages to ideas
Why and How Google Scans Books
Google Confidential and Proprietary
Google’s mission
Online contentBillions of web pages
Offline contentBillions of items becoming
indexed
To organize the world’s information and make it universally accessible and useful.
Limited previews from publishers & authors
http://books.google.com
Google Confidential and Proprietary
Google Books in a nutshell
Google Confidential and Proprietary
Vital stats
Scans• Number of books scanned: 15M+• Number of pages: 4B• Number of words: 2T• Libraries: 40+• Publishers: 30K+
Metadata• Number of books: 130M• Number of records: 4B• Number of metadata fields: 1T
Identifying the book
Library of Congress
title
author
publisher
year
Books in Print
Lord of the Rings, v.1 The Fellowship of the Ring
John Roland Reuel Tolkien J.R.R. Tolkien
Houghton Mifflin Ballantine Books
1954 1994
Google Confidential and Proprietary
How Google Handles Metadata
1.Collect data from 100+ sources (libraries, commercial aggregators, union catalogs, publishers, retailers)
2.Parse the records into our internal format MARC, ONIX, others... "UVA stores item data and call numbers in 955$a..."
• Cluster the records into expressions and manifestations• Create a "best of" record for each cluster• Index and display elements of that record on
books.google.com
478 languages
Kashubian: 14Kara-kalpak: 102Kabyle: 50Kachin: 18Kalaallisut: 82Kamba: 29Kannada: 2600Karen: 50Kashmiri: 289Kanuri: 25Kawi: 106Kazakh: 1871
Kabardian: 16Khasi: 78Khoisan: 53Khotanese: 21Kikuyu, Gikuyu: 48Kinyarwanda: 77Kirghiz, Kyrgyz: 702Kimbundu: 14Konkani: 83Komi: 48Kongo: 134Korean: 35905
Kosraean: 10Kpelle: 6Karachay-balkar: 17Karelian: 28Kru: 26Kurukh: 30Kuanyama: 9Kumyk: 16Kurdish: 220Kutenai: 0Klingon: 3Kalmyk: 26
Translit-aware similarity metrics for names and titles
Material content & form
<datafield tag="245" ind1=" " ind2=" "> <subfield code="a">[Turkey probe]</subfield>
<datafield tag="260" ind1=" " ind2=" "> <subfield code="a">Syracuse : Betty Crocker Supplies, ca 1987</subfield>
<datafield tag="300" ind1=" " ind2=" "> <subfield code="a">1 pointy thing , 46 cm. </subfield>
<datafield tag="650" ind1=" " ind2=" "> <subfield code="a">Microwave cookery</subfield> <datafield tag="650" ind1=" " ind2=" "> <subfield code="a">April Fool's Day</subfield>
Cover generation
Parsing Uncertain Dates
• 18??• [196-?]• 1957/8• late 14th century• finita quarto nonas Januarias [1490]• mense Septembri: Anno Millesimo q[ui]ngentesimo
decimonono• mense iulio, anno M.D.XXXX• (Hebrew year 5751 = Gregorian 1990/1 CE) התשנ״א• ١٣٧٣ (either Islamic year 1373 AH = Gregorian 1953/4 CE or
Persian year 1373 AP = Gregorian 1994/5 CE)
Annotations
The Google Books Settlement
Google Books Settlement
If approved, resolves lawsuit brought against Google by AAP & AGBenefits:
o Rightsholder controlo Snippets => 20%o Library subscriptionso Free terminal in every US public library buildingo Downloadable books for purchaseo Access for the print-disabledo Book Rights Registry: a non-profit organization to find and pay
rightsholderso Research corpus
Linguistic Analysis
"Research that performs linguistic analysis over the Research Corpus to understand language, linguistic use, semantics and syntax as they evolve over time and across different genres or other classifications of Books."
From Pages to Ideas
Books as a corpus of human knowledge
• Understand one book• Understand all books• Understand relations
between books
Google Confidential and Proprietary
Insights into human progress
Source: Matthew Gray & Yuan K. Shen
oxide of leadmay be thusa heavy firea striking proofmiles distant fromterms of peacepresents the appearancemore than mortalvexation of spiritzeal and devotion
lesbian and gayhealth care professionalsabuse and neglectthe overall processshift away fromthe power elitea research projectthe poor countriesprobability of failureincreased awareness of
Old-fashioned trigrams New-fangled trigrams
Google Confidential and Proprietary
Semantic Stack
Google Confidential and Proprietary
Semantic Stack (video remix)
Google Confidential and Proprietary
Reframing the Victorians (Cohen & Gibbs, GMU)
Google Confidential and Proprietary
Victorian terms
Google Confidential and Proprietary
Discipline-specific progress occurs by...
...moving up one level
...or improving the results at one level by creating a reusable data set
...or reasonably using one level as a proxy for a higher level
Google Confidential and Proprietary
Reframing the Victorians
...reasonably using one level as a proxy for a higher level
Google Confidential and Proprietary
Interdisciplinary progress occurs by...
...moving up one level
...or improving the results at one level
...by creating infrastructure that can be used by others
Google Confidential and Proprietary
Intralanguage translations (Efron, U. Illinois)
Meeting the Challenge of Language Change in Text Retrieval with Machine Translation Techniques
Google Confidential and Proprietary
Intralanguage translations
improving the results at one level
...by creating infrastructure that can be used by others
Google Confidential and Proprietary
Grammar inference(Abney & Szymanski, Univ. Michigan)
Automatic Identification and Extraction of Structured Linguistic Passages in Texts
Google Confidential and Proprietary
Grammar inference
moving up one level
...by creating infrastructure that can be used by others
The "Great Man" theory
Thank You!Q&A