+ All Categories
Home > Technology > From the printed page to discoverable content library camp perth 2010

From the printed page to discoverable content library camp perth 2010

Date post: 18-Dec-2014
Category:
Upload: steven-miles
View: 692 times
Download: 1 times
Share this document with a friend
Description:
 
24
From the Printed Page to Discoverable Content the open source way Steven Miles @stevermiles stevenmiles.com.au Tuesday, 18 January 2011
Transcript
Page 1: From the printed page to discoverable content    library camp perth 2010

From the

Printed Page to

Discoverable Contentthe open source way

Steven Miles

@stevermiles stevenmiles.com.auTuesday, 18 January 2011

Page 2: From the printed page to discoverable content    library camp perth 2010

About Me

Tuesday, 18 January 2011

Page 3: From the printed page to discoverable content    library camp perth 2010

About MeWeb Application Developer

State Library of Western Australia

@

Tuesday, 18 January 2011

Page 4: From the printed page to discoverable content    library camp perth 2010

About MeWeb Application Developer

State Library of Western Australia

@

S.L.U.R.P. Digital Content Ingestion &

Integration with LMS

PC Reservation PC Reservations and Booking

System

PLOPublic Libraries Online

Venues BookingsVenues Booking & Reservation

System

P.URL Permanent URL

Tuesday, 18 January 2011

Page 5: From the printed page to discoverable content    library camp perth 2010

WARNING !!!!

Lots of technical stuff!

Tuesday, 18 January 2011

Page 6: From the printed page to discoverable content    library camp perth 2010

How can I make scanned content more discoverable?

presentation

DigitisationIndexing

Capture DIY Scanner

Existing Documents

Dual Camera Setup

Single Camera Setup

Commercial ScannersImage Processing

OCR

Document Scanners

MFD’s

Rotation

Cropping

Normalisation Levels Correction

Multi page

TaggingOpen source

Commercial

Cuneiform

Tesseract

OcropusGOCR

PageLayout Analysis

Abby Fine Reader

Acrobat

leptonica

Metadata

ManualAutomatic

PersonsLocations

Dates

OrganisationsLocations

Formats

hOCRText

XML

Manual

Import

Z39.50

SRU/SRW

Engine

Zebra

XML

Z39.50

RBMS

Postgres

MySQL

Search

Pull from LMS

Search Multiple Databases Results

Expose Web API’s

Other Library Systems

Z39.50

SRU/SRW

Facets Page Previews

Ranked

Sortable

Filters

Web Accessible

SimpleKeywordSearching

Encourage Exploration

Tagging

AdvancedSearch

SavedSearches

Social Sharing,Intergration

Web Browser Accessible

Auto Updating Downloadable PDF’s

User Correctable Text

In DocumentSearching

Highlight Search Results

Potential Conversion to Other Formats

Tuesday, 18 January 2011

Page 7: From the printed page to discoverable content    library camp perth 2010

Most common process of digitisation for public consumption

Scan /Capture Generate PDF OCR

Indexed by ContentManagement

System

Link toDownloadable

PDF(Uncorrected OCR)

(Links only to Document)

How can we do this better?

Tuesday, 18 January 2011

Page 8: From the printed page to discoverable content    library camp perth 2010

Inspirational Resources

National Libraries Australia - Australian Newspapershttp://newspapers.nla.gov.au/

Google Docshttp://docs.google.com

Informit -Text Searchable Content

Tuesday, 18 January 2011

Page 9: From the printed page to discoverable content    library camp perth 2010

Scan /Capture

Semi Auto Cropping

and Rotation Correction

Optimise Each Page for OCR

OCR Pages

Retain Positional Information (hocr)

Post OCR Processing

Spell checking & correction of common

OCR errors

Natural Language

ProcessingAuto Extract Names,

Organisations, Locations & Dates

from Text and Use for tagging

Store as XML

Generate Page Level XML Index

Files

Add/Update XML

Indexing Server

Fully Automated Process

Generate Searchable PDF

Generate Web Friendly Versions

of each page

Full Text Search

Web Services & Z39.50

Downloadable PDF

Google Docs Style Interface

Individual Line Highlighting to Show

search results

Proposed Digitisation Process

Tuesday, 18 January 2011

Page 10: From the printed page to discoverable content    library camp perth 2010

Available Open Source Projects

Ocropus - Page Layout Analysishttp://code.google.com/p/ocropus/

Tesseract OCR - OCRhttp://code.google.com/p/ocropus/

Image Magick - Image Processinghttp://www.imagemagick.org/

Index Data Zebra -XML Indexinghttp://www.indexdata.com/zebra

Index Data Pazpar2 -Federated Searchhttp://www.indexdata.com/pazpar2

Existing Web Technologies - PHP, HTML, CSS etc

Tuesday, 18 January 2011

Page 11: From the printed page to discoverable content    library camp perth 2010

DIY Book Scanner Project

www.diybookscanner.org

Tuesday, 18 January 2011

Page 12: From the printed page to discoverable content    library camp perth 2010

Discovery Layer(PHP, HTML,CSS)

Federated SearchUsing PazPar2 - Z39.50, SRU, SRW

Full Text SearchZebra - XML Indexer

via Z39.50

LMS & External DatabasesExisting via Z39.50

XML Data FilesMARC, Dublin Core, OAI-PM

Document Viewer / Editor(PHP, HTML,CSS)

Ingest / Digitisation(PHP,HTML,CSS)

OCR & NLP(Document Processing, OCR & Natural Language Processing)

Downloadable VersionAutomatic Generation of Searchable

PDF, Text Files etc(Updated from User Alterations)

External Resources

Basic Architecture

Crowdsourcing OCR Corrections & Possible

translation on handwritten documents

Tuesday, 18 January 2011

Page 13: From the printed page to discoverable content    library camp perth 2010

Converting Images for OCR

Convert to Grayscale Generate Text Image Mask Clean Up Background Noise OCR Version

OCRopus Page Layout Analysis

Image Magick Image Manipulation

Combined

Tuesday, 18 January 2011

Page 14: From the printed page to discoverable content    library camp perth 2010

Images to Text

Image for OCR Processing Tesseract OCR to HOCR File

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"><html><head><title></title><meta http-equiv="Content-Type" content="text/html;charset=utf-8" ><meta name='ocr-system' content='tesseract'></head><body><div class='ocr_page' id='page_1' title='image "/var/digindex/repository/eastern_reporter/2010/10/5/ocr/1-masked.png"; bbox 0 0 2161 3247'><div class='ocr_carea' id='block_1_1' title="bbox 200 46 1858 233"><p class='ocr_par'><span class='ocr_line' id='line_1_1' title="bbox 201 50 1858 230"><span class='ocr_word' id='word_1_1' title="bbox 1058 50 1211 196"><span class='xocr_word' id='xword_1_1' title="x_wconf -6">R</span></span> <span class='ocr_word' id='word_1_2' title="bbox 1319 88 1858 230"><span class='xocr_word' id='xword_1_2' title="x_wconf -4"> r </span></span></span></p></div><div class='ocr_carea' id='block_1_2' title="bbox 47 1855 241 1883"><p class='ocr_par'><span class='ocr_line' id='line_1_2' title="bbox 47 1855 241 1882"><span class='ocr_word' id='word_1_3' title="bbox 47 1855 77 1882"><span class='xocr_word' id='xword_1_3' title="x_wconf -2">By</span></span> <span class='ocr_word' id='word_1_4' title="bbox 87 1855 153 1877"><span class='xocr_word' id='xword_1_4' title="x_wconf -3">LIAM</span></span> <span class='ocr_word' id='word_1_5' title="bbox 163 1856 241 1878"><span class='xocr_word' id='xword_1_5' title="x_wconf -2">CROY</span></span></span></p></div><div class='ocr_carea' id='block_1_3' title="bbox 43 1909 533 2404"><p class='ocr_par'><span class='ocr_line' id='line_1_3' title="bbox 46 1910 531 1934"><span class='ocr_word' id='word_1_6' title="bbox 46 1910 72 1928"><span class='xocr_word' id='xword_1_6' title="x_wconf -3">IN</span></span> <span class='ocr_word' id='word_1_7' title="bbox 83 1914 94 1928"><span class='xocr_word' id='xword_1_7' title="x_wconf -2">a</span></span> <span class='ocr_word' id='word_1_8' title="bbox 105 1910 185 1933"><span

<document><metadata><title>Eastern Reporter Tuesday, October 5, 2010</title><id>eastern_reporter/2010/10/5</id></metadata><pages><page id="0" origWidth="3648" origHeight="2736" rotate="-90.5" crop="2199x3321+147+147"/><page id="1" origWidth="3648" origHeight="2736" rotate="91" path="odd/IMG_0946.JPG" crop="2161x3247+374+274" width="2161" height="3247"><paragraph><line id="line_1_1" top="50" left="201" width="1657" height="180">R r</line></paragraph><paragraph><line id="line_1_2" top="1855" left="47" width="194" height="27">By LIAM CROY</line></paragraph><paragraph><line id="line_1_3" top="1910" left="46" width="485" height="24">IN a display of unity, Muslims and Chris-</line><line id="line_1_4" top="1937" left="45" width="486" height="26">tians gathered at Dianella Uniting Church</line><line id="line_1_5" top="1965" left="45" width="485" height="26">last Thursday to share thei.r experiences</line><line id="line_1_6" top="1993" left="45" width="212" height="24">and pray for peace.</line></paragraph><paragraph><line id="line_1_7" top="2020" left="79" width="451" height="25">Sheikh Muhammad Agherdien of the</line></paragraph><paragraph><line id="line_1_8" top="2048" left="46" width="484" height="25">Mirrabooka mosque opened the service</line><line id="line_1_9" top="2076" left="46" width="484" height="26">with a verse of the Islamic religious text,</line><line id="line_1_10" top="2103" left="45" width="117" height="20">the Koran:</line></paragraph><paragraph><line id="line_1_11" top="2131" left="79" width="451" height="27">&#x201C;Oh People! Behold, we have created you</line></paragraph><paragraph><line id="line_1_12" top="2158" left="46" width="331" height="22">all out ofa male and a female.</line></paragraph><paragraph><line id="line_1_13" top="2187" left="79" width="451" height="25">&#x201C;And we have made you into nations</line></paragraph><paragraph><line id="line_1_14" top="2214" left="46"

Convert HOCR to XML for Storage Sample Auto Generate Tags

IN a display of unity , [MISC Muslims ] and [MISC Chris- ] , tians gathered at [ORG Dianella Uniting Church ] , last Thursday to share thei.r experiences , and pray for peace.

Tuesday, 18 January 2011

Page 15: From the printed page to discoverable content    library camp perth 2010

Demo

Tuesday, 18 January 2011

Page 16: From the printed page to discoverable content    library camp perth 2010

Prototype Interface for Ingesting Pages from Book Scanner

Tuesday, 18 January 2011

Page 17: From the printed page to discoverable content    library camp perth 2010

Perform Basic Image Rotation and Cropping

Rotation and Cropping can replicated to other pages

Tuesday, 18 January 2011

Page 18: From the printed page to discoverable content    library camp perth 2010

Prototype Search PagesResults on the left are the Auto Generated facets based on the natural language processing tags

Tuesday, 18 January 2011

Page 19: From the printed page to discoverable content    library camp perth 2010

Viewing Document Pages

Tuesday, 18 January 2011

Page 20: From the printed page to discoverable content    library camp perth 2010

Viewing Document Pages with Highlighted Results

Tuesday, 18 January 2011

Page 21: From the printed page to discoverable content    library camp perth 2010

Editing Document with Auto Updating of Indexer

Tuesday, 18 January 2011

Page 22: From the printed page to discoverable content    library camp perth 2010

Pazar2 can be used to alternative interfaces for search multiple existing catalogs

Tuesday, 18 January 2011

Page 23: From the printed page to discoverable content    library camp perth 2010

Questions?

Tuesday, 18 January 2011

Page 24: From the printed page to discoverable content    library camp perth 2010

More Info & Credits

Tesseract-OCRhttp://code.google.com/p/tesseract-ocr/

OCRopushttp://code.google.com/p/ocropus/

Do-It-Yourself Book Scanninghttp://www.diybookscanner.org/

CHDK - Canon Hack Development Kithttp://chdk.wikia.com/wiki/CHDK

Zebra - XML Indexinghttp://www.indexdata.com/zebra

PazPar2 -Federated Searchhttp://www.indexdata.com/pazpar2

Cuneiformhttp://en.wikipedia.org/wiki/HOCR

EyeFi Python Serverhttp://returnbooleantrue.blogspot.com/2009/01/eye-fi-standalone-server.html/

hOCR - HTML OCRhttp://en.wikipedia.org/wiki/HOCR

OpenNLPhttp://www.indexdata.com/pazpar2

Illinois Named Entity Taggerhttp://cogcomp.cs.illinois.edu/page/software_view/4

Tuesday, 18 January 2011


Recommended