Scanned Documents
LBSC 796/INFM 718R
Douglas W. Oard
Week 8, October 29, 2007
Expanding the Search Space
Scanned Docs
Scanned Docs
Identity: Harriet
“… Later, I learned that John had not heard …”
High Payoff Investments
SearchableFraction
Transducer Capabilities
OCRMT
HandwritingSpeech
producedwords
wordsrecognizedaccurately
The Big Picture
• Find the words
• Index the words
• Do ranked retrieval
• Use that system to find what you want
Some Issues
• Language-based search without language!– Shape codes
• Accuracy-selection effect of ranked retrieval– Poor recognition scatters in the query-term space
• Blind relevance feedback– Based on clean text
• Image-domain summaries
Some Applications
• Case management for litigation
• Duplicate detection for declassification productivity and anti-tiling
• Knowledge management from everything I have ever xeroxed or faxed
Some Applications
• Legacy Tobacco Documents Library– http://legacy.library.ucsf.edu/
• Google Books– http://books.google.com/
• George Washington’s Papers– http://ciir.cs.umass.edu/irdemo/hw-demo/