Date post: | 09-Jan-2017 |
Category: |
Data & Analytics |
Upload: | jim-salmons |
View: | 92 times |
Download: | 0 times |
FactMiners & PRImA’sKnight News Challenge Entry
Turn Text Soup into Smart Data in
Newspaper & Magazine Archives”
A self-running video slideshow.
One slide every 15 seconds.
Pause as needed.
Solution: TechnologyMachine Learning & Smart Data
Q: Can Robots* read magazines?
• Yes (mostly)…when looking at
layout & text recognition within the individual page
• No...in terms of recognizing the complex document structure of the
whole issue
• Our challenge is to move from
individual page to whole-issuedocument structure recognition.
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
* “Robot” = Software Agent (AKA, a computer program)
From page…
…to pages!
Q: What’s the 1st Step & Where to do it?
• We start by teaching Robot
agents to find & understandthe TOC (Table of Contents)
and Advertiser Index pages of newspapers & magazines.
• The best place to do this applied research is in the
collections of the Internet Archive.
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Bring ‘em on!
I can’t get
enough TOC.
Q: Why TOCs & Advertiser Indexes?
• A: TOCs (Table of Contents) &
Advertiser Indexes reveal the complex document structure of newspapers & magazines.
• Like a Sudoku puzzle, the TOC & Ad Index provide helpful “filled-in answers” about the types of content to be found within pages of the newspaper or magazine.
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Q: Why the Internet Archive?
• Thousands of “Text Soup era”newspaper & magazine collections that can be enhanced through research.
• The Archive’s Scanning Service flags TOC pages & generates a TOC-specific XML-encoded file during its standard digitization workflow.
• The current Archive TOC OCR analysis does not “see” & understand the complex TOCs of magazines & newspapers.
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Q: What TOC Robots will we develop?
• TOC-Spotter is an Image/Scene Recognition software agent to crawl the Archive in search of TOC & Ad Index pages.
• TOC-Reader is a software agent extending PRImA recognition & evaluation technologies with Machine Learning capabilities to do “deep reading” with the assist of the TOC Pattern Reference Library.
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Dot dot dot… Check!
Number… Check!
YO! Gotta TOC here!
Great! Let me take
a good look at it.
Q: How will this help?
• By running our TOC-Agents early in
digitization workflows, we can make smarter within-page layout recognition decisions during bulk OCR of the issue’s subsequent pages.
• We can generate “best guess” structure-revealing meta-tags in appropriate files as part of the standard Archive scanning workflow.
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Let’s see… Based on my notes,
that’d be an ad, a feature article,
another ad…and there’s the
Ad Index!
Q: What will be accomplished?
• The structure-mapped text files
generated by the TOC-Readeragent will be ready for FactMiners'
Semantic tagging (AKA “fact-mining”) of the issue’s content.
• These files will be compatible with
PRImA’s Alethia program for use in crowdsourced Ground-Truthdevelopment of the TOC Pattern Reference Library.
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
Welcome to the
TOC Pattern Reference Library
We have a design to “tame” Text Soup and unlock “facts” in archive data.
• Our immediate PRImA-inspired technology agenda is to
develop “Robot” assistance (software agents) to find,
recognize & deeply understand the TOCs (Table of Contents) and Advertiser Indexes of magazines in the Internet Archive magazine & newspaper collections.
• In our last slideshow, we describe the people dimension
of our strategy to “fact-mine” Smart Data from newspaper & magazine digital archives…
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”
FactMiners & PRImA: Our Knight News Challenge Entry
• “Turn Text Soup into Smart Data in Newspaper & Magazine Archives” https://goo.gl/99Vn5M
• Team• Jim Salmons, FactMiners
• Timlynn Babitsky, FactMiners
• Apostolos Antonacopoulos, PRImA
• Christian Clausner, PRImA
FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”