FactMiners & PRImA's "Turning Text Soup into Smart Data" - The Solution: Technology

FactMiners & PRImA’sKnight News Challenge Entry

Turn Text Soup into Smart Data in

Newspaper & Magazine Archives”

A self-running video slideshow.

One slide every 15 seconds.

Pause as needed.

Solution: TechnologyMachine Learning & Smart Data

Q: Can Robots* read magazines?

• Yes (mostly)…when looking at

layout & text recognition within the individual page

• No...in terms of recognizing the complex document structure of the

whole issue

• Our challenge is to move from

individual page to whole-issuedocument structure recognition.

FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”

* “Robot” = Software Agent (AKA, a computer program)

From page…

…to pages!

Q: What’s the 1st Step & Where to do it?

• We start by teaching Robot

agents to find & understandthe TOC (Table of Contents)

and Advertiser Index pages of newspapers & magazines.

• The best place to do this applied research is in the

collections of the Internet Archive.


Bring ‘em on!

I can’t get

enough TOC.

Q: Why TOCs & Advertiser Indexes?

• A: TOCs (Table of Contents) &

Advertiser Indexes reveal the complex document structure of newspapers & magazines.

• Like a Sudoku puzzle, the TOC & Ad Index provide helpful “filled-in answers” about the types of content to be found within pages of the newspaper or magazine.


Q: Why the Internet Archive?

• Thousands of “Text Soup era”newspaper & magazine collections that can be enhanced through research.

• The Archive’s Scanning Service flags TOC pages & generates a TOC-specific XML-encoded file during its standard digitization workflow.

• The current Archive TOC OCR analysis does not “see” & understand the complex TOCs of magazines & newspapers.


Q: What TOC Robots will we develop?

• TOC-Spotter is an Image/Scene Recognition software agent to crawl the Archive in search of TOC & Ad Index pages.

• TOC-Reader is a software agent extending PRImA recognition & evaluation technologies with Machine Learning capabilities to do “deep reading” with the assist of the TOC Pattern Reference Library.


Dot dot dot… Check!

Number… Check!

YO! Gotta TOC here!

Great! Let me take

a good look at it.

Q: How will this help?

• By running our TOC-Agents early in

digitization workflows, we can make smarter within-page layout recognition decisions during bulk OCR of the issue’s subsequent pages.

• We can generate “best guess” structure-revealing meta-tags in appropriate files as part of the standard Archive scanning workflow.


Let’s see… Based on my notes,

that’d be an ad, a feature article,

another ad…and there’s the

Ad Index!

Q: What will be accomplished?

• The structure-mapped text files

generated by the TOC-Readeragent will be ready for FactMiners'

Semantic tagging (AKA “fact-mining”) of the issue’s content.

• These files will be compatible with

PRImA’s Alethia program for use in crowdsourced Ground-Truthdevelopment of the TOC Pattern Reference Library.


Welcome to the

TOC Pattern Reference Library

We have a design to “tame” Text Soup and unlock “facts” in archive data.

• Our immediate PRImA-inspired technology agenda is to

develop “Robot” assistance (software agents) to find,

recognize & deeply understand the TOCs (Table of Contents) and Advertiser Indexes of magazines in the Internet Archive magazine & newspaper collections.

• In our last slideshow, we describe the people dimension

of our strategy to “fact-mine” Smart Data from newspaper & magazine digital archives…


FactMiners & PRImA: Our Knight News Challenge Entry

• “Turn Text Soup into Smart Data in Newspaper & Magazine Archives” https://goo.gl/99Vn5M

• Team• Jim Salmons, FactMiners

• Timlynn Babitsky, FactMiners

• Apostolos Antonacopoulos, PRImA

• Christian Clausner, PRImA


https://goo.gl/99Vn5M

Date post:	09-Jan-2017
Category:	Data & Analytics
Upload:	jim-salmons
View:	92 times
Download:	0 times

FactMiners & PRImA's "Turning Text Soup into Smart Data" - The Solution: Technology

Data & Analytics