+ All Categories
Home > Data & Analytics > FactMiners & PRImA's "Turning Text Soup into Smart Data" - The Solution: Technology

FactMiners & PRImA's "Turning Text Soup into Smart Data" - The Solution: Technology

Date post: 09-Jan-2017
Category:
Upload: jim-salmons
View: 92 times
Download: 0 times
Share this document with a friend
10
FactMiners & PRImA’s Knight News Challenge Entry Turn Text Soup into Smart Data in Newspaper & Magazine Archives” A self-running video slideshow. One slide every 15 seconds. Pause as needed. Solution: Technology Machine Learning & Smart Data
Transcript

FactMiners & PRImA’sKnight News Challenge Entry

Turn Text Soup into Smart Data in

Newspaper & Magazine Archives”

A self-running video slideshow.

One slide every 15 seconds.

Pause as needed.

Solution: TechnologyMachine Learning & Smart Data

Q: Can Robots* read magazines?

• Yes (mostly)…when looking at

layout & text recognition within the individual page

• No...in terms of recognizing the complex document structure of the

whole issue

• Our challenge is to move from

individual page to whole-issuedocument structure recognition.

FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”

* “Robot” = Software Agent (AKA, a computer program)

From page…

…to pages!

Q: What’s the 1st Step & Where to do it?

• We start by teaching Robot

agents to find & understandthe TOC (Table of Contents)

and Advertiser Index pages of newspapers & magazines.

• The best place to do this applied research is in the

collections of the Internet Archive.

FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”

Bring ‘em on!

I can’t get

enough TOC.

Q: Why TOCs & Advertiser Indexes?

• A: TOCs (Table of Contents) &

Advertiser Indexes reveal the complex document structure of newspapers & magazines.

• Like a Sudoku puzzle, the TOC & Ad Index provide helpful “filled-in answers” about the types of content to be found within pages of the newspaper or magazine.

FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”

Q: Why the Internet Archive?

• Thousands of “Text Soup era”newspaper & magazine collections that can be enhanced through research.

• The Archive’s Scanning Service flags TOC pages & generates a TOC-specific XML-encoded file during its standard digitization workflow.

• The current Archive TOC OCR analysis does not “see” & understand the complex TOCs of magazines & newspapers.

FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”

Q: What TOC Robots will we develop?

• TOC-Spotter is an Image/Scene Recognition software agent to crawl the Archive in search of TOC & Ad Index pages.

• TOC-Reader is a software agent extending PRImA recognition & evaluation technologies with Machine Learning capabilities to do “deep reading” with the assist of the TOC Pattern Reference Library.

FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”

Dot dot dot… Check!

Number… Check!

YO! Gotta TOC here!

Great! Let me take

a good look at it.

Q: How will this help?

• By running our TOC-Agents early in

digitization workflows, we can make smarter within-page layout recognition decisions during bulk OCR of the issue’s subsequent pages.

• We can generate “best guess” structure-revealing meta-tags in appropriate files as part of the standard Archive scanning workflow.

FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”

Let’s see… Based on my notes,

that’d be an ad, a feature article,

another ad…and there’s the

Ad Index!

Q: What will be accomplished?

• The structure-mapped text files

generated by the TOC-Readeragent will be ready for FactMiners'

Semantic tagging (AKA “fact-mining”) of the issue’s content.

• These files will be compatible with

PRImA’s Alethia program for use in crowdsourced Ground-Truthdevelopment of the TOC Pattern Reference Library.

FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”

Welcome to the

TOC Pattern Reference Library

We have a design to “tame” Text Soup and unlock “facts” in archive data.

• Our immediate PRImA-inspired technology agenda is to

develop “Robot” assistance (software agents) to find,

recognize & deeply understand the TOCs (Table of Contents) and Advertiser Indexes of magazines in the Internet Archive magazine & newspaper collections.

• In our last slideshow, we describe the people dimension

of our strategy to “fact-mine” Smart Data from newspaper & magazine digital archives…

FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”

FactMiners & PRImA: Our Knight News Challenge Entry

• “Turn Text Soup into Smart Data in Newspaper & Magazine Archives” https://goo.gl/99Vn5M

• Team• Jim Salmons, FactMiners

• Timlynn Babitsky, FactMiners

• Apostolos Antonacopoulos, PRImA

• Christian Clausner, PRImA

FactMiners & PRImA: Knight News Challenge – “Turning Text Soup into Smart Data in Newspaper & Magazine Archives”


Recommended