Value Mining: How Entity Extraction Informs Analysis

Post on 30-Jun-2015

1,346 views 0 download

description

Learn how to create understanding from big data and how entity extraction and open analytics creates understanding from the deep web.

transcript

Value Mining: How Entity Extraction Informs Analysis

June 2012 | Andrew Strite

Agenda• Big Data and Document Analysis• Case Study: Federal Agency

– Problem Definition– Open Analytics & Entity Extraction– Reporting and Visualization– Results Assessment

• Questions

The Big Data Problem

Data is becoming the new raw material of business: an economic input almost on par with

capital and labor.

“Every day I wake up and ask, ‘how can I flow data better, manage data better, analyze data better?”

Rollin Ford, the CIO of Wal-Mart

Solution: Document Analysis"Document Analysis refers tocomputer-assisted analysis of large numbers of documents in order to answer questions about the content of a document set.”Source: http://www.text-tech.com/docanalysis/definition.html

Document Analysis

• The goal is to:– Extract Entities (people, places, things)– Create Associations between entities (in the

form of noun-verb-noun), e.g.:• John Doe lives in Washington, D.C• John Doe is married to Jane Doe• John Doe is a Virgo• John Doe traveled to Mexico on July 6th, 2011

• And…

Document Analysis

• Turn Who, What, When andWhere into a unified data structure that supports data analytics and visualization.

Whopeople, organizations, facilities, company

Whatevents, summaries,facts, themes

Whenpast, present, future dates

Wherecity, state, country, coordinate

Document Analysis

Case Study: Federal Agency

Overview

A Federal client produced reports for other DoD components and wanted to know:

“Did our reports meet customer needs?”

First step: assess historical reporting

“What were teams writing about and when?”

Problem: Unstructured Data

• Plenty of raw data, but no way to get at it– 6K+ unstructured documents – 15+ file types– No standard formats

• Teams (Who)• Dates (When)• Topics (What)

– Some content not relevant

Early Attempts

• Initial client attempts to solve the problem mostly involved manual review– High document volume = labor intensive– Assessing relevance = skilled labor

• Total process tied up skilled analysts for hundreds of man-hours.

• Manual review prone to error– Incomplete attempts corrupted data

Solution: Open Analytics

• Process to design and implement analytical solutions

• Joins open tools and agile engineering techniques

• Goal is to enable organizations to quickly deliver smart analysis and enable top line growth

Mechanism: Infinit.e

CollectingStoring

EnrichingRetrieving

AnalyzingVisualizing

Unstructured documents &

Structured records

Infinit.e is a scalable

framework for

Infinit.e Concept

• Documents• Presentations• Spreadsheets• Meeting notes• Email• IM chats• Reports• Social

• Log files• Databases• Apps

80% Unstructured

20% Structured

Unstructured and Structured Data

• Entities• Events• Facts• Sentiment• Geospatial• Temporal• Themes

Infinit.e Data Model

Tablet ownership levels hit 18% in China, the UK and US versus 3% in November 2010

Bernanke, 57 said in his testimony price increases “have begun to moderate” after a jump in oil costs earlier this year

Duke and Progress announced merger plans in January 2012

<Incident> <uid>20101043423</uid> <subject>1 person killed in armed attack by suspected Boko Haram in Maiduguri, Borno, Nigeria</subject> <multipleDays>No</multipleDays> <eventDate>06/04/2011</eventDate></Incident>

Whopeople, organizations, facilities, company

Whatevents, summaries,facts, themes

Whenpast, present, future dates

Wherecity, state, country, coordinate

Applying Infinit.e

Open Analytic and Agile Intelligence architecture

“What were teams writing about and when?”

Harvested Entities

Reporting and Visualization

• Queries performed on the data, providing breakouts by team, topic, and dates

• Flexible visualization– Built-in visualization framework– Multiple export options

Finding Value

• Over the course of 2.5 weeks, we applied the entity-based data model to our client’s document analysis problem

• Major advantages to this approach were:– Agility– Precision– Relevance

Agility

• Automation reduced processing time:– Manual processing time: ~480 hours– Automated processing time: 2-3 hours

• Speed enabled iterative development– Extraction adapted alongside analysts’

understanding of data– Positive feedback loop

Precision

• Entity definitions created from original data– Definitions improved based on feedback

• Automation ensures uniform application across data set

entity1

entity2

entity3

entity3 entity1TOPIC1

TOPIC2

TOPIC1TOPIC2

Relevance

• Entity extraction informs quality control– Duplicates identified based on similar entities– Exclude documents based on missing entities– Minimizes risk of data corruption– Reduced need for analyst review Duplicates

Missing Meta-Data

The Results

• Extracted entities became key meta-data6K+ unstructured documents became…

…3.5K documents with value to the study

The Results

• Our client was able to complete the research shortly after final extraction

• Confidence in methodology and results bolstered the value of recommendations

• Considering similar approaches for future projects

Bottom Line

Using document analysis significantly…

… reduces the time to ingest data.

… cuts right to relevant information.

… builds a framework for future analysis.

Thank You!

Andrew Strite

www.ikanow.com

astrite@ikanow.com

301.513.1384