Making sense of unstructured data by turning strings into things

Post on 01-Nov-2014

763 views 0 download

Tags:

description

We all know about the promise of Big Data Analytics to transform our understanding of the world. The analysis of structured data, such as inventory, transactions, close rates, and even clicks, likes and shares is clearly valuable, but the curious fact about the immense volume of data being produced is that a vast majority of it is unstructured text. Content such as news articles, blog post, product reviews, and yes even the dreaded 140 character novella contain tremendous value, if only they could be connected to things in the real world – people, places and things. In this talk, we’ll discuss the challenges and opportunities that result when you extract entities from Big Text. Speaker: Gregor Stewart – Director of Product Management for Text Analytics at Basis Technology As Director of Product Management, Mr. Stewart helps to ensure that Basis Technology’s offerings stay ahead of the curve. Previously Mr. Stewart was the CTO of a storage services startup and a strategy consultant. He holds a Masters in Natural Language Processing from the University of Edinburgh, a BA in PPE from the University of Oxford, and a Masters from the London School of Economics. Thanks to our amazing sponsors: MicrosoftNERD (http://microsoftnewengland.com/) for Venue Basis Technology(http://basistech.com) for Food and Kindle Raffle

transcript

Analyze

Extract

Match

Transform

Information

Revealed

Connect

Analyze

Extract

Match

Transform

Information

Revealed

Connect

Overview

• Very briefly introduce Basis

• Motivate the move from Strings to Things

• Review two enabling technologies:

– Entity Extraction: finding names in text (and classifying them)

– Entity Resolution: connecting names together and to things

• Give you three examples of things you can do:

– Entity-based search, illustrating:

• How entities and enriched typing can empower searchers

• How human corrections might be used to improve accuracy over time

– Get additional high quality enrichments from knowledge sources

– Recognize anomalies/outliers, by establishing rich norms

4

Introduction: Basis Technology

5

Introduction: Gregor Stewart

6

Facebingler cares, should you?

Facebingler cares, should you?

8

Entity Extraction: What is it?

9

Entity Extraction: How is it done?

10

Probabilistic Extractor

Supervised Model

Unsupervised Model

Deterministic Extractor

Exact Match (Gazetteer)

Pattern Match (Regex)

En

tity

Red

act

or

JoiningInputText

Filtering

Adjudication

TaggedText

Domain

Text

Annotated

Text

User Defined

Lists

User DefinedPatterns

Entity Resolution: What is it? (1)

Entity Resolution: What is it? (2)

Alberto

Alberto

AlbertoAlberto

Alberto Amos Fernandez…

Alberto M.Fernandez…

Alberto Fernandez…

Alberto Fernandiz…

AlbertFernandez…

Alberto

Alberto

AlbertoAlberto

Alberto Fernandez…

… Chief of Cabinet… Argentina… …Prof of Criminal Law…

Alberto Fernandez…

… born Sept 7, 1984… cycling… Madrid

Alberto Fernandez…

… born in Cuba… US Ambassador

Alburto Fernandez…

Alberto

Alberto Fernandezde la Puebla…

Alberto

Ratio ofPoliticians to Sportsmen?

2:1

Alberto Fernandez… Sportsmen?

YES

Nickname“El Galleta?”

?

But it’s not just text (1)

But it’s not just text (2)

?

Entity Resolution: How is it Done? (1)

Entity Resolution: How is it Done? (2)

16

Entity Resolution: How is it Done? (3)

17

Entity Resolution: How is it Done? (4)

18

Resolution EngineCandidate Selection

Entity Index

Entity Mentio

n+

Context

Link or Ghost

Ranking

Knowledge Base

Learned

Seeded

!

A (Convenient) Fiction…

• In a nearby place, not so long ago… the CIA was asked by the President to assess the likelihood that the Syrian opposition would use chemical weapons by mid-2014.

• As part of building that analysis, and because there are Al-Qaeda elements in the Syrian opposition, Alice the analyst was asked to: characterize Al-Qaeda’s attitude to using chemical weapons against Middle Eastern governments.

19

20

From: Ayman Al-Zawahiri (?)To: “Hafiz Sultan”

Dear Brother, We need guidance from you on the issue of using chlorine gas technology. It was reported that the brothers in Iraq have used it, but this was implicitly denied in a

statement issued by the Islamic State of Iraq.

The brothers where Mahmud is have the potential to use chlorine gas on the forces of the apostates, Jalal Talabani and Mas'ud Barzani, and have already considered using it.

However, I informed them that matters as serious as this require centralized [coordination] and permission from the senior [al-Qa'ida] leadership, because the gas could be difficult to control and might harm some people, which could tarnish our image, alienate people from

us, and so on.”

A document that Alice needs to read (socom-2012-

0000011)…

21

3/28/07 Today

5/2/11 5/3/11 5/12/11 5/14/11

22

3/28/07 Today

5/2/11 5/3/11 5/12/11 5/14/11

23

3/28/07 Today

5/2/11 5/3/11 5/12/11 5/14/11

24

3/28/07 Today

5/2/11 5/3/11 5/12/11 5/14/11

25

3/28/07 Today

5/2/11 5/3/11 5/12/11 5/14/11

26

3/28/07 Today

5/2/11 5/3/11 5/12/11 5/14/11

27

3/28/07 Today

5/2/11 5/3/11 5/12/11 5/14/11

28

3/28/07 Today

5/2/11 5/3/11 5/12/11 5/14/11

29

3/28/07 Today

5/2/11 5/3/11 5/12/11 5/14/11

30

3/28/07 Today

5/2/11 5/3/11 5/12/11 5/14/11

31

3/28/07 Today

5/2/11 5/3/11 5/12/11 5/14/11

32

3/28/07 Today

5/2/11 5/3/11 5/12/11 5/14/11

33

3/28/07 Today

5/2/11 5/3/11 5/12/11 5/14/11

34

3/28/07 Today

5/2/11 5/3/11 5/12/11 5/14/11

35

3/28/07 Today

5/2/11 5/3/11 5/12/11 5/14/11

36

3/28/07 Today

5/2/11 5/3/11 5/12/11 5/14/11

37

3/28/07 Today

5/2/11 5/3/11 5/12/11 5/14/11

38

3/28/07 Today

5/2/11 5/3/11 5/12/11 5/14/11

39

3/28/07 Today

5/2/11 5/3/11 5/12/11 5/14/11

40

3/28/07 Today

5/2/11 5/3/11 5/12/11 5/14/11

41

3/28/07 Today

5/2/11 5/3/11 5/12/11 5/14/11

42

3/28/07 Today

5/2/11 5/3/11 5/12/11 5/14/11

Advanced enrichment: Topics?

• Some knowledge sources have rich connectivity between things, concepts, etc.

• Developers often ask for “Topic”

• Even advanced Topic approaches often yield “howlers”

• Better labels might be derived from node info or graph walking.

Advanced enrichment: Norms?

• By walking the graph in very specific ways, we can build one or more efficient representations of what is normal or expected context for an entity.

• This could be focused on particular entities, types of entities, relationships, etc.

• We could use these representations to affect result rankings, raise alerts.

• Note again, that this is not specific to text: elementary parts of other unstructured sources such as images and video might be connected/used in the same way.

Summary

• Extraction and resolution components like REX and RES, can reliably connect Strings to Things in a range of texts.

• This allows existing knowledge to be usefully applied:• We can add properties (like types), and other advanced enrichments• We can discover where existing knowledge is lacking

• Thing-based search can allow each query to be more precise and productive• Fewer queries, fewer adjustments, fewer results to read

• By using abundant human feedback, KB quality and resolution accuracy can be increased.• More subtle distinctions between entities can be learned, example by

example.

• But…

…these tools are like shoes…

Thank

You!gregor@basistech.c

om