Text Analytics A Tool for Taxonomy Development Tom Reamy Chief Knowledge Architect KAPS Group...

Post on 18-Jan-2016

215 views 0 download

Tags:

transcript

Text AnalyticsA Tool for Taxonomy Development

Tom ReamyChief Knowledge Architect

KAPS Group

Program Chair – Text Analytics World

Knowledge Architecture Professional Services

http://www.kapsgroup.com

2

Agenda

Introduction

Project: Update ACM taxonomy – after 12+ years

Information Environment

Text Mining / Text Analytics Multiple Methods / Reports

Conclusion

3

Introduction: KAPS Group

Knowledge Architecture Professional Services – Network of Consultants Applied Theory – Faceted & emotion taxonomies, natural categories

Services:– Strategy – IM & KM - Text Analytics, Social Media, Integration– Taxonomy/Text Analytics, Social Media development, consulting– Text Analytics Quick Start – Audit, Evaluation, Pilot

Partners – Smart Logic, Expert Systems, SAS, SAP, IBM, FAST, Concept Searching, Attensity, Clarabridge, Lexalytics

Clients: Genentech, Novartis, Northwestern Mutual Life, Financial Times, Hyatt, Home Depot, Harvard Business Library, British Parliament, Battelle, Amdocs, FDA, GAO, World Bank, Dept. of Transportation, etc.

Program Chair – Text Analytics World – March 29-April 1 - SF Presentations, Articles, White Papers – www.kapsgroup.com Current – Book – Text Analytics: How to Conquer Information Overload,

Get Real Value from Social Media, and Add Smart Text to Big Data

4

Introduction: Approach

Is Automatic Taxonomy Development Here Yet? Not Yet But it is getting closer Hybrid:

– Taxonomists, SME’s, database analysts, text analysts– Text Mining software – basic text analysis – power – Text analytics software – brains

New taxonomy terms & structure– Old = indexing, authors adding tags & keywords– New = auto-tagging, applications

5

Information Environment

Existing Taxonomy: Computing Classification System Content:

– Database export of Guide to the Computing Literature bibliographic records (.txt; approximately 7GB in 58 files.)

– Statistical distribution of CCS categories across the Digital Library and Guide to Computing Literature (Excel; 4 files)

– ACM Digital Library full text files (PDFs and XML metadata, including CCS categories; approximately 170GB in 240,000 files)

– Ralston Encyclopedia of Computer Science (PDFs and HTML of each article with XML metadata, including CCS categories; approximately 350MB in 1,850 files)

Text Analytics in Taxonomy DevelopmentCase Study – Multiple Methods

Text Mining - terms in documents – frequency, date, source, etc.– Text Preparation – Create multiple filters

Quality – important terms, co-occurring terms Time savings – only feasible way to scan documents Clustering – suggested categories, chunking for editors

– Clustering within clusters - explore Entity Extraction – people, organizations, programming

languages, hardware/devices, etc. Joint Work Sessions – interactive exploration

6

Case Study – Taxonomy Development

7

8

Case Study – Taxonomy Development

9

Case Study – Taxonomy Development

10

Case Study – Taxonomy Development

11

12

Multiple Sets of Reports

Keyword Frequency– First Pass – 3,026 – Total – 508, 941 (Get from Big Database)– Sub-Totals

• Year Pre-1998, By Year, By 5 year blocks• Map to other variables – Journals, Authors – basis for

communities Keywords in Abstract/Title Cluster analysis of keyword-abstract-title Search Terms in keyword-abstract-title

13

Entity Extraction – Company, Internet, Organization, Title

14

Multiple Methods - Reports

Spreadsheets – static reports Database query reports

– Create multiple slices, views, filters

 Working reports – eliminate more noise words Multiple mapping – extractions, author tags &keywords Map – frequency in abstracts, titles, articles Search logs – terms and phrases

Date ranges – trend reports – per terms, new words

15

16

17

Conclusions

Auto-taxonomy not here - Yet Scale requires semi-automated solution Human effort – initial design, text preparation

– Now would add more auto-categorization Human effort – analysis & refinement – of queries, text mining, and

taxonomy Simple taxonomies are better – part of information ecosystem

– Lower levels of terms – into auto-tagging rules Early 2015: New Book:

– Text Analytics: Everything You Need to Know to Conquer Information Overload, Mine Social Media for Real Value, and Turn Big Text Into Big Data

– Title might be shorter but it will be cover all you need to know

Questions?

Tom Reamytomr@kapsgroup.com

KAPS Group

Knowledge Architecture Professional Services

http://www.kapsgroup.com