Date post: | 18-Jan-2016 |
Category: |
Documents |
Upload: | cuthbert-patrick |
View: | 215 times |
Download: | 0 times |
Text AnalyticsA Tool for Taxonomy Development
Tom ReamyChief Knowledge Architect
KAPS Group
Program Chair – Text Analytics World
Knowledge Architecture Professional Services
http://www.kapsgroup.com
2
Agenda
Introduction
Project: Update ACM taxonomy – after 12+ years
Information Environment
Text Mining / Text Analytics Multiple Methods / Reports
Conclusion
3
Introduction: KAPS Group
Knowledge Architecture Professional Services – Network of Consultants Applied Theory – Faceted & emotion taxonomies, natural categories
Services:– Strategy – IM & KM - Text Analytics, Social Media, Integration– Taxonomy/Text Analytics, Social Media development, consulting– Text Analytics Quick Start – Audit, Evaluation, Pilot
Partners – Smart Logic, Expert Systems, SAS, SAP, IBM, FAST, Concept Searching, Attensity, Clarabridge, Lexalytics
Clients: Genentech, Novartis, Northwestern Mutual Life, Financial Times, Hyatt, Home Depot, Harvard Business Library, British Parliament, Battelle, Amdocs, FDA, GAO, World Bank, Dept. of Transportation, etc.
Program Chair – Text Analytics World – March 29-April 1 - SF Presentations, Articles, White Papers – www.kapsgroup.com Current – Book – Text Analytics: How to Conquer Information Overload,
Get Real Value from Social Media, and Add Smart Text to Big Data
4
Introduction: Approach
Is Automatic Taxonomy Development Here Yet? Not Yet But it is getting closer Hybrid:
– Taxonomists, SME’s, database analysts, text analysts– Text Mining software – basic text analysis – power – Text analytics software – brains
New taxonomy terms & structure– Old = indexing, authors adding tags & keywords– New = auto-tagging, applications
5
Information Environment
Existing Taxonomy: Computing Classification System Content:
– Database export of Guide to the Computing Literature bibliographic records (.txt; approximately 7GB in 58 files.)
– Statistical distribution of CCS categories across the Digital Library and Guide to Computing Literature (Excel; 4 files)
– ACM Digital Library full text files (PDFs and XML metadata, including CCS categories; approximately 170GB in 240,000 files)
– Ralston Encyclopedia of Computer Science (PDFs and HTML of each article with XML metadata, including CCS categories; approximately 350MB in 1,850 files)
Text Analytics in Taxonomy DevelopmentCase Study – Multiple Methods
Text Mining - terms in documents – frequency, date, source, etc.– Text Preparation – Create multiple filters
Quality – important terms, co-occurring terms Time savings – only feasible way to scan documents Clustering – suggested categories, chunking for editors
– Clustering within clusters - explore Entity Extraction – people, organizations, programming
languages, hardware/devices, etc. Joint Work Sessions – interactive exploration
6
Case Study – Taxonomy Development
7
8
Case Study – Taxonomy Development
9
Case Study – Taxonomy Development
10
Case Study – Taxonomy Development
11
12
Multiple Sets of Reports
Keyword Frequency– First Pass – 3,026 – Total – 508, 941 (Get from Big Database)– Sub-Totals
• Year Pre-1998, By Year, By 5 year blocks• Map to other variables – Journals, Authors – basis for
communities Keywords in Abstract/Title Cluster analysis of keyword-abstract-title Search Terms in keyword-abstract-title
13
Entity Extraction – Company, Internet, Organization, Title
14
Multiple Methods - Reports
Spreadsheets – static reports Database query reports
– Create multiple slices, views, filters
Working reports – eliminate more noise words Multiple mapping – extractions, author tags &keywords Map – frequency in abstracts, titles, articles Search logs – terms and phrases
Date ranges – trend reports – per terms, new words
15
16
17
Conclusions
Auto-taxonomy not here - Yet Scale requires semi-automated solution Human effort – initial design, text preparation
– Now would add more auto-categorization Human effort – analysis & refinement – of queries, text mining, and
taxonomy Simple taxonomies are better – part of information ecosystem
– Lower levels of terms – into auto-tagging rules Early 2015: New Book:
– Text Analytics: Everything You Need to Know to Conquer Information Overload, Mine Social Media for Real Value, and Turn Big Text Into Big Data
– Title might be shorter but it will be cover all you need to know
Questions?
KAPS Group
Knowledge Architecture Professional Services
http://www.kapsgroup.com