Text Analytics Software Choosing the Right Fit Tom Reamy Chief Knowledge Architect KAPS Group Text...

Post on 20-Jan-2016

222 views 0 download

Tags:

transcript

Text Analytics SoftwareChoosing the Right Fit

Tom ReamyChief Knowledge Architect

KAPS Group

http://www.kapsgroup.com

Text Analytics World

October 20 New York

2

Agenda

Introduction – Text Analytics Basics Evaluation Process & Methodology

– Two Stages – Initial Filters & POC Proof of Concept

– Methodology – Results

Text Analytics and “Text Analytics” Conclusions

3

KAPS Group: General

Knowledge Architecture Professional Services Virtual Company: Network of consultants – 8-10 Partners – SAS, SAP, FAST, Smart Logic, Concept Searching, etc. Consulting, Strategy, Knowledge architecture audit Services:

– Taxonomy/Text Analytics development, consulting, customization– Evaluation of Enterprise Search, Text Analytics– Text Analytics Assessment, Fast Start– Technology Consulting – Search, CMS, Portals, etc.– Knowledge Management: Collaboration, Expertise, e-learning– Applied Theory – Faceted taxonomies, complexity theory, natural

categories

4

Introduction to Text AnalyticsText Analytics Features Noun Phrase Extraction

– Catalogs with variants, rule based dynamic– Multiple types, custom classes – entities, concepts, events– Feeds facets

Summarization– Customizable rules, map to different content

Fact Extraction– Relationships of entities – people-organizations-activities– Ontologies – triples, RDF, etc.

Sentiment Analysis– Rules – Objects and phrases

5

Introduction to Text AnalyticsText Analytics Features Auto-categorization

– Training sets – Bayesian, Vector space– Terms – literal strings, stemming, dictionary of related terms– Rules – simple – position in text (Title, body, url)– Semantic Network – Predefined relationships, sets of rules– Boolean– Full search syntax – AND, OR, NOT– Advanced – DIST(#), ORDDIST#, PARAGRAPH, SENTENCE

This is the most difficult to develop Build on a Taxonomy Combine with Extraction

– If any of list of entities and other words

Case Study – Categorization & Sentiment

6

Case Study – Categorization & Sentiment

7

8

Evaluation Process & MethodologyOverview Start with Self Knowledge

– Think Big, Start Small, Scale Fast Eliminate the unfit

– Filter One- Ask Experts - reputation, research – Gartner, etc.• Market strength of vendor, platforms, etc.• Feature scorecard – minimum, must have, filter to top 3

– Filter Two – Technology Filter – match to your overall scope and capabilities – Filter not a focus

– Filter Three – In-Depth Demo – 3-6 vendors Deep POC (2) – advanced, integration, semantics Focus on working relationship with vendor.

9

Design of the Text Analytics Selection Team Traditional Candidates – IT&, Business, Library IT - Experience with software purchases, needs assess, budget

– Search/Categorization is unlike other software, deeper look

Business -understand business, focus on business value They can get executive sponsorship, support, and budget

– But don’t understand information behavior, semantic focus Library, KM - Understand information structure Experts in search experience and categorization

– But don’t understand business or technology

10

Design of the Text Analytics Selection Team

Interdisciplinary Team, headed by Information Professionals Relative Contributions

– IT – Set necessary conditions, support tests– Business – provide input into requirements, support project– Library – provide input into requirements, add understanding

of search semantics and functionality Much more likely to make a good decision Create the foundation for implementation

11

Evaluating Taxonomy/Text Analytics Software Start with Self Knowledge

Strategic and Business Context Info Problems – what, how severe Strategic Questions – why, what value from the text analytics,

how are you going to use it– Platform or Applications?

Formal Process - KA audit – content, users, technology, business and information behaviors, applications - Or informal for smaller organization,

Text Analytics Strategy/Model – forms, technology, people– Existing taxonomic resources, software

Need this foundation to evaluate and to develop

12

13

Varieties of Taxonomy/ Text Analytics Software

Taxonomy Management– Synaptica, SchemaLogic

Full Platform– SAS, SAP, Smart Logic, Linguamatics, Concept Searching, Expert

System, IBM, GATE Embedded – Search or Content Management

– FAST, Autonomy, Endeca, Exalead, etc.– Nstein, Interwoven, Documentum, etc.

Specialty / Ontology (other semantic)– Sentiment Analysis – Lexalytics, Clarabridge, Lots of players– Ontology – extraction, plus ontology

Vendors of Taxonomy/ Text Analytics Software

– Attensity– Business Objects –

Inxight– Clarabridge– ClearForest– Concept Searching– Data Harmony / Access

Innovations– Expert Systems– GATE (Open Source)– IBM Infosphere

– Lexalytics– Multi-Tes– Nstein– SAS– SchemaLogic– Smart Logic– Synaptica

14

15

Initial Evaluation – Factors Traditional Software Evaluation - Deeper Basic & Advanced Capabilities Lack of Essential Feature

– No Sentiment Analysis, Limited language support Customization vs. OOB

– Strongest OOB – highest customization cost Company experience, multiple products vs. platform Ease of integration – API’s, Java

– Internal and External Applications– Technical Issues, Development Environment

Total Cost of Ownership and support, initial price POC Candidates – 1-4

16

Initial Evaluation – Factors Case Studies Amdocs

– Customer Support Notes – short, badly written, millions of documents– Total Cost, multiple languages, Integration with their application– Distributed expertise – Platform – resell full range of services, Sentiment Analysis– Twenty to Four to POC (Two) to SAS

GAO– Library of 200 page PDF formal documents, plus public web site– People – library staff – 3-4 taxonomists – centralized expertise– Enterprise search, general public– Twenty to POC with SAS

Phase II - Proof Of Concept - POC

Measurable Quality of results is the essential factor 4 weeks POC – bake off / or short pilot Real life scenarios, categorization with your content 2 rounds of development, test, refine / Not OOB Need SME’s as test evaluators – also to do an initial categorization of

content Majority of time is on auto-categorization Need to balance uniformity of results with vendor unique capabilities –

have to determine at POC time Taxonomy Developers – expert consultants plus internal taxonomists

17

18

POC Design: Evaluation Criteria & Issues

Basic Test Design – categorize test set– Score – by file name, human testers

Categorization & Sentiment – Accuracy 80-90%– Effort Level per accuracy level

Quantify development time – main elements Comparison of two vendors – how score?

– Combination of scores and report Quality of content & initial human categorization

– Normalize among different test evaluators Quality of taxonomists – experience with text analytics software and/or

experience with content and information needs and behaviors Quality of taxonomy – structure, overlapping categories

Text Analytics POC OutcomesEvaluation Factors

Variety & Limits of Content – Twitter to large formal libraries

Quality of Categorization– Scores – Recall, Precision (harder)– Operators – NOT, DIST, START,

Development Environment & Methodology– Toolkit or Integrated Product– Effort Level and Usability

Importance of relevancy – can be used for precision, applications Combination of workbench, statistical modeling Measures – scores, reports, discussions

19

POC and Early Development: Risks and Issues

CTO Problem –This is not a regular software process Semantics is messy not just complex

– 30% accuracy isn’t 30% done – could be 90% Variability of human categorization Categorization is iterative, not “the program works”

– Need realistic budget and flexible project plan Anyone can do categorization

– Librarians often overdo, SME’s often get lost (keywords) Meta-language issues – understanding the results

– Need to educate IT and business in their language

20

Text Analytics and “Text Analytics” – Text Mining

TA is pre-processing for text mining TA adds huge dimensions of unstructured text

– Now 85-90% of all content, Social Media TA can improve the quality of text

– Categorization, Disambiguated metadata extraction Unstructured text into data - What are the possibilities?

– New Kinds of Taxonomies – emotion, small smart modular – Information Overload – search, facets, auto-tagging, etc.– Behavior Prediction – individual actions (cancel or not?)– Customer & Business Intelligence – new relationships– Crowd sourcing – technical support – Expertise Analysis – documents, authors, communities

21

Conclusion

Start with self-knowledge – what will you use it for?– Current Environment – technology, information

Basic Features are only filters, not scores Integration – need an integrated team (IT, Business, KA)

– For evaluation and development POC – your content, real world scenarios – not scores Foundation for development, experience with software

– Development is better, faster, cheaper Categorization is essential, time consuming Text Analytics opens up new worlds of applications

22

Questions?

Tom Reamytomr@kapsgroup.com

KAPS Group

Knowledge Architecture Professional Services

http://www.kapsgroup.com