+ All Categories
Home > Documents > Some Commercial Text Mining Systems Xuanhui Wang UIUC March 29th, 2007.

Some Commercial Text Mining Systems Xuanhui Wang UIUC March 29th, 2007.

Date post: 27-Dec-2015
Category:
Upload: lisa-carson
View: 220 times
Download: 2 times
Share this document with a friend
Popular Tags:
23
Some Commercial Text Mining Systems Xuanhui Wang UIUC March 29th, 2007
Transcript

Some Commercial Text Mining Systems

Xuanhui WangUIUC

March 29th, 2007

Why Text Mining?

• A large portion of all available information today exists in the form of unstructured texts (information overload). – Books, magazine articles, research papers, product manuals,

memorandums, e-mails, and of course the Web, all contain textual information in the natural language form.

• A lot of critical information is in the textual format– The voice of customers -- customer email, customer

complaints– Product reviews

• Thus, making correct decisions often requires analyzing large volumes of textual information – Business Intelligence

Text Mining (From Wikipedia)

• Refer generally to the process of deriving high quality information from text.

• High quality information is typically derived through the divining of patterns and trends through means such as statistical pattern learning.

• Process– structuring the input text– deriving patterns within the structured data– finally evaluation and interpretation of the output

• Tasks– text categorization, text clustering, concept/entity

extraction, sentiment analysis, document summarization, and entity relation modeling

• Named Entity recognition (NE)– Finds and classifies names, places, etc.

• Coreference resolution (CO)– Identifies identity relations between entities.

• Template Element construction (TE)– Adds descriptive information to NE results (using CO).

• Template Relation construction (TR)– Finds relations between TE entities.

• Scenario Template production (ST)– Fits TE and TR results into specified event scenarios

Structuring the input text Information Extraction

Dummy Example

“The shiny red rocket was fired on Tuesday. It is the brainchild of Dr. Big Head. Dr. Head is a staff scientist at We Build Rockets Inc.”

• NE discovers that the entities present are the rocket, Tuesday, Dr. Head and We Build Rockets Inc.

• CO discovers that it refers to the rocket. • TE discovers that the rocket is shiny red and that it

is Head’s brainchild. • TR discovers that Dr. Head works for We Build

Rockets Inc. • ST discovers that there was a rocket launching

event in which the various entities were involved.

Some Systems

• Attensity• Inxight• Anderson• ClearForest• TextAnalyst• Linguamatics

Attensity

• http://www.attensity.com/ • Founded in early 2000• Culmination of over a decade of research in

computational linguistics at the University of Utah• The technology allows users to extract and

analyze facts like who, what, where, when and why

• Allows users to drill down to understand people, places and events and how they are related

• It then creates output in XML and in a structured relational data format that is fused with existing structured data

Architecture

Attensity: Information Extraction Engine

• The foundation of all the applications

• Target extraction– When you know what you are looking for– Entity and event definitions– Creating rules and dictionaries specific to your particular

domain– Graphical user interface that allows users to rapidly

create definitions

• Exhaustive extraction– When you are trying to understand what is in your text

and you don't exactly know what you are looking for

Attensity: Applications

• Discovery– Mining relations: uncover who, what, where, when, and why

• Analytics– Support users to drill down– Visualization tools to slice, dice and analyze important facts– Aggregations of facts

• Text search– Allow approximate matching of query words– Seamlessly combined with the text analysis

• Classify– Enable users to define document groups

• Alert– Provide timely visibility to frequent and emerging issues– Product problems, trigger emails or notifications

• http://www.attensity.com/www/products/applications.php

Examples Using Attensity

• Attensity boasts customers within Global 2000 organizations as well as government agencies

• Warranty Improvement– reviewing warranty data contained in unstructured, text-based

sources such as technician reports, customer surveys and dealer provided information (reduce warranty cost)

• Understand Voice of the Customer– both structured and unstructured data to detect product

problem and customer satisfaction• Government Intelligence

– identify suspicious activities and relationships, detecting threats to improve homeland security and monitoring of the Internet to uncover illegal activities

– improve the reliability and supportability of a variety of military vehicles, weapons and components, by converting unstructured data from service notes and repair logs into relational tables

Inxight

• http://www.inxight.com/ • Founded in 1997• Spun out from Xerox PARC• Based on 25+ years of research at Xerox PARC• Inxight’s ability to “read” text in more than 30

languages • Inxight takes information search, retrieval and

analysis to an entirely new level.

Components

• Federated & Desktop Search– Support hundreds of high-value information sources through a

single, user-friendly interface.– Search results are automatically clustered on-the-fly by

extracting and analyzing the most relevant people, places and events

– Provide alert functionality of new information (Be alerted when competitors' websites change, monitor a single web page to know the change of a product’s price).

– Support different types of search functionalities ("More Like This" Searching)

– Having Google desktop search entender.• Text Analysis

– Extracting the "who," "what," "where" and "when" in each document. (more than 35 types of information)

– Automated entity, concept, event and relation extraction, categorization and summarization

Components Cont’d

• Data Cleansing– Human experts can review to clean the extracted data

• Visualization– Relationship StarTree– Trend TableLens– Timeline TimeWall– Several demos:

http://www.inxight.com/products/vizserver/

Examples Using Inxight

• Customers: More than 350 Global 2000 customers

• Financial Data Analysis

• Crime Analysis

• Pharmaceutical Research

Anderson

• Designed especially for customer behavior

• Market Research– Collecting external business information (from customer,

competitor, and the market)– Qualitative (answer the “why”) vs Quantitative (answer

the “how much/many”)– Hybrid

• Business Intelligence– Collecting and analyzing internal business information– Focus on business transactions and communications– Sale data, supply logs, financial records

ClearForest

• http://www.clearforest.com/ • Tagging Engine

– Information extraction – Document categorization

• Analytics– Improve Early Warning Visibility: Include text-based

information to better assess and trigger organizational responses.– Discover Insights: Identify trends, patterns, and complex inter-

document relationships within large text collections. – Create Links with Structured Data: Incorporation enhances

quality of business intelligence by forging links not previously possible.

– Become an Expert: Rapidly comprehend and synthesize complex issues before making key decisions

• See the simple demo– Automatically identify the people, companies, organizations,

geographies and products on the web page

TextAnalyst

• Based on semantic network– a list of the most important words from the text and

relations between them

• Functionalities– Textbase Navigation: concepts in semantic network is

connected to sentences, then documents.– Topic Structure: transform semantic network to tree-like

list of nested topics– Clustering: eliminating those weak links in the topic

structure– Summarization: using semantic network to score

sentences.

Linguamatics

• Interactive information extraction (I2E)– Powerful queries (John Smith is the chairman of which

company? )– Graphical interface– Structured output– http://www.linguamatics.com/technology/ie/

search_results.html

• Can take existing ontologies– Synonyms and Canonicalisation– Class information: providing sub- and super-classes (In the

Life Science domain, relationships between protein families can point to potential relationships between specific proteins.)

– Balancing precision and recall: by moving up/down hierarchy

Commonness

• Information extraction is very important for commercial text mining systems

• Consider and combine both structured and unstructured data for analysis

• Alerts are considered as very important

• Search and mining is highly integrated

An IE Toolkit: GATE

• General Architecture for Text Engineering– University of Sheffield since 1995 – More than 10 years old – Free open source software– Implemented in Java– language analysis contexts including Information

Extraction in English, Greek, Spanish, Swedish, German, Italian and French

– Easily pluggable and used in a lot other projects– Provide interface as a standalone applications– Pretty slow and memory consuming

IE in GATE

• Named as ANNIE: a Nearly-New Information Extraction System (Show the pdf file for some examples)

• Tokeniser• Gazetteer• Sentence Splitter• Part of Speech Tagger • Semantic Tagger• Orthographic Coreference (OrthoMatcher)• Pronominal Coreference

Thanks


Recommended