+ All Categories
Home > Data & Analytics > Text Mining Analytics 101

Text Mining Analytics 101

Date post: 14-Apr-2017
Category:
Upload: manohar-swamynathan
View: 192 times
Download: 0 times
Share this document with a friend
20
Text Mining 101 Manohar Swamynathan August 2012
Transcript

Text Mining 101 Manohar Swamynathan August 2012

agenda o Text Mining Process Steps

o Calculate Term Weight

o Similarity Distance Measure

o Common Text Mining Techniques

o Appendix

- Required R packages for Text Mining

- Implemented Examples

o R code for obtaining and analyzing tweets o RTextTools ndash Ensemble Classification

o References

Manohar Swamynathan Aug 2012

Step 1 ndash Data assemble

Text Corpus

Flat files

Social

Corporate

Database

Co

mm

on

Te

xt D

ata

So

urc

es

Data Processing Step Brief Description

Explore Corpus through Exploratory Data Analysis

Understand the types of variables their functions permissible values and so on Some formats including html and xml contain tags and other data structures that provide more metadata

Convert text to lowercase This is to avoid distinguish between words simply on case

Remove Number(if required)

Numbers may or may not be relevant to our analyses

Remove Punctuations Punctuation can provide grammatical context which supports understanding Often for initial analyses we ignore the punctuation Later we will use punctuation to support the extraction of meaning

Remove English stop words

Stop words are common words found in a language Words like for of are etc are common stop words

Remove Own stop words(if required)

Along with English stop words we could instead or in addition remove our own stop words The choice of own stop word might depend on the domain of discourse and might not become

apparent until weve done some analysis

Strip whitespace

Eliminate extra white-spaces Any additional space that is not the space that occur within the sentence or between words

Stemming

Stemming uses an algorithm that removes common word endings for English words such as ldquoesrdquo ldquoedrdquo and ldquosrdquo Example computer amp computers become comput

Lemmatization Transform to dictionary base form ie produce amp produced become produce

Sparse terms

We are often not interested in infrequent terms in our documents Such ldquosparse terms should be removed from the document term matrix

Document term matrix A document term matrix is simply a matrix with documents as the rows and terms as the columns and a count of the frequency of words as the cells of the matrix

Step 2 - Data Processing

4 Python packages ndash textmining nltk R packages - tm qdap openNLP

Step 3 - Data Visualization

Frequency Chart Word Cloud

Correlation Plot

Step 4 ndash Models

Clustering

Classification

Sentiment Analysis

Document

Term Frequency - How frequently term appears Term Frequency TF(t) = (Number of times term t appears in a document) (Total number of terms in the document)

Example

Calculate Term Weight (TF IDF)

Inverse Document Frequency - How important a term is

Document Frequency DF = d (number of documents containing a given term) D (the size of the collection of documents)

To normalize take log(dD) but often D gt d and log(dD) will give negative value So invert the ratio inside log expression Essentially we are compressing the scale of values so that very large or very small quantities are smoothly compared

Inverse Document Frequency IDF(t) = log(Total number of documents Number of documents with term t in it)

7

- Assume we have overall 10 million documents and the word spindle appears in one thousand of these - Consider 2 documents containing 100 total words each and contains term spindle x number of times

Document spindle ndash Frequency Total Words TF IDF TF IDF

1 3 100 3100 = 003 log(100000001000) = 4 003 4 = 012

2 30 100 30100 = 3 log(100000001000) = 4 03 4 = 12

Similarity Distance Measure

Example Text 1 statistics skills and programming skills are equally important for analytics

Text 2 statistics skills and domain knowledge are important for analytics

Text 3 I like reading books and travelling

The three vectors are

T1 = (1211011111000000)

T2 = (1110110111100000)

T3 = (0010000000011111)

Degree of Similarity (T1 amp T2) = (T1 T2) (sqrt(sum(T1^2)) sqrt(sum(T2^2))) = 77

Degree of Similarity (T1 amp T3) = (T1 T3) (sqrt(sum(T1^2)) sqrt(sum(T3^2))) = 12

Additional Reading Here is a detailed paper on comparing the efficiency of different distance measures for text documents

URL ndash 1) httphomeiitkacin~spranjalcs671projectreportpdf

2) httpusersdsicupves~prossoresourcesBarronEtAl_ICON09pdf

statistics skills and programming knowledge are equally important for analytics domain I like reading books travelling

Text 1 1 2 1 1 0 1 1 1 1 1 0 0 0 0 0 0

Text 2 1 1 1 0 1 1 0 1 1 1 1 0 0 0 0 0

Text 3 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1

X

Y

Euclidean

Cosine

- cosine value will be a number between 0 and 1 - Smaller the angel bigger the cosine valuesimilarity

8

Common Text Mining Techniques

bull N-grams

bull Shallow Natural Language Processing

bull Deep Natural Language Processing

Example defense attorney for liberty and

montecitordquo 1-gram defense attorney for liberty and montecito 2-gram defense attorney for liberty and montecito attorney for liberty and attorney for 3-gram defense attorney for liberty and montecito attorney for liberty for liberty and liberty and montecito 4-gram defense attorney for liberty attorney for liberty and for liberty and montecito 5-gram defense attorney for liberty and montecito attorney for liberty and montecito

Application

Probabilistic language model for predicting the

next item in a sequence in the form of a (n minus 1)

Widely used in probability communication

theory computational linguistics biological

sequence analysis

Advantage

Relatively simple

Simply increasing n model can be used to store

more context

Disadvantage

Semantic value of the item is not considered

n-gram Definition

bull n-gram is a contiguous sequence of n items from a given sequence of text

bull The items can be letters words syllables or base pairs according to the application

10

Application

- Taxonomy extraction (predefined terms and entities)

- Entities People organizations locations times dates prices genes proteins diseases

medicines

- Concept extraction (main idea or a theme)

Advantage

- Less noisy than n-grams

Disadvantage

- Does not specify role of items in the main sentence

Shallow NLP Technique Definition

- Assign a syntactic label (noun verb etc) to a chunk

- Knowledge extraction from text through semanticsyntactic analysis approach

11

Sentence - ldquoThe driver from Europe crashed the car with the white bumperrdquo

1-gram

the

driver

from

europe

crashed

the

car

with

the

white

bumper

Part of Speech

DT ndash Determiner

NN - Noun singular or mass

IN - Preposition or subordinating conjunction

NNP - Proper Noun singular

VBD - Verb past tense

DT ndash Determiner

NN - Noun singular or mass

IN - Preposition or subordinating conjunction

DT ndash Determiner

JJ ndash Adjective

NN - Noun singular or mass

- Convert to lowercase amp PoS tag

Concept Extraction

- Remove Stop words

- Retain only Nounrsquos amp Verbrsquos

- Bi-gram with Nounrsquos amp Verbrsquos retained

Bi-gram PoS

car white NN JJ

crashed car VBD NN

driver europe NN NNP

europe crashed NNP VBD

white bumper JJ NN

3-gram PoS

car white bumper NN JJ NN

crashed car white VBD NN JJ

driver europe crashed NN NNP VBD

europe crashed car NNP VBD NN

- 3-gram with Nounrsquos amp Verbrsquos retained

Conclusion

1-gram Reduced noise however no clear context Bi-gram amp 3-gram Increased context however there

is a information loss

Shallow NLP Technique

12

Stop words NounVerb

Definition

- Extension to the shallow NLP

- Detected relationships are expressed as complex construction to retain the context

- Example relationships Located in employed by part of married to

Applications

- Develop features and representations appropriate for complex interpretation tasks

- Fraud detection - Life science prediction activities based on complex RNA-Sequence

Deep NLP technique

Example

The above sentence can be represented using triples (Subject Predicate [Modifier] Object) without loosing the context

Triples

driver crash car

driver crash with bumper

driver be from Europe 13

Technique General Steps Pros Cons

N-Gram

- Convert to lowercase

- Remove punctuations - Remove special characters Simple technique Extremely noisy

Shallow NLP

technique

- POS tagging

- Lemmatization ie transform to

dictionary base form ie produce amp produced become produce

- Stemming ie transform to root word

ie 1) computer amp computers become

comput

2) product produce amp produced

become produc

- Chunking ie identify the phrasal

constituents in a sentence including

nounverb phrase etc and splits the

sentence into chunks of semantically

related words

Less noisy than N-

Grams

Computationally

expensive

solution for

analyzing the

structure of texts

Does not specify

the internal

structure or the

role of words in

the sentence

Deep NLP

technique

- Generate syntactic relationship

between each pair of words

- Extract subject predicate nagation

objecct and named entity to form triples

Context of the

sentence is

retained

Sentence level

analysis is too

structured

Techniques - Summary

14

Appendix

15

2A - Explore Corpus through EDA 2B - Convert text to lowercase 2C - Remove

a) Numbers(if required) b) Punctuations c) English stop words d) Own stop words(if

required)

e) Strip whitespace f) LemmatizationStemming g) Sparse terms

2D - Create document term matrix

Step 3 - Visualization

Corpus

Web

Documents

Step 1 ndash Data Assemble

Step 2 ndash Data Processing

Step 4 ndash Build Model(s)

Clustering

Classification

Sentiment Analysis

Fre

qu

en

cy

Ch

art

W

ord

Clo

ud

C

orr

ela

tio

n P

lot

R - Text Mining Process Overview

16

DB

Package Name

Category Description

tm Text Mining A framework for text mining applications

topicmodels Topic Modelling Fit topic models with Latent Dirichlet Allocation (LDA) and Comparative Text Mining (CTM)

wordcloud Visualization Plot a cloud comparing the frequencies of words across documents

lda Topic Modelling Fit topic models with Latent Dirichlet Allocation

wordnet Text Mining Database of English which is commonly used in linguistics and text mining

RTextTools Text Mining Automatic text classification via supervised learning

qdap Sentiment analysis

Transcript analysis text mining and natural language processing

tmplugindc Text Mining A plug-in for package tm to support distributed text mining

tmpluginmail Text Mining A plug-in for package tm to handle mail

textir Text Mining A suite of tools for inference about text documents and associated sentiment

tau Text Mining Utilities for text analysis

textcat Text Mining N-gram based text categorization

SnowballC Text Mining Word stemmer

twitteR Text Mining Provides an interface to the Twitter web API

ROAuth Text Mining Allows users to authenticate to the server of their choice (like Twitter)

RColorBrewer Visualization The packages provides palettes for drawing nice maps shaded according to a variable

ggplot2 Visualization

Graphing package implemented on top of the R statistical package Inspired by the Grammar of Graphics seminal work of Leland Wilkinson

R ndash Required packages for Text Mining

17

Example 1 - Obtaining and analyzing tweets

Objective R code for analyzing tweets relating to AAA2011 (text mining topic modelling network analysis clustering and sentiment analysis)

What does the code do

The code details ten steps in the analysis and visualization of the tweets

Acquiring the raw Twitter data

Calculating some basic statistics with the raw Twitter data

Calculating some basic retweet statistics

Calculating the ratio of retweets to tweets

Calculating some basic statistics about URLs in tweets

Basic text mining for token frequency and token association analysis (word cloud)

Calculating sentiment scores of tweets including on subsets containing tokens of interest

Hierarchical clustering of tokens based on multi scale bootstrap resampling

Topic modelling the tweet corpus using latent Dirichlet allocation

Network analysis of tweeters based on retweets

Code Source Code was taken from following link and tweakedadded additional bits where required to ensure code runs fine httpsgithubcombenmarwickAAA2011-Tweets

How to Run or Test the code - From the word doc copy the R code in the given sequence highlighted in yellow color and paste on your R console

18

RTextTools ndash Example for supervised Learning for Text Classification using Ensemble

RTextTools is a free open source R machine learning package for automatic text classification

The package includes nine algorithms for ensemble classification (svm slda boosting bagging random forests glmnet decision trees neural networks and maximum entropy) comprehensive analytics and thorough documentation

Users may use n-fold cross validation to calculate the accuracy of each algorithm on their dataset and determine which algorithms to use in their ensemble

(Using a four-ensemble agreement approach Collingwood and Wilkerson (2012) found that when four of their algorithms agree on the label of a textual document the machine label matches the human label over 90 of the time The rate is just 45 when only two algorithms agree on the text label)

Code Source The codes is readily available for download and usage from the following link httpsgithubcomtimjurkaRTextTools The code can be implemented without modification for testing however itrsquos set up such that changes can be incorporated easily based on our requirement

Additional Reading httpwwwrtexttoolscomabout-the-projecthtml

19

Example 2 - RTextTools

Penn Treebank - httpswwwlingupenneducoursesFall_2003ling001penn_treebank_poshtml

Stanford info lab - Finding Similar Items httpinfolabstanfordedu~ullmanmmdsch3pdf

TRIPLET EXTRACTION FROM SENTENCES URL - httpailabijssidelia_rusuPapersis_2007pdf

Shallow and Deep NLP Processing for ontology learning a Quick Overview httpazouaqathabascaucapublicationsConferences20Workshops20Books5BBC25D_KDW_2010pdf

References

20

agenda o Text Mining Process Steps

o Calculate Term Weight

o Similarity Distance Measure

o Common Text Mining Techniques

o Appendix

- Required R packages for Text Mining

- Implemented Examples

o R code for obtaining and analyzing tweets o RTextTools ndash Ensemble Classification

o References

Manohar Swamynathan Aug 2012

Step 1 ndash Data assemble

Text Corpus

Flat files

Social

Corporate

Database

Co

mm

on

Te

xt D

ata

So

urc

es

Data Processing Step Brief Description

Explore Corpus through Exploratory Data Analysis

Understand the types of variables their functions permissible values and so on Some formats including html and xml contain tags and other data structures that provide more metadata

Convert text to lowercase This is to avoid distinguish between words simply on case

Remove Number(if required)

Numbers may or may not be relevant to our analyses

Remove Punctuations Punctuation can provide grammatical context which supports understanding Often for initial analyses we ignore the punctuation Later we will use punctuation to support the extraction of meaning

Remove English stop words

Stop words are common words found in a language Words like for of are etc are common stop words

Remove Own stop words(if required)

Along with English stop words we could instead or in addition remove our own stop words The choice of own stop word might depend on the domain of discourse and might not become

apparent until weve done some analysis

Strip whitespace

Eliminate extra white-spaces Any additional space that is not the space that occur within the sentence or between words

Stemming

Stemming uses an algorithm that removes common word endings for English words such as ldquoesrdquo ldquoedrdquo and ldquosrdquo Example computer amp computers become comput

Lemmatization Transform to dictionary base form ie produce amp produced become produce

Sparse terms

We are often not interested in infrequent terms in our documents Such ldquosparse terms should be removed from the document term matrix

Document term matrix A document term matrix is simply a matrix with documents as the rows and terms as the columns and a count of the frequency of words as the cells of the matrix

Step 2 - Data Processing

4 Python packages ndash textmining nltk R packages - tm qdap openNLP

Step 3 - Data Visualization

Frequency Chart Word Cloud

Correlation Plot

Step 4 ndash Models

Clustering

Classification

Sentiment Analysis

Document

Term Frequency - How frequently term appears Term Frequency TF(t) = (Number of times term t appears in a document) (Total number of terms in the document)

Example

Calculate Term Weight (TF IDF)

Inverse Document Frequency - How important a term is

Document Frequency DF = d (number of documents containing a given term) D (the size of the collection of documents)

To normalize take log(dD) but often D gt d and log(dD) will give negative value So invert the ratio inside log expression Essentially we are compressing the scale of values so that very large or very small quantities are smoothly compared

Inverse Document Frequency IDF(t) = log(Total number of documents Number of documents with term t in it)

7

- Assume we have overall 10 million documents and the word spindle appears in one thousand of these - Consider 2 documents containing 100 total words each and contains term spindle x number of times

Document spindle ndash Frequency Total Words TF IDF TF IDF

1 3 100 3100 = 003 log(100000001000) = 4 003 4 = 012

2 30 100 30100 = 3 log(100000001000) = 4 03 4 = 12

Similarity Distance Measure

Example Text 1 statistics skills and programming skills are equally important for analytics

Text 2 statistics skills and domain knowledge are important for analytics

Text 3 I like reading books and travelling

The three vectors are

T1 = (1211011111000000)

T2 = (1110110111100000)

T3 = (0010000000011111)

Degree of Similarity (T1 amp T2) = (T1 T2) (sqrt(sum(T1^2)) sqrt(sum(T2^2))) = 77

Degree of Similarity (T1 amp T3) = (T1 T3) (sqrt(sum(T1^2)) sqrt(sum(T3^2))) = 12

Additional Reading Here is a detailed paper on comparing the efficiency of different distance measures for text documents

URL ndash 1) httphomeiitkacin~spranjalcs671projectreportpdf

2) httpusersdsicupves~prossoresourcesBarronEtAl_ICON09pdf

statistics skills and programming knowledge are equally important for analytics domain I like reading books travelling

Text 1 1 2 1 1 0 1 1 1 1 1 0 0 0 0 0 0

Text 2 1 1 1 0 1 1 0 1 1 1 1 0 0 0 0 0

Text 3 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1

X

Y

Euclidean

Cosine

- cosine value will be a number between 0 and 1 - Smaller the angel bigger the cosine valuesimilarity

8

Common Text Mining Techniques

bull N-grams

bull Shallow Natural Language Processing

bull Deep Natural Language Processing

Example defense attorney for liberty and

montecitordquo 1-gram defense attorney for liberty and montecito 2-gram defense attorney for liberty and montecito attorney for liberty and attorney for 3-gram defense attorney for liberty and montecito attorney for liberty for liberty and liberty and montecito 4-gram defense attorney for liberty attorney for liberty and for liberty and montecito 5-gram defense attorney for liberty and montecito attorney for liberty and montecito

Application

Probabilistic language model for predicting the

next item in a sequence in the form of a (n minus 1)

Widely used in probability communication

theory computational linguistics biological

sequence analysis

Advantage

Relatively simple

Simply increasing n model can be used to store

more context

Disadvantage

Semantic value of the item is not considered

n-gram Definition

bull n-gram is a contiguous sequence of n items from a given sequence of text

bull The items can be letters words syllables or base pairs according to the application

10

Application

- Taxonomy extraction (predefined terms and entities)

- Entities People organizations locations times dates prices genes proteins diseases

medicines

- Concept extraction (main idea or a theme)

Advantage

- Less noisy than n-grams

Disadvantage

- Does not specify role of items in the main sentence

Shallow NLP Technique Definition

- Assign a syntactic label (noun verb etc) to a chunk

- Knowledge extraction from text through semanticsyntactic analysis approach

11

Sentence - ldquoThe driver from Europe crashed the car with the white bumperrdquo

1-gram

the

driver

from

europe

crashed

the

car

with

the

white

bumper

Part of Speech

DT ndash Determiner

NN - Noun singular or mass

IN - Preposition or subordinating conjunction

NNP - Proper Noun singular

VBD - Verb past tense

DT ndash Determiner

NN - Noun singular or mass

IN - Preposition or subordinating conjunction

DT ndash Determiner

JJ ndash Adjective

NN - Noun singular or mass

- Convert to lowercase amp PoS tag

Concept Extraction

- Remove Stop words

- Retain only Nounrsquos amp Verbrsquos

- Bi-gram with Nounrsquos amp Verbrsquos retained

Bi-gram PoS

car white NN JJ

crashed car VBD NN

driver europe NN NNP

europe crashed NNP VBD

white bumper JJ NN

3-gram PoS

car white bumper NN JJ NN

crashed car white VBD NN JJ

driver europe crashed NN NNP VBD

europe crashed car NNP VBD NN

- 3-gram with Nounrsquos amp Verbrsquos retained

Conclusion

1-gram Reduced noise however no clear context Bi-gram amp 3-gram Increased context however there

is a information loss

Shallow NLP Technique

12

Stop words NounVerb

Definition

- Extension to the shallow NLP

- Detected relationships are expressed as complex construction to retain the context

- Example relationships Located in employed by part of married to

Applications

- Develop features and representations appropriate for complex interpretation tasks

- Fraud detection - Life science prediction activities based on complex RNA-Sequence

Deep NLP technique

Example

The above sentence can be represented using triples (Subject Predicate [Modifier] Object) without loosing the context

Triples

driver crash car

driver crash with bumper

driver be from Europe 13

Technique General Steps Pros Cons

N-Gram

- Convert to lowercase

- Remove punctuations - Remove special characters Simple technique Extremely noisy

Shallow NLP

technique

- POS tagging

- Lemmatization ie transform to

dictionary base form ie produce amp produced become produce

- Stemming ie transform to root word

ie 1) computer amp computers become

comput

2) product produce amp produced

become produc

- Chunking ie identify the phrasal

constituents in a sentence including

nounverb phrase etc and splits the

sentence into chunks of semantically

related words

Less noisy than N-

Grams

Computationally

expensive

solution for

analyzing the

structure of texts

Does not specify

the internal

structure or the

role of words in

the sentence

Deep NLP

technique

- Generate syntactic relationship

between each pair of words

- Extract subject predicate nagation

objecct and named entity to form triples

Context of the

sentence is

retained

Sentence level

analysis is too

structured

Techniques - Summary

14

Appendix

15

2A - Explore Corpus through EDA 2B - Convert text to lowercase 2C - Remove

a) Numbers(if required) b) Punctuations c) English stop words d) Own stop words(if

required)

e) Strip whitespace f) LemmatizationStemming g) Sparse terms

2D - Create document term matrix

Step 3 - Visualization

Corpus

Web

Documents

Step 1 ndash Data Assemble

Step 2 ndash Data Processing

Step 4 ndash Build Model(s)

Clustering

Classification

Sentiment Analysis

Fre

qu

en

cy

Ch

art

W

ord

Clo

ud

C

orr

ela

tio

n P

lot

R - Text Mining Process Overview

16

DB

Package Name

Category Description

tm Text Mining A framework for text mining applications

topicmodels Topic Modelling Fit topic models with Latent Dirichlet Allocation (LDA) and Comparative Text Mining (CTM)

wordcloud Visualization Plot a cloud comparing the frequencies of words across documents

lda Topic Modelling Fit topic models with Latent Dirichlet Allocation

wordnet Text Mining Database of English which is commonly used in linguistics and text mining

RTextTools Text Mining Automatic text classification via supervised learning

qdap Sentiment analysis

Transcript analysis text mining and natural language processing

tmplugindc Text Mining A plug-in for package tm to support distributed text mining

tmpluginmail Text Mining A plug-in for package tm to handle mail

textir Text Mining A suite of tools for inference about text documents and associated sentiment

tau Text Mining Utilities for text analysis

textcat Text Mining N-gram based text categorization

SnowballC Text Mining Word stemmer

twitteR Text Mining Provides an interface to the Twitter web API

ROAuth Text Mining Allows users to authenticate to the server of their choice (like Twitter)

RColorBrewer Visualization The packages provides palettes for drawing nice maps shaded according to a variable

ggplot2 Visualization

Graphing package implemented on top of the R statistical package Inspired by the Grammar of Graphics seminal work of Leland Wilkinson

R ndash Required packages for Text Mining

17

Example 1 - Obtaining and analyzing tweets

Objective R code for analyzing tweets relating to AAA2011 (text mining topic modelling network analysis clustering and sentiment analysis)

What does the code do

The code details ten steps in the analysis and visualization of the tweets

Acquiring the raw Twitter data

Calculating some basic statistics with the raw Twitter data

Calculating some basic retweet statistics

Calculating the ratio of retweets to tweets

Calculating some basic statistics about URLs in tweets

Basic text mining for token frequency and token association analysis (word cloud)

Calculating sentiment scores of tweets including on subsets containing tokens of interest

Hierarchical clustering of tokens based on multi scale bootstrap resampling

Topic modelling the tweet corpus using latent Dirichlet allocation

Network analysis of tweeters based on retweets

Code Source Code was taken from following link and tweakedadded additional bits where required to ensure code runs fine httpsgithubcombenmarwickAAA2011-Tweets

How to Run or Test the code - From the word doc copy the R code in the given sequence highlighted in yellow color and paste on your R console

18

RTextTools ndash Example for supervised Learning for Text Classification using Ensemble

RTextTools is a free open source R machine learning package for automatic text classification

The package includes nine algorithms for ensemble classification (svm slda boosting bagging random forests glmnet decision trees neural networks and maximum entropy) comprehensive analytics and thorough documentation

Users may use n-fold cross validation to calculate the accuracy of each algorithm on their dataset and determine which algorithms to use in their ensemble

(Using a four-ensemble agreement approach Collingwood and Wilkerson (2012) found that when four of their algorithms agree on the label of a textual document the machine label matches the human label over 90 of the time The rate is just 45 when only two algorithms agree on the text label)

Code Source The codes is readily available for download and usage from the following link httpsgithubcomtimjurkaRTextTools The code can be implemented without modification for testing however itrsquos set up such that changes can be incorporated easily based on our requirement

Additional Reading httpwwwrtexttoolscomabout-the-projecthtml

19

Example 2 - RTextTools

Penn Treebank - httpswwwlingupenneducoursesFall_2003ling001penn_treebank_poshtml

Stanford info lab - Finding Similar Items httpinfolabstanfordedu~ullmanmmdsch3pdf

TRIPLET EXTRACTION FROM SENTENCES URL - httpailabijssidelia_rusuPapersis_2007pdf

Shallow and Deep NLP Processing for ontology learning a Quick Overview httpazouaqathabascaucapublicationsConferences20Workshops20Books5BBC25D_KDW_2010pdf

References

20

Step 1 ndash Data assemble

Text Corpus

Flat files

Social

Corporate

Database

Co

mm

on

Te

xt D

ata

So

urc

es

Data Processing Step Brief Description

Explore Corpus through Exploratory Data Analysis

Understand the types of variables their functions permissible values and so on Some formats including html and xml contain tags and other data structures that provide more metadata

Convert text to lowercase This is to avoid distinguish between words simply on case

Remove Number(if required)

Numbers may or may not be relevant to our analyses

Remove Punctuations Punctuation can provide grammatical context which supports understanding Often for initial analyses we ignore the punctuation Later we will use punctuation to support the extraction of meaning

Remove English stop words

Stop words are common words found in a language Words like for of are etc are common stop words

Remove Own stop words(if required)

Along with English stop words we could instead or in addition remove our own stop words The choice of own stop word might depend on the domain of discourse and might not become

apparent until weve done some analysis

Strip whitespace

Eliminate extra white-spaces Any additional space that is not the space that occur within the sentence or between words

Stemming

Stemming uses an algorithm that removes common word endings for English words such as ldquoesrdquo ldquoedrdquo and ldquosrdquo Example computer amp computers become comput

Lemmatization Transform to dictionary base form ie produce amp produced become produce

Sparse terms

We are often not interested in infrequent terms in our documents Such ldquosparse terms should be removed from the document term matrix

Document term matrix A document term matrix is simply a matrix with documents as the rows and terms as the columns and a count of the frequency of words as the cells of the matrix

Step 2 - Data Processing

4 Python packages ndash textmining nltk R packages - tm qdap openNLP

Step 3 - Data Visualization

Frequency Chart Word Cloud

Correlation Plot

Step 4 ndash Models

Clustering

Classification

Sentiment Analysis

Document

Term Frequency - How frequently term appears Term Frequency TF(t) = (Number of times term t appears in a document) (Total number of terms in the document)

Example

Calculate Term Weight (TF IDF)

Inverse Document Frequency - How important a term is

Document Frequency DF = d (number of documents containing a given term) D (the size of the collection of documents)

To normalize take log(dD) but often D gt d and log(dD) will give negative value So invert the ratio inside log expression Essentially we are compressing the scale of values so that very large or very small quantities are smoothly compared

Inverse Document Frequency IDF(t) = log(Total number of documents Number of documents with term t in it)

7

- Assume we have overall 10 million documents and the word spindle appears in one thousand of these - Consider 2 documents containing 100 total words each and contains term spindle x number of times

Document spindle ndash Frequency Total Words TF IDF TF IDF

1 3 100 3100 = 003 log(100000001000) = 4 003 4 = 012

2 30 100 30100 = 3 log(100000001000) = 4 03 4 = 12

Similarity Distance Measure

Example Text 1 statistics skills and programming skills are equally important for analytics

Text 2 statistics skills and domain knowledge are important for analytics

Text 3 I like reading books and travelling

The three vectors are

T1 = (1211011111000000)

T2 = (1110110111100000)

T3 = (0010000000011111)

Degree of Similarity (T1 amp T2) = (T1 T2) (sqrt(sum(T1^2)) sqrt(sum(T2^2))) = 77

Degree of Similarity (T1 amp T3) = (T1 T3) (sqrt(sum(T1^2)) sqrt(sum(T3^2))) = 12

Additional Reading Here is a detailed paper on comparing the efficiency of different distance measures for text documents

URL ndash 1) httphomeiitkacin~spranjalcs671projectreportpdf

2) httpusersdsicupves~prossoresourcesBarronEtAl_ICON09pdf

statistics skills and programming knowledge are equally important for analytics domain I like reading books travelling

Text 1 1 2 1 1 0 1 1 1 1 1 0 0 0 0 0 0

Text 2 1 1 1 0 1 1 0 1 1 1 1 0 0 0 0 0

Text 3 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1

X

Y

Euclidean

Cosine

- cosine value will be a number between 0 and 1 - Smaller the angel bigger the cosine valuesimilarity

8

Common Text Mining Techniques

bull N-grams

bull Shallow Natural Language Processing

bull Deep Natural Language Processing

Example defense attorney for liberty and

montecitordquo 1-gram defense attorney for liberty and montecito 2-gram defense attorney for liberty and montecito attorney for liberty and attorney for 3-gram defense attorney for liberty and montecito attorney for liberty for liberty and liberty and montecito 4-gram defense attorney for liberty attorney for liberty and for liberty and montecito 5-gram defense attorney for liberty and montecito attorney for liberty and montecito

Application

Probabilistic language model for predicting the

next item in a sequence in the form of a (n minus 1)

Widely used in probability communication

theory computational linguistics biological

sequence analysis

Advantage

Relatively simple

Simply increasing n model can be used to store

more context

Disadvantage

Semantic value of the item is not considered

n-gram Definition

bull n-gram is a contiguous sequence of n items from a given sequence of text

bull The items can be letters words syllables or base pairs according to the application

10

Application

- Taxonomy extraction (predefined terms and entities)

- Entities People organizations locations times dates prices genes proteins diseases

medicines

- Concept extraction (main idea or a theme)

Advantage

- Less noisy than n-grams

Disadvantage

- Does not specify role of items in the main sentence

Shallow NLP Technique Definition

- Assign a syntactic label (noun verb etc) to a chunk

- Knowledge extraction from text through semanticsyntactic analysis approach

11

Sentence - ldquoThe driver from Europe crashed the car with the white bumperrdquo

1-gram

the

driver

from

europe

crashed

the

car

with

the

white

bumper

Part of Speech

DT ndash Determiner

NN - Noun singular or mass

IN - Preposition or subordinating conjunction

NNP - Proper Noun singular

VBD - Verb past tense

DT ndash Determiner

NN - Noun singular or mass

IN - Preposition or subordinating conjunction

DT ndash Determiner

JJ ndash Adjective

NN - Noun singular or mass

- Convert to lowercase amp PoS tag

Concept Extraction

- Remove Stop words

- Retain only Nounrsquos amp Verbrsquos

- Bi-gram with Nounrsquos amp Verbrsquos retained

Bi-gram PoS

car white NN JJ

crashed car VBD NN

driver europe NN NNP

europe crashed NNP VBD

white bumper JJ NN

3-gram PoS

car white bumper NN JJ NN

crashed car white VBD NN JJ

driver europe crashed NN NNP VBD

europe crashed car NNP VBD NN

- 3-gram with Nounrsquos amp Verbrsquos retained

Conclusion

1-gram Reduced noise however no clear context Bi-gram amp 3-gram Increased context however there

is a information loss

Shallow NLP Technique

12

Stop words NounVerb

Definition

- Extension to the shallow NLP

- Detected relationships are expressed as complex construction to retain the context

- Example relationships Located in employed by part of married to

Applications

- Develop features and representations appropriate for complex interpretation tasks

- Fraud detection - Life science prediction activities based on complex RNA-Sequence

Deep NLP technique

Example

The above sentence can be represented using triples (Subject Predicate [Modifier] Object) without loosing the context

Triples

driver crash car

driver crash with bumper

driver be from Europe 13

Technique General Steps Pros Cons

N-Gram

- Convert to lowercase

- Remove punctuations - Remove special characters Simple technique Extremely noisy

Shallow NLP

technique

- POS tagging

- Lemmatization ie transform to

dictionary base form ie produce amp produced become produce

- Stemming ie transform to root word

ie 1) computer amp computers become

comput

2) product produce amp produced

become produc

- Chunking ie identify the phrasal

constituents in a sentence including

nounverb phrase etc and splits the

sentence into chunks of semantically

related words

Less noisy than N-

Grams

Computationally

expensive

solution for

analyzing the

structure of texts

Does not specify

the internal

structure or the

role of words in

the sentence

Deep NLP

technique

- Generate syntactic relationship

between each pair of words

- Extract subject predicate nagation

objecct and named entity to form triples

Context of the

sentence is

retained

Sentence level

analysis is too

structured

Techniques - Summary

14

Appendix

15

2A - Explore Corpus through EDA 2B - Convert text to lowercase 2C - Remove

a) Numbers(if required) b) Punctuations c) English stop words d) Own stop words(if

required)

e) Strip whitespace f) LemmatizationStemming g) Sparse terms

2D - Create document term matrix

Step 3 - Visualization

Corpus

Web

Documents

Step 1 ndash Data Assemble

Step 2 ndash Data Processing

Step 4 ndash Build Model(s)

Clustering

Classification

Sentiment Analysis

Fre

qu

en

cy

Ch

art

W

ord

Clo

ud

C

orr

ela

tio

n P

lot

R - Text Mining Process Overview

16

DB

Package Name

Category Description

tm Text Mining A framework for text mining applications

topicmodels Topic Modelling Fit topic models with Latent Dirichlet Allocation (LDA) and Comparative Text Mining (CTM)

wordcloud Visualization Plot a cloud comparing the frequencies of words across documents

lda Topic Modelling Fit topic models with Latent Dirichlet Allocation

wordnet Text Mining Database of English which is commonly used in linguistics and text mining

RTextTools Text Mining Automatic text classification via supervised learning

qdap Sentiment analysis

Transcript analysis text mining and natural language processing

tmplugindc Text Mining A plug-in for package tm to support distributed text mining

tmpluginmail Text Mining A plug-in for package tm to handle mail

textir Text Mining A suite of tools for inference about text documents and associated sentiment

tau Text Mining Utilities for text analysis

textcat Text Mining N-gram based text categorization

SnowballC Text Mining Word stemmer

twitteR Text Mining Provides an interface to the Twitter web API

ROAuth Text Mining Allows users to authenticate to the server of their choice (like Twitter)

RColorBrewer Visualization The packages provides palettes for drawing nice maps shaded according to a variable

ggplot2 Visualization

Graphing package implemented on top of the R statistical package Inspired by the Grammar of Graphics seminal work of Leland Wilkinson

R ndash Required packages for Text Mining

17

Example 1 - Obtaining and analyzing tweets

Objective R code for analyzing tweets relating to AAA2011 (text mining topic modelling network analysis clustering and sentiment analysis)

What does the code do

The code details ten steps in the analysis and visualization of the tweets

Acquiring the raw Twitter data

Calculating some basic statistics with the raw Twitter data

Calculating some basic retweet statistics

Calculating the ratio of retweets to tweets

Calculating some basic statistics about URLs in tweets

Basic text mining for token frequency and token association analysis (word cloud)

Calculating sentiment scores of tweets including on subsets containing tokens of interest

Hierarchical clustering of tokens based on multi scale bootstrap resampling

Topic modelling the tweet corpus using latent Dirichlet allocation

Network analysis of tweeters based on retweets

Code Source Code was taken from following link and tweakedadded additional bits where required to ensure code runs fine httpsgithubcombenmarwickAAA2011-Tweets

How to Run or Test the code - From the word doc copy the R code in the given sequence highlighted in yellow color and paste on your R console

18

RTextTools ndash Example for supervised Learning for Text Classification using Ensemble

RTextTools is a free open source R machine learning package for automatic text classification

The package includes nine algorithms for ensemble classification (svm slda boosting bagging random forests glmnet decision trees neural networks and maximum entropy) comprehensive analytics and thorough documentation

Users may use n-fold cross validation to calculate the accuracy of each algorithm on their dataset and determine which algorithms to use in their ensemble

(Using a four-ensemble agreement approach Collingwood and Wilkerson (2012) found that when four of their algorithms agree on the label of a textual document the machine label matches the human label over 90 of the time The rate is just 45 when only two algorithms agree on the text label)

Code Source The codes is readily available for download and usage from the following link httpsgithubcomtimjurkaRTextTools The code can be implemented without modification for testing however itrsquos set up such that changes can be incorporated easily based on our requirement

Additional Reading httpwwwrtexttoolscomabout-the-projecthtml

19

Example 2 - RTextTools

Penn Treebank - httpswwwlingupenneducoursesFall_2003ling001penn_treebank_poshtml

Stanford info lab - Finding Similar Items httpinfolabstanfordedu~ullmanmmdsch3pdf

TRIPLET EXTRACTION FROM SENTENCES URL - httpailabijssidelia_rusuPapersis_2007pdf

Shallow and Deep NLP Processing for ontology learning a Quick Overview httpazouaqathabascaucapublicationsConferences20Workshops20Books5BBC25D_KDW_2010pdf

References

20

Data Processing Step Brief Description

Explore Corpus through Exploratory Data Analysis

Understand the types of variables their functions permissible values and so on Some formats including html and xml contain tags and other data structures that provide more metadata

Convert text to lowercase This is to avoid distinguish between words simply on case

Remove Number(if required)

Numbers may or may not be relevant to our analyses

Remove Punctuations Punctuation can provide grammatical context which supports understanding Often for initial analyses we ignore the punctuation Later we will use punctuation to support the extraction of meaning

Remove English stop words

Stop words are common words found in a language Words like for of are etc are common stop words

Remove Own stop words(if required)

Along with English stop words we could instead or in addition remove our own stop words The choice of own stop word might depend on the domain of discourse and might not become

apparent until weve done some analysis

Strip whitespace

Eliminate extra white-spaces Any additional space that is not the space that occur within the sentence or between words

Stemming

Stemming uses an algorithm that removes common word endings for English words such as ldquoesrdquo ldquoedrdquo and ldquosrdquo Example computer amp computers become comput

Lemmatization Transform to dictionary base form ie produce amp produced become produce

Sparse terms

We are often not interested in infrequent terms in our documents Such ldquosparse terms should be removed from the document term matrix

Document term matrix A document term matrix is simply a matrix with documents as the rows and terms as the columns and a count of the frequency of words as the cells of the matrix

Step 2 - Data Processing

4 Python packages ndash textmining nltk R packages - tm qdap openNLP

Step 3 - Data Visualization

Frequency Chart Word Cloud

Correlation Plot

Step 4 ndash Models

Clustering

Classification

Sentiment Analysis

Document

Term Frequency - How frequently term appears Term Frequency TF(t) = (Number of times term t appears in a document) (Total number of terms in the document)

Example

Calculate Term Weight (TF IDF)

Inverse Document Frequency - How important a term is

Document Frequency DF = d (number of documents containing a given term) D (the size of the collection of documents)

To normalize take log(dD) but often D gt d and log(dD) will give negative value So invert the ratio inside log expression Essentially we are compressing the scale of values so that very large or very small quantities are smoothly compared

Inverse Document Frequency IDF(t) = log(Total number of documents Number of documents with term t in it)

7

- Assume we have overall 10 million documents and the word spindle appears in one thousand of these - Consider 2 documents containing 100 total words each and contains term spindle x number of times

Document spindle ndash Frequency Total Words TF IDF TF IDF

1 3 100 3100 = 003 log(100000001000) = 4 003 4 = 012

2 30 100 30100 = 3 log(100000001000) = 4 03 4 = 12

Similarity Distance Measure

Example Text 1 statistics skills and programming skills are equally important for analytics

Text 2 statistics skills and domain knowledge are important for analytics

Text 3 I like reading books and travelling

The three vectors are

T1 = (1211011111000000)

T2 = (1110110111100000)

T3 = (0010000000011111)

Degree of Similarity (T1 amp T2) = (T1 T2) (sqrt(sum(T1^2)) sqrt(sum(T2^2))) = 77

Degree of Similarity (T1 amp T3) = (T1 T3) (sqrt(sum(T1^2)) sqrt(sum(T3^2))) = 12

Additional Reading Here is a detailed paper on comparing the efficiency of different distance measures for text documents

URL ndash 1) httphomeiitkacin~spranjalcs671projectreportpdf

2) httpusersdsicupves~prossoresourcesBarronEtAl_ICON09pdf

statistics skills and programming knowledge are equally important for analytics domain I like reading books travelling

Text 1 1 2 1 1 0 1 1 1 1 1 0 0 0 0 0 0

Text 2 1 1 1 0 1 1 0 1 1 1 1 0 0 0 0 0

Text 3 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1

X

Y

Euclidean

Cosine

- cosine value will be a number between 0 and 1 - Smaller the angel bigger the cosine valuesimilarity

8

Common Text Mining Techniques

bull N-grams

bull Shallow Natural Language Processing

bull Deep Natural Language Processing

Example defense attorney for liberty and

montecitordquo 1-gram defense attorney for liberty and montecito 2-gram defense attorney for liberty and montecito attorney for liberty and attorney for 3-gram defense attorney for liberty and montecito attorney for liberty for liberty and liberty and montecito 4-gram defense attorney for liberty attorney for liberty and for liberty and montecito 5-gram defense attorney for liberty and montecito attorney for liberty and montecito

Application

Probabilistic language model for predicting the

next item in a sequence in the form of a (n minus 1)

Widely used in probability communication

theory computational linguistics biological

sequence analysis

Advantage

Relatively simple

Simply increasing n model can be used to store

more context

Disadvantage

Semantic value of the item is not considered

n-gram Definition

bull n-gram is a contiguous sequence of n items from a given sequence of text

bull The items can be letters words syllables or base pairs according to the application

10

Application

- Taxonomy extraction (predefined terms and entities)

- Entities People organizations locations times dates prices genes proteins diseases

medicines

- Concept extraction (main idea or a theme)

Advantage

- Less noisy than n-grams

Disadvantage

- Does not specify role of items in the main sentence

Shallow NLP Technique Definition

- Assign a syntactic label (noun verb etc) to a chunk

- Knowledge extraction from text through semanticsyntactic analysis approach

11

Sentence - ldquoThe driver from Europe crashed the car with the white bumperrdquo

1-gram

the

driver

from

europe

crashed

the

car

with

the

white

bumper

Part of Speech

DT ndash Determiner

NN - Noun singular or mass

IN - Preposition or subordinating conjunction

NNP - Proper Noun singular

VBD - Verb past tense

DT ndash Determiner

NN - Noun singular or mass

IN - Preposition or subordinating conjunction

DT ndash Determiner

JJ ndash Adjective

NN - Noun singular or mass

- Convert to lowercase amp PoS tag

Concept Extraction

- Remove Stop words

- Retain only Nounrsquos amp Verbrsquos

- Bi-gram with Nounrsquos amp Verbrsquos retained

Bi-gram PoS

car white NN JJ

crashed car VBD NN

driver europe NN NNP

europe crashed NNP VBD

white bumper JJ NN

3-gram PoS

car white bumper NN JJ NN

crashed car white VBD NN JJ

driver europe crashed NN NNP VBD

europe crashed car NNP VBD NN

- 3-gram with Nounrsquos amp Verbrsquos retained

Conclusion

1-gram Reduced noise however no clear context Bi-gram amp 3-gram Increased context however there

is a information loss

Shallow NLP Technique

12

Stop words NounVerb

Definition

- Extension to the shallow NLP

- Detected relationships are expressed as complex construction to retain the context

- Example relationships Located in employed by part of married to

Applications

- Develop features and representations appropriate for complex interpretation tasks

- Fraud detection - Life science prediction activities based on complex RNA-Sequence

Deep NLP technique

Example

The above sentence can be represented using triples (Subject Predicate [Modifier] Object) without loosing the context

Triples

driver crash car

driver crash with bumper

driver be from Europe 13

Technique General Steps Pros Cons

N-Gram

- Convert to lowercase

- Remove punctuations - Remove special characters Simple technique Extremely noisy

Shallow NLP

technique

- POS tagging

- Lemmatization ie transform to

dictionary base form ie produce amp produced become produce

- Stemming ie transform to root word

ie 1) computer amp computers become

comput

2) product produce amp produced

become produc

- Chunking ie identify the phrasal

constituents in a sentence including

nounverb phrase etc and splits the

sentence into chunks of semantically

related words

Less noisy than N-

Grams

Computationally

expensive

solution for

analyzing the

structure of texts

Does not specify

the internal

structure or the

role of words in

the sentence

Deep NLP

technique

- Generate syntactic relationship

between each pair of words

- Extract subject predicate nagation

objecct and named entity to form triples

Context of the

sentence is

retained

Sentence level

analysis is too

structured

Techniques - Summary

14

Appendix

15

2A - Explore Corpus through EDA 2B - Convert text to lowercase 2C - Remove

a) Numbers(if required) b) Punctuations c) English stop words d) Own stop words(if

required)

e) Strip whitespace f) LemmatizationStemming g) Sparse terms

2D - Create document term matrix

Step 3 - Visualization

Corpus

Web

Documents

Step 1 ndash Data Assemble

Step 2 ndash Data Processing

Step 4 ndash Build Model(s)

Clustering

Classification

Sentiment Analysis

Fre

qu

en

cy

Ch

art

W

ord

Clo

ud

C

orr

ela

tio

n P

lot

R - Text Mining Process Overview

16

DB

Package Name

Category Description

tm Text Mining A framework for text mining applications

topicmodels Topic Modelling Fit topic models with Latent Dirichlet Allocation (LDA) and Comparative Text Mining (CTM)

wordcloud Visualization Plot a cloud comparing the frequencies of words across documents

lda Topic Modelling Fit topic models with Latent Dirichlet Allocation

wordnet Text Mining Database of English which is commonly used in linguistics and text mining

RTextTools Text Mining Automatic text classification via supervised learning

qdap Sentiment analysis

Transcript analysis text mining and natural language processing

tmplugindc Text Mining A plug-in for package tm to support distributed text mining

tmpluginmail Text Mining A plug-in for package tm to handle mail

textir Text Mining A suite of tools for inference about text documents and associated sentiment

tau Text Mining Utilities for text analysis

textcat Text Mining N-gram based text categorization

SnowballC Text Mining Word stemmer

twitteR Text Mining Provides an interface to the Twitter web API

ROAuth Text Mining Allows users to authenticate to the server of their choice (like Twitter)

RColorBrewer Visualization The packages provides palettes for drawing nice maps shaded according to a variable

ggplot2 Visualization

Graphing package implemented on top of the R statistical package Inspired by the Grammar of Graphics seminal work of Leland Wilkinson

R ndash Required packages for Text Mining

17

Example 1 - Obtaining and analyzing tweets

Objective R code for analyzing tweets relating to AAA2011 (text mining topic modelling network analysis clustering and sentiment analysis)

What does the code do

The code details ten steps in the analysis and visualization of the tweets

Acquiring the raw Twitter data

Calculating some basic statistics with the raw Twitter data

Calculating some basic retweet statistics

Calculating the ratio of retweets to tweets

Calculating some basic statistics about URLs in tweets

Basic text mining for token frequency and token association analysis (word cloud)

Calculating sentiment scores of tweets including on subsets containing tokens of interest

Hierarchical clustering of tokens based on multi scale bootstrap resampling

Topic modelling the tweet corpus using latent Dirichlet allocation

Network analysis of tweeters based on retweets

Code Source Code was taken from following link and tweakedadded additional bits where required to ensure code runs fine httpsgithubcombenmarwickAAA2011-Tweets

How to Run or Test the code - From the word doc copy the R code in the given sequence highlighted in yellow color and paste on your R console

18

RTextTools ndash Example for supervised Learning for Text Classification using Ensemble

RTextTools is a free open source R machine learning package for automatic text classification

The package includes nine algorithms for ensemble classification (svm slda boosting bagging random forests glmnet decision trees neural networks and maximum entropy) comprehensive analytics and thorough documentation

Users may use n-fold cross validation to calculate the accuracy of each algorithm on their dataset and determine which algorithms to use in their ensemble

(Using a four-ensemble agreement approach Collingwood and Wilkerson (2012) found that when four of their algorithms agree on the label of a textual document the machine label matches the human label over 90 of the time The rate is just 45 when only two algorithms agree on the text label)

Code Source The codes is readily available for download and usage from the following link httpsgithubcomtimjurkaRTextTools The code can be implemented without modification for testing however itrsquos set up such that changes can be incorporated easily based on our requirement

Additional Reading httpwwwrtexttoolscomabout-the-projecthtml

19

Example 2 - RTextTools

Penn Treebank - httpswwwlingupenneducoursesFall_2003ling001penn_treebank_poshtml

Stanford info lab - Finding Similar Items httpinfolabstanfordedu~ullmanmmdsch3pdf

TRIPLET EXTRACTION FROM SENTENCES URL - httpailabijssidelia_rusuPapersis_2007pdf

Shallow and Deep NLP Processing for ontology learning a Quick Overview httpazouaqathabascaucapublicationsConferences20Workshops20Books5BBC25D_KDW_2010pdf

References

20

Step 3 - Data Visualization

Frequency Chart Word Cloud

Correlation Plot

Step 4 ndash Models

Clustering

Classification

Sentiment Analysis

Document

Term Frequency - How frequently term appears Term Frequency TF(t) = (Number of times term t appears in a document) (Total number of terms in the document)

Example

Calculate Term Weight (TF IDF)

Inverse Document Frequency - How important a term is

Document Frequency DF = d (number of documents containing a given term) D (the size of the collection of documents)

To normalize take log(dD) but often D gt d and log(dD) will give negative value So invert the ratio inside log expression Essentially we are compressing the scale of values so that very large or very small quantities are smoothly compared

Inverse Document Frequency IDF(t) = log(Total number of documents Number of documents with term t in it)

7

- Assume we have overall 10 million documents and the word spindle appears in one thousand of these - Consider 2 documents containing 100 total words each and contains term spindle x number of times

Document spindle ndash Frequency Total Words TF IDF TF IDF

1 3 100 3100 = 003 log(100000001000) = 4 003 4 = 012

2 30 100 30100 = 3 log(100000001000) = 4 03 4 = 12

Similarity Distance Measure

Example Text 1 statistics skills and programming skills are equally important for analytics

Text 2 statistics skills and domain knowledge are important for analytics

Text 3 I like reading books and travelling

The three vectors are

T1 = (1211011111000000)

T2 = (1110110111100000)

T3 = (0010000000011111)

Degree of Similarity (T1 amp T2) = (T1 T2) (sqrt(sum(T1^2)) sqrt(sum(T2^2))) = 77

Degree of Similarity (T1 amp T3) = (T1 T3) (sqrt(sum(T1^2)) sqrt(sum(T3^2))) = 12

Additional Reading Here is a detailed paper on comparing the efficiency of different distance measures for text documents

URL ndash 1) httphomeiitkacin~spranjalcs671projectreportpdf

2) httpusersdsicupves~prossoresourcesBarronEtAl_ICON09pdf

statistics skills and programming knowledge are equally important for analytics domain I like reading books travelling

Text 1 1 2 1 1 0 1 1 1 1 1 0 0 0 0 0 0

Text 2 1 1 1 0 1 1 0 1 1 1 1 0 0 0 0 0

Text 3 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1

X

Y

Euclidean

Cosine

- cosine value will be a number between 0 and 1 - Smaller the angel bigger the cosine valuesimilarity

8

Common Text Mining Techniques

bull N-grams

bull Shallow Natural Language Processing

bull Deep Natural Language Processing

Example defense attorney for liberty and

montecitordquo 1-gram defense attorney for liberty and montecito 2-gram defense attorney for liberty and montecito attorney for liberty and attorney for 3-gram defense attorney for liberty and montecito attorney for liberty for liberty and liberty and montecito 4-gram defense attorney for liberty attorney for liberty and for liberty and montecito 5-gram defense attorney for liberty and montecito attorney for liberty and montecito

Application

Probabilistic language model for predicting the

next item in a sequence in the form of a (n minus 1)

Widely used in probability communication

theory computational linguistics biological

sequence analysis

Advantage

Relatively simple

Simply increasing n model can be used to store

more context

Disadvantage

Semantic value of the item is not considered

n-gram Definition

bull n-gram is a contiguous sequence of n items from a given sequence of text

bull The items can be letters words syllables or base pairs according to the application

10

Application

- Taxonomy extraction (predefined terms and entities)

- Entities People organizations locations times dates prices genes proteins diseases

medicines

- Concept extraction (main idea or a theme)

Advantage

- Less noisy than n-grams

Disadvantage

- Does not specify role of items in the main sentence

Shallow NLP Technique Definition

- Assign a syntactic label (noun verb etc) to a chunk

- Knowledge extraction from text through semanticsyntactic analysis approach

11

Sentence - ldquoThe driver from Europe crashed the car with the white bumperrdquo

1-gram

the

driver

from

europe

crashed

the

car

with

the

white

bumper

Part of Speech

DT ndash Determiner

NN - Noun singular or mass

IN - Preposition or subordinating conjunction

NNP - Proper Noun singular

VBD - Verb past tense

DT ndash Determiner

NN - Noun singular or mass

IN - Preposition or subordinating conjunction

DT ndash Determiner

JJ ndash Adjective

NN - Noun singular or mass

- Convert to lowercase amp PoS tag

Concept Extraction

- Remove Stop words

- Retain only Nounrsquos amp Verbrsquos

- Bi-gram with Nounrsquos amp Verbrsquos retained

Bi-gram PoS

car white NN JJ

crashed car VBD NN

driver europe NN NNP

europe crashed NNP VBD

white bumper JJ NN

3-gram PoS

car white bumper NN JJ NN

crashed car white VBD NN JJ

driver europe crashed NN NNP VBD

europe crashed car NNP VBD NN

- 3-gram with Nounrsquos amp Verbrsquos retained

Conclusion

1-gram Reduced noise however no clear context Bi-gram amp 3-gram Increased context however there

is a information loss

Shallow NLP Technique

12

Stop words NounVerb

Definition

- Extension to the shallow NLP

- Detected relationships are expressed as complex construction to retain the context

- Example relationships Located in employed by part of married to

Applications

- Develop features and representations appropriate for complex interpretation tasks

- Fraud detection - Life science prediction activities based on complex RNA-Sequence

Deep NLP technique

Example

The above sentence can be represented using triples (Subject Predicate [Modifier] Object) without loosing the context

Triples

driver crash car

driver crash with bumper

driver be from Europe 13

Technique General Steps Pros Cons

N-Gram

- Convert to lowercase

- Remove punctuations - Remove special characters Simple technique Extremely noisy

Shallow NLP

technique

- POS tagging

- Lemmatization ie transform to

dictionary base form ie produce amp produced become produce

- Stemming ie transform to root word

ie 1) computer amp computers become

comput

2) product produce amp produced

become produc

- Chunking ie identify the phrasal

constituents in a sentence including

nounverb phrase etc and splits the

sentence into chunks of semantically

related words

Less noisy than N-

Grams

Computationally

expensive

solution for

analyzing the

structure of texts

Does not specify

the internal

structure or the

role of words in

the sentence

Deep NLP

technique

- Generate syntactic relationship

between each pair of words

- Extract subject predicate nagation

objecct and named entity to form triples

Context of the

sentence is

retained

Sentence level

analysis is too

structured

Techniques - Summary

14

Appendix

15

2A - Explore Corpus through EDA 2B - Convert text to lowercase 2C - Remove

a) Numbers(if required) b) Punctuations c) English stop words d) Own stop words(if

required)

e) Strip whitespace f) LemmatizationStemming g) Sparse terms

2D - Create document term matrix

Step 3 - Visualization

Corpus

Web

Documents

Step 1 ndash Data Assemble

Step 2 ndash Data Processing

Step 4 ndash Build Model(s)

Clustering

Classification

Sentiment Analysis

Fre

qu

en

cy

Ch

art

W

ord

Clo

ud

C

orr

ela

tio

n P

lot

R - Text Mining Process Overview

16

DB

Package Name

Category Description

tm Text Mining A framework for text mining applications

topicmodels Topic Modelling Fit topic models with Latent Dirichlet Allocation (LDA) and Comparative Text Mining (CTM)

wordcloud Visualization Plot a cloud comparing the frequencies of words across documents

lda Topic Modelling Fit topic models with Latent Dirichlet Allocation

wordnet Text Mining Database of English which is commonly used in linguistics and text mining

RTextTools Text Mining Automatic text classification via supervised learning

qdap Sentiment analysis

Transcript analysis text mining and natural language processing

tmplugindc Text Mining A plug-in for package tm to support distributed text mining

tmpluginmail Text Mining A plug-in for package tm to handle mail

textir Text Mining A suite of tools for inference about text documents and associated sentiment

tau Text Mining Utilities for text analysis

textcat Text Mining N-gram based text categorization

SnowballC Text Mining Word stemmer

twitteR Text Mining Provides an interface to the Twitter web API

ROAuth Text Mining Allows users to authenticate to the server of their choice (like Twitter)

RColorBrewer Visualization The packages provides palettes for drawing nice maps shaded according to a variable

ggplot2 Visualization

Graphing package implemented on top of the R statistical package Inspired by the Grammar of Graphics seminal work of Leland Wilkinson

R ndash Required packages for Text Mining

17

Example 1 - Obtaining and analyzing tweets

Objective R code for analyzing tweets relating to AAA2011 (text mining topic modelling network analysis clustering and sentiment analysis)

What does the code do

The code details ten steps in the analysis and visualization of the tweets

Acquiring the raw Twitter data

Calculating some basic statistics with the raw Twitter data

Calculating some basic retweet statistics

Calculating the ratio of retweets to tweets

Calculating some basic statistics about URLs in tweets

Basic text mining for token frequency and token association analysis (word cloud)

Calculating sentiment scores of tweets including on subsets containing tokens of interest

Hierarchical clustering of tokens based on multi scale bootstrap resampling

Topic modelling the tweet corpus using latent Dirichlet allocation

Network analysis of tweeters based on retweets

Code Source Code was taken from following link and tweakedadded additional bits where required to ensure code runs fine httpsgithubcombenmarwickAAA2011-Tweets

How to Run or Test the code - From the word doc copy the R code in the given sequence highlighted in yellow color and paste on your R console

18

RTextTools ndash Example for supervised Learning for Text Classification using Ensemble

RTextTools is a free open source R machine learning package for automatic text classification

The package includes nine algorithms for ensemble classification (svm slda boosting bagging random forests glmnet decision trees neural networks and maximum entropy) comprehensive analytics and thorough documentation

Users may use n-fold cross validation to calculate the accuracy of each algorithm on their dataset and determine which algorithms to use in their ensemble

(Using a four-ensemble agreement approach Collingwood and Wilkerson (2012) found that when four of their algorithms agree on the label of a textual document the machine label matches the human label over 90 of the time The rate is just 45 when only two algorithms agree on the text label)

Code Source The codes is readily available for download and usage from the following link httpsgithubcomtimjurkaRTextTools The code can be implemented without modification for testing however itrsquos set up such that changes can be incorporated easily based on our requirement

Additional Reading httpwwwrtexttoolscomabout-the-projecthtml

19

Example 2 - RTextTools

Penn Treebank - httpswwwlingupenneducoursesFall_2003ling001penn_treebank_poshtml

Stanford info lab - Finding Similar Items httpinfolabstanfordedu~ullmanmmdsch3pdf

TRIPLET EXTRACTION FROM SENTENCES URL - httpailabijssidelia_rusuPapersis_2007pdf

Shallow and Deep NLP Processing for ontology learning a Quick Overview httpazouaqathabascaucapublicationsConferences20Workshops20Books5BBC25D_KDW_2010pdf

References

20

Step 4 ndash Models

Clustering

Classification

Sentiment Analysis

Document

Term Frequency - How frequently term appears Term Frequency TF(t) = (Number of times term t appears in a document) (Total number of terms in the document)

Example

Calculate Term Weight (TF IDF)

Inverse Document Frequency - How important a term is

Document Frequency DF = d (number of documents containing a given term) D (the size of the collection of documents)

To normalize take log(dD) but often D gt d and log(dD) will give negative value So invert the ratio inside log expression Essentially we are compressing the scale of values so that very large or very small quantities are smoothly compared

Inverse Document Frequency IDF(t) = log(Total number of documents Number of documents with term t in it)

7

- Assume we have overall 10 million documents and the word spindle appears in one thousand of these - Consider 2 documents containing 100 total words each and contains term spindle x number of times

Document spindle ndash Frequency Total Words TF IDF TF IDF

1 3 100 3100 = 003 log(100000001000) = 4 003 4 = 012

2 30 100 30100 = 3 log(100000001000) = 4 03 4 = 12

Similarity Distance Measure

Example Text 1 statistics skills and programming skills are equally important for analytics

Text 2 statistics skills and domain knowledge are important for analytics

Text 3 I like reading books and travelling

The three vectors are

T1 = (1211011111000000)

T2 = (1110110111100000)

T3 = (0010000000011111)

Degree of Similarity (T1 amp T2) = (T1 T2) (sqrt(sum(T1^2)) sqrt(sum(T2^2))) = 77

Degree of Similarity (T1 amp T3) = (T1 T3) (sqrt(sum(T1^2)) sqrt(sum(T3^2))) = 12

Additional Reading Here is a detailed paper on comparing the efficiency of different distance measures for text documents

URL ndash 1) httphomeiitkacin~spranjalcs671projectreportpdf

2) httpusersdsicupves~prossoresourcesBarronEtAl_ICON09pdf

statistics skills and programming knowledge are equally important for analytics domain I like reading books travelling

Text 1 1 2 1 1 0 1 1 1 1 1 0 0 0 0 0 0

Text 2 1 1 1 0 1 1 0 1 1 1 1 0 0 0 0 0

Text 3 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1

X

Y

Euclidean

Cosine

- cosine value will be a number between 0 and 1 - Smaller the angel bigger the cosine valuesimilarity

8

Common Text Mining Techniques

bull N-grams

bull Shallow Natural Language Processing

bull Deep Natural Language Processing

Example defense attorney for liberty and

montecitordquo 1-gram defense attorney for liberty and montecito 2-gram defense attorney for liberty and montecito attorney for liberty and attorney for 3-gram defense attorney for liberty and montecito attorney for liberty for liberty and liberty and montecito 4-gram defense attorney for liberty attorney for liberty and for liberty and montecito 5-gram defense attorney for liberty and montecito attorney for liberty and montecito

Application

Probabilistic language model for predicting the

next item in a sequence in the form of a (n minus 1)

Widely used in probability communication

theory computational linguistics biological

sequence analysis

Advantage

Relatively simple

Simply increasing n model can be used to store

more context

Disadvantage

Semantic value of the item is not considered

n-gram Definition

bull n-gram is a contiguous sequence of n items from a given sequence of text

bull The items can be letters words syllables or base pairs according to the application

10

Application

- Taxonomy extraction (predefined terms and entities)

- Entities People organizations locations times dates prices genes proteins diseases

medicines

- Concept extraction (main idea or a theme)

Advantage

- Less noisy than n-grams

Disadvantage

- Does not specify role of items in the main sentence

Shallow NLP Technique Definition

- Assign a syntactic label (noun verb etc) to a chunk

- Knowledge extraction from text through semanticsyntactic analysis approach

11

Sentence - ldquoThe driver from Europe crashed the car with the white bumperrdquo

1-gram

the

driver

from

europe

crashed

the

car

with

the

white

bumper

Part of Speech

DT ndash Determiner

NN - Noun singular or mass

IN - Preposition or subordinating conjunction

NNP - Proper Noun singular

VBD - Verb past tense

DT ndash Determiner

NN - Noun singular or mass

IN - Preposition or subordinating conjunction

DT ndash Determiner

JJ ndash Adjective

NN - Noun singular or mass

- Convert to lowercase amp PoS tag

Concept Extraction

- Remove Stop words

- Retain only Nounrsquos amp Verbrsquos

- Bi-gram with Nounrsquos amp Verbrsquos retained

Bi-gram PoS

car white NN JJ

crashed car VBD NN

driver europe NN NNP

europe crashed NNP VBD

white bumper JJ NN

3-gram PoS

car white bumper NN JJ NN

crashed car white VBD NN JJ

driver europe crashed NN NNP VBD

europe crashed car NNP VBD NN

- 3-gram with Nounrsquos amp Verbrsquos retained

Conclusion

1-gram Reduced noise however no clear context Bi-gram amp 3-gram Increased context however there

is a information loss

Shallow NLP Technique

12

Stop words NounVerb

Definition

- Extension to the shallow NLP

- Detected relationships are expressed as complex construction to retain the context

- Example relationships Located in employed by part of married to

Applications

- Develop features and representations appropriate for complex interpretation tasks

- Fraud detection - Life science prediction activities based on complex RNA-Sequence

Deep NLP technique

Example

The above sentence can be represented using triples (Subject Predicate [Modifier] Object) without loosing the context

Triples

driver crash car

driver crash with bumper

driver be from Europe 13

Technique General Steps Pros Cons

N-Gram

- Convert to lowercase

- Remove punctuations - Remove special characters Simple technique Extremely noisy

Shallow NLP

technique

- POS tagging

- Lemmatization ie transform to

dictionary base form ie produce amp produced become produce

- Stemming ie transform to root word

ie 1) computer amp computers become

comput

2) product produce amp produced

become produc

- Chunking ie identify the phrasal

constituents in a sentence including

nounverb phrase etc and splits the

sentence into chunks of semantically

related words

Less noisy than N-

Grams

Computationally

expensive

solution for

analyzing the

structure of texts

Does not specify

the internal

structure or the

role of words in

the sentence

Deep NLP

technique

- Generate syntactic relationship

between each pair of words

- Extract subject predicate nagation

objecct and named entity to form triples

Context of the

sentence is

retained

Sentence level

analysis is too

structured

Techniques - Summary

14

Appendix

15

2A - Explore Corpus through EDA 2B - Convert text to lowercase 2C - Remove

a) Numbers(if required) b) Punctuations c) English stop words d) Own stop words(if

required)

e) Strip whitespace f) LemmatizationStemming g) Sparse terms

2D - Create document term matrix

Step 3 - Visualization

Corpus

Web

Documents

Step 1 ndash Data Assemble

Step 2 ndash Data Processing

Step 4 ndash Build Model(s)

Clustering

Classification

Sentiment Analysis

Fre

qu

en

cy

Ch

art

W

ord

Clo

ud

C

orr

ela

tio

n P

lot

R - Text Mining Process Overview

16

DB

Package Name

Category Description

tm Text Mining A framework for text mining applications

topicmodels Topic Modelling Fit topic models with Latent Dirichlet Allocation (LDA) and Comparative Text Mining (CTM)

wordcloud Visualization Plot a cloud comparing the frequencies of words across documents

lda Topic Modelling Fit topic models with Latent Dirichlet Allocation

wordnet Text Mining Database of English which is commonly used in linguistics and text mining

RTextTools Text Mining Automatic text classification via supervised learning

qdap Sentiment analysis

Transcript analysis text mining and natural language processing

tmplugindc Text Mining A plug-in for package tm to support distributed text mining

tmpluginmail Text Mining A plug-in for package tm to handle mail

textir Text Mining A suite of tools for inference about text documents and associated sentiment

tau Text Mining Utilities for text analysis

textcat Text Mining N-gram based text categorization

SnowballC Text Mining Word stemmer

twitteR Text Mining Provides an interface to the Twitter web API

ROAuth Text Mining Allows users to authenticate to the server of their choice (like Twitter)

RColorBrewer Visualization The packages provides palettes for drawing nice maps shaded according to a variable

ggplot2 Visualization

Graphing package implemented on top of the R statistical package Inspired by the Grammar of Graphics seminal work of Leland Wilkinson

R ndash Required packages for Text Mining

17

Example 1 - Obtaining and analyzing tweets

Objective R code for analyzing tweets relating to AAA2011 (text mining topic modelling network analysis clustering and sentiment analysis)

What does the code do

The code details ten steps in the analysis and visualization of the tweets

Acquiring the raw Twitter data

Calculating some basic statistics with the raw Twitter data

Calculating some basic retweet statistics

Calculating the ratio of retweets to tweets

Calculating some basic statistics about URLs in tweets

Basic text mining for token frequency and token association analysis (word cloud)

Calculating sentiment scores of tweets including on subsets containing tokens of interest

Hierarchical clustering of tokens based on multi scale bootstrap resampling

Topic modelling the tweet corpus using latent Dirichlet allocation

Network analysis of tweeters based on retweets

Code Source Code was taken from following link and tweakedadded additional bits where required to ensure code runs fine httpsgithubcombenmarwickAAA2011-Tweets

How to Run or Test the code - From the word doc copy the R code in the given sequence highlighted in yellow color and paste on your R console

18

RTextTools ndash Example for supervised Learning for Text Classification using Ensemble

RTextTools is a free open source R machine learning package for automatic text classification

The package includes nine algorithms for ensemble classification (svm slda boosting bagging random forests glmnet decision trees neural networks and maximum entropy) comprehensive analytics and thorough documentation

Users may use n-fold cross validation to calculate the accuracy of each algorithm on their dataset and determine which algorithms to use in their ensemble

(Using a four-ensemble agreement approach Collingwood and Wilkerson (2012) found that when four of their algorithms agree on the label of a textual document the machine label matches the human label over 90 of the time The rate is just 45 when only two algorithms agree on the text label)

Code Source The codes is readily available for download and usage from the following link httpsgithubcomtimjurkaRTextTools The code can be implemented without modification for testing however itrsquos set up such that changes can be incorporated easily based on our requirement

Additional Reading httpwwwrtexttoolscomabout-the-projecthtml

19

Example 2 - RTextTools

Penn Treebank - httpswwwlingupenneducoursesFall_2003ling001penn_treebank_poshtml

Stanford info lab - Finding Similar Items httpinfolabstanfordedu~ullmanmmdsch3pdf

TRIPLET EXTRACTION FROM SENTENCES URL - httpailabijssidelia_rusuPapersis_2007pdf

Shallow and Deep NLP Processing for ontology learning a Quick Overview httpazouaqathabascaucapublicationsConferences20Workshops20Books5BBC25D_KDW_2010pdf

References

20

Term Frequency - How frequently term appears Term Frequency TF(t) = (Number of times term t appears in a document) (Total number of terms in the document)

Example

Calculate Term Weight (TF IDF)

Inverse Document Frequency - How important a term is

Document Frequency DF = d (number of documents containing a given term) D (the size of the collection of documents)

To normalize take log(dD) but often D gt d and log(dD) will give negative value So invert the ratio inside log expression Essentially we are compressing the scale of values so that very large or very small quantities are smoothly compared

Inverse Document Frequency IDF(t) = log(Total number of documents Number of documents with term t in it)

7

- Assume we have overall 10 million documents and the word spindle appears in one thousand of these - Consider 2 documents containing 100 total words each and contains term spindle x number of times

Document spindle ndash Frequency Total Words TF IDF TF IDF

1 3 100 3100 = 003 log(100000001000) = 4 003 4 = 012

2 30 100 30100 = 3 log(100000001000) = 4 03 4 = 12

Similarity Distance Measure

Example Text 1 statistics skills and programming skills are equally important for analytics

Text 2 statistics skills and domain knowledge are important for analytics

Text 3 I like reading books and travelling

The three vectors are

T1 = (1211011111000000)

T2 = (1110110111100000)

T3 = (0010000000011111)

Degree of Similarity (T1 amp T2) = (T1 T2) (sqrt(sum(T1^2)) sqrt(sum(T2^2))) = 77

Degree of Similarity (T1 amp T3) = (T1 T3) (sqrt(sum(T1^2)) sqrt(sum(T3^2))) = 12

Additional Reading Here is a detailed paper on comparing the efficiency of different distance measures for text documents

URL ndash 1) httphomeiitkacin~spranjalcs671projectreportpdf

2) httpusersdsicupves~prossoresourcesBarronEtAl_ICON09pdf

statistics skills and programming knowledge are equally important for analytics domain I like reading books travelling

Text 1 1 2 1 1 0 1 1 1 1 1 0 0 0 0 0 0

Text 2 1 1 1 0 1 1 0 1 1 1 1 0 0 0 0 0

Text 3 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1

X

Y

Euclidean

Cosine

- cosine value will be a number between 0 and 1 - Smaller the angel bigger the cosine valuesimilarity

8

Common Text Mining Techniques

bull N-grams

bull Shallow Natural Language Processing

bull Deep Natural Language Processing

Example defense attorney for liberty and

montecitordquo 1-gram defense attorney for liberty and montecito 2-gram defense attorney for liberty and montecito attorney for liberty and attorney for 3-gram defense attorney for liberty and montecito attorney for liberty for liberty and liberty and montecito 4-gram defense attorney for liberty attorney for liberty and for liberty and montecito 5-gram defense attorney for liberty and montecito attorney for liberty and montecito

Application

Probabilistic language model for predicting the

next item in a sequence in the form of a (n minus 1)

Widely used in probability communication

theory computational linguistics biological

sequence analysis

Advantage

Relatively simple

Simply increasing n model can be used to store

more context

Disadvantage

Semantic value of the item is not considered

n-gram Definition

bull n-gram is a contiguous sequence of n items from a given sequence of text

bull The items can be letters words syllables or base pairs according to the application

10

Application

- Taxonomy extraction (predefined terms and entities)

- Entities People organizations locations times dates prices genes proteins diseases

medicines

- Concept extraction (main idea or a theme)

Advantage

- Less noisy than n-grams

Disadvantage

- Does not specify role of items in the main sentence

Shallow NLP Technique Definition

- Assign a syntactic label (noun verb etc) to a chunk

- Knowledge extraction from text through semanticsyntactic analysis approach

11

Sentence - ldquoThe driver from Europe crashed the car with the white bumperrdquo

1-gram

the

driver

from

europe

crashed

the

car

with

the

white

bumper

Part of Speech

DT ndash Determiner

NN - Noun singular or mass

IN - Preposition or subordinating conjunction

NNP - Proper Noun singular

VBD - Verb past tense

DT ndash Determiner

NN - Noun singular or mass

IN - Preposition or subordinating conjunction

DT ndash Determiner

JJ ndash Adjective

NN - Noun singular or mass

- Convert to lowercase amp PoS tag

Concept Extraction

- Remove Stop words

- Retain only Nounrsquos amp Verbrsquos

- Bi-gram with Nounrsquos amp Verbrsquos retained

Bi-gram PoS

car white NN JJ

crashed car VBD NN

driver europe NN NNP

europe crashed NNP VBD

white bumper JJ NN

3-gram PoS

car white bumper NN JJ NN

crashed car white VBD NN JJ

driver europe crashed NN NNP VBD

europe crashed car NNP VBD NN

- 3-gram with Nounrsquos amp Verbrsquos retained

Conclusion

1-gram Reduced noise however no clear context Bi-gram amp 3-gram Increased context however there

is a information loss

Shallow NLP Technique

12

Stop words NounVerb

Definition

- Extension to the shallow NLP

- Detected relationships are expressed as complex construction to retain the context

- Example relationships Located in employed by part of married to

Applications

- Develop features and representations appropriate for complex interpretation tasks

- Fraud detection - Life science prediction activities based on complex RNA-Sequence

Deep NLP technique

Example

The above sentence can be represented using triples (Subject Predicate [Modifier] Object) without loosing the context

Triples

driver crash car

driver crash with bumper

driver be from Europe 13

Technique General Steps Pros Cons

N-Gram

- Convert to lowercase

- Remove punctuations - Remove special characters Simple technique Extremely noisy

Shallow NLP

technique

- POS tagging

- Lemmatization ie transform to

dictionary base form ie produce amp produced become produce

- Stemming ie transform to root word

ie 1) computer amp computers become

comput

2) product produce amp produced

become produc

- Chunking ie identify the phrasal

constituents in a sentence including

nounverb phrase etc and splits the

sentence into chunks of semantically

related words

Less noisy than N-

Grams

Computationally

expensive

solution for

analyzing the

structure of texts

Does not specify

the internal

structure or the

role of words in

the sentence

Deep NLP

technique

- Generate syntactic relationship

between each pair of words

- Extract subject predicate nagation

objecct and named entity to form triples

Context of the

sentence is

retained

Sentence level

analysis is too

structured

Techniques - Summary

14

Appendix

15

2A - Explore Corpus through EDA 2B - Convert text to lowercase 2C - Remove

a) Numbers(if required) b) Punctuations c) English stop words d) Own stop words(if

required)

e) Strip whitespace f) LemmatizationStemming g) Sparse terms

2D - Create document term matrix

Step 3 - Visualization

Corpus

Web

Documents

Step 1 ndash Data Assemble

Step 2 ndash Data Processing

Step 4 ndash Build Model(s)

Clustering

Classification

Sentiment Analysis

Fre

qu

en

cy

Ch

art

W

ord

Clo

ud

C

orr

ela

tio

n P

lot

R - Text Mining Process Overview

16

DB

Package Name

Category Description

tm Text Mining A framework for text mining applications

topicmodels Topic Modelling Fit topic models with Latent Dirichlet Allocation (LDA) and Comparative Text Mining (CTM)

wordcloud Visualization Plot a cloud comparing the frequencies of words across documents

lda Topic Modelling Fit topic models with Latent Dirichlet Allocation

wordnet Text Mining Database of English which is commonly used in linguistics and text mining

RTextTools Text Mining Automatic text classification via supervised learning

qdap Sentiment analysis

Transcript analysis text mining and natural language processing

tmplugindc Text Mining A plug-in for package tm to support distributed text mining

tmpluginmail Text Mining A plug-in for package tm to handle mail

textir Text Mining A suite of tools for inference about text documents and associated sentiment

tau Text Mining Utilities for text analysis

textcat Text Mining N-gram based text categorization

SnowballC Text Mining Word stemmer

twitteR Text Mining Provides an interface to the Twitter web API

ROAuth Text Mining Allows users to authenticate to the server of their choice (like Twitter)

RColorBrewer Visualization The packages provides palettes for drawing nice maps shaded according to a variable

ggplot2 Visualization

Graphing package implemented on top of the R statistical package Inspired by the Grammar of Graphics seminal work of Leland Wilkinson

R ndash Required packages for Text Mining

17

Example 1 - Obtaining and analyzing tweets

Objective R code for analyzing tweets relating to AAA2011 (text mining topic modelling network analysis clustering and sentiment analysis)

What does the code do

The code details ten steps in the analysis and visualization of the tweets

Acquiring the raw Twitter data

Calculating some basic statistics with the raw Twitter data

Calculating some basic retweet statistics

Calculating the ratio of retweets to tweets

Calculating some basic statistics about URLs in tweets

Basic text mining for token frequency and token association analysis (word cloud)

Calculating sentiment scores of tweets including on subsets containing tokens of interest

Hierarchical clustering of tokens based on multi scale bootstrap resampling

Topic modelling the tweet corpus using latent Dirichlet allocation

Network analysis of tweeters based on retweets

Code Source Code was taken from following link and tweakedadded additional bits where required to ensure code runs fine httpsgithubcombenmarwickAAA2011-Tweets

How to Run or Test the code - From the word doc copy the R code in the given sequence highlighted in yellow color and paste on your R console

18

RTextTools ndash Example for supervised Learning for Text Classification using Ensemble

RTextTools is a free open source R machine learning package for automatic text classification

The package includes nine algorithms for ensemble classification (svm slda boosting bagging random forests glmnet decision trees neural networks and maximum entropy) comprehensive analytics and thorough documentation

Users may use n-fold cross validation to calculate the accuracy of each algorithm on their dataset and determine which algorithms to use in their ensemble

(Using a four-ensemble agreement approach Collingwood and Wilkerson (2012) found that when four of their algorithms agree on the label of a textual document the machine label matches the human label over 90 of the time The rate is just 45 when only two algorithms agree on the text label)

Code Source The codes is readily available for download and usage from the following link httpsgithubcomtimjurkaRTextTools The code can be implemented without modification for testing however itrsquos set up such that changes can be incorporated easily based on our requirement

Additional Reading httpwwwrtexttoolscomabout-the-projecthtml

19

Example 2 - RTextTools

Penn Treebank - httpswwwlingupenneducoursesFall_2003ling001penn_treebank_poshtml

Stanford info lab - Finding Similar Items httpinfolabstanfordedu~ullmanmmdsch3pdf

TRIPLET EXTRACTION FROM SENTENCES URL - httpailabijssidelia_rusuPapersis_2007pdf

Shallow and Deep NLP Processing for ontology learning a Quick Overview httpazouaqathabascaucapublicationsConferences20Workshops20Books5BBC25D_KDW_2010pdf

References

20

Similarity Distance Measure

Example Text 1 statistics skills and programming skills are equally important for analytics

Text 2 statistics skills and domain knowledge are important for analytics

Text 3 I like reading books and travelling

The three vectors are

T1 = (1211011111000000)

T2 = (1110110111100000)

T3 = (0010000000011111)

Degree of Similarity (T1 amp T2) = (T1 T2) (sqrt(sum(T1^2)) sqrt(sum(T2^2))) = 77

Degree of Similarity (T1 amp T3) = (T1 T3) (sqrt(sum(T1^2)) sqrt(sum(T3^2))) = 12

Additional Reading Here is a detailed paper on comparing the efficiency of different distance measures for text documents

URL ndash 1) httphomeiitkacin~spranjalcs671projectreportpdf

2) httpusersdsicupves~prossoresourcesBarronEtAl_ICON09pdf

statistics skills and programming knowledge are equally important for analytics domain I like reading books travelling

Text 1 1 2 1 1 0 1 1 1 1 1 0 0 0 0 0 0

Text 2 1 1 1 0 1 1 0 1 1 1 1 0 0 0 0 0

Text 3 0 0 1 0 0 0 0 0 0 0 0 1 1 1 1 1

X

Y

Euclidean

Cosine

- cosine value will be a number between 0 and 1 - Smaller the angel bigger the cosine valuesimilarity

8

Common Text Mining Techniques

bull N-grams

bull Shallow Natural Language Processing

bull Deep Natural Language Processing

Example defense attorney for liberty and

montecitordquo 1-gram defense attorney for liberty and montecito 2-gram defense attorney for liberty and montecito attorney for liberty and attorney for 3-gram defense attorney for liberty and montecito attorney for liberty for liberty and liberty and montecito 4-gram defense attorney for liberty attorney for liberty and for liberty and montecito 5-gram defense attorney for liberty and montecito attorney for liberty and montecito

Application

Probabilistic language model for predicting the

next item in a sequence in the form of a (n minus 1)

Widely used in probability communication

theory computational linguistics biological

sequence analysis

Advantage

Relatively simple

Simply increasing n model can be used to store

more context

Disadvantage

Semantic value of the item is not considered

n-gram Definition

bull n-gram is a contiguous sequence of n items from a given sequence of text

bull The items can be letters words syllables or base pairs according to the application

10

Application

- Taxonomy extraction (predefined terms and entities)

- Entities People organizations locations times dates prices genes proteins diseases

medicines

- Concept extraction (main idea or a theme)

Advantage

- Less noisy than n-grams

Disadvantage

- Does not specify role of items in the main sentence

Shallow NLP Technique Definition

- Assign a syntactic label (noun verb etc) to a chunk

- Knowledge extraction from text through semanticsyntactic analysis approach

11

Sentence - ldquoThe driver from Europe crashed the car with the white bumperrdquo

1-gram

the

driver

from

europe

crashed

the

car

with

the

white

bumper

Part of Speech

DT ndash Determiner

NN - Noun singular or mass

IN - Preposition or subordinating conjunction

NNP - Proper Noun singular

VBD - Verb past tense

DT ndash Determiner

NN - Noun singular or mass

IN - Preposition or subordinating conjunction

DT ndash Determiner

JJ ndash Adjective

NN - Noun singular or mass

- Convert to lowercase amp PoS tag

Concept Extraction

- Remove Stop words

- Retain only Nounrsquos amp Verbrsquos

- Bi-gram with Nounrsquos amp Verbrsquos retained

Bi-gram PoS

car white NN JJ

crashed car VBD NN

driver europe NN NNP

europe crashed NNP VBD

white bumper JJ NN

3-gram PoS

car white bumper NN JJ NN

crashed car white VBD NN JJ

driver europe crashed NN NNP VBD

europe crashed car NNP VBD NN

- 3-gram with Nounrsquos amp Verbrsquos retained

Conclusion

1-gram Reduced noise however no clear context Bi-gram amp 3-gram Increased context however there

is a information loss

Shallow NLP Technique

12

Stop words NounVerb

Definition

- Extension to the shallow NLP

- Detected relationships are expressed as complex construction to retain the context

- Example relationships Located in employed by part of married to

Applications

- Develop features and representations appropriate for complex interpretation tasks

- Fraud detection - Life science prediction activities based on complex RNA-Sequence

Deep NLP technique

Example

The above sentence can be represented using triples (Subject Predicate [Modifier] Object) without loosing the context

Triples

driver crash car

driver crash with bumper

driver be from Europe 13

Technique General Steps Pros Cons

N-Gram

- Convert to lowercase

- Remove punctuations - Remove special characters Simple technique Extremely noisy

Shallow NLP

technique

- POS tagging

- Lemmatization ie transform to

dictionary base form ie produce amp produced become produce

- Stemming ie transform to root word

ie 1) computer amp computers become

comput

2) product produce amp produced

become produc

- Chunking ie identify the phrasal

constituents in a sentence including

nounverb phrase etc and splits the

sentence into chunks of semantically

related words

Less noisy than N-

Grams

Computationally

expensive

solution for

analyzing the

structure of texts

Does not specify

the internal

structure or the

role of words in

the sentence

Deep NLP

technique

- Generate syntactic relationship

between each pair of words

- Extract subject predicate nagation

objecct and named entity to form triples

Context of the

sentence is

retained

Sentence level

analysis is too

structured

Techniques - Summary

14

Appendix

15

2A - Explore Corpus through EDA 2B - Convert text to lowercase 2C - Remove

a) Numbers(if required) b) Punctuations c) English stop words d) Own stop words(if

required)

e) Strip whitespace f) LemmatizationStemming g) Sparse terms

2D - Create document term matrix

Step 3 - Visualization

Corpus

Web

Documents

Step 1 ndash Data Assemble

Step 2 ndash Data Processing

Step 4 ndash Build Model(s)

Clustering

Classification

Sentiment Analysis

Fre

qu

en

cy

Ch

art

W

ord

Clo

ud

C

orr

ela

tio

n P

lot

R - Text Mining Process Overview

16

DB

Package Name

Category Description

tm Text Mining A framework for text mining applications

topicmodels Topic Modelling Fit topic models with Latent Dirichlet Allocation (LDA) and Comparative Text Mining (CTM)

wordcloud Visualization Plot a cloud comparing the frequencies of words across documents

lda Topic Modelling Fit topic models with Latent Dirichlet Allocation

wordnet Text Mining Database of English which is commonly used in linguistics and text mining

RTextTools Text Mining Automatic text classification via supervised learning

qdap Sentiment analysis

Transcript analysis text mining and natural language processing

tmplugindc Text Mining A plug-in for package tm to support distributed text mining

tmpluginmail Text Mining A plug-in for package tm to handle mail

textir Text Mining A suite of tools for inference about text documents and associated sentiment

tau Text Mining Utilities for text analysis

textcat Text Mining N-gram based text categorization

SnowballC Text Mining Word stemmer

twitteR Text Mining Provides an interface to the Twitter web API

ROAuth Text Mining Allows users to authenticate to the server of their choice (like Twitter)

RColorBrewer Visualization The packages provides palettes for drawing nice maps shaded according to a variable

ggplot2 Visualization

Graphing package implemented on top of the R statistical package Inspired by the Grammar of Graphics seminal work of Leland Wilkinson

R ndash Required packages for Text Mining

17

Example 1 - Obtaining and analyzing tweets

Objective R code for analyzing tweets relating to AAA2011 (text mining topic modelling network analysis clustering and sentiment analysis)

What does the code do

The code details ten steps in the analysis and visualization of the tweets

Acquiring the raw Twitter data

Calculating some basic statistics with the raw Twitter data

Calculating some basic retweet statistics

Calculating the ratio of retweets to tweets

Calculating some basic statistics about URLs in tweets

Basic text mining for token frequency and token association analysis (word cloud)

Calculating sentiment scores of tweets including on subsets containing tokens of interest

Hierarchical clustering of tokens based on multi scale bootstrap resampling

Topic modelling the tweet corpus using latent Dirichlet allocation

Network analysis of tweeters based on retweets

Code Source Code was taken from following link and tweakedadded additional bits where required to ensure code runs fine httpsgithubcombenmarwickAAA2011-Tweets

How to Run or Test the code - From the word doc copy the R code in the given sequence highlighted in yellow color and paste on your R console

18

RTextTools ndash Example for supervised Learning for Text Classification using Ensemble

RTextTools is a free open source R machine learning package for automatic text classification

The package includes nine algorithms for ensemble classification (svm slda boosting bagging random forests glmnet decision trees neural networks and maximum entropy) comprehensive analytics and thorough documentation

Users may use n-fold cross validation to calculate the accuracy of each algorithm on their dataset and determine which algorithms to use in their ensemble

(Using a four-ensemble agreement approach Collingwood and Wilkerson (2012) found that when four of their algorithms agree on the label of a textual document the machine label matches the human label over 90 of the time The rate is just 45 when only two algorithms agree on the text label)

Code Source The codes is readily available for download and usage from the following link httpsgithubcomtimjurkaRTextTools The code can be implemented without modification for testing however itrsquos set up such that changes can be incorporated easily based on our requirement

Additional Reading httpwwwrtexttoolscomabout-the-projecthtml

19

Example 2 - RTextTools

Penn Treebank - httpswwwlingupenneducoursesFall_2003ling001penn_treebank_poshtml

Stanford info lab - Finding Similar Items httpinfolabstanfordedu~ullmanmmdsch3pdf

TRIPLET EXTRACTION FROM SENTENCES URL - httpailabijssidelia_rusuPapersis_2007pdf

Shallow and Deep NLP Processing for ontology learning a Quick Overview httpazouaqathabascaucapublicationsConferences20Workshops20Books5BBC25D_KDW_2010pdf

References

20

Common Text Mining Techniques

bull N-grams

bull Shallow Natural Language Processing

bull Deep Natural Language Processing

Example defense attorney for liberty and

montecitordquo 1-gram defense attorney for liberty and montecito 2-gram defense attorney for liberty and montecito attorney for liberty and attorney for 3-gram defense attorney for liberty and montecito attorney for liberty for liberty and liberty and montecito 4-gram defense attorney for liberty attorney for liberty and for liberty and montecito 5-gram defense attorney for liberty and montecito attorney for liberty and montecito

Application

Probabilistic language model for predicting the

next item in a sequence in the form of a (n minus 1)

Widely used in probability communication

theory computational linguistics biological

sequence analysis

Advantage

Relatively simple

Simply increasing n model can be used to store

more context

Disadvantage

Semantic value of the item is not considered

n-gram Definition

bull n-gram is a contiguous sequence of n items from a given sequence of text

bull The items can be letters words syllables or base pairs according to the application

10

Application

- Taxonomy extraction (predefined terms and entities)

- Entities People organizations locations times dates prices genes proteins diseases

medicines

- Concept extraction (main idea or a theme)

Advantage

- Less noisy than n-grams

Disadvantage

- Does not specify role of items in the main sentence

Shallow NLP Technique Definition

- Assign a syntactic label (noun verb etc) to a chunk

- Knowledge extraction from text through semanticsyntactic analysis approach

11

Sentence - ldquoThe driver from Europe crashed the car with the white bumperrdquo

1-gram

the

driver

from

europe

crashed

the

car

with

the

white

bumper

Part of Speech

DT ndash Determiner

NN - Noun singular or mass

IN - Preposition or subordinating conjunction

NNP - Proper Noun singular

VBD - Verb past tense

DT ndash Determiner

NN - Noun singular or mass

IN - Preposition or subordinating conjunction

DT ndash Determiner

JJ ndash Adjective

NN - Noun singular or mass

- Convert to lowercase amp PoS tag

Concept Extraction

- Remove Stop words

- Retain only Nounrsquos amp Verbrsquos

- Bi-gram with Nounrsquos amp Verbrsquos retained

Bi-gram PoS

car white NN JJ

crashed car VBD NN

driver europe NN NNP

europe crashed NNP VBD

white bumper JJ NN

3-gram PoS

car white bumper NN JJ NN

crashed car white VBD NN JJ

driver europe crashed NN NNP VBD

europe crashed car NNP VBD NN

- 3-gram with Nounrsquos amp Verbrsquos retained

Conclusion

1-gram Reduced noise however no clear context Bi-gram amp 3-gram Increased context however there

is a information loss

Shallow NLP Technique

12

Stop words NounVerb

Definition

- Extension to the shallow NLP

- Detected relationships are expressed as complex construction to retain the context

- Example relationships Located in employed by part of married to

Applications

- Develop features and representations appropriate for complex interpretation tasks

- Fraud detection - Life science prediction activities based on complex RNA-Sequence

Deep NLP technique

Example

The above sentence can be represented using triples (Subject Predicate [Modifier] Object) without loosing the context

Triples

driver crash car

driver crash with bumper

driver be from Europe 13

Technique General Steps Pros Cons

N-Gram

- Convert to lowercase

- Remove punctuations - Remove special characters Simple technique Extremely noisy

Shallow NLP

technique

- POS tagging

- Lemmatization ie transform to

dictionary base form ie produce amp produced become produce

- Stemming ie transform to root word

ie 1) computer amp computers become

comput

2) product produce amp produced

become produc

- Chunking ie identify the phrasal

constituents in a sentence including

nounverb phrase etc and splits the

sentence into chunks of semantically

related words

Less noisy than N-

Grams

Computationally

expensive

solution for

analyzing the

structure of texts

Does not specify

the internal

structure or the

role of words in

the sentence

Deep NLP

technique

- Generate syntactic relationship

between each pair of words

- Extract subject predicate nagation

objecct and named entity to form triples

Context of the

sentence is

retained

Sentence level

analysis is too

structured

Techniques - Summary

14

Appendix

15

2A - Explore Corpus through EDA 2B - Convert text to lowercase 2C - Remove

a) Numbers(if required) b) Punctuations c) English stop words d) Own stop words(if

required)

e) Strip whitespace f) LemmatizationStemming g) Sparse terms

2D - Create document term matrix

Step 3 - Visualization

Corpus

Web

Documents

Step 1 ndash Data Assemble

Step 2 ndash Data Processing

Step 4 ndash Build Model(s)

Clustering

Classification

Sentiment Analysis

Fre

qu

en

cy

Ch

art

W

ord

Clo

ud

C

orr

ela

tio

n P

lot

R - Text Mining Process Overview

16

DB

Package Name

Category Description

tm Text Mining A framework for text mining applications

topicmodels Topic Modelling Fit topic models with Latent Dirichlet Allocation (LDA) and Comparative Text Mining (CTM)

wordcloud Visualization Plot a cloud comparing the frequencies of words across documents

lda Topic Modelling Fit topic models with Latent Dirichlet Allocation

wordnet Text Mining Database of English which is commonly used in linguistics and text mining

RTextTools Text Mining Automatic text classification via supervised learning

qdap Sentiment analysis

Transcript analysis text mining and natural language processing

tmplugindc Text Mining A plug-in for package tm to support distributed text mining

tmpluginmail Text Mining A plug-in for package tm to handle mail

textir Text Mining A suite of tools for inference about text documents and associated sentiment

tau Text Mining Utilities for text analysis

textcat Text Mining N-gram based text categorization

SnowballC Text Mining Word stemmer

twitteR Text Mining Provides an interface to the Twitter web API

ROAuth Text Mining Allows users to authenticate to the server of their choice (like Twitter)

RColorBrewer Visualization The packages provides palettes for drawing nice maps shaded according to a variable

ggplot2 Visualization

Graphing package implemented on top of the R statistical package Inspired by the Grammar of Graphics seminal work of Leland Wilkinson

R ndash Required packages for Text Mining

17

Example 1 - Obtaining and analyzing tweets

Objective R code for analyzing tweets relating to AAA2011 (text mining topic modelling network analysis clustering and sentiment analysis)

What does the code do

The code details ten steps in the analysis and visualization of the tweets

Acquiring the raw Twitter data

Calculating some basic statistics with the raw Twitter data

Calculating some basic retweet statistics

Calculating the ratio of retweets to tweets

Calculating some basic statistics about URLs in tweets

Basic text mining for token frequency and token association analysis (word cloud)

Calculating sentiment scores of tweets including on subsets containing tokens of interest

Hierarchical clustering of tokens based on multi scale bootstrap resampling

Topic modelling the tweet corpus using latent Dirichlet allocation

Network analysis of tweeters based on retweets

Code Source Code was taken from following link and tweakedadded additional bits where required to ensure code runs fine httpsgithubcombenmarwickAAA2011-Tweets

How to Run or Test the code - From the word doc copy the R code in the given sequence highlighted in yellow color and paste on your R console

18

RTextTools ndash Example for supervised Learning for Text Classification using Ensemble

RTextTools is a free open source R machine learning package for automatic text classification

The package includes nine algorithms for ensemble classification (svm slda boosting bagging random forests glmnet decision trees neural networks and maximum entropy) comprehensive analytics and thorough documentation

Users may use n-fold cross validation to calculate the accuracy of each algorithm on their dataset and determine which algorithms to use in their ensemble

(Using a four-ensemble agreement approach Collingwood and Wilkerson (2012) found that when four of their algorithms agree on the label of a textual document the machine label matches the human label over 90 of the time The rate is just 45 when only two algorithms agree on the text label)

Code Source The codes is readily available for download and usage from the following link httpsgithubcomtimjurkaRTextTools The code can be implemented without modification for testing however itrsquos set up such that changes can be incorporated easily based on our requirement

Additional Reading httpwwwrtexttoolscomabout-the-projecthtml

19

Example 2 - RTextTools

Penn Treebank - httpswwwlingupenneducoursesFall_2003ling001penn_treebank_poshtml

Stanford info lab - Finding Similar Items httpinfolabstanfordedu~ullmanmmdsch3pdf

TRIPLET EXTRACTION FROM SENTENCES URL - httpailabijssidelia_rusuPapersis_2007pdf

Shallow and Deep NLP Processing for ontology learning a Quick Overview httpazouaqathabascaucapublicationsConferences20Workshops20Books5BBC25D_KDW_2010pdf

References

20

Example defense attorney for liberty and

montecitordquo 1-gram defense attorney for liberty and montecito 2-gram defense attorney for liberty and montecito attorney for liberty and attorney for 3-gram defense attorney for liberty and montecito attorney for liberty for liberty and liberty and montecito 4-gram defense attorney for liberty attorney for liberty and for liberty and montecito 5-gram defense attorney for liberty and montecito attorney for liberty and montecito

Application

Probabilistic language model for predicting the

next item in a sequence in the form of a (n minus 1)

Widely used in probability communication

theory computational linguistics biological

sequence analysis

Advantage

Relatively simple

Simply increasing n model can be used to store

more context

Disadvantage

Semantic value of the item is not considered

n-gram Definition

bull n-gram is a contiguous sequence of n items from a given sequence of text

bull The items can be letters words syllables or base pairs according to the application

10

Application

- Taxonomy extraction (predefined terms and entities)

- Entities People organizations locations times dates prices genes proteins diseases

medicines

- Concept extraction (main idea or a theme)

Advantage

- Less noisy than n-grams

Disadvantage

- Does not specify role of items in the main sentence

Shallow NLP Technique Definition

- Assign a syntactic label (noun verb etc) to a chunk

- Knowledge extraction from text through semanticsyntactic analysis approach

11

Sentence - ldquoThe driver from Europe crashed the car with the white bumperrdquo

1-gram

the

driver

from

europe

crashed

the

car

with

the

white

bumper

Part of Speech

DT ndash Determiner

NN - Noun singular or mass

IN - Preposition or subordinating conjunction

NNP - Proper Noun singular

VBD - Verb past tense

DT ndash Determiner

NN - Noun singular or mass

IN - Preposition or subordinating conjunction

DT ndash Determiner

JJ ndash Adjective

NN - Noun singular or mass

- Convert to lowercase amp PoS tag

Concept Extraction

- Remove Stop words

- Retain only Nounrsquos amp Verbrsquos

- Bi-gram with Nounrsquos amp Verbrsquos retained

Bi-gram PoS

car white NN JJ

crashed car VBD NN

driver europe NN NNP

europe crashed NNP VBD

white bumper JJ NN

3-gram PoS

car white bumper NN JJ NN

crashed car white VBD NN JJ

driver europe crashed NN NNP VBD

europe crashed car NNP VBD NN

- 3-gram with Nounrsquos amp Verbrsquos retained

Conclusion

1-gram Reduced noise however no clear context Bi-gram amp 3-gram Increased context however there

is a information loss

Shallow NLP Technique

12

Stop words NounVerb

Definition

- Extension to the shallow NLP

- Detected relationships are expressed as complex construction to retain the context

- Example relationships Located in employed by part of married to

Applications

- Develop features and representations appropriate for complex interpretation tasks

- Fraud detection - Life science prediction activities based on complex RNA-Sequence

Deep NLP technique

Example

The above sentence can be represented using triples (Subject Predicate [Modifier] Object) without loosing the context

Triples

driver crash car

driver crash with bumper

driver be from Europe 13

Technique General Steps Pros Cons

N-Gram

- Convert to lowercase

- Remove punctuations - Remove special characters Simple technique Extremely noisy

Shallow NLP

technique

- POS tagging

- Lemmatization ie transform to

dictionary base form ie produce amp produced become produce

- Stemming ie transform to root word

ie 1) computer amp computers become

comput

2) product produce amp produced

become produc

- Chunking ie identify the phrasal

constituents in a sentence including

nounverb phrase etc and splits the

sentence into chunks of semantically

related words

Less noisy than N-

Grams

Computationally

expensive

solution for

analyzing the

structure of texts

Does not specify

the internal

structure or the

role of words in

the sentence

Deep NLP

technique

- Generate syntactic relationship

between each pair of words

- Extract subject predicate nagation

objecct and named entity to form triples

Context of the

sentence is

retained

Sentence level

analysis is too

structured

Techniques - Summary

14

Appendix

15

2A - Explore Corpus through EDA 2B - Convert text to lowercase 2C - Remove

a) Numbers(if required) b) Punctuations c) English stop words d) Own stop words(if

required)

e) Strip whitespace f) LemmatizationStemming g) Sparse terms

2D - Create document term matrix

Step 3 - Visualization

Corpus

Web

Documents

Step 1 ndash Data Assemble

Step 2 ndash Data Processing

Step 4 ndash Build Model(s)

Clustering

Classification

Sentiment Analysis

Fre

qu

en

cy

Ch

art

W

ord

Clo

ud

C

orr

ela

tio

n P

lot

R - Text Mining Process Overview

16

DB

Package Name

Category Description

tm Text Mining A framework for text mining applications

topicmodels Topic Modelling Fit topic models with Latent Dirichlet Allocation (LDA) and Comparative Text Mining (CTM)

wordcloud Visualization Plot a cloud comparing the frequencies of words across documents

lda Topic Modelling Fit topic models with Latent Dirichlet Allocation

wordnet Text Mining Database of English which is commonly used in linguistics and text mining

RTextTools Text Mining Automatic text classification via supervised learning

qdap Sentiment analysis

Transcript analysis text mining and natural language processing

tmplugindc Text Mining A plug-in for package tm to support distributed text mining

tmpluginmail Text Mining A plug-in for package tm to handle mail

textir Text Mining A suite of tools for inference about text documents and associated sentiment

tau Text Mining Utilities for text analysis

textcat Text Mining N-gram based text categorization

SnowballC Text Mining Word stemmer

twitteR Text Mining Provides an interface to the Twitter web API

ROAuth Text Mining Allows users to authenticate to the server of their choice (like Twitter)

RColorBrewer Visualization The packages provides palettes for drawing nice maps shaded according to a variable

ggplot2 Visualization

Graphing package implemented on top of the R statistical package Inspired by the Grammar of Graphics seminal work of Leland Wilkinson

R ndash Required packages for Text Mining

17

Example 1 - Obtaining and analyzing tweets

Objective R code for analyzing tweets relating to AAA2011 (text mining topic modelling network analysis clustering and sentiment analysis)

What does the code do

The code details ten steps in the analysis and visualization of the tweets

Acquiring the raw Twitter data

Calculating some basic statistics with the raw Twitter data

Calculating some basic retweet statistics

Calculating the ratio of retweets to tweets

Calculating some basic statistics about URLs in tweets

Basic text mining for token frequency and token association analysis (word cloud)

Calculating sentiment scores of tweets including on subsets containing tokens of interest

Hierarchical clustering of tokens based on multi scale bootstrap resampling

Topic modelling the tweet corpus using latent Dirichlet allocation

Network analysis of tweeters based on retweets

Code Source Code was taken from following link and tweakedadded additional bits where required to ensure code runs fine httpsgithubcombenmarwickAAA2011-Tweets

How to Run or Test the code - From the word doc copy the R code in the given sequence highlighted in yellow color and paste on your R console

18

RTextTools ndash Example for supervised Learning for Text Classification using Ensemble

RTextTools is a free open source R machine learning package for automatic text classification

The package includes nine algorithms for ensemble classification (svm slda boosting bagging random forests glmnet decision trees neural networks and maximum entropy) comprehensive analytics and thorough documentation

Users may use n-fold cross validation to calculate the accuracy of each algorithm on their dataset and determine which algorithms to use in their ensemble

(Using a four-ensemble agreement approach Collingwood and Wilkerson (2012) found that when four of their algorithms agree on the label of a textual document the machine label matches the human label over 90 of the time The rate is just 45 when only two algorithms agree on the text label)

Code Source The codes is readily available for download and usage from the following link httpsgithubcomtimjurkaRTextTools The code can be implemented without modification for testing however itrsquos set up such that changes can be incorporated easily based on our requirement

Additional Reading httpwwwrtexttoolscomabout-the-projecthtml

19

Example 2 - RTextTools

Penn Treebank - httpswwwlingupenneducoursesFall_2003ling001penn_treebank_poshtml

Stanford info lab - Finding Similar Items httpinfolabstanfordedu~ullmanmmdsch3pdf

TRIPLET EXTRACTION FROM SENTENCES URL - httpailabijssidelia_rusuPapersis_2007pdf

Shallow and Deep NLP Processing for ontology learning a Quick Overview httpazouaqathabascaucapublicationsConferences20Workshops20Books5BBC25D_KDW_2010pdf

References

20

Application

- Taxonomy extraction (predefined terms and entities)

- Entities People organizations locations times dates prices genes proteins diseases

medicines

- Concept extraction (main idea or a theme)

Advantage

- Less noisy than n-grams

Disadvantage

- Does not specify role of items in the main sentence

Shallow NLP Technique Definition

- Assign a syntactic label (noun verb etc) to a chunk

- Knowledge extraction from text through semanticsyntactic analysis approach

11

Sentence - ldquoThe driver from Europe crashed the car with the white bumperrdquo

1-gram

the

driver

from

europe

crashed

the

car

with

the

white

bumper

Part of Speech

DT ndash Determiner

NN - Noun singular or mass

IN - Preposition or subordinating conjunction

NNP - Proper Noun singular

VBD - Verb past tense

DT ndash Determiner

NN - Noun singular or mass

IN - Preposition or subordinating conjunction

DT ndash Determiner

JJ ndash Adjective

NN - Noun singular or mass

- Convert to lowercase amp PoS tag

Concept Extraction

- Remove Stop words

- Retain only Nounrsquos amp Verbrsquos

- Bi-gram with Nounrsquos amp Verbrsquos retained

Bi-gram PoS

car white NN JJ

crashed car VBD NN

driver europe NN NNP

europe crashed NNP VBD

white bumper JJ NN

3-gram PoS

car white bumper NN JJ NN

crashed car white VBD NN JJ

driver europe crashed NN NNP VBD

europe crashed car NNP VBD NN

- 3-gram with Nounrsquos amp Verbrsquos retained

Conclusion

1-gram Reduced noise however no clear context Bi-gram amp 3-gram Increased context however there

is a information loss

Shallow NLP Technique

12

Stop words NounVerb

Definition

- Extension to the shallow NLP

- Detected relationships are expressed as complex construction to retain the context

- Example relationships Located in employed by part of married to

Applications

- Develop features and representations appropriate for complex interpretation tasks

- Fraud detection - Life science prediction activities based on complex RNA-Sequence

Deep NLP technique

Example

The above sentence can be represented using triples (Subject Predicate [Modifier] Object) without loosing the context

Triples

driver crash car

driver crash with bumper

driver be from Europe 13

Technique General Steps Pros Cons

N-Gram

- Convert to lowercase

- Remove punctuations - Remove special characters Simple technique Extremely noisy

Shallow NLP

technique

- POS tagging

- Lemmatization ie transform to

dictionary base form ie produce amp produced become produce

- Stemming ie transform to root word

ie 1) computer amp computers become

comput

2) product produce amp produced

become produc

- Chunking ie identify the phrasal

constituents in a sentence including

nounverb phrase etc and splits the

sentence into chunks of semantically

related words

Less noisy than N-

Grams

Computationally

expensive

solution for

analyzing the

structure of texts

Does not specify

the internal

structure or the

role of words in

the sentence

Deep NLP

technique

- Generate syntactic relationship

between each pair of words

- Extract subject predicate nagation

objecct and named entity to form triples

Context of the

sentence is

retained

Sentence level

analysis is too

structured

Techniques - Summary

14

Appendix

15

2A - Explore Corpus through EDA 2B - Convert text to lowercase 2C - Remove

a) Numbers(if required) b) Punctuations c) English stop words d) Own stop words(if

required)

e) Strip whitespace f) LemmatizationStemming g) Sparse terms

2D - Create document term matrix

Step 3 - Visualization

Corpus

Web

Documents

Step 1 ndash Data Assemble

Step 2 ndash Data Processing

Step 4 ndash Build Model(s)

Clustering

Classification

Sentiment Analysis

Fre

qu

en

cy

Ch

art

W

ord

Clo

ud

C

orr

ela

tio

n P

lot

R - Text Mining Process Overview

16

DB

Package Name

Category Description

tm Text Mining A framework for text mining applications

topicmodels Topic Modelling Fit topic models with Latent Dirichlet Allocation (LDA) and Comparative Text Mining (CTM)

wordcloud Visualization Plot a cloud comparing the frequencies of words across documents

lda Topic Modelling Fit topic models with Latent Dirichlet Allocation

wordnet Text Mining Database of English which is commonly used in linguistics and text mining

RTextTools Text Mining Automatic text classification via supervised learning

qdap Sentiment analysis

Transcript analysis text mining and natural language processing

tmplugindc Text Mining A plug-in for package tm to support distributed text mining

tmpluginmail Text Mining A plug-in for package tm to handle mail

textir Text Mining A suite of tools for inference about text documents and associated sentiment

tau Text Mining Utilities for text analysis

textcat Text Mining N-gram based text categorization

SnowballC Text Mining Word stemmer

twitteR Text Mining Provides an interface to the Twitter web API

ROAuth Text Mining Allows users to authenticate to the server of their choice (like Twitter)

RColorBrewer Visualization The packages provides palettes for drawing nice maps shaded according to a variable

ggplot2 Visualization

Graphing package implemented on top of the R statistical package Inspired by the Grammar of Graphics seminal work of Leland Wilkinson

R ndash Required packages for Text Mining

17

Example 1 - Obtaining and analyzing tweets

Objective R code for analyzing tweets relating to AAA2011 (text mining topic modelling network analysis clustering and sentiment analysis)

What does the code do

The code details ten steps in the analysis and visualization of the tweets

Acquiring the raw Twitter data

Calculating some basic statistics with the raw Twitter data

Calculating some basic retweet statistics

Calculating the ratio of retweets to tweets

Calculating some basic statistics about URLs in tweets

Basic text mining for token frequency and token association analysis (word cloud)

Calculating sentiment scores of tweets including on subsets containing tokens of interest

Hierarchical clustering of tokens based on multi scale bootstrap resampling

Topic modelling the tweet corpus using latent Dirichlet allocation

Network analysis of tweeters based on retweets

Code Source Code was taken from following link and tweakedadded additional bits where required to ensure code runs fine httpsgithubcombenmarwickAAA2011-Tweets

How to Run or Test the code - From the word doc copy the R code in the given sequence highlighted in yellow color and paste on your R console

18

RTextTools ndash Example for supervised Learning for Text Classification using Ensemble

RTextTools is a free open source R machine learning package for automatic text classification

The package includes nine algorithms for ensemble classification (svm slda boosting bagging random forests glmnet decision trees neural networks and maximum entropy) comprehensive analytics and thorough documentation

Users may use n-fold cross validation to calculate the accuracy of each algorithm on their dataset and determine which algorithms to use in their ensemble

(Using a four-ensemble agreement approach Collingwood and Wilkerson (2012) found that when four of their algorithms agree on the label of a textual document the machine label matches the human label over 90 of the time The rate is just 45 when only two algorithms agree on the text label)

Code Source The codes is readily available for download and usage from the following link httpsgithubcomtimjurkaRTextTools The code can be implemented without modification for testing however itrsquos set up such that changes can be incorporated easily based on our requirement

Additional Reading httpwwwrtexttoolscomabout-the-projecthtml

19

Example 2 - RTextTools

Penn Treebank - httpswwwlingupenneducoursesFall_2003ling001penn_treebank_poshtml

Stanford info lab - Finding Similar Items httpinfolabstanfordedu~ullmanmmdsch3pdf

TRIPLET EXTRACTION FROM SENTENCES URL - httpailabijssidelia_rusuPapersis_2007pdf

Shallow and Deep NLP Processing for ontology learning a Quick Overview httpazouaqathabascaucapublicationsConferences20Workshops20Books5BBC25D_KDW_2010pdf

References

20

Sentence - ldquoThe driver from Europe crashed the car with the white bumperrdquo

1-gram

the

driver

from

europe

crashed

the

car

with

the

white

bumper

Part of Speech

DT ndash Determiner

NN - Noun singular or mass

IN - Preposition or subordinating conjunction

NNP - Proper Noun singular

VBD - Verb past tense

DT ndash Determiner

NN - Noun singular or mass

IN - Preposition or subordinating conjunction

DT ndash Determiner

JJ ndash Adjective

NN - Noun singular or mass

- Convert to lowercase amp PoS tag

Concept Extraction

- Remove Stop words

- Retain only Nounrsquos amp Verbrsquos

- Bi-gram with Nounrsquos amp Verbrsquos retained

Bi-gram PoS

car white NN JJ

crashed car VBD NN

driver europe NN NNP

europe crashed NNP VBD

white bumper JJ NN

3-gram PoS

car white bumper NN JJ NN

crashed car white VBD NN JJ

driver europe crashed NN NNP VBD

europe crashed car NNP VBD NN

- 3-gram with Nounrsquos amp Verbrsquos retained

Conclusion

1-gram Reduced noise however no clear context Bi-gram amp 3-gram Increased context however there

is a information loss

Shallow NLP Technique

12

Stop words NounVerb

Definition

- Extension to the shallow NLP

- Detected relationships are expressed as complex construction to retain the context

- Example relationships Located in employed by part of married to

Applications

- Develop features and representations appropriate for complex interpretation tasks

- Fraud detection - Life science prediction activities based on complex RNA-Sequence

Deep NLP technique

Example

The above sentence can be represented using triples (Subject Predicate [Modifier] Object) without loosing the context

Triples

driver crash car

driver crash with bumper

driver be from Europe 13

Technique General Steps Pros Cons

N-Gram

- Convert to lowercase

- Remove punctuations - Remove special characters Simple technique Extremely noisy

Shallow NLP

technique

- POS tagging

- Lemmatization ie transform to

dictionary base form ie produce amp produced become produce

- Stemming ie transform to root word

ie 1) computer amp computers become

comput

2) product produce amp produced

become produc

- Chunking ie identify the phrasal

constituents in a sentence including

nounverb phrase etc and splits the

sentence into chunks of semantically

related words

Less noisy than N-

Grams

Computationally

expensive

solution for

analyzing the

structure of texts

Does not specify

the internal

structure or the

role of words in

the sentence

Deep NLP

technique

- Generate syntactic relationship

between each pair of words

- Extract subject predicate nagation

objecct and named entity to form triples

Context of the

sentence is

retained

Sentence level

analysis is too

structured

Techniques - Summary

14

Appendix

15

2A - Explore Corpus through EDA 2B - Convert text to lowercase 2C - Remove

a) Numbers(if required) b) Punctuations c) English stop words d) Own stop words(if

required)

e) Strip whitespace f) LemmatizationStemming g) Sparse terms

2D - Create document term matrix

Step 3 - Visualization

Corpus

Web

Documents

Step 1 ndash Data Assemble

Step 2 ndash Data Processing

Step 4 ndash Build Model(s)

Clustering

Classification

Sentiment Analysis

Fre

qu

en

cy

Ch

art

W

ord

Clo

ud

C

orr

ela

tio

n P

lot

R - Text Mining Process Overview

16

DB

Package Name

Category Description

tm Text Mining A framework for text mining applications

topicmodels Topic Modelling Fit topic models with Latent Dirichlet Allocation (LDA) and Comparative Text Mining (CTM)

wordcloud Visualization Plot a cloud comparing the frequencies of words across documents

lda Topic Modelling Fit topic models with Latent Dirichlet Allocation

wordnet Text Mining Database of English which is commonly used in linguistics and text mining

RTextTools Text Mining Automatic text classification via supervised learning

qdap Sentiment analysis

Transcript analysis text mining and natural language processing

tmplugindc Text Mining A plug-in for package tm to support distributed text mining

tmpluginmail Text Mining A plug-in for package tm to handle mail

textir Text Mining A suite of tools for inference about text documents and associated sentiment

tau Text Mining Utilities for text analysis

textcat Text Mining N-gram based text categorization

SnowballC Text Mining Word stemmer

twitteR Text Mining Provides an interface to the Twitter web API

ROAuth Text Mining Allows users to authenticate to the server of their choice (like Twitter)

RColorBrewer Visualization The packages provides palettes for drawing nice maps shaded according to a variable

ggplot2 Visualization

Graphing package implemented on top of the R statistical package Inspired by the Grammar of Graphics seminal work of Leland Wilkinson

R ndash Required packages for Text Mining

17

Example 1 - Obtaining and analyzing tweets

Objective R code for analyzing tweets relating to AAA2011 (text mining topic modelling network analysis clustering and sentiment analysis)

What does the code do

The code details ten steps in the analysis and visualization of the tweets

Acquiring the raw Twitter data

Calculating some basic statistics with the raw Twitter data

Calculating some basic retweet statistics

Calculating the ratio of retweets to tweets

Calculating some basic statistics about URLs in tweets

Basic text mining for token frequency and token association analysis (word cloud)

Calculating sentiment scores of tweets including on subsets containing tokens of interest

Hierarchical clustering of tokens based on multi scale bootstrap resampling

Topic modelling the tweet corpus using latent Dirichlet allocation

Network analysis of tweeters based on retweets

Code Source Code was taken from following link and tweakedadded additional bits where required to ensure code runs fine httpsgithubcombenmarwickAAA2011-Tweets

How to Run or Test the code - From the word doc copy the R code in the given sequence highlighted in yellow color and paste on your R console

18

RTextTools ndash Example for supervised Learning for Text Classification using Ensemble

RTextTools is a free open source R machine learning package for automatic text classification

The package includes nine algorithms for ensemble classification (svm slda boosting bagging random forests glmnet decision trees neural networks and maximum entropy) comprehensive analytics and thorough documentation

Users may use n-fold cross validation to calculate the accuracy of each algorithm on their dataset and determine which algorithms to use in their ensemble

(Using a four-ensemble agreement approach Collingwood and Wilkerson (2012) found that when four of their algorithms agree on the label of a textual document the machine label matches the human label over 90 of the time The rate is just 45 when only two algorithms agree on the text label)

Code Source The codes is readily available for download and usage from the following link httpsgithubcomtimjurkaRTextTools The code can be implemented without modification for testing however itrsquos set up such that changes can be incorporated easily based on our requirement

Additional Reading httpwwwrtexttoolscomabout-the-projecthtml

19

Example 2 - RTextTools

Penn Treebank - httpswwwlingupenneducoursesFall_2003ling001penn_treebank_poshtml

Stanford info lab - Finding Similar Items httpinfolabstanfordedu~ullmanmmdsch3pdf

TRIPLET EXTRACTION FROM SENTENCES URL - httpailabijssidelia_rusuPapersis_2007pdf

Shallow and Deep NLP Processing for ontology learning a Quick Overview httpazouaqathabascaucapublicationsConferences20Workshops20Books5BBC25D_KDW_2010pdf

References

20

Definition

- Extension to the shallow NLP

- Detected relationships are expressed as complex construction to retain the context

- Example relationships Located in employed by part of married to

Applications

- Develop features and representations appropriate for complex interpretation tasks

- Fraud detection - Life science prediction activities based on complex RNA-Sequence

Deep NLP technique

Example

The above sentence can be represented using triples (Subject Predicate [Modifier] Object) without loosing the context

Triples

driver crash car

driver crash with bumper

driver be from Europe 13

Technique General Steps Pros Cons

N-Gram

- Convert to lowercase

- Remove punctuations - Remove special characters Simple technique Extremely noisy

Shallow NLP

technique

- POS tagging

- Lemmatization ie transform to

dictionary base form ie produce amp produced become produce

- Stemming ie transform to root word

ie 1) computer amp computers become

comput

2) product produce amp produced

become produc

- Chunking ie identify the phrasal

constituents in a sentence including

nounverb phrase etc and splits the

sentence into chunks of semantically

related words

Less noisy than N-

Grams

Computationally

expensive

solution for

analyzing the

structure of texts

Does not specify

the internal

structure or the

role of words in

the sentence

Deep NLP

technique

- Generate syntactic relationship

between each pair of words

- Extract subject predicate nagation

objecct and named entity to form triples

Context of the

sentence is

retained

Sentence level

analysis is too

structured

Techniques - Summary

14

Appendix

15

2A - Explore Corpus through EDA 2B - Convert text to lowercase 2C - Remove

a) Numbers(if required) b) Punctuations c) English stop words d) Own stop words(if

required)

e) Strip whitespace f) LemmatizationStemming g) Sparse terms

2D - Create document term matrix

Step 3 - Visualization

Corpus

Web

Documents

Step 1 ndash Data Assemble

Step 2 ndash Data Processing

Step 4 ndash Build Model(s)

Clustering

Classification

Sentiment Analysis

Fre

qu

en

cy

Ch

art

W

ord

Clo

ud

C

orr

ela

tio

n P

lot

R - Text Mining Process Overview

16

DB

Package Name

Category Description

tm Text Mining A framework for text mining applications

topicmodels Topic Modelling Fit topic models with Latent Dirichlet Allocation (LDA) and Comparative Text Mining (CTM)

wordcloud Visualization Plot a cloud comparing the frequencies of words across documents

lda Topic Modelling Fit topic models with Latent Dirichlet Allocation

wordnet Text Mining Database of English which is commonly used in linguistics and text mining

RTextTools Text Mining Automatic text classification via supervised learning

qdap Sentiment analysis

Transcript analysis text mining and natural language processing

tmplugindc Text Mining A plug-in for package tm to support distributed text mining

tmpluginmail Text Mining A plug-in for package tm to handle mail

textir Text Mining A suite of tools for inference about text documents and associated sentiment

tau Text Mining Utilities for text analysis

textcat Text Mining N-gram based text categorization

SnowballC Text Mining Word stemmer

twitteR Text Mining Provides an interface to the Twitter web API

ROAuth Text Mining Allows users to authenticate to the server of their choice (like Twitter)

RColorBrewer Visualization The packages provides palettes for drawing nice maps shaded according to a variable

ggplot2 Visualization

Graphing package implemented on top of the R statistical package Inspired by the Grammar of Graphics seminal work of Leland Wilkinson

R ndash Required packages for Text Mining

17

Example 1 - Obtaining and analyzing tweets

Objective R code for analyzing tweets relating to AAA2011 (text mining topic modelling network analysis clustering and sentiment analysis)

What does the code do

The code details ten steps in the analysis and visualization of the tweets

Acquiring the raw Twitter data

Calculating some basic statistics with the raw Twitter data

Calculating some basic retweet statistics

Calculating the ratio of retweets to tweets

Calculating some basic statistics about URLs in tweets

Basic text mining for token frequency and token association analysis (word cloud)

Calculating sentiment scores of tweets including on subsets containing tokens of interest

Hierarchical clustering of tokens based on multi scale bootstrap resampling

Topic modelling the tweet corpus using latent Dirichlet allocation

Network analysis of tweeters based on retweets

Code Source Code was taken from following link and tweakedadded additional bits where required to ensure code runs fine httpsgithubcombenmarwickAAA2011-Tweets

How to Run or Test the code - From the word doc copy the R code in the given sequence highlighted in yellow color and paste on your R console

18

RTextTools ndash Example for supervised Learning for Text Classification using Ensemble

RTextTools is a free open source R machine learning package for automatic text classification

The package includes nine algorithms for ensemble classification (svm slda boosting bagging random forests glmnet decision trees neural networks and maximum entropy) comprehensive analytics and thorough documentation

Users may use n-fold cross validation to calculate the accuracy of each algorithm on their dataset and determine which algorithms to use in their ensemble

(Using a four-ensemble agreement approach Collingwood and Wilkerson (2012) found that when four of their algorithms agree on the label of a textual document the machine label matches the human label over 90 of the time The rate is just 45 when only two algorithms agree on the text label)

Code Source The codes is readily available for download and usage from the following link httpsgithubcomtimjurkaRTextTools The code can be implemented without modification for testing however itrsquos set up such that changes can be incorporated easily based on our requirement

Additional Reading httpwwwrtexttoolscomabout-the-projecthtml

19

Example 2 - RTextTools

Penn Treebank - httpswwwlingupenneducoursesFall_2003ling001penn_treebank_poshtml

Stanford info lab - Finding Similar Items httpinfolabstanfordedu~ullmanmmdsch3pdf

TRIPLET EXTRACTION FROM SENTENCES URL - httpailabijssidelia_rusuPapersis_2007pdf

Shallow and Deep NLP Processing for ontology learning a Quick Overview httpazouaqathabascaucapublicationsConferences20Workshops20Books5BBC25D_KDW_2010pdf

References

20

Technique General Steps Pros Cons

N-Gram

- Convert to lowercase

- Remove punctuations - Remove special characters Simple technique Extremely noisy

Shallow NLP

technique

- POS tagging

- Lemmatization ie transform to

dictionary base form ie produce amp produced become produce

- Stemming ie transform to root word

ie 1) computer amp computers become

comput

2) product produce amp produced

become produc

- Chunking ie identify the phrasal

constituents in a sentence including

nounverb phrase etc and splits the

sentence into chunks of semantically

related words

Less noisy than N-

Grams

Computationally

expensive

solution for

analyzing the

structure of texts

Does not specify

the internal

structure or the

role of words in

the sentence

Deep NLP

technique

- Generate syntactic relationship

between each pair of words

- Extract subject predicate nagation

objecct and named entity to form triples

Context of the

sentence is

retained

Sentence level

analysis is too

structured

Techniques - Summary

14

Appendix

15

2A - Explore Corpus through EDA 2B - Convert text to lowercase 2C - Remove

a) Numbers(if required) b) Punctuations c) English stop words d) Own stop words(if

required)

e) Strip whitespace f) LemmatizationStemming g) Sparse terms

2D - Create document term matrix

Step 3 - Visualization

Corpus

Web

Documents

Step 1 ndash Data Assemble

Step 2 ndash Data Processing

Step 4 ndash Build Model(s)

Clustering

Classification

Sentiment Analysis

Fre

qu

en

cy

Ch

art

W

ord

Clo

ud

C

orr

ela

tio

n P

lot

R - Text Mining Process Overview

16

DB

Package Name

Category Description

tm Text Mining A framework for text mining applications

topicmodels Topic Modelling Fit topic models with Latent Dirichlet Allocation (LDA) and Comparative Text Mining (CTM)

wordcloud Visualization Plot a cloud comparing the frequencies of words across documents

lda Topic Modelling Fit topic models with Latent Dirichlet Allocation

wordnet Text Mining Database of English which is commonly used in linguistics and text mining

RTextTools Text Mining Automatic text classification via supervised learning

qdap Sentiment analysis

Transcript analysis text mining and natural language processing

tmplugindc Text Mining A plug-in for package tm to support distributed text mining

tmpluginmail Text Mining A plug-in for package tm to handle mail

textir Text Mining A suite of tools for inference about text documents and associated sentiment

tau Text Mining Utilities for text analysis

textcat Text Mining N-gram based text categorization

SnowballC Text Mining Word stemmer

twitteR Text Mining Provides an interface to the Twitter web API

ROAuth Text Mining Allows users to authenticate to the server of their choice (like Twitter)

RColorBrewer Visualization The packages provides palettes for drawing nice maps shaded according to a variable

ggplot2 Visualization

Graphing package implemented on top of the R statistical package Inspired by the Grammar of Graphics seminal work of Leland Wilkinson

R ndash Required packages for Text Mining

17

Example 1 - Obtaining and analyzing tweets

Objective R code for analyzing tweets relating to AAA2011 (text mining topic modelling network analysis clustering and sentiment analysis)

What does the code do

The code details ten steps in the analysis and visualization of the tweets

Acquiring the raw Twitter data

Calculating some basic statistics with the raw Twitter data

Calculating some basic retweet statistics

Calculating the ratio of retweets to tweets

Calculating some basic statistics about URLs in tweets

Basic text mining for token frequency and token association analysis (word cloud)

Calculating sentiment scores of tweets including on subsets containing tokens of interest

Hierarchical clustering of tokens based on multi scale bootstrap resampling

Topic modelling the tweet corpus using latent Dirichlet allocation

Network analysis of tweeters based on retweets

Code Source Code was taken from following link and tweakedadded additional bits where required to ensure code runs fine httpsgithubcombenmarwickAAA2011-Tweets

How to Run or Test the code - From the word doc copy the R code in the given sequence highlighted in yellow color and paste on your R console

18

RTextTools ndash Example for supervised Learning for Text Classification using Ensemble

RTextTools is a free open source R machine learning package for automatic text classification

The package includes nine algorithms for ensemble classification (svm slda boosting bagging random forests glmnet decision trees neural networks and maximum entropy) comprehensive analytics and thorough documentation

Users may use n-fold cross validation to calculate the accuracy of each algorithm on their dataset and determine which algorithms to use in their ensemble

(Using a four-ensemble agreement approach Collingwood and Wilkerson (2012) found that when four of their algorithms agree on the label of a textual document the machine label matches the human label over 90 of the time The rate is just 45 when only two algorithms agree on the text label)

Code Source The codes is readily available for download and usage from the following link httpsgithubcomtimjurkaRTextTools The code can be implemented without modification for testing however itrsquos set up such that changes can be incorporated easily based on our requirement

Additional Reading httpwwwrtexttoolscomabout-the-projecthtml

19

Example 2 - RTextTools

Penn Treebank - httpswwwlingupenneducoursesFall_2003ling001penn_treebank_poshtml

Stanford info lab - Finding Similar Items httpinfolabstanfordedu~ullmanmmdsch3pdf

TRIPLET EXTRACTION FROM SENTENCES URL - httpailabijssidelia_rusuPapersis_2007pdf

Shallow and Deep NLP Processing for ontology learning a Quick Overview httpazouaqathabascaucapublicationsConferences20Workshops20Books5BBC25D_KDW_2010pdf

References

20

Appendix

15

2A - Explore Corpus through EDA 2B - Convert text to lowercase 2C - Remove

a) Numbers(if required) b) Punctuations c) English stop words d) Own stop words(if

required)

e) Strip whitespace f) LemmatizationStemming g) Sparse terms

2D - Create document term matrix

Step 3 - Visualization

Corpus

Web

Documents

Step 1 ndash Data Assemble

Step 2 ndash Data Processing

Step 4 ndash Build Model(s)

Clustering

Classification

Sentiment Analysis

Fre

qu

en

cy

Ch

art

W

ord

Clo

ud

C

orr

ela

tio

n P

lot

R - Text Mining Process Overview

16

DB

Package Name

Category Description

tm Text Mining A framework for text mining applications

topicmodels Topic Modelling Fit topic models with Latent Dirichlet Allocation (LDA) and Comparative Text Mining (CTM)

wordcloud Visualization Plot a cloud comparing the frequencies of words across documents

lda Topic Modelling Fit topic models with Latent Dirichlet Allocation

wordnet Text Mining Database of English which is commonly used in linguistics and text mining

RTextTools Text Mining Automatic text classification via supervised learning

qdap Sentiment analysis

Transcript analysis text mining and natural language processing

tmplugindc Text Mining A plug-in for package tm to support distributed text mining

tmpluginmail Text Mining A plug-in for package tm to handle mail

textir Text Mining A suite of tools for inference about text documents and associated sentiment

tau Text Mining Utilities for text analysis

textcat Text Mining N-gram based text categorization

SnowballC Text Mining Word stemmer

twitteR Text Mining Provides an interface to the Twitter web API

ROAuth Text Mining Allows users to authenticate to the server of their choice (like Twitter)

RColorBrewer Visualization The packages provides palettes for drawing nice maps shaded according to a variable

ggplot2 Visualization

Graphing package implemented on top of the R statistical package Inspired by the Grammar of Graphics seminal work of Leland Wilkinson

R ndash Required packages for Text Mining

17

Example 1 - Obtaining and analyzing tweets

Objective R code for analyzing tweets relating to AAA2011 (text mining topic modelling network analysis clustering and sentiment analysis)

What does the code do

The code details ten steps in the analysis and visualization of the tweets

Acquiring the raw Twitter data

Calculating some basic statistics with the raw Twitter data

Calculating some basic retweet statistics

Calculating the ratio of retweets to tweets

Calculating some basic statistics about URLs in tweets

Basic text mining for token frequency and token association analysis (word cloud)

Calculating sentiment scores of tweets including on subsets containing tokens of interest

Hierarchical clustering of tokens based on multi scale bootstrap resampling

Topic modelling the tweet corpus using latent Dirichlet allocation

Network analysis of tweeters based on retweets

Code Source Code was taken from following link and tweakedadded additional bits where required to ensure code runs fine httpsgithubcombenmarwickAAA2011-Tweets

How to Run or Test the code - From the word doc copy the R code in the given sequence highlighted in yellow color and paste on your R console

18

RTextTools ndash Example for supervised Learning for Text Classification using Ensemble

RTextTools is a free open source R machine learning package for automatic text classification

The package includes nine algorithms for ensemble classification (svm slda boosting bagging random forests glmnet decision trees neural networks and maximum entropy) comprehensive analytics and thorough documentation

Users may use n-fold cross validation to calculate the accuracy of each algorithm on their dataset and determine which algorithms to use in their ensemble

(Using a four-ensemble agreement approach Collingwood and Wilkerson (2012) found that when four of their algorithms agree on the label of a textual document the machine label matches the human label over 90 of the time The rate is just 45 when only two algorithms agree on the text label)

Code Source The codes is readily available for download and usage from the following link httpsgithubcomtimjurkaRTextTools The code can be implemented without modification for testing however itrsquos set up such that changes can be incorporated easily based on our requirement

Additional Reading httpwwwrtexttoolscomabout-the-projecthtml

19

Example 2 - RTextTools

Penn Treebank - httpswwwlingupenneducoursesFall_2003ling001penn_treebank_poshtml

Stanford info lab - Finding Similar Items httpinfolabstanfordedu~ullmanmmdsch3pdf

TRIPLET EXTRACTION FROM SENTENCES URL - httpailabijssidelia_rusuPapersis_2007pdf

Shallow and Deep NLP Processing for ontology learning a Quick Overview httpazouaqathabascaucapublicationsConferences20Workshops20Books5BBC25D_KDW_2010pdf

References

20

2A - Explore Corpus through EDA 2B - Convert text to lowercase 2C - Remove

a) Numbers(if required) b) Punctuations c) English stop words d) Own stop words(if

required)

e) Strip whitespace f) LemmatizationStemming g) Sparse terms

2D - Create document term matrix

Step 3 - Visualization

Corpus

Web

Documents

Step 1 ndash Data Assemble

Step 2 ndash Data Processing

Step 4 ndash Build Model(s)

Clustering

Classification

Sentiment Analysis

Fre

qu

en

cy

Ch

art

W

ord

Clo

ud

C

orr

ela

tio

n P

lot

R - Text Mining Process Overview

16

DB

Package Name

Category Description

tm Text Mining A framework for text mining applications

topicmodels Topic Modelling Fit topic models with Latent Dirichlet Allocation (LDA) and Comparative Text Mining (CTM)

wordcloud Visualization Plot a cloud comparing the frequencies of words across documents

lda Topic Modelling Fit topic models with Latent Dirichlet Allocation

wordnet Text Mining Database of English which is commonly used in linguistics and text mining

RTextTools Text Mining Automatic text classification via supervised learning

qdap Sentiment analysis

Transcript analysis text mining and natural language processing

tmplugindc Text Mining A plug-in for package tm to support distributed text mining

tmpluginmail Text Mining A plug-in for package tm to handle mail

textir Text Mining A suite of tools for inference about text documents and associated sentiment

tau Text Mining Utilities for text analysis

textcat Text Mining N-gram based text categorization

SnowballC Text Mining Word stemmer

twitteR Text Mining Provides an interface to the Twitter web API

ROAuth Text Mining Allows users to authenticate to the server of their choice (like Twitter)

RColorBrewer Visualization The packages provides palettes for drawing nice maps shaded according to a variable

ggplot2 Visualization

Graphing package implemented on top of the R statistical package Inspired by the Grammar of Graphics seminal work of Leland Wilkinson

R ndash Required packages for Text Mining

17

Example 1 - Obtaining and analyzing tweets

Objective R code for analyzing tweets relating to AAA2011 (text mining topic modelling network analysis clustering and sentiment analysis)

What does the code do

The code details ten steps in the analysis and visualization of the tweets

Acquiring the raw Twitter data

Calculating some basic statistics with the raw Twitter data

Calculating some basic retweet statistics

Calculating the ratio of retweets to tweets

Calculating some basic statistics about URLs in tweets

Basic text mining for token frequency and token association analysis (word cloud)

Calculating sentiment scores of tweets including on subsets containing tokens of interest

Hierarchical clustering of tokens based on multi scale bootstrap resampling

Topic modelling the tweet corpus using latent Dirichlet allocation

Network analysis of tweeters based on retweets

Code Source Code was taken from following link and tweakedadded additional bits where required to ensure code runs fine httpsgithubcombenmarwickAAA2011-Tweets

How to Run or Test the code - From the word doc copy the R code in the given sequence highlighted in yellow color and paste on your R console

18

RTextTools ndash Example for supervised Learning for Text Classification using Ensemble

RTextTools is a free open source R machine learning package for automatic text classification

The package includes nine algorithms for ensemble classification (svm slda boosting bagging random forests glmnet decision trees neural networks and maximum entropy) comprehensive analytics and thorough documentation

Users may use n-fold cross validation to calculate the accuracy of each algorithm on their dataset and determine which algorithms to use in their ensemble

(Using a four-ensemble agreement approach Collingwood and Wilkerson (2012) found that when four of their algorithms agree on the label of a textual document the machine label matches the human label over 90 of the time The rate is just 45 when only two algorithms agree on the text label)

Code Source The codes is readily available for download and usage from the following link httpsgithubcomtimjurkaRTextTools The code can be implemented without modification for testing however itrsquos set up such that changes can be incorporated easily based on our requirement

Additional Reading httpwwwrtexttoolscomabout-the-projecthtml

19

Example 2 - RTextTools

Penn Treebank - httpswwwlingupenneducoursesFall_2003ling001penn_treebank_poshtml

Stanford info lab - Finding Similar Items httpinfolabstanfordedu~ullmanmmdsch3pdf

TRIPLET EXTRACTION FROM SENTENCES URL - httpailabijssidelia_rusuPapersis_2007pdf

Shallow and Deep NLP Processing for ontology learning a Quick Overview httpazouaqathabascaucapublicationsConferences20Workshops20Books5BBC25D_KDW_2010pdf

References

20

Package Name

Category Description

tm Text Mining A framework for text mining applications

topicmodels Topic Modelling Fit topic models with Latent Dirichlet Allocation (LDA) and Comparative Text Mining (CTM)

wordcloud Visualization Plot a cloud comparing the frequencies of words across documents

lda Topic Modelling Fit topic models with Latent Dirichlet Allocation

wordnet Text Mining Database of English which is commonly used in linguistics and text mining

RTextTools Text Mining Automatic text classification via supervised learning

qdap Sentiment analysis

Transcript analysis text mining and natural language processing

tmplugindc Text Mining A plug-in for package tm to support distributed text mining

tmpluginmail Text Mining A plug-in for package tm to handle mail

textir Text Mining A suite of tools for inference about text documents and associated sentiment

tau Text Mining Utilities for text analysis

textcat Text Mining N-gram based text categorization

SnowballC Text Mining Word stemmer

twitteR Text Mining Provides an interface to the Twitter web API

ROAuth Text Mining Allows users to authenticate to the server of their choice (like Twitter)

RColorBrewer Visualization The packages provides palettes for drawing nice maps shaded according to a variable

ggplot2 Visualization

Graphing package implemented on top of the R statistical package Inspired by the Grammar of Graphics seminal work of Leland Wilkinson

R ndash Required packages for Text Mining

17

Example 1 - Obtaining and analyzing tweets

Objective R code for analyzing tweets relating to AAA2011 (text mining topic modelling network analysis clustering and sentiment analysis)

What does the code do

The code details ten steps in the analysis and visualization of the tweets

Acquiring the raw Twitter data

Calculating some basic statistics with the raw Twitter data

Calculating some basic retweet statistics

Calculating the ratio of retweets to tweets

Calculating some basic statistics about URLs in tweets

Basic text mining for token frequency and token association analysis (word cloud)

Calculating sentiment scores of tweets including on subsets containing tokens of interest

Hierarchical clustering of tokens based on multi scale bootstrap resampling

Topic modelling the tweet corpus using latent Dirichlet allocation

Network analysis of tweeters based on retweets

Code Source Code was taken from following link and tweakedadded additional bits where required to ensure code runs fine httpsgithubcombenmarwickAAA2011-Tweets

How to Run or Test the code - From the word doc copy the R code in the given sequence highlighted in yellow color and paste on your R console

18

RTextTools ndash Example for supervised Learning for Text Classification using Ensemble

RTextTools is a free open source R machine learning package for automatic text classification

The package includes nine algorithms for ensemble classification (svm slda boosting bagging random forests glmnet decision trees neural networks and maximum entropy) comprehensive analytics and thorough documentation

Users may use n-fold cross validation to calculate the accuracy of each algorithm on their dataset and determine which algorithms to use in their ensemble

(Using a four-ensemble agreement approach Collingwood and Wilkerson (2012) found that when four of their algorithms agree on the label of a textual document the machine label matches the human label over 90 of the time The rate is just 45 when only two algorithms agree on the text label)

Code Source The codes is readily available for download and usage from the following link httpsgithubcomtimjurkaRTextTools The code can be implemented without modification for testing however itrsquos set up such that changes can be incorporated easily based on our requirement

Additional Reading httpwwwrtexttoolscomabout-the-projecthtml

19

Example 2 - RTextTools

Penn Treebank - httpswwwlingupenneducoursesFall_2003ling001penn_treebank_poshtml

Stanford info lab - Finding Similar Items httpinfolabstanfordedu~ullmanmmdsch3pdf

TRIPLET EXTRACTION FROM SENTENCES URL - httpailabijssidelia_rusuPapersis_2007pdf

Shallow and Deep NLP Processing for ontology learning a Quick Overview httpazouaqathabascaucapublicationsConferences20Workshops20Books5BBC25D_KDW_2010pdf

References

20

Example 1 - Obtaining and analyzing tweets

Objective R code for analyzing tweets relating to AAA2011 (text mining topic modelling network analysis clustering and sentiment analysis)

What does the code do

The code details ten steps in the analysis and visualization of the tweets

Acquiring the raw Twitter data

Calculating some basic statistics with the raw Twitter data

Calculating some basic retweet statistics

Calculating the ratio of retweets to tweets

Calculating some basic statistics about URLs in tweets

Basic text mining for token frequency and token association analysis (word cloud)

Calculating sentiment scores of tweets including on subsets containing tokens of interest

Hierarchical clustering of tokens based on multi scale bootstrap resampling

Topic modelling the tweet corpus using latent Dirichlet allocation

Network analysis of tweeters based on retweets

Code Source Code was taken from following link and tweakedadded additional bits where required to ensure code runs fine httpsgithubcombenmarwickAAA2011-Tweets

How to Run or Test the code - From the word doc copy the R code in the given sequence highlighted in yellow color and paste on your R console

18

RTextTools ndash Example for supervised Learning for Text Classification using Ensemble

RTextTools is a free open source R machine learning package for automatic text classification

The package includes nine algorithms for ensemble classification (svm slda boosting bagging random forests glmnet decision trees neural networks and maximum entropy) comprehensive analytics and thorough documentation

Users may use n-fold cross validation to calculate the accuracy of each algorithm on their dataset and determine which algorithms to use in their ensemble

(Using a four-ensemble agreement approach Collingwood and Wilkerson (2012) found that when four of their algorithms agree on the label of a textual document the machine label matches the human label over 90 of the time The rate is just 45 when only two algorithms agree on the text label)

Code Source The codes is readily available for download and usage from the following link httpsgithubcomtimjurkaRTextTools The code can be implemented without modification for testing however itrsquos set up such that changes can be incorporated easily based on our requirement

Additional Reading httpwwwrtexttoolscomabout-the-projecthtml

19

Example 2 - RTextTools

Penn Treebank - httpswwwlingupenneducoursesFall_2003ling001penn_treebank_poshtml

Stanford info lab - Finding Similar Items httpinfolabstanfordedu~ullmanmmdsch3pdf

TRIPLET EXTRACTION FROM SENTENCES URL - httpailabijssidelia_rusuPapersis_2007pdf

Shallow and Deep NLP Processing for ontology learning a Quick Overview httpazouaqathabascaucapublicationsConferences20Workshops20Books5BBC25D_KDW_2010pdf

References

20

RTextTools ndash Example for supervised Learning for Text Classification using Ensemble

RTextTools is a free open source R machine learning package for automatic text classification

The package includes nine algorithms for ensemble classification (svm slda boosting bagging random forests glmnet decision trees neural networks and maximum entropy) comprehensive analytics and thorough documentation

Users may use n-fold cross validation to calculate the accuracy of each algorithm on their dataset and determine which algorithms to use in their ensemble

(Using a four-ensemble agreement approach Collingwood and Wilkerson (2012) found that when four of their algorithms agree on the label of a textual document the machine label matches the human label over 90 of the time The rate is just 45 when only two algorithms agree on the text label)

Code Source The codes is readily available for download and usage from the following link httpsgithubcomtimjurkaRTextTools The code can be implemented without modification for testing however itrsquos set up such that changes can be incorporated easily based on our requirement

Additional Reading httpwwwrtexttoolscomabout-the-projecthtml

19

Example 2 - RTextTools

Penn Treebank - httpswwwlingupenneducoursesFall_2003ling001penn_treebank_poshtml

Stanford info lab - Finding Similar Items httpinfolabstanfordedu~ullmanmmdsch3pdf

TRIPLET EXTRACTION FROM SENTENCES URL - httpailabijssidelia_rusuPapersis_2007pdf

Shallow and Deep NLP Processing for ontology learning a Quick Overview httpazouaqathabascaucapublicationsConferences20Workshops20Books5BBC25D_KDW_2010pdf

References

20

Penn Treebank - httpswwwlingupenneducoursesFall_2003ling001penn_treebank_poshtml

Stanford info lab - Finding Similar Items httpinfolabstanfordedu~ullmanmmdsch3pdf

TRIPLET EXTRACTION FROM SENTENCES URL - httpailabijssidelia_rusuPapersis_2007pdf

Shallow and Deep NLP Processing for ontology learning a Quick Overview httpazouaqathabascaucapublicationsConferences20Workshops20Books5BBC25D_KDW_2010pdf

References

20


Recommended