We love NLTK

Post on 02-Jul-2015

284 views 1 download

description

NLTK + Data Matching? Yep!

transcript

[(‘We’, ‘PRP’),(‘<3’, ‘VBP’),(‘NLTK’, ‘NNP’)

]Dhiana Deva | Gabriel Fonseca

Data Matching @ UFRJ

“NLTK” == “Natural Language ToolKit”

+ Python library for NLP+ Created in 2001 at University of Pennsylvania+ Very extensive+ Many examples+ Built-in support for 84 datasets (today!)+ Great documentation+ Open source ;)+ Active community

Lot’s of modules!corpus

standardized interfaces to corpora and lexicons

tokenizetokenizers!

stemstemmers!

collocationt-test, chi-squared, point-wise mutual information

classifydecision tree, maximum

entropy, naive bayes

clusterEM, k-means

chunkregular expression, n-gram, named-entity

metricsdistances, precision,

recall, agreement coefficients

probabilityfrequency distributions, smoothed probability

distributions

...parse

chart, feature-based, unification, probabilistic,

dependency

tagpart-of-speech tagging, n-gram, backoff, Brill,

HMM, TnT

Can I haz Data Matching?☑ Accuracy score

☑ Precision score

☑ Recall score

☑ F-measure score

☐ Reduction ratio

☑ Stop-words (11 languages)

★ Punkt sentence tokenizer

★ Punkt word tokenizer

☑ N-gram (words and chars)

☑ Tf-idf

☑ Levenshtein distance

☑ Damerau-Levenshtein distance

☑ Binary distance... Durr!

★ Krippendorff's distance

★ Masi distance

☑ Jaccard distance

☐ Jaro distance

☐ Jaro-Winkler distance

☐ Monge-Elkan distance

☐ Soundex

☐ Phonex

☐ NYSIIS

☐ ONCA

☐ Double-Metaphone

☐ Fuzzy Soundex

☑ Decision tree

☑ SVM

☑ Naive Bayes

★ MaxEnt

Fun fun fun!Sentiment analysisSpelling correctionSpam detectionTopic modelingRecommender systemsData deduplication

Why not song matching?!Grooveshark: online music streaming serviceSongs uploaded by record labels, independent artists and usersLot’s of duplicates!Tinysong: Grooveshark’s open RESTful APIOur goal: No repeated songs!

(remixes and lives are okay!)

Bohemian Rhapsody by Qween-?! {

"Url": "http:\/\/tinysong.com\/PBCJ",

"SongID": 33834073,

"SongName": "Bohemian Rhapsody",

"ArtistID": 2324,

"ArtistName": "Queen",

"AlbumID": 1071492,

"AlbumName": "Greatest Hits"

},

...

{

"Url": "http:\/\/tinysong.com\/CYxG",

"SongID": 28835215,

"SongName": "Bohemian Rhapsody",

"ArtistID": 1731732,

"ArtistName": "Qween -",

"AlbumID": 2364353,

"AlbumName": "A Night at the Opera"

}

...

Next stepsOther textual dataMachine learningAcoustic features

LoudnessBPMLiveness

Acoustic fingerprinting for supervised learningYes, songs have fingerprints too!

Our “sentiment”+ Quick and easy!+ Exteeeeeeeeeeeeeeeeensive!+ Docs & community!+ Internationalization- Time performance- Memory usage- No online or active learning

Want more?!+ jellyfish

Jaro-Winkler, Hamming, Soundex, Metaphone, NYSIIS, …+ nltk-trainer

Command-line NLTK classifiers!+ scikit-learn

More machine learning! Memory efficient!+ pattern

Web mining. Out-of-the-box!+ gensim

Topic modeling. Out-of-the-box!

Thanks! ;)