+ All Categories
Home > Documents > Analyzing data with python

Analyzing data with python

Date post: 22-Feb-2016
Category:
Upload: adie
View: 36 times
Download: 0 times
Share this document with a friend
Description:
Analyzing data with python. Sarah Guido @ sarah_guido Reonomy OSCON 2014. About me. Data scientist at Reonomy University of Michigan graduate NYC Python organizer PyGotham organizer. About this talk. Bird’s-eye overview: not comprehensive explanation of these tools! - PowerPoint PPT Presentation
Popular Tags:
48
Sarah Guido @sarah_gui do Reonomy OSCON 2014 ANALYZING DATA WITH PYTHON
Transcript
Page 1: Analyzing data with python

Sarah Guido@sarah_guidoReonomyOSCON 2014

ANALYZING DATA WITH PYTHON

Page 2: Analyzing data with python

Data scientist at ReonomyUniversity of Michigan graduateNYC Python organizerPyGotham organizer

ABOUT ME

Page 3: Analyzing data with python

Bird’s-eye overview: not comprehensive explanation of these tools!

Take data from start-to-finishPreprocessing: PandasAnalysis: scikit-learnAnalysis: nltkData pipeline: MRjobVisualization: matplotlib

What next?

ABOUT THIS TALK

Page 4: Analyzing data with python

So many toolsPreprocessing, analysis, statistics, machine learning, natural language processing, network analysis, visualization, scalability

Community support“Easy” language to learnBoth a scripting and production-ready

language

WHY PYTHON?

Page 5: Analyzing data with python

How to find the best tool(s)?The 90/10 ruleSimple is better than complex

FROM POINT A TO POINT…X?

Page 6: Analyzing data with python

Available resourcesDocumentation, tutorials, books, videos

Ease of use (with a grain of salt)Community support and continuous

developmentWidely used

WHY I CHOSE THESE TOOLS

Page 7: Analyzing data with python

The importance of data preprocessingAKA wrangling, munging, manipulating, and so on

Preprocessing is also getting to know your dataMissing values? Categorical/continuous? Distribution?

PREPROCESSING

Page 8: Analyzing data with python

Data analysis and modelingSimilar to R and ExcelEasy-to-use data structures

DataFrameData wrangling tools

Merging, pivoting, etc

PANDAS

Page 9: Analyzing data with python

Keep everything in PythonCommunity support/resourcesUse for preprocessing

File I/0, cleaning, manipulation, etcCombinable with other modules

NumPy, SciPy, statsmodel, matplotlib

PANDAS

Page 10: Analyzing data with python

File I/O

PANDAS

Page 11: Analyzing data with python

Finding missing values

PANDAS

Page 12: Analyzing data with python

Removing missing values

PANDAS

Page 13: Analyzing data with python

Pivoting

PANDAS

Page 14: Analyzing data with python

Other thingsStatistical methodsMerge/join like SQLTime seriesHas some visualization functionality

PANDAS

Page 15: Analyzing data with python

Application of algorithms that learn from examples

Representation and generalizationUseful in everyday lifeEspecially useful in data analysis

MACHINE LEARNING

Page 16: Analyzing data with python

Supervised learningClassification and regression

Unsupervised learningClustering and dimensionality reduction

MACHINE LEARNING

Page 17: Analyzing data with python

Machine learning moduleOpen-sourceBuilt-in datasetsGood resources for learning

SCIKIT-LEARN

Page 18: Analyzing data with python

Scikit-learn: your data has to be continuous

Here’s what one observation/label looks like:

SCIKIT-LEARN

Page 19: Analyzing data with python

Transform categorical values/labels

SCIKIT-LEARN

Page 20: Analyzing data with python

Classification

SCIKIT-LEARN

Page 21: Analyzing data with python

Classification

SCIKIT-LEARN

Page 22: Analyzing data with python

Other thingsVery comprehensive of machine learning algorithms

Preprocessing toolsMethods for testing the accuracy of your model

SCIKIT-LEARN

Page 23: Analyzing data with python

Concerned with interactions between computers and human languages

Derive meaning from textMany NLP algorithms are based on

machine learning

NATURAL LANGUAGE PROCESSING

Page 24: Analyzing data with python

Natural Language ToolKitAccess to over 50 corpora

Corpus: body of textNLP tools

Stemming, tokenizing, etcResources for learning

NLTK

Page 25: Analyzing data with python

Stopword removal

NLTK

Page 26: Analyzing data with python

Stopword removal

NLTK

Page 27: Analyzing data with python

Stemming

NLTK

Page 28: Analyzing data with python

Other thingsLemmatizing, tokenization, tagging, parse trees

ClassificationChunkingSentence structure

NLTK

Page 29: Analyzing data with python

Data that takes too long to process on your machineNot “big data” but larger data

Solution: MapReduce!Processing large datasets with a parallel, distributed algorithm

Map stepReduce step

PROCESSING LARGE DATA

Page 30: Analyzing data with python

Map stepTakes series of key/value pairs Ex. Word counts: break line into words, return word and count within line

Reduce stepOnce for each unique key: iterates through values associated with that key

Ex. Word counts: returns word and sum of all counts

PROCESSING LARGE DATA

Page 31: Analyzing data with python

Write MapReduce jobs in PythonTest code locally without installing

HadoopLots of thorough documentationA few things to know

Keep everything in one classMRJob program in a separate fileOutput to new file if doing something like word counts

MRJOB

Page 32: Analyzing data with python

Stemmed file

Line 1: (‘miss’, 2), (‘taylor’, 1)Line 2: (‘taylor’, 1), (‘first’, 1), (‘wed’,

1)And so on…

MRJOB

Page 33: Analyzing data with python

MapLine 1: (‘miss’, 2),

(‘taylor’, 1)Line 2: (‘taylor’, 1),

(‘first’, 1), (‘wed’, 1)Line 3: (‘first’, 1),

(‘wed’, 1)Line 4: (‘father’, 1)Line 5: (‘father’, 1)

Reduce(‘miss’, 2)(‘taylor’, 2)(‘first’, 2)(‘wed’, 2)(‘father’, 2)

MRJOB

Page 34: Analyzing data with python

Let’s count all words in the Gutenberg file

Map step

MRJOB

Page 35: Analyzing data with python

Reduce (and run) step

MRJOB

Page 36: Analyzing data with python

ResultsMapped counts reducedKey/val pairs

MRJOB

Page 37: Analyzing data with python

Other thingsRun on Hadoop clustersCan write highly complex jobsWorks with Elasticsearch

MRJOB

Page 38: Analyzing data with python

The “final step”Conveying your results in a meaningful

wayLiterally see what’s going on

DATA VISUALIZATION

Page 39: Analyzing data with python

2D visualization libraryVery VERY widely usedWide variety of plotsEasy to feed in results from other

modules (like Pandas, scikit-learn, NumPy, SciPy, etc)

MATPLOTLIB

Page 40: Analyzing data with python

Remember this?

MATPLOTLIB

Page 41: Analyzing data with python

Bar chart of distribution

MATPLOTLIB

Page 42: Analyzing data with python

Let’s graph our word count frequencies(Hint: It’s a power law distribution!)

MATPLOTLIB

Page 43: Analyzing data with python

High frequency of low numbers, low frequency of high numbers

MATPLOTLIB

Page 44: Analyzing data with python

Other thingsMany different kinds of graphsCustomizableTime series

MATPLOTLIB

Page 45: Analyzing data with python

Phew!Which tool to choose depends on your

needsWorkflow:

PreprocessAnalyzeVisualize

WHAT NEXT?

Page 46: Analyzing data with python

Pandashttp://pandas.pydata.org/

scikit-learnhttp://scikit-learn.org/

NLTKhttp://www.nltk.org/

MRJobhttp://mrjob.readthedocs.org/

matplotlibhttp://matplotlib.org/

RESOURCES

Page 47: Analyzing data with python

Twitter@sarah_guido

LinkedInhttps://www.linkedin.com/in/sarahguido

NYC Pythonhttp://www.meetup.com/nycpython/

CONTACT ME!

Page 48: Analyzing data with python

Questions?

THE END!


Recommended