Date post: | 18-Aug-2015 |
Category: |
Documents |
Upload: | li-wei-yang |
View: | 384 times |
Download: | 1 times |
DATA S C I E N T I S T ’ S DA I LY L I F E
AG E N DA
• Data scientist?
• Big data and data scientist
• Data scientist’s Toolbox
• Data is the biggest
Derive Knowledge
fromBig data
Efficiently
and
Intelligently
F R O M BAC K E N D T O F R O N T E N D
https://doubleclix.wordpress.com/2012/12/15/what-or-who-is-a-data-scientist/
W H AT I S B I G DATA ?
W H E R E D O T H E DATA C OM E F R OM
• Web Log data
• Machine data
• Transactional data
• Social media data
• …
https://plus.google.com/+DigitalStrategyIE
A WE B SE RV I C E RE C E I VE T H E LOG DATA M ORE T H E N 50G PE R DAYT OTAL SPAC E US E D L AST T H RE E M ONT H : 4500GT OTAL SPAC E US E D L AST ONE Y E AR : 18 , 000G (17 .6T )
• Data Storage/ Backup
• 2T/per HDD
• How to save the data MORE than 2T?
• $0.3 USD/per gigabyte
• Pay 900 USR for KEEPING data but do nothing else.
• Read/Write Speed
• Read: 131.6 MB/s / Write 131.4MB/s
• Spend 393s(6 min) reading just ONE day data.
• Large number of transactions immediately
H A DO O P AN D M A P R E D U C E
H A D O O P A N D H D FS
http://www.fraudtechwire.com/f-level-guide-to-hadoop-hdfs/
– D I S T R I BUT ED A LG OR I THM
「 The world will change,when data is distributed」
M A P R E D U C E
http://www.milanor.net/blog/?p=853
https://chamibuddhika.wordpress.com/2012/02/26/joins-with-map-reduce/
http://blog.agro-know.com/?p=3810
P E R F O R M A N C E O F H A D OO P ?
• Not good, but at least can run.
• Count 86,389,084 rows/per day in 39 sec. (64G ram, E5 8core * 2/per node * 10)
• How about 39sec * 30days ?
B E F O R E A N A LY T I C …
E XT RAC T T RA S F O R M LOA D
http://www.wisdomjobs.com/e-university/data-warehouse-etl-toolkit-tutorial-201/surrounding-the-requirements-1319/architecture-8029.html
http://www.slideshare.net/capgemini/emc-world-2014-breakout-move-to-the-business-data-lake-not-as-hard-as-it-sounds
http://www.slideshare.net/hortonworks/modern-data-architecture-for-a-data-lake-with-informatica-and-hortonworks-data-platform
DATA S C I E N T I S T ’ S T O O L BOX
L I N U X
• The best server choice
• Free and freedom
• Easy to control system
• Easy data processing
• Hadoop is based on Linux
P O W E R F U L S H E L L S C R I PT
S QL DATA BA S E
• MySql, Postgresql, Hive, MongoDB(NOSQL)
• Standard SQL Language
• Store and Manage data
R E L AT I O N A L DATA BA S E
TA BL E R E L AT I O N
https://cloudant.com/blog/foundbites-data-model-relational-db-vs-nosql-on-cloudant/
http://ghtorrent.org/relational.html
S QL S Y N TA X
R & PY T H O N
• Basic Analysis Tools
• Easy to Learn
• Many Packages
• Example
• http://bryannotes.blogspot.tw/2014/08/r-ptt-wantedsocial-network-analysis.html
• http://bryannotes.blogspot.tw/2014/10/python-k-means-script.html
E TC …
• Excel
• Google Analytics
• Visualisation tools (tableau)
• Web Crawler
• Version control management (git)
• ETL and job scheduling tools (jenkins)
• …
DATA I S T H E B I G G E S T
– J OS H W I LLS
“Person who is better at statistics than any software engineer and better at software
engineering than any statistician.”
S TAT I S T I C
W H Y D O W E N E E D M AC H I N E L E A R N I N G ?
• Clustering這些人可以分成幾類
• Classification哪個人屬於哪一類?
• Regression某個事件發生或某人屬於哪類的機率是多少?
• Dimensionality reduction降維
C LU S T E R I N G
http://simplystatistics.org/2014/02/18/k-means-clustering-in-a-gif/
source http://humble-developer.blogspot.tw/2011/01/kmeans-clustering-algorithm-part-1.html
C L A S S I F I C AT I O N
http://letsmakerobots.com/content/tcs3200-color-sensor-with-k-nearest-neighbor-classification-algorithm
http://www.astroml.org/sklearn_tutorial/
LO G I S T I C R E G R E S S I O N
https://www.coursera.org/instructor/andrewng
C O S T F U N C T I O N
https://www.coursera.org/instructor/andrewng
OV E R F I TT I N G
https://www.coursera.org/instructor/andrewng
O H M Y G O D !H O W T O C H O O S E I T
M AC H I N E L E A R N I N G A L G OR I T H M N
http://amueller.github.io/sklearn_tutorial/
S TAT I S T I C V S M L
S TATT I S T I C MAC H I NEL E ARN I NG
FOC U S ON U NDE RS TAND I NG DATA I N TER MS OF MODEL S
FOC U S ON THE ANALYS I S OF L EAR N I NG AL G OR I THMS
I NTER P R ETAB I L I TY , HYP OTHES I S TE S T I NG
G R EATE R FOC U S ON P R ED I C T I ON
S Y S T E M AT I C S A N D A U T OM AT I O N
http://www.slideshare.net/CetasAnalytics/cetas-e-baymeetupprezofinal
http://mlg.postech.ac.kr/projects/
S H O W YO U R DATA AN D F I N D I N G S
http://hortonworks.com/wp-content/uploads/2012/06/Tableau2.png
http://www.tableau.com
http://www.tableau.com
http://www.tableau.com
T H E R E A L C A S E
H O W T O S TA RT ?
• Codecademy http://www.codecademy.com/Include kinds of programming language, i.e. python, JavaSrtipt, even shell script and sql
• Coursera http://www.codecademy.com/Famous self-learning MOOC website.
http://nirvacana.com/thoughts/becoming-a-data-scientist/