BIG DATA ANALYTICS TO THE MASSES
JOSE LUIS LÓPEZ PINODATA ENGINEER GETYOURGUIDE
Big Data Analytics to the masses
Why it has failed and how we can fix it
Jose Luis Lopez Pino
Who am I?
BI Consultant
Large-Scale & Distributed
Founding
Data Engineer
Big Data is like Tourism But if you aren’t an expert,
you can’t make the most of itIt seems easy to do
Struggle to analyze Big Data
Harlan Harris, Sean Murphy, and Marck Vaisman. Analyzing the Analyzers: An Introspective Survey of Data Scientists and Their Work. O’Reilly Media, Inc., 2013Also: Sean Kandel, Andreas Paepcke, Joseph M Hellerstein, and Jeffrey Heer. Enterprise data analysis and visualization: An interview study. Visualization and Computer Graphics, IEEE Transactions
Tools
Volker Markl. Breaking the chains: On declarative data analysis and data independence in the big data era. Proceedings of the VLDB Endowment, 7(13), 2014
Tools (October 2014)
Original: Volker Markl. Breaking the chains: On declarative data analysis and data independence in the big data era. Proceedings of the VLDB Endowment, 7(13), 2014
Deep analytics
Libraries!
We need libraries...
Query languages
Write your own MR/RDD/Transformations
… comprehensive ones!
Say it with memes!
When you doDeep analytics in small data
using R and CRAN packages
When you dodeep analytics in BIG data
using R and CRAN packages
When you try to program it using MapReduce
When you try to program it using Apache Spark /
Apache Flink
When you try to use a library scalable to large data sets
Can’t we do it better?
- Make it similar to normal R programs.
- Hide complexity.- Make file manipulation easier.- Part of the computing in the
cluster and part of the computer in the client.
Our approach
Our approach
Behind the scenes: Before
Behind the scenes: After
Without writing significantly different code
Competitive or even faster than R native code in small data
And it scales
Some relevant findings
- Transmission time was not significant.- Stratosphere/Flink was competitive in highly
iterative programs.- We were not able to do it keeping the code
100% the same.- Ensemble scenarios are the most exciting
ones.
4 Takeaways from this talk
- We still need to bring Big Data to the right people in the right place.
- We need comprehensive libraries.- We need to move data back and forth.- Use a syntax that the users are familiar with.
That’s all!- Have you found this talk interesting?
- Follow me: @jllopezpino- Interested in a job as SEM Data Analyst
(Berlin)?- Ask me for the details:
- Are you interested in Data + Energy?- Keep in touch:
17TH ~ 18th NOV 2014MADRID (SPAIN)