Date post: | 16-Jul-2015 |
Category: |
Technology |
Upload: | roi-blanco |
View: | 345 times |
Download: | 1 times |
What is Big Data?
• A fashioned term used by some IT vendors to remarked old fashioned hardware and software
• “The term itself is vague, but it is getting at something that is real… Big Data is a tagline for a process that has the potential to transform everything.” John Kleinberg
• What I want to talk about: – Big Data science, cool use cases – Access to data, tools to process the data (Hadoop and friends’ ecosystem) – What’s next (now!)
3
Data?
• Advances in digital sensors, communications, computation, and storage have created huge collections of data, capturing information of value to business, science, government, and society.
• Example: search engine companies – transformed how people find and make use of information on a daily basis.
• Other forms of big data are transforming the activities of companies, scientific researchers.
• Machine learning on large data-sets for decision making, product shaping.
5
Motivation • BIG DATA is an OPEN SOURCE Software Revolution • BIG DATA Analytics 2.0
• What is happening right now
• Why do we need new tools? • Improve decision making:
• Measure and react in REAL-TIME
6
Real Time Decision Making
8
Companies need to know:
• what is happening right now, in real time, to be able to
• react • anticipate and detect new
business opportunities.
Controversy of Big Data
• All data is BIG now • Hype to sell Hadoop
based systems • Ethical concerns about
accessibility • Limited access to Big
Data creates new digital divides
17
Controversy of Big Data
• Statistical Significance: – When the number of
variables grow, the number of fake correlations also grow
– Leinweber: S&P 500 stock index correlated with butter production in Bangladesh
18
Need for Big Data
McKinsey Global Institute (MGI) Report on Big Data, 2011 19
• WEF defined data as an asset just like gold or currency
• Business opportunities to exploit by companies that can analyze information in the right way
• What do your customers need?
• What will they demand in the future?
Need for Big Data
20
• How do you know the invest was worth it?
• In the happy success cases predictive analysis has led to income improvement of ~70%
McKinsey Global Institute (MGI) Report on Big Data, 2011
Data Analysis
• Most business still running on small data! • Is more data always better?
– Hardly – past a certain point, return on adding more data diminishes to the point that
you’re only wasting time gathering more
• Do you need data? – Of course – … but the right data (+ interpretation)
• Unbiased, context • Big data is not a magic wand for inferring causality
• Most AI problems have been tackled from a data perspective – Still, unsolved (Google’s cat detector).
22
Why Machine Learning interest is increasing?
• Data is everywhere – Increasingly captured – Increasingly comprehensive
• Storage capabilities are now much cheaper, such is processing – In-house Hadoop clusters – Cloud-based processing (Amazon EC2)
• Data is important – Machine learning provides effective development methodology – … when you cannot program a solution by hand – … but you have data available
• Let the data figure out the program
• Any company with large data sets will have an interest
24
Big Data Challenges
“Fat” servers implies high cost
– use cheap commodity nodes instead
Large number of cheap nodes implies frequent failures
– leverage automatic fault-tolerance
commodity
fault-tolerance
27
Big Data Challenges
We need new data-parallel programming model for clusters of commodity
machines
data-parallel
28
MapReduce
Published in 2004 by Google
– MapReduce: Simplified Data Processing on Large Clusters
Popularized by Apache Hadoop project started by Yahoo!
– Now used by virtually everybody else Facebook, Twitter,
Amazon, …
29
Map Reduce Philosophy
– hide complexity
– make it scalable
– make it cheap
1. System Shall Manage and Heal Itself
2. Performance Shall Scale Linearly
3. Compute Should Move to Data 4. Simple Core, Modular and
Extensible
31
Hadoop High-Level Architecture
Name Node Maintains mapping of file blocks
to data node slaves
Job Tracker Schedules jobs across
task tracker slaves
Data Node Stores and serves
blocks of data
Hadoop Client Contacts Name Node for data or Job Tracker to submit jobs
Task Tracker Runs tasks (work units)
within a job Share Physical Node
32
Pig
33
Pig
A = LOAD ’data’ USING PigStorage() AS(f1:int, f2:int, f3:int);
B = GROUP A BY f1;
C = FOREACH B GENERATE COUNT ($0);
DUMP C;
Pig: Similar to SQL
21 / 55
Pig Similar to SQL
HBase
35
• Apache HBase™ is the Hadoop database, a distributed, scalable, big key-value store – Linear and modular
scalability. – Strictly consistent reads
and writes. – Automatic and configurable
sharding of tables – Failover support – Interoperable with Java,
Hadoop
Hive
• Apache project for querying and analyzing datasets in HDFS – Tools to enable easy data
extract/transform/load (ETL) – A mechanism to impose
structure on a variety of data formats
– Access to files stored either directly in Apache HDFSTM or in other data storage systems such as Apache HBaseTM
– Query execution via MapReduce
36
Future
• Process data fast enough – BI analytics
• Key drivers: connected devices/services – Tablets, smartphones, etc. – Your data is “always connected to the cloud” – Low latency (again)/enormous amount of data
• User data – Categorize data to infer knowledge about a user
• Targeting, personalization • 100B events per day
– ML: from information to knowledge – Behavioral targeting (user features)
• How likely am I to be interested in fashion? For how long? • Map to behavioral targeting categories, segment for targeting
42
Future (II)
• Data processed in batches – There are gaps! – Things you’ve calculated half an hour ago – Ok for monthly reports, not for online NRT prediction – Think of GEO targeting
• You can’t go fast enough with MR – From big long windows to small incremental iterations – Micro-batches updating user knowledge
• Use cases – Ad campaign allocation
• Delay between click and deducting budget from an advertiser (overspending) – Personalization and targeting
• Y! Homepage • Use every event on the stream to detect the interest
– How do we train machine learning models when the data is arriving non-stop? • You want parameters to adapt, to change slowly • Maybe 99% of the data is the same! Incrementally is better
43
Beyond Hadoop
• YARN – Why if you just want to interact with the data in Hadoop?
• Hive (SQL-like), Hbase (NoSQL) and Pig (scripted data access) – Those apps are great but limited to running as a single application system with
MapReduce at the core – Spark (see below) and Storm have been ported to YARN already
• Streaming – SAMOA
• RDDs – Spark
• Shark (Hive on Spark)
• Analytics Architecture – Visualization http://visualize.yahoo.com/mail/
44
Future Challenges for Big Data
• Evaluation
• Time evolving data • Distributed mining
• Compression • Visualization • Hidden Big Data
45
Hadoop 2.0
• No longer “only” running MR jobs – MR + processing low latency and streaming
• Iterative processing – Hold data in memory to re-process
• Figure the questions of what to do with data – BI that want to do exploration of the data really fast
• Possible thanks to YARN + Storm(S4) + Spark + … ? – 350PB of data – >30K nodes with Yarn – 400K per day (6 jobs/sec) – 10M hours of compute with YARN
46
Future key take-aways
• Scalability • Performance • Flexibility • Programming paradigms
– MAP/MAP/MAP .. OR REDUCE/REDUCE/REDUCE
47
Big Data Myths
• Big Data is new • Big Data is objective • Big Data doesn’t discriminate • Big Data makes things smart • Big Data is anonymous • You can opt-out
48