Date post: | 21-Aug-2015 |
Category: |
Technology |
Upload: | peter-morgan |
View: | 58 times |
Download: | 4 times |
Table of Contents
1. Definition and Overview
2. Data Sources
3. Databases
4. Data Analytics
Glossary
References
2
Four main components
• Data
– Structured and unstructured
• Databases
– Proprietary and open source
• Query language
– Querying the database
• Analytics
– Analysing the data
5
How big is big?
• Large data sets – Greater than 1,000 Terabytes? (1 Petabyte) – 1,000,000 Terabytes? (1 Exabyte)
• Excel 2013 can have 1,048,576 rows by 16,384 columns – About 10 Gigabyte of data
• Only going to get bigger – 90% of all data produced in the past two years ! – Rate is increasing
• Recall – Giga = 10⁹ – Tera = 10¹² – Peta = 10¹⁵ – Exa = 10¹⁸
6
Where does the data come from? • Science – particle, astrophysics
• Industry – oil, finance, telecom
– Actually all verticals
• Social – Facebook, LinkedIn, Twitter
• Medicine – genome, neuroscience
• Government – census, education, police
• Sports – statistics
• Environment – weather, sensors
9
Unstructured Data • 80% of data is unstructured
• NoSQL • Document based
– Documents – Texts, tweets – Emails – Machine logs – Blogs – Web pages – Photos – Videos (YouTube)
• Graph based – Social media sites – Facebook has 1.1billions users (Microstrategy, July 27, 2013)
10
Why do we need to use big data?
Use in public and private sector to: • Make faster and more accurate business decisions • Make accurate predictions • Gain competitive advantage • Implement smarter marketing – CRM • Discover new opportunities • Enhance Business Intelligence • Enable fraud detection • Reduce crime • Improve scientific research • Quicken analysis (up to real time)
– Weeks, days minutes, seconds
11
Big Data Startup - Case Study
• Rocket Fuel • No. 4 on Forbes' 2013 Most Promising Companies In
America list • Digital advertising startup • Screens over 26 billion ads per day • “Advertising that learns” big data platform • Distributed planet-scale computing engine • Hadoop implementation • Founders from Yahoo!, Salesforce.com, DoubleClick • Targeting algorithms use lifestyle, purchase intent and
social data
12
Relational databases – SQL
Proprietary
• Oracle DB
• IBM DB2
• Microsoft SQL
• SAP
• EMC
Open Source
• MySQL
• PostgresQL
• Drizzle
• Firebird
16
Non-relational databases – NoSQL
• BigTable – Google
• Cassandra – Facebook
• Eucalyptus – Amazon
• Hbase – Hadoop
• MongoDB – 10Gen
• Neo4j - NeoTechnologies
• CouchDB - Apache
• CouchBase
• Riak - Basho
• Redis - Pivotal
17
Big Data Analytics - Incumbents
• Oracle – Exadata, Exalytics • Microsoft – HDInsight, xVelocity • IBM – Netezza, Cognos, BigInsights • SAP – HANA, Business Objects • EMC – Pivotal (Greenplum) • HP – Vertica, HAVEn • All run on Hadoop
19
Big Data Analytics – Pure Plays
• Pure plays – definition:
– Been around more than 20 years
– Purely data analytic companies
• Teradata - Aster
• SAS
• Microstrategy
20
Big Data Analytics – New Entrants
• Hortonworks
• Cloudera
• MapR
• Acunu
• Pentaho
• Tableau
• Talend
• Splunk
21
(Some of) IBM’s Big Data Acquisitions
• Algorithmics – Oct 2011, $400million
• OpenPages – Oct 2010, ?
• Netezza – Sept 2010, $1.7billion
• SPSS – Jan 2010, $1.2billion
• Cognos – Jan 2008, $4.9billion
• About $10billion in four years
http://en.wikipedia.org/wiki/List_of_mergers_and_acquisitions_by_IBM
22
Big Data Hadoop Stack • Hadoop is the de facto big data operating system
• Developed from Google and Yahoo! (2005)
• It is distributed, open source and managed by Apache
24
Analytic Technologies
• A/B testing
• Genetic algorithms
• Machine learning
• Natural language processing
• Neural networks
• Pattern recognition
• Anomaly detection
• Decision tree
• Predictive modeling
• Regression testing
• Sentiment analysis
• Signal processing
• Simulations
• Time series analysis
• Visualization
• Multivariate analysis
• Text analytics
25
Glossary
• OLTP = On Line Transactional Processing
• OLAP = On Line Analytic Processing
• ODBC = Open DataBase Connectivity
• IMDB = In Memory DataBase
• CRUD = Create, Read, Update, Delete
• ETL = Extract, Transform and Load
• CDO = Chief Data Officer
• NLP = Natural Language Processing
• GQL = Graph Query Language
• AaaS = Analytics as a Service
• EDW = Enterprise Data Warehouse
26
References
• Microstrategy website, 27 July, 2013, Michael Saylor Presentation at Microstrategy World 2013, http://www.microstrategy.com/
• Teradata website www.teradata.com
• Wikipedia http://en.wikipedia.org/wiki/
• Google images www.google.co.uk
• IBM website www.ibm.com
• Youtube www.youtube.com
• Hadoop www.hortonworks.com
27