ISO/IEC JTC1 SC32N2383 1
BIG DATA & NEXT GENERATION ANALYTICSKrishna KulkarniKeith W. HareISO/IEC JTC1 SC32 Opening PlenaryMay 27, 2013, Gyeongju Korea
ISO/IEC JTC1 SC32N2383 2
Introduction• Goal of this talk is to provide additional input to the
discussion.• “Next Generation Analytics is essentially dealing with Big
Data – with the same concepts for predictive analysis in discovering hidden patterns, discovering unknown correlations by analyzing huge volumes of transactional data and other untapped data (data mining, data warehouses, unstructured etc.), and, essentially using the same toolsets (NoSQL, Hadoop etc.).” Baba Priprani
ISO/IEC JTC1 SC32N2383 3
Next Generation Analytics Goals • Cost of acquiring and storing data is rapidly decreasing • Enterprises are collecting huge amounts of extremely fine-
grained data. • Enable enterprises to get newer actionable business
insights from vast amounts of raw fine-grained data dramatically faster than is possible today
ISO/IEC JTC1 SC32N2383 4
Sample use case – Retail• Utilize transactional and query logs collected by retail
companies• Finer segmentation of customers for direct marketing campaigns• Generate differentiated pricing structures• Predicting future customer demands.
ISO/IEC JTC1 SC32N2383 5
For retail use cases…• Critical to narrow time gap between:
• Data acquisition • And acting on a business decision based on the data.
• Referred to as:• Near Real-time Business Analytics • Or Operational Business Intelligence
• For example, a retailer would• Decide on promotions for the next week based on the data
collected during this week• For on-line stores, take action based on data even more quickly• Real time marketing e.g. as customers are walking down the street
ISO/IEC JTC1 SC32N2383 6
Sample use case – Medical• Cancer treatment regimen
• 100% effective in 80% of the patients• Completely ineffective in 20% of patients
• Need to identify the 20%• Sufficient to identify correlations• Causations can come later
ISO/IEC JTC1 SC32N2383 7
Requirements for Achieving Goals• Handling diverse data formats/structures • Handling high speed of data collection • Analytics capability beyond what is offered by the
traditional business intelligence • Low cost, highly scalable analytics platforms • Heterogonous infrastructure
ISO/IEC JTC1 SC32N2383 8
Diversity of data• Small fraction is structured formats, Relational, XML, etc.• Fair amount is semi-structured, as web logs, etc. • Rest of the data is unstructured text, photographs, etc. Very difficult to implement a single data model can handle the diversity
ISO/IEC JTC1 SC32N2383 9
Velocity of data • Continuously streaming data
• Need to analyze data in-flight• Combine with data at-rest
• Need a good answer quickly• A precisely correct answer
• May not exist • May not be required
ISO/IEC JTC1 SC32N2383 10
Analytics capability• Current technologies are not sufficient or are too static:
• Business Intelligence (BI) techniques • Data Warehousing (static, batch oriented style)• Built-in analytic functions in SQL• Data Mining
• “Machine learning” viewed as• key technology • will unlock novel insights in data.
• Statistical packages• Project R – public domain• SAS – proprietary • SPSS – proprietary
Effective leveraging of the machine learning tool kits requires understanding of probability and statistics.
ISO/IEC JTC1 SC32N2383 11
Significant challenges in identifying deep insights from data• How to identify relevant fragments of data easily from a
multitude of data sources?• How to use data cleaning techniques across multiple data
sources?• How to sample results of a query progressively? • How to obtain rich visualization? Best successes so far have been vertically integrated machine learning software packages for use in specific use cases, e.g., detection of credit card fraud
ISO/IEC JTC1 SC32N2383 12
Significant Challenges in Storing Data• Next Generation Analytics Operate on “Big Data”• Data Storage May Span
• Multiple Servers• Multiple Storage sub systems• Multiple data centers
• NoSQL Databases often used to store “Big Data”• Large variety of products• Diverse sets of features• No standard interface
ISO/IEC JTC1 SC32N2383 13
Low Cost, Highly Scalable Analytics Platforms
• Infrastructure based on MapReduce framework emerging as a popular retrieval and consolidation solution
• However, this infrastructure is very low-level• Responsibility for exploiting the platform is on the user• Lacks much of the maturity of the relational world.
Integration with existing relational/BI platforms is a must for long-term success
ISO/IEC JTC1 SC32N2383 14
Significant Challenges for Retrieving Data
• MapReduce• Framework for managing partitioned query & retrieval of distributed
data• Retrieves data from distributed data stores and presents it to the
analysis layer • Custom Map operation• Custom Reduce operation• No high level declarative language • Languages specific to underlying data storesNo automated way to apply MapReduce to extremely complex questions
ISO/IEC JTC1 SC32N2383 15
Summary• Community experimentation and understanding are
evolving rapidly• Need complete eco-system make this all work• Standards are essential – Niche solutions will lead to
vendor lock-in
ISO/IEC JTC1 SC32N2383 16
How the pieces fit together
Statistical Analysis EngineMachine Learning Engine
Big DataNoSQL
RelationalXML
Data Retrieval & SummaryMapReduce
ISO/IEC JTC1 SC32N2383 17
Sources• Chaudhuri, S., "What next?: A half-dozen data
management research goals for big data and the cloud", In Proceedings of the 31st Symposium on Principles of Database Systems, ACM, 2012.
• “Big Data Now: 2012 Edition”http://oreilly.com/data/radarreports/big-data-now-2012.csp
ISO/IEC JTC1 SC32N2383 18
Additional Discussion…• The following slides were incomplete and beyond the
scope of this presentation, but worth preserving for future discussions.
ISO/IEC JTC1 SC32N2383 19
Domain, Range, & Function• In traditional mathematics
• Given a domain and a function, solve for range• Given a domain and a range, identify a function, if it exsists
• Example:• Given the set of pairs {(2,-3),(4,6),(3,-1),(6,6),(2,3)}
• domain of relation is set {2,3,4,6}• Range is {-3,-1,3,6}• Answer is no, there is not a function
• one X value (2) that produces 2 different Y values
ISO/IEC JTC1 SC32N2383 20
In Analytics• Determine the range, given a set of candidate domains• Solve for function that will give range for candidate
domains.
ISO/IEC JTC1 SC32N2383 21
National Security Example• Range:
• Find candidate national security issues related to attacks on American assets
• Candidate domains: • Banking records• Money flows• E-mail• Social Media Networks• Telephone Calls• Reports from human intelligence• Satellite photos
• Find function(s) that uses those domains to produce the range• Data is always incomplete
ISO/IEC JTC1 SC32N2383 22
Cancer Research Example• Range
• Identify patients who will not respond to specific treatment• Domains
• Genotype• Health History• Family History• Geology of residence• Work history
• Find function(s) that uses those domains to produce the range
• Data is always incomplete