Data science innovations

Data Science Innovations:Natural Language Generation, Systems of Insight & Deep Learning

August 2017

Suresh Sood, PhD

@soody,

[email protected]

linkedin.com/in/sureshsood

Areas for Conversation

Data Science

Data Science Innovation (s)

Democratisation of big data

Gartner & Forrester Trends

Natural Language Generation

Systems of Insight

Deep Learning

Vignettes in the two-step arrival of the internet

of things and its reshaping of marketing

management’s service-dominant logic

Woodside & Sood

Journal of Marketing Management Volume

33, 2017 - Issue 1-2: The Internet of Things

(IoT) and Marketing: The State of Play,

Future Trends and the Implications for

Marketing

Statistics, Data Mining or Data Science ?

• Statistics

–precise deterministic causal analysis over precisely collected data

• Data Mining

–deterministic causal analysis over re-purposed data carefully sampled

• Data Science

– trending/correlation analysis over existing data using bulk of population i.e. big data

–Extraction of actionable knowledge directly from data through a process of discovery, hypothesis, and hypothesis testing.

Adapted from: NIST Big Data taxonomy draft report :

(see http://bigdatawg.nist.gov /show_InputDoc.php)

http://bigdatawg.nist.gov

Useful References Big Data • NIST Big Data interoperability Framework (NBDIF) V1.0 Final Version (September 2015)

Big Data Definitions: http://dx.doi.org/10.6028/NIST.SP.1500-1

Big Data Taxonomies: http://dx.doi.org/10.6028/NIST.SP.1500-2

Big Data Use Cases and Requirements: http://dx.doi.org/10.6028/NIST.SP.1500-3

Big Data Security and Privacy: http://dx.doi.org/10.6028/NIST.SP.1500-4

Big Data Architecture White Paper Survey: http://dx.doi.org/10.6028/NIST.SP.1500-5

Big Data Reference Architecture: http://dx.doi.org/10.6028/NIST.SP.1500-6

Big Data Standards Roadmap: http://dx.doi.org/10.6028/NIST.SP.1500-7

• Apache Spark 2.1.0 Documentation

Machine Learning Library (MLlib) Guide http://spark.apache.org/docs/latest/ml-guide.html

GraphX Programming Guide http://spark.apache.org/docs/latest/graphx-programming-guide.html

SparkR (R on Spark) http://spark.apache.org/docs/latest/sparkr.html#sparkdataframe

Spark SQL, DataFrames and Datasets Guide http://spark.apache.org/docs/latest/sql-programming-guide.html

http://dx.doi.org/10.6028/NIST.SP.1500-1







http://spark.apache.org/docs/latest/ml-guide.html

http://spark.apache.org/docs/latest/graphx-programming-guide.html

http://spark.apache.org/docs/latest/sparkr.html#sparkdataframe

Data Science Innovation

Data science innovation is something an organization has not done before or even something nobody anywhere has done before. A data science innovation focuses on discovering and using new or untraditional data sources to solve new problems.

Adapted from:Franks, B. (2012) Taming the Big Data Tidal

Wave, p. 255, John Wiley & Son

Data Science Algorithms

Companies are reimagining Business Processes with Algorithms and there

is “evidence of significant, even exponential, business gains in customer’s

customer engagement, cost & revenue performance”

Wilson, H., Alter A. and Shukla, P. (2016), Companies Are Reimagining Business Processes

with Algorithms, Harvard Business Review, February

Variety of Data Types & Big Data Challenge

1.Astronomical

2.Documents

3.Earthquake

4.Email

5.Environmental sensors

6.Fingerprints

7.Health (personal) Images

8.Graph data (social network)

9.Location

10.Marine

11.Particle accelerator

12.Satellite

13.Scanned survey data

14.Sound

15.Text

16.Transactions

17.Video Big Data consists of extensive datasets primarily in the characteristics

of volume, variety, velocity, and/or variability that require a scalable

architecture for efficient storage, manipulation, and analysis.

. Computational portability is the movement of the computation to the location of the data.

http://www.trendhunter.com/id/273409

http://www.trendhunter.com/id/273409

• The data collected in a single day take nearly two million years to playback on an MP3 player• Generates enough raw data to fill 15 million 64GB iPods every day • The central computer has processing power of about one hundred million PCs• Uses enough optical fiber linking up all the radio telescopes to wrap twice around the Earth• The dishes when fully operational will produce 10 times the global internet traffic as of 2013• The supercomputer will perform 1018 operations per second - equivalent to the number of stars in

three million Milky Way galaxies - in order to process all the data produced.• Sensitivity to detect an airport radar on a planet 50 light years away.• Thousands of antennas with a combined collecting area of 1,000,000 square meters - 1 sqkm)• Previous mapping of Centaurus A galaxy took a team 12,000 hours of observations and several

years - SKA ETA 5 minutes !

To the scientists involved, however, the SKA is no testbed, it’s a transformative instrument which, according to Luijten, will lead to “fundamental discoveries of how life and planets and matter all came into existence. As a scientist, this is a once in a lifetime opportunity.”

Sources: http://bit.ly/amazin-facts & http://bit.ly/astro-ska

Galileo

Square Kilometer Array Construction

(SKA1 - 2018-23; SKA2 - 2023-30)

Centaurus A

http://bit.ly/amazin-facts

The following BigQuery query (note that the wildcard on "TAX_WEAPONS_SUICIDE_" catches suicide vests, suicide bombers, suicide bombings, suicide jackets, and so on):

SELECT DATE, DocumentIdentifier, SourceCommonName, V2Themes, V2Locations, V2Tone, SharingImage, TranslationInfo FROM [gdeltv2.gkg] where (V2Themes like '%TAX_TERROR_GROUP_ISLAMIC_STATE%' or V2Themes like '%TAX_TERROR_GROUP_ISIL%' or V2Themes like '%TAX_TERROR_GROUP_ISIS%' or V2Themes like '%TAX_TERROR_GROUP_DAASH%') and (V2Themes like '%TERROR%TERROR%' or V2Themes like '%SUICIDE_ATTACK%' or V2Themes like '%TAX_WEAPONS_SUICIDE_%')

The GDELT Project pushes the boundaries of “big data,” weighing in at over a quarter-billion rows with 59 fields for each record, spanning the geography of the entire planet, and covering a time horizon of more than 35 years. The GDELT Project is the largestopen-access database on human society in existence. Its archives contain nearly 400M latitude/longitude geographic coordinates spanning over 12,900 days, making it one of the largest open-access spatio-temporal datasets as well.

GDELT + BigQuery = Query The Planet

Oil reserves shipment monitoring

Ras Tanura Najmah compound, Saudi Arabia

Source: http://www.skyboximaging.com/blog/monitoring-oil-reserves-from-space

https://nodexl.codeplex.com/

13

Sherman and Young (2016), When Financial Reporting Still Falls

Short, Harvard Business Review, July-August

Sood (2015), Truth, Lies and Brand Trust The Deceit

Algorithm,

http://datafication.com.au/

New Analytical Tools Can

Help

14

Deception Algorithm

(1) Self words e.g. “I” and “me” – decrease when someone

distances themselves from content

(2) Exclusive words e.g. “but” and “or” decrease with fabricated

content owing to complexity of maintaining deception

(3) Negative emotion words e.g. “hate” increase in word usage

owing to shame or guilty feeling

(4) Motion verbs e.g. “go” or “move” increase as exclusive words

go down to keep the story on track

Language on Twitter Tracks Rates of Coronary Heart Disease, Psychological Science, January 2015

15

The findings show that expressions of negative emotions such as anger, stress, and fatigue in the tweets from people in a given county were associated with higher heart disease risk in that county.On the other hand, expressions of positive emotions like excitement and optimism were associated with lower risk.

The results suggest that using Twitter as a window into a community’s collective mental state may provide a useful tool in epidemiology…So predictions from Twitter can actually be more accurate than using a set of traditional variables.

http://www.analyzewords.com

16

2017 Hype Cycle for Data Science and Machine Learning,

29 July, http://www.gartner.com/document/3772081

Gartner (2017)

Strategic Predictions for 2017 and Beyond, research note

14 October, http://www.gartner.com/document/3471568

By 2020-22 :

100 million consumers shop in augmented reality

30% of web browsing sessions without a screen

Algorithms positively alter behavior of over 1B

Blockchain-based business worth $10B

IoT will save consumers/businesses $1T a year

40% of employees cut healthcare costs via fitness tracker

Smart Data Discovery Will Enable New Class of Citizen Data Scientist

“With the addition of NLG [Natural Language Generation], smart data discovery platforms automatically present

a written or spoken context-based narrative of findings in the data that, alongside the visualization, inform the

user about what is most important for them to act on in the data.”

Gartner, 29 June, 2015

“With the addition of NLG [Natural Language Generation], smart

data discovery platforms automatically present a written or spoken

context-based narrative of findings in the data that, alongside the

visualization, inform the user about what is most important for them

to act on in the data.”

Gartner, 29 June, 2015

Smart Data Discovery Will Enable

New Class of Citizen Data Scientist

Systems of Insight Automated pattern extraction

Outlier detection

Correlation

Time series

Analytics integration with process, app or IoT

https://ubereats.com/melbourne/

20© 2017 FORRESTER. REPRODUCTION PROHIBITED.

Forrester Research, 2016

Reports&

Analysis

Visualisation&

Interpretation

WriteData/Business

“Story” Insights

Led by Data Analyst or Scientist

SME owner, Machine Learning and Natural Language Generation

Fusion of data science, business knowledge & creativity for maximium ROI

Data Aggregation Operationalise

Detect & Extract

Patterns andRelationships

Generate Insights &

Story

ProcessApplication

IoT

Data Aggregation

orData Set

Traditional Analytics: Slow & Expensive80% of time sifting through data

System of Insight (SoI)

SoI: Fast & Cost Effective80% of time in decision making with client

22

outlier-detection “allow detecting a significant fraction of fraudulent cases…different in nature from

historical fraud…resulting in a novel fraud pattern”

Baesens, B., Vlasselaer, V., and Verbeke, W., 2015, Fraud Analytics Using Descriptive,

Predictive, and Social Network Techniques: A Guide to Data Science for Fraud

Detection, Wiley

Online tenure leads to more spending per customer

High engagement leads to more orders, more

categories purchased, and more spend

https://www.quillengage.com

Better customer experiences . . .

. . . and half the inventory-carrying

costs

of other online fashion retailers.

Forrester, 2016

The ANZ Heavy Traffic Index comprises flows of vehicles weighing more than 3.5 tonnes (primarily trucks) on 11 selected roads around NZ. It is contemporaneous with GDP growth.

The ANZ Light Traffic Index is made up of light or total traffic flows (primarily cars and vans) on 10 selected roads around the country. It gives a six month lead on GDP growth in normal circumstances (but cannot predict sudden adverse events such as the Global Financial Crisis).

http://www.a http://www.anz.co.nz/about-us/economic-markets-research/truckometer/ANZ TRUCKOMETER

Systems of Insight

• Helps move away from “crisis levels” in talent

• Traditional 5 step analytics process reduced to 2 step from data to action

• Reimagine business processes through “machine engineering”

• Minimise messy data issues and data preparation time

Deep Learning Libraries, Platforms, APIs and Hardware

Next Step

Start using Data Science Innovations

Systems of Insight and innovative data sources

Natural Language Generation

Deep Learning

Data Science Resources

30

The future is impossible to predict.

However one thing is certain :

The company that can excite it’s customers dreams Is out ahead in the race to business success

Selling Dreams, Gian Luigi Longinotti

Date post:	22-Jan-2018
Category:	Education
Upload:	suresh-sood
View:	51 times
Download:	2 times

Data science innovations

Education