Big Data and Data Mining - Lecture 3 in Introduction to Computational Social Science

BIG DATA & DATA MINING

LECTURE 3, 7.9.2015

INTRODUCTION TO COMPUTATIONAL SOCIAL SCIENCE (CSS01)

LAURI ELORANTA

• LECTURE 1: Introduction to Computational Social Science [DONE]

• Tuesday 01.09. 16:00 – 18:00, U35, Seminar room114

• LECTURE 2: Basics of Computation and Modeling [DONE]

• Wednesday 02.09. 16:00 – 18:00, U35, Seminar room 113

• LECTURE 3: Big Data and Information Extraction [TODAY]

• Monday 07.09. 16:00 – 18:00, U35, Seminar room 114

• LECTURE 4: Network Analysis


• LECTURE 5: Complex Systems

• Tuesday 15.09. 16:00 – 18:00, U35, Seminar room 114

• LECTURE 6: Simulation in Social Science

• Wednesday 16.09. 16:00 – 18:00, U35, Seminar room 113

• LECTURE 7: Ethical and Legal issues in CSS


• LECTURE 8: Summary

• Tuesday 22.09. 17:00 – 19:00, U35, Seminar room 114

LECTURESSCHEDULE

• PART 1: BIG DATA DEFINED

• PART 2: DATA MINING PROCESS

• PART 3: WHERE TO GET DATA

• PART 4 : DATA VISUALIZATION

LECTURE 3OVERVIEW

BIG DATA DEFINED

• The term big data is used quite loosely, with various definitions depending on the context

• Typically big data is misunderstood only to refer to big volumes of data

• One of the most used definitions in the field of IT is by Gartner:

“Big data is high-volume, high-velocity and high-varietyinformation assets that demand cost-effective, innovative forms of information processing for enhanced insight and decisionmaking.” (Gartner 2014.)

• Gartner analyst Doug Laney introduced the 3Vs concept in a 2001 MetaGroup research publication, 3D data management: Controlling data volume, variety and velocity.

BIG DATA DEFINED

(Gartner 2014.)

• Called as the three “V”s of Big Data

1. Volume refers to the big quantities of data

2. Velocity refers to the usually high speed of which data is generated

3. Variety refers to different kinds and types of data

• Other Vs suggested as well: Variability, Veracity

VOLUME, VELOCITY & VARIETY

(Gartner 2014.)

•“Big Data represents the Information assets

characterized by such a High Volume,

Velocity and Variety to require specific

Technology and Analytical Methods for its

transformation into Value".

• (De Mauro, A., Greco, M., Grimaldi, M. 2014. What is big data? A

consensual definition and a review of key research topics. 4th

International Conference on Integrated Information, Madrid)

DE MAURO, GRECO & GRIMALDI2014, DEFINITION

• Strong instrumental component in relation to how you get “value” out of

big data

• Answering research questions

• Answering business problems

• Instead of just one particular technology, big data also refers to large set

of different technologies used in various ways

BIG DATA IS ABOUT USING BIG DATA

(Sicular 2013.)

• “Every day, we create 2.5 quintillion bytes of data — so

much that 90% of the data in the world today has been

created in the last two years alone. This data comes from

everywhere: sensors used to gather climate information,

posts to social media sites, digital pictures and videos,

purchase transaction records, and cell phone GPS signals to

name a few. This data is big data.” (IBM 2014a.)

• Underlines the volume component of big data.

IBM’S DEFINITION

IBM’S FOUR VS

(IBM 2014b.)

• E.g 7 vies from Elliot 2013:

• Big Data as

1. Volume, Velocity & Variety (dictionary definition)

2. Set of technologies and tools

3. Set of different categories and types of data

4. Means of predicting the future (big data as signals)

5. New possibilities, that previously were impossible (value)

6. Metafora for a global neural network (combining all data)

7. As a capitalist/neoliberal concept (critical view)

MANY VIEWPOINTS TO BIG DATA

(Elliot 2013)

• Letely in social sciences big data has been defined either in quite vague terms or underlining only the volume component of big data

• ”Big Data, that is, data that are too big for standard database software to process, or the more future-proof, ‘capacity to search, aggregate, and cross-reference large data sets.” (Eynon 2013.)

• “Today, our more-than-ever digital lives leave significant footprints in cyberspace. Large scale collections of these socially generated footprints, often known as big data --“ (Yasseri ja Brigth 2013.)

• "These emitted shadows of ‘big data’ can take a variety of forms, but mostare manifestations or byproducts of human/machine interactions in code/spaces and coded spaces. We now see hundreds of millions of connected people, billions of sensors, and trillions of communications, information transfers, and transactions producing unfathomably large data shadows --" (Graham 2013.)

TYPICALLY NOT A COMMON DEFINITION IN SOCIAL SCIENCE RESEARCH

DATA MININGPROCESS

• Data mining process aims at answering research questions based on

large sets of data (in another words, big data)

• New insights and information is “mined” from the data with automated

computation

• For variety of research purposes with many different kinds of data

• Long traditions: Quantitative content analysis and register based

research, for example, could be seen as form of data mining

• NOTE! To be specific, in computer science the term data mining only

refers to the pre-processing and analysis part of the whole process

DATA MINING PROCESS IN CSS

1. Formulating research questions

2. Selecting source raw data

3. Gathering source raw data

4. Preprocessing 5. Analysis6.

Communication

(Cioffi-Revilla 2014.)

• Everything starts with a research question

• Three main types of research questions in relation to data

• 1. Inductive = Data-driven. The data tells something new.

• 2. Deductive = Theory-driven. The data tells something about a theory.

E.g. data can be used to test hypotheses.

• 3. Abductive = Mixed model, in-between of inductive and deductive

research

RESEARCH QUESTIONS IN DATA MINING


• Main guiding factor: the research question

• Not just text: many different forms of data

• Text / Numeric data

• Images

• Video

• Audio

• Sensor-data

• Register data

• Where to get the data?

• Data and its selection comes with many problems: ethics, legal,

privacy, public vs. private. (These matters will have a lecture of its

own).

SELECTING AND GATHERING RAW DATA


• Data needs to be pre-processed in order it can be analyzed: typically this

can take a very big part of the data mining process

• Cioffi-Revilla 2014 mentions these (mainly from textual content analysis

perspective):

• Scanning = generating machine readable files

• Cleaning = making the data set more concise (extracting unnecessary

noise)

• Filtering = there may be a need to filter the data based on some rules

or categories even before the analysis

• Reformatting = changing the structure of the data, for example

dividing data in smaller parts

• Content proxy extraction = using removing the proxies in text that

denote to latent entities

PREPROCESSING DATA


• This is the main automated information extraction part: data is “mined” to

reveal new information

• Many different analysis method classes, typically combining techniques

from statistics, machine learning, artificial intelligence and database

systems.

• Main types of analysis (according to Fayyad et al 1996):

Classification, Clustering, Regression Analysis, Summarization,

Dependency Modeling, Anomaly detection

• There are many many others, which can be seen combining and

mixing the main types given above

DATA ANALYSIS

(Fayyad et al. 1996)

• Classification is maps (classifies) data item in one or several predefined

classes

• Classification algorithms are learning algorithms in the sense that they

need a data set that defines how to categorize the data: thus, one needs

to teach the classification algorithm what classes to look for

• For example

• Classification of images in different categories

• Classification of news items in different categories

• Classification email into spam an normal mail

CLASSIFICATION


• Clustering groups a set of data objects in such a way that objects in the

same group (cluster) are more similar to each other than to those in

other groups (clusters).

• Not a one specific algorithm, but a general task with many different

solutions and algorithms

• Connectivity based clustering (based on distance)

• Centroid based clustering (e.g. K-means clustering)

• Distribution based clustering (objects belonging most likely to the same

distribution)

• Density based clustering

CLUSTERING


• Helsingin Sanomat (the biggest news corporation in Finland) opened

their Finnish parliament election 2015 questionnaire data to public

• The data contained questions and their answers from election

candidates for the Finnish parliament

• The data could be analyzed via clustering and factor analysis to find out

what different groups (clusters) of thought do the candidates actually

represent (in comparison to their actual party).

• Try it out: http://users.aalto.fi/~leinona1/vaalit2015/

CLUSTERING EXAMPLE

• Does what is says on the tin! Finding compact descriptions on subsets of

data.

• For example calculating means of standard deviations over different data

attributes (dimension)

• Summarization techniques are often applied to interactive exploratory

data analysis and automated report generation.

SUMMARIZATION


• Estimating the relationship among variables (with a regression function)

• It includes many techniques for modeling and analyzing

• Focuses on the relationship between a dependent variable and one or

more independent variables.

• Regression function is a learning function based on the data

• Applications in prediction and

REGRESSIONANALYSIS


REGRESSION EXAMPLELINEAR REGRESSION

(Image is public domain, from Wikipedia 2015, Regression Analysis)

• Finds significant dependencies between the data variables

• Two levels

• Structural level defining which variables are dependent (can be

graphical form)

• Quantitative level defining the strength of the dependency in numeric

form

• E.g. Correlation analysis

• E.g. Probabilistic density networks

DEPENDENCY MODELING


CORRELATION DOES NOT IMPLY CAUSATION

(XKCD: Correlation, http://imgs.xkcd.com/comics/correlation.png)

• Change and deviation detection

• Has the data changed from some previously known stable state or from

some previously measured normative values (“normal range”)

• Time scales matter, short term anomaly may actually be normal in long

term.

• Synchronic change (anomalies in stable processes) and diachronic

change (deeper change in generative structures of the process)

• Quite a dynamic category

ANOMALY DETECTION


• Cioffi-Revilla (2014) lists, for example, vocabularity analysis, correlation,

lexical analysis, spatial analysis, semantic analysis, sentiment analysis,

similarity analysis, clustering, network analysis, sequence analysis,

intensity analysis, anomaly detection, sonification analysis

• Most important thing is to understand the ins and outs of the analysis

model you are using: what is it for and how does it behave under the

hood

• The relationship of the model to your research question

AND MANY OTHERS…

• Basically means that data analysis algorithm is able to “learn” and enhance its performance iteratively from the data

• 1. Supervised machine learning

• The algorithm is schooled based on some known labeled data (input/target pairs)

• e.g. Netflix is able to suggest you better movies based on how you use it: By watching and rating films you are teaching the machine how to suggest better movies to you

• 2. Semi-supervised machine learning

• The algorithm is schooled with a small set of labelet data (input/target pairs) and a set of un labelet data

• 3. Unsupervised machine learning

• No result-set data is given for the machine to learn

• The algorithm is able to find patterns and structures from the data automatically without any pre-learning

• 4. Reinforcement machine learning

• Algorithm has a certain goal and it interacts with a dynamic environment, which gives it rewards based on actions

MACHINE LEARNING

WHERE TO GET DATA

• Ready Data Sets = Many public data sets provided by different institutions

• Web APIs = Application programming interfaces, that gives you data in structured format. For example facebook and twitter have APIs for getting data

• Web Scraping = Gather the information automatically from webpages, when it is allowed.

• Data Bases = Quering databases directly with query languages (e.g SQL)

• Custom data gathering process = the traditional research data gathering (surveys, interviews…)

• Open Data and Open Science growing trends: governments opening providing APIs and Data Sets to different kinds of public data (e.g. fiscal information, expenses)

DATA SOURCESMAIN TYPES

OLDIE BUT GOLDIE…GOVERNMENTALREGISTRIES

FINNISHSOCIAL SCIENCE DATA ARCHIVE

CSC.FI: ETSIN& AAVA

STATISTICS FINLAND

HELSINKI REGION INFOSHARE

GAPMINDER DATA

• The Internet is full of open datasets of different kinds! Some examples:

• Economics• American Economic Ass. (AEA): http://www.aeaweb.org/RFE/toc.php?show=complete

• Gapminder: http://www.gapminder.org/data/

• UMD:: http://inforumweb.umd.edu/econdata/econdata.html

• World bank: http://data.worldbank.org/indicator

• Finance• CBOE Futures Exchange: http://cfe.cboe.com/Data/

• Google Finance: https://www.google.com/finance (R)

• Google Trends: http://www.google.com/trends?q=google&ctab=0&geo=all&date=all&sort=0

• St Louis Fed: http://research.stlouisfed.org/fred2/ (R)

• NASDAQ: https://data.nasdaq.com/

• OANDA: http://www.oanda.com/ (R)

• Quandl: http://www.quandl.com/

• Yahoo Finance: http://finance.yahoo.com/ (R)

• Social Sciences• General Social Survey: http://www3.norc.org/GSS+Website/

• ICPSR: http://www.icpsr.umich.edu/icpsrweb/ICPSR/access/index.jsp

• Pew Research: http://www.pewinternet.org/datasets/pages/2/

• SNAP: http://snap.stanford.edu/data/index.html

• UCLA Social Sciences Archive: http://dataarchives.ss.ucla.edu/Home.DataPortals.htm

• UPJOHN INST: http://www.upjohn.org/erdc/erdc.html

• FROM: http://www.inside-r.org/howto/finding-data-internet

INTERNET IS FULL OF DATA

http://www.upjohn.org/erdc/erdc.html

WEB SCRAPING, APIS & DATABASES

DATABASE

API (APPLICATION

PROGRAMMING

INTERFACE)

PUBLIC WWW-

PAGE

Access via Internet

Automated Web Scraping

API calls

Data provider organisation

The database is typically accessed only from inside the oganisation and not via Internet.

• Web services and applications (such as twitter, facebook,…) provide

Web APIs so that others are able to build their services using some

functionality or data based on the data provider’s Web API / Web service

• Using APIs is the structured and “the right” way” to get data from a web

service

• The use of APIs is controlled by the data provider: they are thus used

with data providers permission

• Some APIs cost according usage, some have other conditions for use

• Needs programming to connect

API (APPLICATION PROGRAMMING INTERFACE)

TWITTER REST APIS

FACEBOOK GRAPH API

• Web scraping (web harvesting or web data extraction) is a computer

software technique of extracting information from websites. (Wikipedia

2015, Web Scraping)

• Transforms unstructured data in HTML format in some structured format

for for further analysis

• Used when you do not have access to the original Data Base or when

there are no APIs

• NOTE! Always make sure that scraping is allowed and legal! This is

not always the case, as some websites and services explicitly forbid web

scraping.

• Numerous tools varying from manual to semi-manual to fully automatic

• High-level scraping services

• Browser plugin tools

• Programming libraries

WEB SCRAPING

SERVICES FOR WEB SCRAPING:IMPORT.IO

https://www.youtube.com/watch?v=ghvsVLkTKLk

SERVICES FOR WEB SCRAPING:KIMONOLABS.COM

SERVICES FOR WEB SCRAPING:WEBHOSE.IO

BROWSER PLUGINS FOR WEB SCRAPING: DATA MINER

• Python

• Scrapy: http://scrapy.org

• BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/

• Scrapemark: http://arshaw.com/scrapemark/ (not maintained

anymore)

• R

• rvest: http://cran.r-project.org/web/packages/rvest/index.html

WEB SCRAPING LIBRARIES

http://scrapy.org

http://www.crummy.com/software/BeautifulSoup/

http://arshaw.com/scrapemark/

http://cran.r-project.org/web/packages/rvest/index.html

• Watch “The Beauty of Data Visualization” by David

McCandless:http://www.ted.com/talks/david_mccandless_the_beauty_of

_data_visualization?language=en

VISUALIZING DATA

http://www.ted.com/talks/david_mccandless_the_beauty_of_data_visualization?language=en

• Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining

to knowledge discovery in databases. AI magazine, 17(3), 37.

• De Mauro, A., Greco, M., Grimaldi, M. 2014. What is big data? A

consensual definition and a review of key research topics. 4th

International Conference on Integrated Information, Madrid

LECTURE 3 READING

• Cioffi-Revilla, C. 2014. Introduction to Computational Social Science. Springer-Verlag, London

• Elliot, T. 2013. 7 Definitions of Big Data You Should Know About. http://timoelliott.com/blog/2013/07/7-definitions-of-big-data-you-should-know-about.html

• Eynon, R. 2013. The rise of Big Data: what does it mean for education, technology, and media research? Learning, Media and Technology, 38:3, 237-240, DOI: 10.1080/17439884.2013.771783.

• Gartner, 2014. IT Glossary: Big Data. http://www.gartner.com/it-glossary/big-data/

• Graham, M. 2013. The Virtual Dimension. Global City Challenges: Debating a Concept, Improving the Practice, M. Acuto and W. Steele, 2013. London: Palgrave. 117-139.

• De Mauro, A., Greco, M., Grimaldi, M. 2014. What is big data? A consensual definition and a review of key research topics. 4th International Conference on Integrated Information, Madrid

• Fayyad, U., Piatetsky-Shapiro, G., & Smyth, P. (1996). From data mining to knowledge discovery in databases. AI magazine, 17(3), 37.

• IBM, 2014a. What is big data? http://www-01.ibm.com/software/data/bigdata/what-is-big-data.html

• IBM, 2014b. The Four V’s of Big Data. http://www.ibmbigdatahub.com/sites/default/files/infographic_file/4-Vs-of-big-data.jpg

• Sicular, S. 2013. Gartner's Big Data Definition Consists of Three Parts, Not to Be Confused with Three "V"s. Forbes, 3/27/2013. http://www.forbes.com/sites/gartnergroup/2013/03/27/gartners-big-data-definition-consists-of-three-parts-not-to-be-confused-with-three-vs/

• Yesseri, T.; Bright, J. 2013. Can electoral popularity be predicted using socially generated big data? Oxford Internet Institute, University of Oxford. 2013.

REFERENCES

http://timoelliott.com/blog/2013/07/7-definitions-of-big-data-you-should-know-about.html

http://www.gartner.com/it-glossary/big-data/

http://www-01.ibm.com/software/data/bigdata/what-is-big-data.html

http://www.forbes.com/sites/gartnergroup/2013/03/27/gartners-big-data-definition-consists-of-three-parts-not-to-be-confused-with-three-vs/

Thank You!

Questions and comments?

twitter: @laurieloranta

Date post:	15-Jul-2015
Category:	Data & Analytics
Upload:	lauri-eloranta
View:	620 times
Download:	2 times

Big Data and Data Mining - Lecture 3 in Introduction to Computational Social Science

Data & Analytics