+ All Categories
Home > Education > Big data and the dark arts - Jisc Digital Media 2015

Big data and the dark arts - Jisc Digital Media 2015

Date post: 15-Jul-2015
Category:
Upload: jisc
View: 2,769 times
Download: 1 times
Share this document with a friend
Popular Tags:
87
Transcript

Big data and the dark arts:Demystifying the world of big dataCatherine Grout, Jisc

http://fc00.deviantart.net/fs71/f/2013/073/5/e/defence_against_the_dark_arts_lesson_by_asiapasek-d5y0oc7.jpg

» Introduction to the topic and its importance education and research

» Presentations from some key projects at the coal face of this issue

› COSMOS - Collaborative online social media observatory (Pete Burnap)

› Mining Biodiversity - Enriching biodiversity heritage with text mining and social media (Riza Batista-Navarro)

› Trees and Tweets - combining twitter data with family trees - (Jack Grieve)

Structure of session

410/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

In 2012, Gartner updated its definition as follows: "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization."[16] Additionally, a new V "Veracity" is added by some organizations to describe it.[17]

(http://en.wikipedia.org/wiki/Big_data)

5

» Better use of Big data through high performance analytics could add £216 billion to the UK economy by 2017 (CEBR via sas.com)

» Data has moved from a backroom issue to a boardroom issue (strategy insight and competitive advantage) chiefdataofficersummit.com/

» Therefore data ownership also a very important issue

» Tim Berners-Lee (as paraphrased in Guardian):“the data we create about ourselves should be owned by each of us, not the large companies that harvest it” theguardian.com/technology/2014/oct/08/sir-tim-berners-lee-speaks-out-on-data-ownership

Big data: big issue

610/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

» Total investment is in the region of £550 m (2012-15)

» This is across all 7 research councils but also includes collaborative programmes (17 programmes)

» Includes production of:

› Methodologies, tools and new aggregated datasets

› Infrastructure - giving access to public and private data

› Infrastructure - providing storage, compute

› Centres of Expertise - Capacity and skills development

» RCUK overview of Big data investments rcuk.ac.uk/research/infrastructure/big-data/

RCUK “Big data” investment overview

710/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

» Power

» Responsibility

» Opportunity

Big data for Universities

810/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

» Enterprise Data: about learners, researchers and staff and the University as a business (including research grants)

› Held in structured systems, databases but maybe not all interoperable

» Research Data (generally not structured or centrally held, Jiscsupporting universities to address this challenge “Research at Risk”)

› But Open Access publications (and some other material) in Institutional Repositories (about 125 universities have one)

» Sensitive Data (e.g. medical data – secure networks, anonymised etc.)

» Activity data (data about performance, benchmarking, student and researcher behaviour)

Big data for Universities

910/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

» Big data enables much better analytics -Key area for universities and for Jisc to support

» Jisc-HESA Business Intelligence Service (in development)

» LAMP (shared academic library analytics service)

» Effective Learner Analytics challenge

» All designed to help support effective analytics at institutional and national (aggregate level)

Big data: analytics

1010/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

11

» “Your recent Amazon purchases, Tweet score and location history makes you 23.5% welcome here.”

(Cartoon critical of big data application, by T. Gregoriusen.wikipedia.org/wiki/Big_data)

10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

» Big data research is not all about analysing very big data

» It can be about bringing data together from different sources

» It can be about techniques from the big data field to build more interesting ways of interacting with digital libraries

» It can be about using and building new techniques, tools to interact with data and address research questions

» Project presentations will illustrate this

Big data: For research

1210/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

» Issues around curation and preservation of research data (variable size and condition)

» Performance of infrastructure required

» Why should we share and re-use research data?

» What tools, methodologies, techniques can be used?

» Do researchers have the rights skills to exploit data effectively

» How does all of the above impact on research and the research process?

Big data: For research

1310/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

» Two of the projects presenting today are part ofDigging into Data Challenge

» Digging into Data has been addressing many of the challenges that were flagged earlier

» Digging into Data brings together 10 funders in four countries (UK, US, Canada, NL)

» 36 projects funded since 2011

» Addresses “big data for research” in the humanities and social sciences

Big data: Digging into data

Machine Anatomy 101 - UK funders & unviersities 17/10/2013 14

» Pioneered and legitimised big data based research in the humanities – for computer scientists and others. (from zero to hero)

» “digital humanities” and “computational social sciences” working together

» Engaged GLAM sector and others and encourage them to make their data available in forms useful to researchers and to work with them (encourages joint data curation)

Digging into data: Achievements so far

10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham 15

» Progress on the policy side toward reforming copyright and IP to allow for big data research on cultural heritage materials - (more to do here)

» International & multidisciplinary cooperation had high impact (more than anticipated). Increased visibility also strengthened research bringing new teams together)

Digging into data: Achievements so far

1610/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

» Bringing data together to make Big data can create exciting research opportunities

» Article in Nature 2013

» Mummies reveal that clogged arteries plagued the ancient world

» Based on Digging into Data programme project that brought together CT scans on 137 mummies from four very different ancient populations: Egyptian, Peruvian, the Ancestral Puebloansof southwest America and the Unangans of the Aleutian Islands in Alaska

» nature.com/news/mummies-reveal-that-clogged-arteries-plagued-the-ancient-world-1.12568

Big data: For research

1710/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

Big data: For research

1810/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

» “Big data” covers a very wide set of activities

» But has and is inspiring major investments and changes in practice

» Jisc is helping to support institutions in making the most of big data through:

› Developing shared services, advice and guidance to help manage research data effectively and comply with funders requirements (Research at Risk Challenge)

› Promoting effective use of data analytics and delivering some key analytics services

› Working with the Research Councils to help exploit the benefits of big data for research

Big data: In summary

1910/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

Gerd Leonhard , Big Data and the Future of Journalismflickr.com/photos/gleonhard/8978372783/

10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

Find out more…

Contact…

Catherine GroutHead of change – research, Jisc

[email protected]

Collaborative Online Social Media

Observatory

COSMOS

Dr. Pete Burnap (@pbFeed)

Cardiff School of Computer Science and Informatics

Cardiff University, UK

With Matthew Williams, Jeffrey Morgan, Omer Rana, Luke Sloan, Alex Voss

Adam Edwards, William Housley and Rob Procter

What is COSMOS?

• Aim to establish a coordinated interdisciplinary response to “Big

Social Data”

• Led from Cardiff (Computer Science and Social Sciences),

Warwick and St. Andrews

• Additional input from Edinburgh, UCL, Leeds, Manchester and

Wolverhampton

• Brings together social, computer, political, health and

mathematical scientists to study the methodological, theoretical,

and empirical dimensions of Big Data in technical, social and policy

contexts

• Developing a research programme to help understand and explain

how social processes and interactions manifest on the Web, with

a focus upon the challenges posed by big social data to government,

digital economy and civil society,

• Development of new methodological tools and technical/data

solutions for UK academia and public sector…a Web Observatory

What is COSMOS?

• COSMOS has attracted 17 research grants

amounting to over £1.25M in funding from

JISC/ESRC/EPSRC/AHRC/and £500K from the

public and private sectors (DoH/FSA/HPC Wales).

• A significant proportion of these funds have been

awarded to collect and analyse social media data in

the contexts of Societal Safety and Security e.g.

social tension, hate speech, crime reporting and

fear of crime, suicidal ideation

Research Programme

Digital Social Research Tools, Tension Indicators and Safer

Communities: A demonstration of COSMOS (ESRC DSR)

COSMOS: Supporting Empirical Social Scientific Research with a

Virtual Research Environment (JISC)

Small items of research equipment at Cardiff University (EPSRC)

Hate Speech and Social Media: Understanding Users, Networks and

Information Flows (ESRC Google)

Social Media and Prediction: Crime Sensing, Data Integration and

Statistical Modelling (ESRC NCRM)

Understanding the Role of Social Media in the Aftermath of Youth

Suicides (Department of Health)

Scaling the Computational Analysis of “Big Social Data” & Massive

Temporal Social Media Datasets (HPC Wales)

Digital Wildfire: (Mis)information flows, propagation and responsible

governance, (ESRC Global Uncertainties)

Public perceptions of the UK food system: public understanding and

engagement, and the impact of crises and scares (ESRC/FSA)

2011

2016

COSMOS Web Observatory

Integrated

Open (“plug and play”)

Scalable (MongoDB data stores/

Hadoop Back End)

Burnap, P. et al. (2014) ‘COSMOS: Towards an Integrated and Scalable Service for Analyzing Social Media

on Demand’, International Journal of Parallel, Emergent and Distributed Systems

Usable – developed with social

scientists for social scientists

Reproducible/Citable Research

- export/share workflow

Web Observatory Features

• Data Collection

– Persistent connection to Twitter 1% Stream (~4 billion)

– ONS/Police API

– Drag and drop RSS

– Import CSV/JSON

• Data Transformation

– Word Frequency

– Point data frequency over time

– Social Network Analysis

– Geospatial Clustering

– Sentiment Analysis

– …API to plug new modules and benchmark tools

Observing Events

Observing Events

COSMOS Infrastructure

COSMOS Desktop

•Small local datasets

•Users’ API credentials

•Local analysis

•Sept ‘14 launch (>100 dl’s in 17

countries)

COSMOS Cloud

•Scalable storage

• Massive datasets

•Scalable compute

• On-demand nodes

• Fast search & retrieve

• Fast analysis

•Workflow management

•Collaboration support

•2015 launch

Web Observatory Examples

• Policy/impact driven (benefit to society/economy)

• Focus on ethical research into human safety and

security

• Augment terrestrial methods

• Comparison to existing methods

• Experimental applied stats & machine learning

• Provide examples of machine intelligence tasks

integrated into social research workflow…

• Radio 5 Live Hit List (#5LiveHitList) - biggest impact

stories across social media and online

Questions?

Pete Burnap (@pbFeed)

[email protected]

Mining Biodiversity:Enriching biodiversity literature with OCR corrections and text-minedsemantic metadata

Riza Batista-NavarroNational centre for text mining, University of Manchester

Mining biodiversity

34

The Partners

A

A

B

B

C C

D

DSocial Media Lab

E

E

10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

Mining biodiversity

35

» Transform BHL into a next-generation social digital library

» Bring together strengths from multiple disciplines:

› Text mining

› Machine learning

› Data visualisation

› History

› Library and information science

› Social media

Project aims

10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

Mining biodiversity

10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham 36

What do we want to accomplish?

Social Media

Semantic

MetadataVisualisa-

tion

Mining biodiversity

37

» A consortium of botanical and natural history libraries

» Stores digitised legacy literature on biodiversity

» Currently holds 130,000 volumes = millions of pages (PDFs and OCR-generated text)

» Open-access

Biodiversity Heritage Library (BHL)

10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

Mining biodiversity

38

» Supports keyword-based search

» Species annotated and linked to the Encyclopedia of Life

» Integrates automatic taxonomic name finding tools

» Data access through export functionalities and Web services

BHL: Current features

10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

Mining biodiversity

39

BHL: Keyword-based search and Browsing

10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

Mining biodiversity

40

BHL: Metadata included in advanced search functionality

10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

Mining biodiversity

41

BHL: Page viewing

Page in PDF/image format

OCR – generated text

Annotated species names

10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

Mining biodiversity

42

Enhanced BHL: Proposed search functionalities

Faceted search

Time-sensitive search

Automatically generated questions

10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

Mining biodiversity

43

Enhanced BHL: Proposed page view

Page in PDF/image format

OCR – corrected text with annotations

10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

Mining biodiversity

10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham 44

Big data analytics: OCR correction and text mining

10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham 45

Big data analytics: Compilation and visualisation of (evolving) terms

Mining biodiversity

10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham 46

Big data analytics: Compilation and visualisation of (evolving) terms

Mining biodiversity

Sample OCR errors detected and corrected

10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham 47

Mining biodiversity

» Original

I mean by habit, that law in virtiie of which all the actions and the characters of living beings tend to repeat and to T)err)etuatf

vi I'REFACE.

themselves, not only in tlie individual but in its offspring.

» Result

I mean by habit, that law in virtue of which all the actions and the characters of living beings tend to repeat and to perpetuate

vi PREFACE.

themselves, not only in the individual but in its offspring.

Semantic metadata generation

10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham 48

Mining biodiversity

» Entity types

› Taxonomic entities

› Geographic locations

› Habitats

› Anatomical entities

› Qualities

› Temporal expressions

› Persons

» Association types

› Observation

› Habitation

› Nutrition

› Trait

Mining biodiversity

10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham 49

Examples of semantic metadata (annotations)

» Observation

» Habitation

Mining biodiversity

50

Examples of semantic metadata (annotations)

» Nutrition

» Trait

10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

Mining biodiversity

51

» Web-based, graphical TM workbench

» Conforms with the Unstructured Information Management Architecture (UIMA) standard

» Facilitates the straightforward integration of various analytics into workflows

» Allows for the validation of annotations

: Automatic annotation by text mining (TM)

10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

Mining biodiversity

52

Main interface

10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

53

Reconfigurable, reusable, modular workflows

Mining biodiversity

ENVO

Catalogue of Life

PATO

GAZ

10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

54

Validation interface

Mining biodiversity

10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

55

» Semantic metadata is generated and visualised using big data analytics

» Enhanced searching through historical archives is facilitated

» Outcomes

› More informative search results

› Discovery of novel associations

In summary…

Mining biodiversity

10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

Find out more…

Contact…

Riza Batista-NavarroResearch associate, NaCTeM

[email protected]

nactem.ac.uk/

Big data for lexical research

Jack Grieve, Aston University

» The problem with analyzing the lexicon is that most words are very rare. For example, a majority of the 100,000 most common words in English occur on average less than once per 25 million words. However, even the largest standard linguistic datasets (e.g. the British National Corpus) are smaller than 100 million words

» To observe the usage of most words, we therefore require access to incredibly large corpora, which is now possible with the availability big data

Big data for lexical research

5810/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

» Today, I’m going to demonstrate how taking advantage of big data mined from Twitter allows us to study for the first time how newly emerging words enter and spread within in language

» In particular, I’ll be analysing a 8.9 billion word corpus of American Tweets posted by over 7 million different users using geo-enabled smart phones from October 2013 – November 2014, which was collected for the Digging into Data Challenge

Big data for lexical research

5910/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

» To find newly emerging words we looked for words that were very rare at the start of the period represented by our corpus but that rose considerably over the course of this period by analysing the relative frequency of the 67,000 most common words in our corpus over each day of the corpus

Finding newly emerging words

6010/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

6110/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

6210/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

» “Unbothered by the negativity and foolishness”

» “I starting to enjoying being unbothered”

» “What's that new s**t bitches are saying. Unbothered whatever that means”

» “I'm always Unbothered I have no need to worry about the next person.”

» “I'm so unbothered omg I've never felt more in my zone”

» “The FACT That Beyoncé Was So Unbothered About Michelle Falling”

Unbothered examples

6310/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham 64

10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham 65

10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham 66

10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham 67

10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham 68

6910/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

7010/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

7110/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

» In addition to finding newly emerging words, we can also map the spread of these words across space for the first time, by taking advantage of the geocoded information provided by Twitter, which consists of a longitude and latitude for each tweet

Mapping newly emerging words

7210/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

7310/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

7410/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

7510/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

7610/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

7710/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

7810/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

7910/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

8010/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

8110/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

8210/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

8310/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

8410/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

8510/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

» By taking of advantage of big data we are thus able to investigate language in far greater detail than was previously possible, includingidentifying and mapping the spread of newly emerging words

» Big data is therefore incredibly useful for understanding complex systems that involve very large numbers of rare events, including the lexicon of modern languages

Conclusion

8610/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham

Find out more…

Contact…

Jack GrieveAston University

[email protected]

@JWGrieve


Recommended