http://fc00.deviantart.net/fs71/f/2013/073/5/e/defence_against_the_dark_arts_lesson_by_asiapasek-d5y0oc7.jpg
» Introduction to the topic and its importance education and research
» Presentations from some key projects at the coal face of this issue
› COSMOS - Collaborative online social media observatory (Pete Burnap)
› Mining Biodiversity - Enriching biodiversity heritage with text mining and social media (Riza Batista-Navarro)
› Trees and Tweets - combining twitter data with family trees - (Jack Grieve)
Structure of session
410/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
In 2012, Gartner updated its definition as follows: "Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization."[16] Additionally, a new V "Veracity" is added by some organizations to describe it.[17]
(http://en.wikipedia.org/wiki/Big_data)
5
» Better use of Big data through high performance analytics could add £216 billion to the UK economy by 2017 (CEBR via sas.com)
» Data has moved from a backroom issue to a boardroom issue (strategy insight and competitive advantage) chiefdataofficersummit.com/
» Therefore data ownership also a very important issue
» Tim Berners-Lee (as paraphrased in Guardian):“the data we create about ourselves should be owned by each of us, not the large companies that harvest it” theguardian.com/technology/2014/oct/08/sir-tim-berners-lee-speaks-out-on-data-ownership
Big data: big issue
610/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
» Total investment is in the region of £550 m (2012-15)
» This is across all 7 research councils but also includes collaborative programmes (17 programmes)
» Includes production of:
› Methodologies, tools and new aggregated datasets
› Infrastructure - giving access to public and private data
› Infrastructure - providing storage, compute
› Centres of Expertise - Capacity and skills development
» RCUK overview of Big data investments rcuk.ac.uk/research/infrastructure/big-data/
RCUK “Big data” investment overview
710/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
» Power
» Responsibility
» Opportunity
Big data for Universities
810/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
» Enterprise Data: about learners, researchers and staff and the University as a business (including research grants)
› Held in structured systems, databases but maybe not all interoperable
» Research Data (generally not structured or centrally held, Jiscsupporting universities to address this challenge “Research at Risk”)
› But Open Access publications (and some other material) in Institutional Repositories (about 125 universities have one)
» Sensitive Data (e.g. medical data – secure networks, anonymised etc.)
» Activity data (data about performance, benchmarking, student and researcher behaviour)
Big data for Universities
910/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
» Big data enables much better analytics -Key area for universities and for Jisc to support
» Jisc-HESA Business Intelligence Service (in development)
» LAMP (shared academic library analytics service)
» Effective Learner Analytics challenge
» All designed to help support effective analytics at institutional and national (aggregate level)
Big data: analytics
1010/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
11
» “Your recent Amazon purchases, Tweet score and location history makes you 23.5% welcome here.”
(Cartoon critical of big data application, by T. Gregoriusen.wikipedia.org/wiki/Big_data)
10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
» Big data research is not all about analysing very big data
» It can be about bringing data together from different sources
» It can be about techniques from the big data field to build more interesting ways of interacting with digital libraries
» It can be about using and building new techniques, tools to interact with data and address research questions
» Project presentations will illustrate this
Big data: For research
1210/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
» Issues around curation and preservation of research data (variable size and condition)
» Performance of infrastructure required
» Why should we share and re-use research data?
» What tools, methodologies, techniques can be used?
» Do researchers have the rights skills to exploit data effectively
» How does all of the above impact on research and the research process?
Big data: For research
1310/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
» Two of the projects presenting today are part ofDigging into Data Challenge
» Digging into Data has been addressing many of the challenges that were flagged earlier
» Digging into Data brings together 10 funders in four countries (UK, US, Canada, NL)
» 36 projects funded since 2011
» Addresses “big data for research” in the humanities and social sciences
Big data: Digging into data
Machine Anatomy 101 - UK funders & unviersities 17/10/2013 14
» Pioneered and legitimised big data based research in the humanities – for computer scientists and others. (from zero to hero)
» “digital humanities” and “computational social sciences” working together
» Engaged GLAM sector and others and encourage them to make their data available in forms useful to researchers and to work with them (encourages joint data curation)
Digging into data: Achievements so far
10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham 15
» Progress on the policy side toward reforming copyright and IP to allow for big data research on cultural heritage materials - (more to do here)
» International & multidisciplinary cooperation had high impact (more than anticipated). Increased visibility also strengthened research bringing new teams together)
Digging into data: Achievements so far
1610/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
» Bringing data together to make Big data can create exciting research opportunities
» Article in Nature 2013
» Mummies reveal that clogged arteries plagued the ancient world
» Based on Digging into Data programme project that brought together CT scans on 137 mummies from four very different ancient populations: Egyptian, Peruvian, the Ancestral Puebloansof southwest America and the Unangans of the Aleutian Islands in Alaska
» nature.com/news/mummies-reveal-that-clogged-arteries-plagued-the-ancient-world-1.12568
Big data: For research
1710/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
» “Big data” covers a very wide set of activities
» But has and is inspiring major investments and changes in practice
» Jisc is helping to support institutions in making the most of big data through:
› Developing shared services, advice and guidance to help manage research data effectively and comply with funders requirements (Research at Risk Challenge)
› Promoting effective use of data analytics and delivering some key analytics services
› Working with the Research Councils to help exploit the benefits of big data for research
Big data: In summary
1910/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
Gerd Leonhard , Big Data and the Future of Journalismflickr.com/photos/gleonhard/8978372783/
10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
Collaborative Online Social Media
Observatory
COSMOS
Dr. Pete Burnap (@pbFeed)
Cardiff School of Computer Science and Informatics
Cardiff University, UK
With Matthew Williams, Jeffrey Morgan, Omer Rana, Luke Sloan, Alex Voss
Adam Edwards, William Housley and Rob Procter
What is COSMOS?
• Aim to establish a coordinated interdisciplinary response to “Big
Social Data”
• Led from Cardiff (Computer Science and Social Sciences),
Warwick and St. Andrews
• Additional input from Edinburgh, UCL, Leeds, Manchester and
Wolverhampton
• Brings together social, computer, political, health and
mathematical scientists to study the methodological, theoretical,
and empirical dimensions of Big Data in technical, social and policy
contexts
• Developing a research programme to help understand and explain
how social processes and interactions manifest on the Web, with
a focus upon the challenges posed by big social data to government,
digital economy and civil society,
• Development of new methodological tools and technical/data
solutions for UK academia and public sector…a Web Observatory
What is COSMOS?
• COSMOS has attracted 17 research grants
amounting to over £1.25M in funding from
JISC/ESRC/EPSRC/AHRC/and £500K from the
public and private sectors (DoH/FSA/HPC Wales).
• A significant proportion of these funds have been
awarded to collect and analyse social media data in
the contexts of Societal Safety and Security e.g.
social tension, hate speech, crime reporting and
fear of crime, suicidal ideation
Research Programme
Digital Social Research Tools, Tension Indicators and Safer
Communities: A demonstration of COSMOS (ESRC DSR)
COSMOS: Supporting Empirical Social Scientific Research with a
Virtual Research Environment (JISC)
Small items of research equipment at Cardiff University (EPSRC)
Hate Speech and Social Media: Understanding Users, Networks and
Information Flows (ESRC Google)
Social Media and Prediction: Crime Sensing, Data Integration and
Statistical Modelling (ESRC NCRM)
Understanding the Role of Social Media in the Aftermath of Youth
Suicides (Department of Health)
Scaling the Computational Analysis of “Big Social Data” & Massive
Temporal Social Media Datasets (HPC Wales)
Digital Wildfire: (Mis)information flows, propagation and responsible
governance, (ESRC Global Uncertainties)
Public perceptions of the UK food system: public understanding and
engagement, and the impact of crises and scares (ESRC/FSA)
2011
2016
COSMOS Web Observatory
Integrated
Open (“plug and play”)
Scalable (MongoDB data stores/
Hadoop Back End)
Burnap, P. et al. (2014) ‘COSMOS: Towards an Integrated and Scalable Service for Analyzing Social Media
on Demand’, International Journal of Parallel, Emergent and Distributed Systems
Usable – developed with social
scientists for social scientists
Reproducible/Citable Research
- export/share workflow
Web Observatory Features
• Data Collection
– Persistent connection to Twitter 1% Stream (~4 billion)
– ONS/Police API
– Drag and drop RSS
– Import CSV/JSON
• Data Transformation
– Word Frequency
– Point data frequency over time
– Social Network Analysis
– Geospatial Clustering
– Sentiment Analysis
– …API to plug new modules and benchmark tools
COSMOS Infrastructure
COSMOS Desktop
•Small local datasets
•Users’ API credentials
•Local analysis
•Sept ‘14 launch (>100 dl’s in 17
countries)
COSMOS Cloud
•Scalable storage
• Massive datasets
•Scalable compute
• On-demand nodes
• Fast search & retrieve
• Fast analysis
•Workflow management
•Collaboration support
•2015 launch
Web Observatory Examples
• Policy/impact driven (benefit to society/economy)
• Focus on ethical research into human safety and
security
• Augment terrestrial methods
• Comparison to existing methods
• Experimental applied stats & machine learning
• Provide examples of machine intelligence tasks
integrated into social research workflow…
• Radio 5 Live Hit List (#5LiveHitList) - biggest impact
stories across social media and online
Mining Biodiversity:Enriching biodiversity literature with OCR corrections and text-minedsemantic metadata
Riza Batista-NavarroNational centre for text mining, University of Manchester
Mining biodiversity
34
The Partners
A
A
B
B
C C
D
DSocial Media Lab
E
E
10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
Mining biodiversity
35
» Transform BHL into a next-generation social digital library
» Bring together strengths from multiple disciplines:
› Text mining
› Machine learning
› Data visualisation
› History
› Library and information science
› Social media
Project aims
10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
Mining biodiversity
10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham 36
What do we want to accomplish?
Social Media
Semantic
MetadataVisualisa-
tion
Mining biodiversity
37
» A consortium of botanical and natural history libraries
» Stores digitised legacy literature on biodiversity
» Currently holds 130,000 volumes = millions of pages (PDFs and OCR-generated text)
» Open-access
Biodiversity Heritage Library (BHL)
10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
Mining biodiversity
38
» Supports keyword-based search
» Species annotated and linked to the Encyclopedia of Life
» Integrates automatic taxonomic name finding tools
» Data access through export functionalities and Web services
BHL: Current features
10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
Mining biodiversity
39
BHL: Keyword-based search and Browsing
10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
Mining biodiversity
40
BHL: Metadata included in advanced search functionality
10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
Mining biodiversity
41
BHL: Page viewing
Page in PDF/image format
OCR – generated text
Annotated species names
10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
Mining biodiversity
42
Enhanced BHL: Proposed search functionalities
Faceted search
Time-sensitive search
Automatically generated questions
10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
Mining biodiversity
43
Enhanced BHL: Proposed page view
Page in PDF/image format
OCR – corrected text with annotations
10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
Mining biodiversity
10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham 44
Big data analytics: OCR correction and text mining
10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham 45
Big data analytics: Compilation and visualisation of (evolving) terms
Mining biodiversity
10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham 46
Big data analytics: Compilation and visualisation of (evolving) terms
Mining biodiversity
Sample OCR errors detected and corrected
10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham 47
Mining biodiversity
» Original
I mean by habit, that law in virtiie of which all the actions and the characters of living beings tend to repeat and to T)err)etuatf
vi I'REFACE.
themselves, not only in tlie individual but in its offspring.
» Result
I mean by habit, that law in virtue of which all the actions and the characters of living beings tend to repeat and to perpetuate
vi PREFACE.
themselves, not only in the individual but in its offspring.
Semantic metadata generation
10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham 48
Mining biodiversity
» Entity types
› Taxonomic entities
› Geographic locations
› Habitats
› Anatomical entities
› Qualities
› Temporal expressions
› Persons
» Association types
› Observation
› Habitation
› Nutrition
› Trait
Mining biodiversity
10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham 49
Examples of semantic metadata (annotations)
» Observation
» Habitation
Mining biodiversity
50
Examples of semantic metadata (annotations)
» Nutrition
» Trait
10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
Mining biodiversity
51
» Web-based, graphical TM workbench
» Conforms with the Unstructured Information Management Architecture (UIMA) standard
» Facilitates the straightforward integration of various analytics into workflows
» Allows for the validation of annotations
: Automatic annotation by text mining (TM)
10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
Mining biodiversity
52
Main interface
10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
53
Reconfigurable, reusable, modular workflows
Mining biodiversity
ENVO
Catalogue of Life
PATO
GAZ
10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
54
Validation interface
Mining biodiversity
10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
55
» Semantic metadata is generated and visualised using big data analytics
» Enhanced searching through historical archives is facilitated
» Outcomes
› More informative search results
› Discovery of novel associations
In summary…
Mining biodiversity
10/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
Find out more…
Contact…
Riza Batista-NavarroResearch associate, NaCTeM
nactem.ac.uk/
» The problem with analyzing the lexicon is that most words are very rare. For example, a majority of the 100,000 most common words in English occur on average less than once per 25 million words. However, even the largest standard linguistic datasets (e.g. the British National Corpus) are smaller than 100 million words
» To observe the usage of most words, we therefore require access to incredibly large corpora, which is now possible with the availability big data
Big data for lexical research
5810/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
» Today, I’m going to demonstrate how taking advantage of big data mined from Twitter allows us to study for the first time how newly emerging words enter and spread within in language
» In particular, I’ll be analysing a 8.9 billion word corpus of American Tweets posted by over 7 million different users using geo-enabled smart phones from October 2013 – November 2014, which was collected for the Digging into Data Challenge
Big data for lexical research
5910/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
» To find newly emerging words we looked for words that were very rare at the start of the period represented by our corpus but that rose considerably over the course of this period by analysing the relative frequency of the 67,000 most common words in our corpus over each day of the corpus
Finding newly emerging words
6010/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
» “Unbothered by the negativity and foolishness”
» “I starting to enjoying being unbothered”
» “What's that new s**t bitches are saying. Unbothered whatever that means”
» “I'm always Unbothered I have no need to worry about the next person.”
» “I'm so unbothered omg I've never felt more in my zone”
» “The FACT That Beyoncé Was So Unbothered About Michelle Falling”
Unbothered examples
6310/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
» In addition to finding newly emerging words, we can also map the spread of these words across space for the first time, by taking advantage of the geocoded information provided by Twitter, which consists of a longitude and latitude for each tweet
Mapping newly emerging words
7210/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham
» By taking of advantage of big data we are thus able to investigate language in far greater detail than was previously possible, includingidentifying and mapping the spread of newly emerging words
» Big data is therefore incredibly useful for understanding complex systems that involve very large numbers of rare events, including the lexicon of modern languages
Conclusion
8610/03/2015 Jisc Digital Festival, 9-10 March 2015, ICC Birmingham