INFO310, autumn 2017, session 1
Session 2:Semantics for Big Data
Andreas L. Opdahl<[email protected]>
INFO310, autumn 2017, session 1
Themes
• “Hangovers” (from S1):– big data are disruptive!– the essays
– the programming projects
• Hadoop and Big Data– technology introduction
• Paper presentations– learning to read and present scholarly work– examples of recent research– set of starting references for essays, theses...– varying difficulty – will try to even out
INFO310, autumn 2017, session 1
Big Data as a Disruption
• Disruptive technology:– a technology that displaces established ones, and
shakes up existing or creates new industries– e.g., PCs, the internet, digital media, social media
• Big data is disruptive– it creates new data-driven organisation forms– new ways of doing research and science– new ways of creating and maintaining products and
services– new threats to privacy and social order
• ...too easy to shrug off (just) as a hype/buzzword
INFO310, autumn 2017, session 1
Data-driven organisations
• “The next phase of the knowledge economy, reshaping the mode of production” (RK, p. 16)– inward: monitor, evaluate performance in real time;
reduce waste and fraud; improve strategy, planning and decision making
– outward: design new commodities, identify and target new markets, implement dynamic pricing, realise untapped potential, gain competitive advantage
• Goals: run more intelligently; flexibility and innovation; reduced risk, cost, losses; improved customer exper., return on investment, profit
• Changing organisational practice in all these areas– and in a coordinated / integrated way
INFO310, autumn 2017, session 1
New ways of doing business
• Marts (Walmart, Kohl's): analyse sales, pricing, eoconomic, demographic and weather data to tailor local product selection and price markdowns
• Online dating: sift through personal characterstics, reactions and communications to improve matches
• NY Police: analyse data on past arrests, paydays, sporting events, weather and holidays to deploy officers optimally
• Professional sports: massaging sports statistics to spot undervalued players
• Education: analyse data from learning management systems to improve teaching / studying
Steve Lohr (2012): The Age of Big Data, NYTimes.com
INFO310, autumn 2017, session 1
The Essays
INFO310, autumn 2017, session 1
Individual essays
• The essay shall present and discuss selected theory, technology and tools related to semantic technologies, backed by scholarly and other references– counts 60% of final grade
– presentations: November 8th
– deadline: November 9th 1400
– send me a brief informal email proposal by next Thursday!
• Encouraged:– more than a paper
– social media contrib’s (Wikipedia, Wikidata...)
– vocabulary / ontology proposals
• Previous essays available in the wiki!
INFO310, autumn 2017, session 1
Some previous essay themes• Semanticare - Semantic web for Gestational
Diabetes
• Faulty science and Big data
• SEMANTIC TECHNOLOGIES IN SEARCH ENGINES: GOOGLE AND COMPETITORS
• Semantikkens fremtid i ‘Scientific Workflows’
• Privacy in Linked Health Data
• Using Classified ads for Semantic web – Applied in the problem of immigration labor in Mexico.
• Semantic Web Technology in the Internet of Things: A Survey
• Discovering Semantic Technologies
• Utilizing the data from biofeedback-capable gaming equipment
• Visualisation of big semantic data
• Privacy and profiling
• The use of wearables and adding semantics to wearable data
• Semantically analyzing tweets: Discover sentiment and context in 140 characters
• Ontology Matching in the Semantic Web, Progress and a Futuristic Approach
• Big Data - is it trustworthy?
• Participatory Sensing: A further step. Sensing through Social Media feeds.
• Uses of the Semantic web technologies applied to social networks
• Sentiment Analysis: semantic techniques and machine learning approach
• Hvordan det norske næringslivet bruker åpne data og semantiske løsninger.
• SEMANTICS TECHNOLOGIES IN STREAMING SERVICES
• Ontology evolution: A survey on Change Discovery approaches
INFO310, autumn 2017, session 1
The ProgrammingProject
INFO310, autumn 2017, session 1
Group programming project
• The project shall develop an application that uses semantic technologies. Development and run-time platform is free choice, as is programming language. The project should be carried out in groups of three and not more. Working individually or in pairs is not recommended.
• Counts 40% of final grade. • Final presentation: Thursday November 23rd• Submission deadline: 1400 Monday December 18th
INFO310, autumn 2017, session 1
Group programming project
• Examples:– big data for emergency management– lifting selected Norwegian public data sources– (bot) projects for Wikipedia, Wikidata– RDF, JSON-LD interfaces to a FLOSS project– semantic web about public information systems– a natively semantic proof-of-concept IS– smart visualiser for semantic datasets– take up a public challenge
– <<your own suggestions here>>
• Can we find a joint programming projects for several groups / the whole class?
INFO310, autumn 2017, session 1
Big data for emergency managment
• Social media emergency dashboard: develop a dashboard that aggregates web resources, social media, and other resources (e.g. radio feeds) in a emergency – may be focused on a particular geographic area or emergency type
• Social media analysis for the Barcelona attack: analysis of hashtag usage in the first 3 hours of the attack– or of another emergency event (Texas?)
• Big linked dataset summariser: make semantic datasets quickly retrievable by pre-analysing their spatiality, temporality, theme, etc.
INFO310, autumn 2017, session 1
• Useful resources and datasets:– <https://www.bigdata.vestforsk.no/links/#links-home>
• related links for the BDEM project
– <http://humanitariancomp.referata.com/wiki/Big_Crisis_Data:_Social_Media_in_Disasters_and_Time-Critical_Situations>
• companion wiki to Castillo’s book
INFO310, autumn 2017, session 1
Lifting public data sources
• data.norge.no (Open Public Data in Norway)– ...or other public data sources (EU, other...)
• There are lots of open data out there – but not so much of it is in semantic formats– lifting required, developing:
• semantic wrappers around APIs
• auto-lifters for annual datasets in XLS, CSV...
• use existing lifting technologies
• Challenges: not one off lifting: automate as much as possible, make it work over time
• Risk: supply-side only, will it be used?
INFO310, autumn 2017, session 1
Bots for Wikipedia, Wikidata...
• Programming Wikipedia, Wikidata
– Wikidata is natively semantic– not natively RDF, but interfaced– numerous bot requests– perhaps also other relevant development tasks
• Challenges:– less experience, (mostly) not Java-based, not natively
RDF, bots can go very wrong, the bot tasks can be rather mundane, sparse documentation
• Risk: – new type of project for us
– most bot requests are quite trivial (semantically)
INFO310, autumn 2017, session 1
Semantic web of government ISs
• There's a lot of government information systems out there
– Norwegian, other national, transnational (e.g., EU)– what information do they contain?– how do they exchange information?– where does out information end up?– lift, structure, extend and use available information
(e.g., in Wikipedia, Wikidata)– provide nice interface to the public
• Challenges: data collection needed, data may be hard to get, huge task: we can only provide partial example solutions
• Risk: bordering on an essay project
INFO310, autumn 2017, session 1
Natively semantic proof-of-concept IS
• Lots of conventional SQL-based information systems could be made semantic
– is it possible to make an information system that stores and manages all its information in RDF?
– not only semantic working data, but also information about accounts/users, access rights, user interface, workflow
• Example:– developing spikes for a natively semantic ERP system
• Risk: many interlocking parts
INFO310, autumn 2017, session 1
• Idea:
– lots of open-source community projects out there
– could some of them make use of semantic interfaces?
– export/import data on semantic formats
– offer SPARQL endpoints and semantic web services
• Example:
– Sindre Njøsen's Master Thesis explored adding semantic web services to Drupal
• Challenges:
– programming language, complex code base, perhaps difficult to split tasks, so-so documentation, ongoing activities
• Risk: choosing the right FLOSS project
Semantic interfaces to a FLOSS project
INFO310, autumn 2017, session 1
Smart visualiser for semantic datasets
• Semantic datasets have meaning
– which we can glean from the vocabulary used
• Sgvizler (and similar tools) offer different visualisations of semantic datasets– certain meanings may fit certain visualisations
• Example:– Provenance vocab. talks about sequences of activities
– parhaps a fit with Sgvizler's Gantt-charts...
• Tasks: – identify as many connections as possible between
common vocabularies and standard visualisations
– provide the technology that makes it work!
• Risk: there may be existing tools
INFO310, autumn 2017, session 1
Take up a public challenge
• To direct research in fruitful directions, research organisations and conferences sometimes publish challenges
• See project HOBBIT:
– <https://project-hobbit.eu/challenges/>
• Ensures that the tasks you do are relevant :-)
• Some of the other ideas could fit
• Risk: these are research challenges, may be a bit hard
INFO310, autumn 2017, session 1
Should we find a coordinated project task?
INFO310, autumn 2017, session 1
Hadoop and MapReduce
INFO310, autumn 2017, session 1
Hadoop and MapReduce
• Hadoop is a running software framework– massively distributed computing
• on terabytes of data and beyond
• over thousands of computing nodes (or on a laptop)
– mostly written in Java– components:
• HDFS – Hadoop Distributed File System
• MapReduce – distributed computing model
• YARN – job tracking and process monitoring
• Common – libraries and utilities (.jar-files)
– most can be run separately– part of a bigger ecology of big-data technologies
https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
INFO310, autumn 2017, session 1 Figure: https://www.mindtory.com/an-introduction-to-hadoop/
Hadoop ecosystem
• Developed by Google, today maintained by Apache– (one of) the first big data frameworks– today many newer frameworks exist
– still a good reference and starting point
...and much more!
INFO310, autumn 2017, session 1
HDFS
• Distributed, scalable, portable file system / data store– optimised for mostly immutable files
• Cluster of data nodes– splits large files (Tb-Pb) into blocks (many Mb) that are
– replicated across nodes (and racks / switches) – sharding
• Single name node– keeps track of
the blocks
– can be replicated
• TCP / IP• Appears to clients
as a single logicalfile storage
Figure: https://hortonworks.com/blog/heterogeneous-storages-hdfs/
INFO310, autumn 2017, session 1
HDFS
• Distributed, scalable, portable file system / data store– optimised for mostly immutable files
• Cluster of data nodes– splits large files (Tb-Pb) into blocks (many Mb)
– blocks are replicated across nodes (and racks / switches)
• Single name node– keeps track of
the blocks
– can be replicated
• TCP / IP• Appears to clients
as a single logicalfile storage
Figure: https://cvw.cac.cornell.edu/mapreduce/dfs?AspxAutoDetectCookieSupport=1
INFO310, autumn 2017, session 1
Hadoop cluster
• MapReduce and HDFS may run on the same nodes• Master node:
– runs the job tracker and name node (in Hadoop 1)– and can be a slave too
• Slave nodes:– runs task trackers
and data nodes– possibly not both
• Tasks can be run close to their data– moving tasks to data
Figure: https://en.wikipedia.org/wiki/Apache_Hadoop
INFO310, autumn 2017, session 1
• Single JobTracker:– YARN ResourceManager (in Hadoop 2)– receives MapReduce jobs from client( application)-s
– starts a YARN MRAppManager per application – pushes the work to task trackers close to the data
– (simple) scheduling and rescheduling
• Multiple TaskTrackers:– YARN NodeManager– has task slots available (called containers)
– spawns (lots of) separate JVMs
• ...so what are these tasks?
INFO310, autumn 2017, session 1
MapReduce programming model
• Two main tasks with an intermediate step:– Map: filters and sorts local input data
– Shuffle: redistributes intermediate data across nodes
– Reduce: summarises the data in each node
...all three steps are parallel (and more can be added)• Example:
– the inputsare texts
– we want tocount theoccurencesof eachword
Figure: Hafeng Li - “Big Data Analytics: MapReduce”
INFO310, autumn 2017, session 1
MapReduce and HDFS
• MapReduce can run on other file systems...• HDFS can support other computing models...
Figure: http://www.glennklockwood.com/data-intensive/hadoop/overview.html
INFO310, autumn 2017, session 1 Figure: https://wikis.nyu.edu/display/NYUHPC/Big+Data+Tutorial+1%3A+MapReduce
Data are (key, value)-pairs
• Map(k1,v1) → list(k2,v2)– Map(<file_id>,<text>) → list(<word>,<count>)
• Reduce(k2, list (v2)) → (k2, v3 (or list, or nothing) )
– Reduce(<word>, list (<count>)) → (<word>, <count>)– associativity and commutativity are helpful
INFO310, autumn 2017, session 1
Example
• Pseudocode:
function map(String name, String document): // name: document name // document: document contents for each word w in document: emit (w, 1)
function reduce(String word, Iterator partialCounts): // word: a word // partialCounts: a list of aggregated partial counts sum = 0 for each pc in partialCounts: sum += pc emit (word, sum)
See the full implementation at:https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
https://en.wikipedia.org/wiki/MapReduce
INFO310, autumn 2017, session 1 Figure: https://stackoverflow.com/questions/15144578/
MapReduce internals
• Data shuffling over HTTP• Highly configurable TaskManager nodes
– in-memory versus on-disk tradeoffs
INFO310, autumn 2017, session 1 Figure: Mathew Rathbone - “Real World Hadoop”
More powerful MapReduce processing
• Chained MapReduce jobs• More complex (key, value)-structures
– e.g., (IRI, (IRI, IRI)) and (IRI, (IRI, <literal>))
• Different maps as input to the same reduce:
INFO310, autumn 2017, session 1
More powerful MapReduce processing
• Parameter sweep:– input data are used as control parameters
– actual data to be analysed is shared as configuration
• Additional task and control types:– Reduce: can be skipped
– CompressionCodec: such as gzip / gunzip
– InputSplit: before mapping, respecting record boundaries
– Combiner: local aggregation of the Map results; cuts down data transfer Mapper → Reducer
– Partitioner: controls which keys (and hence values) go to which Reducers (and thus how many are needed)
– Comparator: to control sorting of values in the Reducer
– Counter: to report Mapper and Reducer statistics
INFO310, autumn 2017, session 1
Configuration
• Map:– one map task can map one block of data– 10-100 map tasks suggested per computing node
– should take minutes each, due to setup overhead– example: 10TB input data, 128MB blocksize →
82 000 maps → 800-8000 computing nodes
– output buffer size
• Reduce:– 0.95 or 1.75 * #computing-nodes * #max-containers– not 1 or 2 * ...: leave containers for rescheduling etc.
– 0.95 * …: all reduces start simultaneously– 1.75 * …: first round can start when data available
INFO310, autumn 2017, session 1
Beyond MapReduce
• Newer technologies extend / replace MapReduce:– less disk dependency– data schemas and optimisation
– move from batch to streaming (live inputs / outputs)– more inter-task communication
– higher-level interfaces and programming abstractions– SQL-on-Hadoop, OLAP-on-Hadoop
– embedded support, e.g., for machine learning
INFO310, autumn 2017, session 1
MapReduce and semantic technologies
• Not much explored so far:– the largest linked open datasets were smaller– focus on native triple stores
– focus automated reasoning
• More interest in big data + semantics in recent years– more research papers since ~2014– linked big data– higher-capacity triple stores– Apache Jena Elephas (early beta):
• supports Hadoop MapReduce with Jena
• https://jena.apache.org/documentation/hadoop/
– also, e.g., Apache Giraph for graph processing
INFO310, autumn 2017, session 1
Run the Tutorial:https://hadoop.apache.org/
docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html
Check out the JavaDoc:http://hadoop.apache.org/docs/r2.7.4/api/index.html
INFO310, autumn 2017, session 1
What to doin Two Weeks?
...and in the meantime :-)