Session 2 - Wikihost › ... › images › 9 › 9b › S2-map-reduce-Hadoop.pdf · 2017-09-11 ·...

INFO310, autumn 2017, session 1

Session 2:Semantics for Big Data

Andreas L. Opdahl<[email protected]>


Themes

• “Hangovers” (from S1):– big data are disruptive!– the essays

– the programming projects

• Hadoop and Big Data– technology introduction

• Paper presentations– learning to read and present scholarly work– examples of recent research– set of starting references for essays, theses...– varying difficulty – will try to even out


Big Data as a Disruption

• Disruptive technology:– a technology that displaces established ones, and

shakes up existing or creates new industries– e.g., PCs, the internet, digital media, social media

• Big data is disruptive– it creates new data-driven organisation forms– new ways of doing research and science– new ways of creating and maintaining products and

services– new threats to privacy and social order

• ...too easy to shrug off (just) as a hype/buzzword


Data-driven organisations

• “The next phase of the knowledge economy, reshaping the mode of production” (RK, p. 16)– inward: monitor, evaluate performance in real time;

reduce waste and fraud; improve strategy, planning and decision making

– outward: design new commodities, identify and target new markets, implement dynamic pricing, realise untapped potential, gain competitive advantage

• Goals: run more intelligently; flexibility and innovation; reduced risk, cost, losses; improved customer exper., return on investment, profit

• Changing organisational practice in all these areas– and in a coordinated / integrated way


New ways of doing business

• Marts (Walmart, Kohl's): analyse sales, pricing, eoconomic, demographic and weather data to tailor local product selection and price markdowns

• Online dating: sift through personal characterstics, reactions and communications to improve matches

• NY Police: analyse data on past arrests, paydays, sporting events, weather and holidays to deploy officers optimally

• Professional sports: massaging sports statistics to spot undervalued players

• Education: analyse data from learning management systems to improve teaching / studying

Steve Lohr (2012): The Age of Big Data, NYTimes.com


The Essays


Individual essays

• The essay shall present and discuss selected theory, technology and tools related to semantic technologies, backed by scholarly and other references– counts 60% of final grade

– presentations: November 8th

– deadline: November 9th 1400

– send me a brief informal email proposal by next Thursday!

• Encouraged:– more than a paper

– social media contrib’s (Wikipedia, Wikidata...)

– vocabulary / ontology proposals

• Previous essays available in the wiki!


Some previous essay themes• Semanticare - Semantic web for Gestational

Diabetes

• Faulty science and Big data

• SEMANTIC TECHNOLOGIES IN SEARCH ENGINES: GOOGLE AND COMPETITORS

• Semantikkens fremtid i ‘Scientific Workflows’

• Privacy in Linked Health Data

• Using Classified ads for Semantic web – Applied in the problem of immigration labor in Mexico.

• Semantic Web Technology in the Internet of Things: A Survey

• Discovering Semantic Technologies

• Utilizing the data from biofeedback-capable gaming equipment

• Visualisation of big semantic data

• Privacy and profiling

• The use of wearables and adding semantics to wearable data

• Semantically analyzing tweets: Discover sentiment and context in 140 characters

• Ontology Matching in the Semantic Web, Progress and a Futuristic Approach

• Big Data - is it trustworthy?

• Participatory Sensing: A further step. Sensing through Social Media feeds.

• Uses of the Semantic web technologies applied to social networks

• Sentiment Analysis: semantic techniques and machine learning approach

• Hvordan det norske næringslivet bruker åpne data og semantiske løsninger.

• SEMANTICS TECHNOLOGIES IN STREAMING SERVICES

• Ontology evolution: A survey on Change Discovery approaches


The ProgrammingProject


Group programming project

• The project shall develop an application that uses semantic technologies. Development and run-time platform is free choice, as is programming language. The project should be carried out in groups of three and not more. Working individually or in pairs is not recommended.

• Counts 40% of final grade. • Final presentation: Thursday November 23rd• Submission deadline: 1400 Monday December 18th


Group programming project

• Examples:– big data for emergency management– lifting selected Norwegian public data sources– (bot) projects for Wikipedia, Wikidata– RDF, JSON-LD interfaces to a FLOSS project– semantic web about public information systems– a natively semantic proof-of-concept IS– smart visualiser for semantic datasets– take up a public challenge

– <<your own suggestions here>>

• Can we find a joint programming projects for several groups / the whole class?


Big data for emergency managment

• Social media emergency dashboard: develop a dashboard that aggregates web resources, social media, and other resources (e.g. radio feeds) in a emergency – may be focused on a particular geographic area or emergency type

• Social media analysis for the Barcelona attack: analysis of hashtag usage in the first 3 hours of the attack– or of another emergency event (Texas?)

• Big linked dataset summariser: make semantic datasets quickly retrievable by pre-analysing their spatiality, temporality, theme, etc.


• Useful resources and datasets:– <https://www.bigdata.vestforsk.no/links/#links-home>

• related links for the BDEM project

– <http://humanitariancomp.referata.com/wiki/Big_Crisis_Data:_Social_Media_in_Disasters_and_Time-Critical_Situations>

• companion wiki to Castillo’s book


Lifting public data sources

• data.norge.no (Open Public Data in Norway)– ...or other public data sources (EU, other...)

• There are lots of open data out there – but not so much of it is in semantic formats– lifting required, developing:

• semantic wrappers around APIs

• auto-lifters for annual datasets in XLS, CSV...

• use existing lifting technologies

• Challenges: not one off lifting: automate as much as possible, make it work over time

• Risk: supply-side only, will it be used?


Bots for Wikipedia, Wikidata...

• Programming Wikipedia, Wikidata

– Wikidata is natively semantic– not natively RDF, but interfaced– numerous bot requests– perhaps also other relevant development tasks

• Challenges:– less experience, (mostly) not Java-based, not natively

RDF, bots can go very wrong, the bot tasks can be rather mundane, sparse documentation

• Risk: – new type of project for us

– most bot requests are quite trivial (semantically)


Semantic web of government ISs

• There's a lot of government information systems out there

– Norwegian, other national, transnational (e.g., EU)– what information do they contain?– how do they exchange information?– where does out information end up?– lift, structure, extend and use available information

(e.g., in Wikipedia, Wikidata)– provide nice interface to the public

• Challenges: data collection needed, data may be hard to get, huge task: we can only provide partial example solutions

• Risk: bordering on an essay project


Natively semantic proof-of-concept IS

• Lots of conventional SQL-based information systems could be made semantic

– is it possible to make an information system that stores and manages all its information in RDF?

– not only semantic working data, but also information about accounts/users, access rights, user interface, workflow

• Example:– developing spikes for a natively semantic ERP system

• Risk: many interlocking parts


• Idea:

– lots of open-source community projects out there

– could some of them make use of semantic interfaces?

– export/import data on semantic formats

– offer SPARQL endpoints and semantic web services

• Example:

– Sindre Njøsen's Master Thesis explored adding semantic web services to Drupal

• Challenges:

– programming language, complex code base, perhaps difficult to split tasks, so-so documentation, ongoing activities

• Risk: choosing the right FLOSS project

Semantic interfaces to a FLOSS project


Smart visualiser for semantic datasets

• Semantic datasets have meaning

– which we can glean from the vocabulary used

• Sgvizler (and similar tools) offer different visualisations of semantic datasets– certain meanings may fit certain visualisations

• Example:– Provenance vocab. talks about sequences of activities

– parhaps a fit with Sgvizler's Gantt-charts...

• Tasks: – identify as many connections as possible between

common vocabularies and standard visualisations

– provide the technology that makes it work!

• Risk: there may be existing tools


Take up a public challenge

• To direct research in fruitful directions, research organisations and conferences sometimes publish challenges

• See project HOBBIT:

– <https://project-hobbit.eu/challenges/>

• Ensures that the tasks you do are relevant :-)

• Some of the other ideas could fit

• Risk: these are research challenges, may be a bit hard


Should we find a coordinated project task?


Hadoop and MapReduce


Hadoop and MapReduce

• Hadoop is a running software framework– massively distributed computing

• on terabytes of data and beyond

• over thousands of computing nodes (or on a laptop)

– mostly written in Java– components:

• HDFS – Hadoop Distributed File System

• MapReduce – distributed computing model

• YARN – job tracking and process monitoring

• Common – libraries and utilities (.jar-files)

– most can be run separately– part of a bigger ecology of big-data technologies

https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html

INFO310, autumn 2017, session 1 Figure: https://www.mindtory.com/an-introduction-to-hadoop/

Hadoop ecosystem

• Developed by Google, today maintained by Apache– (one of) the first big data frameworks– today many newer frameworks exist

– still a good reference and starting point

...and much more!


HDFS

• Distributed, scalable, portable file system / data store– optimised for mostly immutable files

• Cluster of data nodes– splits large files (Tb-Pb) into blocks (many Mb) that are

– replicated across nodes (and racks / switches) – sharding

• Single name node– keeps track of

the blocks

– can be replicated

• TCP / IP• Appears to clients

as a single logicalfile storage

Figure: https://hortonworks.com/blog/heterogeneous-storages-hdfs/


HDFS

• Distributed, scalable, portable file system / data store– optimised for mostly immutable files

• Cluster of data nodes– splits large files (Tb-Pb) into blocks (many Mb)

– blocks are replicated across nodes (and racks / switches)

• Single name node– keeps track of

the blocks

– can be replicated

• TCP / IP• Appears to clients

as a single logicalfile storage

Figure: https://cvw.cac.cornell.edu/mapreduce/dfs?AspxAutoDetectCookieSupport=1


Hadoop cluster

• MapReduce and HDFS may run on the same nodes• Master node:

– runs the job tracker and name node (in Hadoop 1)– and can be a slave too

• Slave nodes:– runs task trackers

and data nodes– possibly not both

• Tasks can be run close to their data– moving tasks to data

Figure: https://en.wikipedia.org/wiki/Apache_Hadoop


• Single JobTracker:– YARN ResourceManager (in Hadoop 2)– receives MapReduce jobs from client( application)-s

– starts a YARN MRAppManager per application – pushes the work to task trackers close to the data

– (simple) scheduling and rescheduling

• Multiple TaskTrackers:– YARN NodeManager– has task slots available (called containers)

– spawns (lots of) separate JVMs

• ...so what are these tasks?


MapReduce programming model

• Two main tasks with an intermediate step:– Map: filters and sorts local input data

– Shuffle: redistributes intermediate data across nodes

– Reduce: summarises the data in each node

...all three steps are parallel (and more can be added)• Example:

– the inputsare texts

– we want tocount theoccurencesof eachword

Figure: Hafeng Li - “Big Data Analytics: MapReduce”


MapReduce and HDFS

• MapReduce can run on other file systems...• HDFS can support other computing models...

Figure: http://www.glennklockwood.com/data-intensive/hadoop/overview.html

INFO310, autumn 2017, session 1 Figure: https://wikis.nyu.edu/display/NYUHPC/Big+Data+Tutorial+1%3A+MapReduce

Data are (key, value)-pairs

• Map(k1,v1) → list(k2,v2)– Map(<file_id>,<text>) → list(<word>,<count>)

• Reduce(k2, list (v2)) → (k2, v3 (or list, or nothing) )

– Reduce(<word>, list (<count>)) → (<word>, <count>)– associativity and commutativity are helpful


Example

• Pseudocode:

function map(String name, String document): // name: document name // document: document contents for each word w in document: emit (w, 1)

function reduce(String word, Iterator partialCounts): // word: a word // partialCounts: a list of aggregated partial counts sum = 0 for each pc in partialCounts: sum += pc emit (word, sum)

See the full implementation at:https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html

https://en.wikipedia.org/wiki/MapReduce

INFO310, autumn 2017, session 1 Figure: https://stackoverflow.com/questions/15144578/

MapReduce internals

• Data shuffling over HTTP• Highly configurable TaskManager nodes

– in-memory versus on-disk tradeoffs

INFO310, autumn 2017, session 1 Figure: Mathew Rathbone - “Real World Hadoop”

More powerful MapReduce processing

• Chained MapReduce jobs• More complex (key, value)-structures

– e.g., (IRI, (IRI, IRI)) and (IRI, (IRI, <literal>))

• Different maps as input to the same reduce:


More powerful MapReduce processing

• Parameter sweep:– input data are used as control parameters

– actual data to be analysed is shared as configuration

• Additional task and control types:– Reduce: can be skipped

– CompressionCodec: such as gzip / gunzip

– InputSplit: before mapping, respecting record boundaries

– Combiner: local aggregation of the Map results; cuts down data transfer Mapper → Reducer

– Partitioner: controls which keys (and hence values) go to which Reducers (and thus how many are needed)

– Comparator: to control sorting of values in the Reducer

– Counter: to report Mapper and Reducer statistics


Configuration

• Map:– one map task can map one block of data– 10-100 map tasks suggested per computing node

– should take minutes each, due to setup overhead– example: 10TB input data, 128MB blocksize →

82 000 maps → 800-8000 computing nodes

– output buffer size

• Reduce:– 0.95 or 1.75 * #computing-nodes * #max-containers– not 1 or 2 * ...: leave containers for rescheduling etc.

– 0.95 * …: all reduces start simultaneously– 1.75 * …: first round can start when data available


Beyond MapReduce

• Newer technologies extend / replace MapReduce:– less disk dependency– data schemas and optimisation

– move from batch to streaming (live inputs / outputs)– more inter-task communication

– higher-level interfaces and programming abstractions– SQL-on-Hadoop, OLAP-on-Hadoop

– embedded support, e.g., for machine learning


MapReduce and semantic technologies

• Not much explored so far:– the largest linked open datasets were smaller– focus on native triple stores

– focus automated reasoning

• More interest in big data + semantics in recent years– more research papers since ~2014– linked big data– higher-capacity triple stores– Apache Jena Elephas (early beta):

• supports Hadoop MapReduce with Jena

• https://jena.apache.org/documentation/hadoop/

– also, e.g., Apache Giraph for graph processing


Run the Tutorial:https://hadoop.apache.org/

docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html

Check out the JavaDoc:http://hadoop.apache.org/docs/r2.7.4/api/index.html


What to doin Two Weeks?

...and in the meantime :-)

Date post:	09-Jun-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Session 2 - Wikihost › ... › images › 9 › 9b › S2-map-reduce-Hadoop.pdf · 2017-09-11 ·...

Documents