Seman&c Analysis in Language Technology http://stp.lingfil.uu.se/~santinim/sais/2016/sais_2016.htm
Semantic Word Clouds
Marina San(ni [email protected]
Department of Linguis(cs and Philology Uppsala University, Uppsala, Sweden
Spring 2016
Previous lecture: Ontologies
2
Semantic Web & Ontologies • The goal of the Seman(c Web is to allow web informa(on and services to be more
effec(vely exploited by humans and automated tools.
• Essen(ally, the focus of the seman(c web is to share data instead of documents.
• This data must be ”meaningful” both for human and for machines (ie automated tools and web applica(ons)
• Q: How are we going to represent meaning and knowledge on the web?
• A: … via annota&on.
• Knowledge is represented in the form of rich conceptual schemas/formalisms called ontologies.
• Therefore, ontologies are the backbone of the Seman(c Web.
• Ontologies give formally defined meanings to the terms used in annota&ons, transforming them into seman&c annota&ons. 3
Ontologies are… • … concepts that are hierarchically organized
4
Tree of Porphyry, III AD
Wordnet, XXI AD (see Lect 5, ex similarity measures)
Reasoning: RDF/OWL vs Databases (and other data structures) OWL axioms behave like inference rules rather than database constraints.
!Class: Phoenix!
!SubClassOf: isPetOf only Wizard!!Individual: Fawkes!
Types: Phoenix!Facts: isPetOf Dumbledore!
• Fawkes is said to be a Phoenix and to be the pet of Dumbledore, and it is also stated that only a Wizard can have a pet Phoenix.
• In OWL, this leads to the implica(on that Dumbledore is a Wizard. That is, if we were to query the ontology for instances of Wizard, then Dumbledore would be part of the answer.
• In a database se[ng the schema could include a similar statement about the Phoenix class, but in this case it would be interpreted as a constraint on the data: adding the fact that Fawkes isPetOf Dumbledore without Dumbledore being already known to be a Wizard would lead to an invalid database state, and such an update would therefore be rejected by a database management system as a constraint viola(on.
5
So, what is an ontology for us?
6
“An ontology is a FORMAl, EXPLICIT specifica&on of a SHARED conceptualiza&on”
Studer, Benjamins, Fensel. Knowledge Engineering: Principles and Methods. Data and Knowledge Engineering. 25 (1998) 161-‐197
An ontology is an explicit specification of a conceptualization Gruber, T. A translation Approach to portable ontology specifications. Knowledge Acquisition. Vol. 5. 1993. 199-220
Abstract model and simplified view of some phenomenon in the world that we want to represent
Machine-readable
Concepts, properties relations, functions, constraints, axioms, are explicitly defined
Consensual Knowledge
How to build an ontology
Generally speaking (and roughly said), when designing an ontology, four main components are used: 1. Classes 2. Rela(ons 3. Axioms 4. Instances 7
Prac(cal Ac(vity: emo(ons
8
Your remarks: • Emo(ons are ambiguous:
eg. happiness can be also ill-‐directed
• The polarity of some emo(ons cannot be assessed…
• etc.
Classes Rela(ons Axioms Instances etc.
Occupa(onal psychology (wikipedia)
• Industrial and organiza(onal psychology (also known as I–O psychology, occupa(onal psychology, work psychology, WO psychology, IWO psychology and business psychology) is the scien$fic study of human behavior in the workplace and applies psychological theories and principles to organiza(ons and individuals in their workplace.
• I-‐O psychologists are trained in the scien(st–prac((oner model. I-‐O psychologists contribute to an organiza(on's success by improving the performance, mo(va(on, job sa(sfac(on, occupa(onal safety and health as well as the overall health and well-‐being of its employees. An I–O psychologist conducts research on employee behaviors and a[tudes, and how these can be improved through hiring prac(ces, training programs, feedback, and management systems.
9
In summary…
Why to build an ontology? • To share common understanding of the structure of informa(on among people or machines
• To make domain assump$ons explicit • Ojen based on controlled vocabulary • To analyze domain knowledge • To enable reuse of domain knowledge
10
Ontologies and Tags
• Ontologies and tagging systems are two different ways to organize the knowledge present in Web.
• The first one has a formal fundamental that derives from descrip(ve logic and ar(ficial intelligence. Domain experts decide the terms.
• The other one is simpler and it integrates heterogeneous contents, and it is based on the collabora(on of users in the Web 2.0. User-‐ generated annota(on.
11
Folksonomies
• Tagging facili(es within Web 2.0 applica(ons have shown how it might be possible for user communi$es to collabora$vely annotate web content, and create simple forms of ontology via the development of loosely-‐hierarchically organised sets of tags, oNen called folksonomies….
12
Folksonomy=Social Tagging • Folksonomies (also known as social tagging) are user-‐defined metadata collec(ons.
• Users do not deliberately create folksonomies and there is rarely a prescribed purpose, but a folksonomy evolves when many users create or store content at par(cular sites and iden(fy what they think the content is about.
• “Tag clouds” pinpoint the frequency of certain tags.
13
• A common way to organize tags is in tag clouds…
14
Automa(c folksonomy construc(on
• The collec(ve knowledge expressed though user-‐generated tags has a great poten(al.
• However, we need tools to efficiently aggregate data from large numbers of users with highly idiosyncra$c vocabularies and invented words or expressions.
• Many approaches to automa(c folksonomy construc(on combine tags using sta(s(cal methods ...
• Ample space for improvement…
15
Ontology, taxonomy, folksonomy, etc.
• Many different defini(ons…
• A good summary and interpreta(on is here: hpp://www.ideaeng.com/taxonomies-‐ontologies-‐0602
16
Today…
• We will talk more generally about word clouds…
17
Further Reading Seman&c Similarity from Natural Language and Ontology Analysis by Sébas(en Harispe, Sylvie Ranwez, Stefan Janaqi, and Jacky Montmain Synthesis Lectures on Human Language Technologies, May 2015, Vol. 8, No. 1
• The two state-‐of-‐the-‐art approaches for es(ma(ng and quan(fying seman(c similari(es/relatedness of seman(c en((es are presented in detail: the first one relies on corpora analysis and is based on Natural Language Processing techniques and seman(c models while the second is based on more or less formal, computer-‐readable and workable forms of knowledge such as seman(c networks, thesauri or ontologies.
18
Previous lecture: the end
19
Acknowledgements This presenta(on is based on the following paper: • Barth et al. (2014). Experimental Comparison of Seman(c
Word Cloud. In Experimental Algorithms, Volume 8504 of the series Lecture Notes in Computer Science pp 247-‐258 – Link: hpps://www.cs.arizona.edu/~kobourov/wordle2.pdf
Some slides have been borrowed from Sergey Pupyrev.
20
Today
• Experiments on seman&cs-‐preserving word clouds, in which seman(cally related words are close to each other.
21
Outline
• What is a Word Cloud? • 3 early algorithms • 3 new algorithms • Metrics & Quan(ta(ve Evalua(on
22
Word Clouds
• Word clouds have become a standard tool for abstrac(ng, visualizing and comparing texts…
• We could apply the same or similar techniques to the huge amonts of tags produced by users interac(ng in the social networks
23
Comparison & conceptualiza(on Tool
24
• Word Clouds as a tool for ”conceptualizing” documents. Cf Ontologies
• Ex: 2008, comparison of speeches: Obama vs McCain
Cf. Lect 10: Extrac(ve
summariza(on & Abstrac(ve
summariza(on
Word Clouds and Tag Clouds…
• … are ojen used to represent importance among terms (ex, band popularity) or serve as a naviga(on tool (ex, Google search results).
25
The Problem…
• How to compute seman(c-‐preserving word clouds in which seman(cally-‐related words are close to each other?
26
Wordle hpp://www.wordle.net
• Prac(cal tools, like Wordle, make word cloud visualiza(on easy.
They offer an appealing way to SUMMARIZE text…
Shortoming: they do not capture the rela(onships between words in any way since word placement is independent of context
27
Many word clouds are arranged randomly (look also at the scapered colours)
28
Paperns and Vicinity/Adjacency
Humans are spontaneously papern-‐seekers: if they see two words close to each other in a word cloud, they spontaneously think they are related…
29
In Linguis(cs and NLP…
• This natural tendency in linking spacial vicinity to seman&c relatedness is exploited as evidence that words are seman(cally related or seman(cally similar…
Remember? : ”You shall know a word by the company it keeps (Firth, J. R. 1957:11)”
30
So, it makes sense to place such related words close to each other (look also at the color distribu(on)
31
Seman(c word clouds have higher user sa(sfac(on compared to other layouts…
32
All recent word cloud visualiza(on tools aim to incoprorate seman(cs in the layout…
33
… but none of them provide any guarantee about the quality of the layout in terms of seman(cs
34
Early algorithms: Force-‐Directed Graph
• Most of the exis(ng algorithms are based on force-‐directed graph layout.
• Force-‐directed graph drawing algorithms are a class of algorithms for drawing graphs in an aesthe(cally pleasing way
– Aprac(ve forces between pairs to reduce empty space
– Repulsive forces ensure that words do not overlap
– Final force preserve seman(c rela(ons between words.
35
Some of the most flexible algorithms for calcula(ng layouts of simple undirected graphs belong to a class known as force-‐directed algorithms. Such algorithms calculate the layout of a graph using only informa(on contained within the structure of the graph itself, rather than relying on domain-‐specific knowledge. Graphs drawn with these algorithms tend to be aesthe(cally pleasing, exhibit symmetries, and tend to produce crossing-‐free layouts for planar graphs.
Newer Algorithms: rectangle representa(on of graphs
• Vertex-‐weighted and edge-‐weighed graph: – The ver(ces of the graph are the words
• Their weight correspond to some measure of importance (eg. word frequencies)
– The edges capture the seman(c relatedness of pair of words (eg. co-‐occurrence)
• Their weight correspond to the strength of the rela(on – Each vertex can be drawn as a box (rectangle) with a dimension determing by its weight
– A realized adjacency is the sum of the edge weights for all pairs of touching boxes.
– The goal is to maximize the realized adjacencies.
36
Purpose of the experiments that are shown here:
• Seman(cs preserva(on in terms of closeness/vicinity/adjacency
37
Example • A contact of 2 boxes is a common boundary. • The contact of two boxes is interpredet as
seman(c relatedness • The contact of 2 boxes can be calculated, so the
adjacency can be computed and evaluated.
38
Preprocessing: 1) Term Extrac(on 2) Ranking 3) Similarity/Dissimilarity Computa(on
39
• Similarity/dissimilarity matrix
40
Lect 6: Repe((on
large data computer
apricot 1 0 0
digital 0 1 2
informa(on 1 6 1
41
Which pair of words is more similar? cosine(apricot,informa(on) = cosine(digital,informa(on) = cosine(apricot,digital) =
cos(v, w) =v • wv w
=vv•ww=
viwii=1
N∑vi2
i=1
N∑ wi
2i=1
N∑
1+ 0+ 0
1+ 0+ 0
1+36+1
1+36+1
0+1+ 4
0+1+ 4
1+ 0+ 0
0+ 6+ 2
0+ 0+ 0
=138
= .16
=838 5
= .58
= 0
Lect 06: Other possible similarity measures
42
Input -‐ Output
• The input for all algorithms is – a collec(on of n rectangles, each with a fixed width and height propor(onal to the rank of the word
– A similarity/dissimilarity matrix
• The output is a set of non-‐overlapping posi(ons for the rectangles.
43
Early Algorithms
1. Wordle (Random) 2. Context-‐Preserving Word Cloud Visualiza(on
(CPWCV) 3. Seam Carving
44
Wordle à Random
• The Wordle algorithm places one word at a (me in a greedy fashion, ie aiming to use space as efficiently as possible.
• First the words are sorted by weight/rank in decreasing order.
• Then for each word in the order, a posi(on is picked at random.
45
1: Random
46
2: Random
47
3: Random
48
4: Random
49
5: Random
50
6: Random
51
Context-‐Preserving Word Cloud Visualiza(on (CPWCV)
• First, a dissimilarity matrix is computed and Mul(dimensional Scaling (MDS) is performed
• Second, effort to create a compact layout
52
Mul(dimensional Scaling (MDS) aims at detec(ng meaningful underlying dimensions in the data.
1: Context-‐Preserving
53
2: Context-‐Preserving : repulsive force
54
3: Context-‐Preserving : aprac(ve force
55
Seam Carving
• Basically, an algorithm for image resizing
• It was invented at Mitsubishi’s
56
1: Seam Carving
57
2: Seam Carving : space is divided into regions
58
3: Seam Carving : empty paths trimmed out itera(vely
59
4: Seam Carving
60
5: Seam Carving
61
6: Seam Carving: space divided into regions
62
7: Seam Carving
63
3 New Algorithms
1. Inflate and Push 2. Star Forest 3. Cycle Cover
64
Inflate-‐and-‐Push • Simple heuris(c method for word layout, which aims to preserve seman(c rela(ons between pair of words.
• Based on 1. Heuris(cs: scaling down all word rectangles by some
constant; 2. Compu(ng MDS (mul(dimensional scaling) on the
dissimilarity matrix 3. Iteretavely increase the size of rectangles by 5% (ie
”inflate” words; 4. When words overlaps, apply a force-‐directed algorithm
to ”push” words away.
65
Inflate: star(ng point
66
Inflate : scaling down
67
Inflate : seman(cally-‐related words are placed close to each other. Apply ”inflate words” (5%) itera(vely.
68
Inflate: ”push words”: repulsive force to resolve overlaps
69
Inflate: final stage
70
Star Forest
• A star is a tree • A star forest is a forest whose connected components are all stars.
71
Repe((on: trees and graphs • A tree is special form of graph i.e. minimally connected graph and having only one path between any two ver(ces.
• In a graph there can be more than one path i.e. graph can have uni-‐direc(onal or bi-‐direc(onal paths (edges) between nodes.
72
Three steps
1. Extrac(ng the star forest: par&&on a graph into disjoint stars
2. Realising a star: build a word cloud for every star
3. Pack all the stars together
73
Star Forest : star = tree 1. Extract stars greedily from a dissimilarity matrix à disjoint stars = star forest 2. Compute the op(mal stars, ie the best set of words to be adjacent 3. Aprac(ve force to get a compact layout
74
Cycle Cover • This algorithm is based on a similarity matrix. • First, a similarity path is created • Then, the op(mal level of compact-‐ness is computed
75
Quan(ta(ve Metrics
76
1. Realized Adjacenies – how close are similar words to each
other? 2. Distor(on
– how distant are dissimilar words? 3. Uniform Area U(liza(on
– uniformity of the distribu(on (overpopulated vs sparse areas in the word cloud)
4. Comptactness – how well u(lized is the drawing
area? 5. Aspect Ra(o
– width and height of the bounding box
6. Running Time – execu(on (me
2 datasets
(1) WIKI , a set of 112 plain-‐text ar(cles extracted from the English Wikipedia, each consis(ng of at least 200 dis(nct words (2) PAPERS , a set of 56 research papers published in conferences on experimental algorithms (SEA and ALENEX) in 2011-‐2012.
77
Cycle Cover wins
78
Seam Carving wins
79
Random wins
80
Inflate wins
81
Random and Seam Carving win
82
All ok except Seam Carving
83
Demo
84
The end
85