Download - Semantic Word Clouds - Marina Santinisantini.se/teaching/sais/2016/12_Lect_SemanticWordClouds.pdf · Studer, Benjamins, Fensel.&Knowledge&Engineering:&Principles&and&Methods.&Data(and(Knowledge(Engineering.25(1998)161197&

Seman&c Analysis in Language Technology http://stp.lingfil.uu.se/~santinim/sais/2016/sais_2016.htm

Semantic Word Clouds

Marina San(ni [email protected]

Department of Linguis(cs and Philology Uppsala University, Uppsala, Sweden

Spring 2016

Previous lecture: Ontologies

2

Semantic Web & Ontologies •  The goal of the Seman(c Web is to allow web informa(on and services to be more

effec(vely exploited by humans and automated tools.

•  Essen(ally, the focus of the seman(c web is to share data instead of documents.

•  This data must be ”meaningful” both for human and for machines (ie automated tools and web applica(ons)

•  Q: How are we going to represent meaning and knowledge on the web?

•  A: … via annota&on.

•  Knowledge is represented in the form of rich conceptual schemas/formalisms called ontologies.

•  Therefore, ontologies are the backbone of the Seman(c Web.

•  Ontologies give formally defined meanings to the terms used in annota&ons, transforming them into seman&c annota&ons. 3

Ontologies are… •  … concepts that are hierarchically organized

4

Tree of Porphyry, III AD

Wordnet, XXI AD (see Lect 5, ex similarity measures)

Reasoning: RDF/OWL vs Databases (and other data structures) OWL axioms behave like inference rules rather than database constraints.

!Class: Phoenix!

!SubClassOf: isPetOf only Wizard!!Individual: Fawkes!

Types: Phoenix!Facts: isPetOf Dumbledore!

•  Fawkes is said to be a Phoenix and to be the pet of Dumbledore, and it is also stated that only a Wizard can have a pet Phoenix.

•  In OWL, this leads to the implica(on that Dumbledore is a Wizard. That is, if we were to query the ontology for instances of Wizard, then Dumbledore would be part of the answer.

•  In a database se[ng the schema could include a similar statement about the Phoenix class, but in this case it would be interpreted as a constraint on the data: adding the fact that Fawkes isPetOf Dumbledore without Dumbledore being already known to be a Wizard would lead to an invalid database state, and such an update would therefore be rejected by a database management system as a constraint viola(on.

5

So, what is an ontology for us?

6

“An ontology is a FORMAl, EXPLICIT specifica&on of a SHARED conceptualiza&on”

Studer, Benjamins, Fensel. Knowledge Engineering: Principles and Methods. Data and Knowledge Engineering. 25 (1998) 161-‐197

An ontology is an explicit specification of a conceptualization Gruber, T. A translation Approach to portable ontology specifications. Knowledge Acquisition. Vol. 5. 1993. 199-220

Abstract model and simplified view of some phenomenon in the world that we want to represent

Machine-readable

Concepts, properties relations, functions, constraints, axioms, are explicitly defined

Consensual Knowledge

How to build an ontology

Generally speaking (and roughly said), when designing an ontology, four main components are used: 1.  Classes 2.  Rela(ons 3.  Axioms 4.  Instances 7

Prac(cal Ac(vity: emo(ons

8

Your remarks: •  Emo(ons are ambiguous:

eg. happiness can be also ill-‐directed

•  The polarity of some emo(ons cannot be assessed…

•  etc.

Classes Rela(ons Axioms Instances etc.

Occupa(onal psychology (wikipedia)

•  Industrial and organiza(onal psychology (also known as I–O psychology, occupa(onal psychology, work psychology, WO psychology, IWO psychology and business psychology) is the scien$fic study of human behavior in the workplace and applies psychological theories and principles to organiza(ons and individuals in their workplace.

•  I-‐O psychologists are trained in the scien(st–prac((oner model. I-‐O psychologists contribute to an organiza(on's success by improving the performance, mo(va(on, job sa(sfac(on, occupa(onal safety and health as well as the overall health and well-‐being of its employees. An I–O psychologist conducts research on employee behaviors and a[tudes, and how these can be improved through hiring prac(ces, training programs, feedback, and management systems.

9

In summary…

Why to build an ontology? •  To share common understanding of the structure of informa(on among people or machines

•  To make domain assump$ons explicit •  Ojen based on controlled vocabulary •  To analyze domain knowledge •  To enable reuse of domain knowledge

10

Ontologies and Tags

•  Ontologies and tagging systems are two different ways to organize the knowledge present in Web.

•  The first one has a formal fundamental that derives from descrip(ve logic and ar(ficial intelligence. Domain experts decide the terms.

•  The other one is simpler and it integrates heterogeneous contents, and it is based on the collabora(on of users in the Web 2.0. User-‐ generated annota(on.

11

Folksonomies

•  Tagging facili(es within Web 2.0 applica(ons have shown how it might be possible for user communi$es to collabora$vely annotate web content, and create simple forms of ontology via the development of loosely-‐hierarchically organised sets of tags, oNen called folksonomies….

12

Folksonomy=Social Tagging •  Folksonomies (also known as social tagging) are user-‐defined metadata collec(ons.

•  Users do not deliberately create folksonomies and there is rarely a prescribed purpose, but a folksonomy evolves when many users create or store content at par(cular sites and iden(fy what they think the content is about.

•  “Tag clouds” pinpoint the frequency of certain tags.

13

•  A common way to organize tags is in tag clouds…

14

Automa(c folksonomy construc(on

•  The collec(ve knowledge expressed though user-‐generated tags has a great poten(al.

•  However, we need tools to efficiently aggregate data from large numbers of users with highly idiosyncra$c vocabularies and invented words or expressions.

•  Many approaches to automa(c folksonomy construc(on combine tags using sta(s(cal methods ...

•  Ample space for improvement…

15

Ontology, taxonomy, folksonomy, etc.

•  Many different defini(ons…

•  A good summary and interpreta(on is here: hpp://www.ideaeng.com/taxonomies-‐ontologies-‐0602

16

Today…

•  We will talk more generally about word clouds…

17

Further Reading Seman&c Similarity from Natural Language and Ontology Analysis by Sébas(en Harispe, Sylvie Ranwez, Stefan Janaqi, and Jacky Montmain Synthesis Lectures on Human Language Technologies, May 2015, Vol. 8, No. 1

•  The two state-‐of-‐the-‐art approaches for es(ma(ng and quan(fying seman(c similari(es/relatedness of seman(c en((es are presented in detail: the first one relies on corpora analysis and is based on Natural Language Processing techniques and seman(c models while the second is based on more or less formal, computer-‐readable and workable forms of knowledge such as seman(c networks, thesauri or ontologies.

18

Previous lecture: the end

19

Acknowledgements This presenta(on is based on the following paper: •  Barth et al. (2014). Experimental Comparison of Seman(c

Word Cloud. In Experimental Algorithms, Volume 8504 of the series Lecture Notes in Computer Science pp 247-‐258 –  Link: hpps://www.cs.arizona.edu/~kobourov/wordle2.pdf

Some slides have been borrowed from Sergey Pupyrev.

20

Today

•  Experiments on seman&cs-‐preserving word clouds, in which seman(cally related words are close to each other.

21

Outline

•  What is a Word Cloud? •  3 early algorithms •  3 new algorithms •  Metrics & Quan(ta(ve Evalua(on

22

Word Clouds

•  Word clouds have become a standard tool for abstrac(ng, visualizing and comparing texts…

•  We could apply the same or similar techniques to the huge amonts of tags produced by users interac(ng in the social networks

23

Comparison & conceptualiza(on Tool

24

•  Word Clouds as a tool for ”conceptualizing” documents. Cf Ontologies

•  Ex: 2008, comparison of speeches: Obama vs McCain

Cf. Lect 10: Extrac(ve

summariza(on & Abstrac(ve

summariza(on

Word Clouds and Tag Clouds…

•  … are ojen used to represent importance among terms (ex, band popularity) or serve as a naviga(on tool (ex, Google search results).

25

The Problem…

• How to compute seman(c-‐preserving word clouds in which seman(cally-‐related words are close to each other?

26

Wordle hpp://www.wordle.net

•  Prac(cal tools, like Wordle, make word cloud visualiza(on easy.

They offer an appealing way to SUMMARIZE text…

Shortoming: they do not capture the rela(onships between words in any way since word placement is independent of context

27

Many word clouds are arranged randomly (look also at the scapered colours)

28

Paperns and Vicinity/Adjacency

Humans are spontaneously papern-‐seekers: if they see two words close to each other in a word cloud, they spontaneously think they are related…

29

In Linguis(cs and NLP…

•  This natural tendency in linking spacial vicinity to seman&c relatedness is exploited as evidence that words are seman(cally related or seman(cally similar…

Remember? : ”You shall know a word by the company it keeps (Firth, J. R. 1957:11)”

30

So, it makes sense to place such related words close to each other (look also at the color distribu(on)

31

Seman(c word clouds have higher user sa(sfac(on compared to other layouts…

32

All recent word cloud visualiza(on tools aim to incoprorate seman(cs in the layout…

33

… but none of them provide any guarantee about the quality of the layout in terms of seman(cs

34

Early algorithms: Force-‐Directed Graph

•  Most of the exis(ng algorithms are based on force-‐directed graph layout.

•  Force-‐directed graph drawing algorithms are a class of algorithms for drawing graphs in an aesthe(cally pleasing way

–  Aprac(ve forces between pairs to reduce empty space

–  Repulsive forces ensure that words do not overlap

–  Final force preserve seman(c rela(ons between words.

35

Some of the most flexible algorithms for calcula(ng layouts of simple undirected graphs belong to a class known as force-‐directed algorithms. Such algorithms calculate the layout of a graph using only informa(on contained within the structure of the graph itself, rather than relying on domain-‐specific knowledge. Graphs drawn with these algorithms tend to be aesthe(cally pleasing, exhibit symmetries, and tend to produce crossing-‐free layouts for planar graphs.

Newer Algorithms: rectangle representa(on of graphs

•  Vertex-‐weighted and edge-‐weighed graph: –  The ver(ces of the graph are the words

•  Their weight correspond to some measure of importance (eg. word frequencies)

–  The edges capture the seman(c relatedness of pair of words (eg. co-‐occurrence)

•  Their weight correspond to the strength of the rela(on –  Each vertex can be drawn as a box (rectangle) with a dimension determing by its weight

– A realized adjacency is the sum of the edge weights for all pairs of touching boxes.

–  The goal is to maximize the realized adjacencies.

36

Purpose of the experiments that are shown here:

•  Seman(cs preserva(on in terms of closeness/vicinity/adjacency

37

Example •  A contact of 2 boxes is a common boundary. •  The contact of two boxes is interpredet as

seman(c relatedness •  The contact of 2 boxes can be calculated, so the

adjacency can be computed and evaluated.

38

Preprocessing: 1) Term Extrac(on 2) Ranking 3) Similarity/Dissimilarity Computa(on

39

•  Similarity/dissimilarity matrix

40

Lect 6: Repe((on

large data computer

apricot 1 0 0

digital 0 1 2

informa(on 1 6 1

41

Which pair of words is more similar? cosine(apricot,informa(on) = cosine(digital,informa(on) = cosine(apricot,digital) =

cos(v, w) =v • wv w

=vv•ww=

viwii=1

N∑vi2

i=1

N∑ wi

2i=1

N∑

1+ 0+ 0

1+ 0+ 0

1+36+1

1+36+1

0+1+ 4

0+1+ 4

1+ 0+ 0

0+ 6+ 2

0+ 0+ 0

=138

= .16

=838 5

= .58

= 0

Lect 06: Other possible similarity measures

42

Input -‐ Output

•  The input for all algorithms is – a collec(on of n rectangles, each with a fixed width and height propor(onal to the rank of the word

– A similarity/dissimilarity matrix

•  The output is a set of non-‐overlapping posi(ons for the rectangles.

43

Early Algorithms

1.  Wordle (Random) 2.  Context-‐Preserving Word Cloud Visualiza(on

(CPWCV) 3.  Seam Carving

44

Wordle à Random

•  The Wordle algorithm places one word at a (me in a greedy fashion, ie aiming to use space as efficiently as possible.

•  First the words are sorted by weight/rank in decreasing order.

•  Then for each word in the order, a posi(on is picked at random.

45

1: Random

46

2: Random

47

3: Random

48

4: Random

49

5: Random

50

6: Random

51

Context-‐Preserving Word Cloud Visualiza(on (CPWCV)

•  First, a dissimilarity matrix is computed and Mul(dimensional Scaling (MDS) is performed

•  Second, effort to create a compact layout

52

Mul(dimensional Scaling (MDS) aims at detec(ng meaningful underlying dimensions in the data.

1: Context-‐Preserving

53

2: Context-‐Preserving : repulsive force

54

3: Context-‐Preserving : aprac(ve force

55

Seam Carving

•  Basically, an algorithm for image resizing

•  It was invented at Mitsubishi’s

56

1: Seam Carving

57

2: Seam Carving : space is divided into regions

58

3: Seam Carving : empty paths trimmed out itera(vely

59

4: Seam Carving

60

5: Seam Carving

61

6: Seam Carving: space divided into regions

62

7: Seam Carving

63

3 New Algorithms

1.  Inflate and Push 2.  Star Forest 3.  Cycle Cover

64

Inflate-‐and-‐Push •  Simple heuris(c method for word layout, which aims to preserve seman(c rela(ons between pair of words.

•  Based on 1.  Heuris(cs: scaling down all word rectangles by some

constant; 2.  Compu(ng MDS (mul(dimensional scaling) on the

dissimilarity matrix 3.  Iteretavely increase the size of rectangles by 5% (ie

”inflate” words; 4.  When words overlaps, apply a force-‐directed algorithm

to ”push” words away.

65

Inflate: star(ng point

66

Inflate : scaling down

67

Inflate : seman(cally-‐related words are placed close to each other. Apply ”inflate words” (5%) itera(vely.

68

Inflate: ”push words”: repulsive force to resolve overlaps

69

Inflate: final stage

70

Star Forest

•  A star is a tree •  A star forest is a forest whose connected components are all stars.

71

Repe((on: trees and graphs •  A tree is special form of graph i.e. minimally connected graph and having only one path between any two ver(ces.

•  In a graph there can be more than one path i.e. graph can have uni-‐direc(onal or bi-‐direc(onal paths (edges) between nodes.

72

Three steps

1.  Extrac(ng the star forest: par&&on a graph into disjoint stars

2.  Realising a star: build a word cloud for every star

3.  Pack all the stars together

73

Star Forest : star = tree 1.  Extract stars greedily from a dissimilarity matrix à disjoint stars = star forest 2.  Compute the op(mal stars, ie the best set of words to be adjacent 3.  Aprac(ve force to get a compact layout

74

Cycle Cover •  This algorithm is based on a similarity matrix. •  First, a similarity path is created •  Then, the op(mal level of compact-‐ness is computed

75

Quan(ta(ve Metrics

76

1.  Realized Adjacenies –  how close are similar words to each

other? 2.  Distor(on

–  how distant are dissimilar words? 3.  Uniform Area U(liza(on

–  uniformity of the distribu(on (overpopulated vs sparse areas in the word cloud)

4.  Comptactness –  how well u(lized is the drawing

area? 5.  Aspect Ra(o

–  width and height of the bounding box

6.  Running Time –  execu(on (me

2 datasets

(1) WIKI , a set of 112 plain-‐text ar(cles extracted from the English Wikipedia, each consis(ng of at least 200 dis(nct words (2) PAPERS , a set of 56 research papers published in conferences on experimental algorithms (SEA and ALENEX) in 2011-‐2012.

77

Cycle Cover wins

78

Seam Carving wins

79

Random wins

80

Inflate wins

81

Random and Seam Carving win

82

All ok except Seam Carving

83

Demo

84

The end

85