Visualization Taxonomies and Techniques Text: Words, phrases, sentences, … University of Texas –...

transcript

Visualization Taxonomies and Techniques

Text: Words, phrases, sentences, …

University of Texas – Pan American

CSCI 6361, Spring 2014

Introduction

• Text is ubiquitous– Documents, and more

generally text, are a primary information source

• (Verbal has its place!)

– Access to documents and text has grown exponentially in recent years due to networking infrastructure

• WWW • Digital libraries • Social media

• Visualization to aid users in understanding and gathering information from text and document collections

Introduction

• Visualization can aid in performing tasks

• For example: – Which documents contain text on topic XYZ? – Which documents are of interest to me? – Are there other documents that are similar to this one (so they are worthwhile)? – How are different words used in a document or a document collection? – What are the main themes and ideas in a document or a collection? – Which documents have an angry tone? – How are certain words or themes distributed through a document? – Identify “hidden” messages or stories in this document collection. – How does one set of documents differ from another set? – Quickly gain an understanding of a document or collection in order to

subsequently do XYZ. – Understand the history of changes in a document. – Find connections between documents.

From Stasko, 2013

IntroductionChallenges of Text Visualization

• Text is unlike other data types seen so far, for example

• Context and Semantics– Context relevant to understanding and meaning– Indeed, natural language understanding a challenge of the nth + 1 century

• Dimensionality– Inherently, “not dimensional”, so must create “visually realizable” visual encoding – Often, first step is n-D, then 2- or 3-D

• Modeling Abstraction– Consider level of “understanding” require for task– Match analysis task with appropriate tools and models

IntroductionRelated topics

• Information Retrieval – Active search process that brings back particular/specific items (will discuss that

some today, but not always focus) – InfoVis and HCI can help some…

• Visualization may be most useful when not sure precisely what you’re looking for when retrieving information

– More of a browsing paradigm than a search one – But, this is part of the information retrieval task

• Define information need, formulate “query”, examine/evaluate results, … repeat

• Sensemaking – Gaining better understanding of facts at hand in order to take some next steps

• A principle focus in visual analytics – Visualization can help make large document collection more understandable more

rapidly • Which is good: “Overview, zoom and filter, details on demand”

Recall, Visualization Pipeline: Visualization Stages

• Data transformations:– Map raw data (idiosynchratic form) into data tables (relational descriptions

including metatags)

• Text is nominal data– A word, or any text unit, does not map easily to any quantitative representation! – The “Raw data --> Data Table” mapping is a principle element of creating any

visual representation• How do you get numbers from words, sentences, …??

– Will see several solutions

RawInformation

VisualFormDataset Views

User - Task

DataTransformations

VisualMappings

ViewTransformations

F F -1

Interaction

VisualPerception

Recall, Visualization Pipeline: Visualization Stages

• Visual Mappings:– Transform data tables into visual structures that combine spatial substrates,

marks, and graphical properties

• And … visual mappings, as well, requires at least “the usual level” of creativity

RawInformation

VisualFormDataset Views

User - Task

DataTransformations

VisualMappings

ViewTransformations

F F -1

Interaction

VisualPerception

Understanding Text Content

• Visual representations of words, phrases, and sentences – Main goal of understanding, versus search

• Visual presentation always part of text presentation – – Standard typography uses layout, font, style, color … – Electronic media, especially – pick a web page– “Single text content”

Single Text ContentWord Counts

• 2012 National Conventions• NY Times: http://www.nytimes.com/interactive/2012/08/28/us/politics/convention-word-counts.html

Tag / Word Clouds

• Lots of popular interest – E.g., on web

• Idea is to show word/concept importance through visual means – Tags: User-specified metadata (descriptors) about something – Sometimes generalized to just reflect word frequencies

• Not a new technique– Milgram’s ‘76 experiment to have people label landmarks in Paris – Flanagan’s ‘97 “Search referral Zeitgeist” – Fortune’s ‘01 Money Makes the World Go Round

Tag / Word CloudsExample: US State of the Union Speeches

• Guardian• http://www.guardian.co.

uk/news/datablog/2011/jan/25/state-of-the-union-text-obama#

• http://image.guardian.co.uk/sys-files/Guardian/documents/2011/01/26/State_of_the_union_2011.pdf?guni=Graphic:in body link

Flickr Tag Cloud

delicious Tag Cloud

Alternate Order

Many Eyes Tag Cloud

• Word pairs

Wordle

Wordle“Beautiful Word Clouds”, http://www.wordle.net/

• Tightly packed words– Horizontal, vertical or diagonal

• Size correlated with frequency

• Multiple color palettes

• User gets some control

• Layout Algorithm – Details not published – Sort words by weight, decreasing

order for each word– Init position randomly chosen

according to distribution for target shape

– Update position moves out radially

• Course schedule, table of topics, and assignments

Can be many variations …

• A bit more order• Order the words more by frequency

Mani-WordleUser control

• Mani-Wordle – Start with nice default algorithm – Give user more control over design

• Alter color (within a palette) • Pin words, redo the rest • Move and rotate words

– http://www.cg.tuwien.ac.at/courses/InfoVis/HallOfFame/2012/Gruppe03/Homepage/index.html

– Koh et al TVCG (InfoVis) ‘10

Tag / Word CloudsConclusions

• Weaknesses– Sub-optimal visual encoding (size vs. position)– Inaccurate size encoding (long words are bigger)– Font sizes are hard to compare – May not facilitate comparison (unstable layout)– Word frequency may not be meaningful

• Most use words vs. stems

– Does not show structure of the text– Studies have even shown they underperform (Gruen et al CHI ’06)

• Why so popular?– OK for “quick look”– Serve as social signifiers that provide a friendly atmosphere that provide a

point of entry into a complex site – Act as individual and group mirrors – Fun, not business-like

BTW - Text Analysis Toolsvoyeur: http://voyeurtools.org/

• Book• + tools for

text analysis and visualization

BTW - Text Analysis Toolsvoyeur: http://voyeurtools.org/

Visualization and Information Retrieval

• Examples so far have focused on representing a single document– …, or, really, set of words as no consideration of even word order, let alone

sentence structure, etc.

• Principle question is how might visual representations aid text, or document, search

– I.e., how to find the proverbial needle in a haystack, where the haystack is all the documents on the www or a digital library

– Term information retrieval refers to this search and its history antedates computers

• IR entails:– Determine information need– Query formulation– Retrieval – Assessment of results– Reformulation of query or even information need– Repeat (until information need met)

…• IR entails:

– Determine information need– Query formulation– Retrieval – Assessment of results– Reformulation of query or even information need– Repeat (until information need met)

• Provide visual representations that during this process– Document collection visually, support browsing, …

• Even for determining information need!

– Show query results visually – Show how query terms relate to results – … any aspect

• Provide visual representations that during this process– Document collection visually, support browsing, …

• Even for determining information need!

– Show query results visually – Show how query terms relate to results – … any aspect

From Stasko, 2013

Evaluating Query ResultsTileBars, Hearst, 1996

• Hearst points out that query responses do not include:

– How strong the match is – How frequent each term is – How each term is distributed

in the document – Overlap between terms – Length of document

• Document ranking is opaque

• Inability to compare between results

• Input limits term relationships

TileBarsOverview

• Goal : Minimize time and effort for deciding which documents to view in detail

• Show the role of the query terms in the retrieved documents, making use of document structure

• Graphical representation of term distribution and overlap

• Simultaneously indicate: – Relative document length – Frequency of term sets in document – Distribution of term sets with respect to the document and each other

From Stasko, 2013

TileBarsScreen

• TileBars screen:

From Stasko, 2013

TileBarsDocument representation

• Visual representation of retrieved documents

• Video: TileBars-80mb-chi96_05_m1.mpeg

From Stasko, 2013

TileBars

•TileBars

•Video

TileBarsConclusions

• Clearly visually provides the information intended about each document

• Ease/effort/time of comparison?– Surely would improve with use

• … ?

Evaluating Query ResultsSparkler

• Abstract result documents more – Havre et al InfoVis ‘01

• Show “distance” from query in order to give user better feel for quality of match(es)

• Also shows documents in responses to multiple queries • Visualizing One Query

– Triangle – query – Square – document

• Distance between query and documents represents their relevance

From Stasko, 2013

Sparkler

• Visualizing Multiple Queries • Six queries here • Bullseye allows viewer to select quality results

From Stasko, 2013

Sparkler

• Test Example • Text Retrieval Conference (TREC-3) test document collection • AP news stories from June 24–30, 1990 • TREC topic: Japan Protectionist Measures • Sparkler found 16 of 17 relevant documents

From Stasko, 2013

Evaluating Query ResultsRankSpiral

• Compare search results from different search engines– Spoerri InfoVis ’04 poster

From Stasko, 2013

RankSpiral

• Color represents different search engines Compare search results from different search engines

From Stasko, 2013

RankSpiral

• Color represents different search engines Compare search results from different search engines

Evaluating Query Results ResultMaps

• Treemap-style vis for showing query results in a digital library– Clarkson, Desai & Foley TVCG (InfoVis) ‘09

From Stasko, 2013

Representing Multiple Documents

• Previously, have seen various techniques for comparing multiple documents that are results of query, i.e., a subset of all documents

• Also, may want to just show everything, and then let user do “manual search”, or user-directed search

• Such displays of all documents also support the type of search common in visual analytics

– Query, browse, connect, drill-down

• Will see:– Parallel word clouds– Tree layout of synonyms– …

Multiple DocumentsParallel Tags Clouds

• Tag clouds increase size of word as f(frequency)• Showing multiple documents as tag clouds allows visual inspection

– Automated and user directed, visual analytics

• Parallel Tag Clouds - name says it all– Video - Collins et al VAST ‘09 – different circuit courts– http://www.youtube.com/watch?v=rL3Ga6xBgLw

Multiple DocumentsDo different district courts differ in cases they handle?

Multiple DocumentsCounting Words: Overview & Timeline

• Ex., across speeches can count words

• State of the Union Addresses

• http://www.nytimes.com/ref/washington/20070123_STATEOFUNION.html?initialWord=iraq

• NY Times demo

Multiple DocumentsCounting Words: Overview & Timeline

• State of the Union Addresses • http://www.nytimes.com/ref/washington/20070123_STATEOFUNION.html?initialWord=iraq

•NY Times demo

Multiple Document Word UseDocuBurst

• Sets of synonyms grouped together

– Uses WordNet – show words from a

document in terms of their hypernym (ISA) links

– Size – # of leaves in subtree – Hue – diff synsets of word– Shade – frequency of use

• Demo, etc. – http://vialab.science.uoit.ca/portfolio/docuburst-

visualizing-document-content-using-language-structure

FeatureLens

• Show patterns of words or n-grams – Don et al. CIKM ‘07

• Video

FeatureLens

• Show patterns of words or n-grams – Don et al. CIKM ‘07

•Check Video

Combinations of words, phrases, and sentences

Multiple SentencesSeeSoft Display

• Originally for software visualization

• One line of text on each horizontal line

• Color highlight for attributes

– E.g., for software, how often modified, days since modification

– E.g., for text where a particular word appears in a sentence,

• Conversations might be revealed

• Detail view in pop up window

Multiple SentencesTextArc - Simple Single Document Visualization

• Visualize an entire book – Word appearances – Sentences – … – http://textarc.org

• Sentences laid out on circumference in order of appearance in spiral

• Frequently occurring words inside spiral

• Selecting word draws line on to sentences with word

– A kind of “visual concordance”

• Significant interaction

TextArc

Concordances and Word Frequencies

• From field of literary analysis

• Concordance– An alphabetical index

of the principal words in a book or the works of an author with their immediate context

• Word of interest in center, with text in which appears to left and right

• As, KWIC– Key word in context

Word Tree

• Shows context of a word or words – Follow word with all the phrases that follow it

• Wattenberg & Viégas TVCG (InfoVis) ‘08

• Font size shows frequency of appearance • Continue branch until hitting unique phrase • Clicking on phrase makes it the focus • Ordered alphabetically, by frequency, or by first appearance

Word TreeInteraction

Word TreeFrom King James Bible

• From King James Bible

WordTreeMany Eyes

Finding Structure: Phrase Nets

Find Structure: Phrase Nets

• Concordances show local, repeated structure of word context• Phrase Nets In Many Eyes, van Ham et al.

• Other types of patterns– Lexical: <A> at , <A> and , <A> at , <A> (is|are|was|were) – Syntactic: <Noun> <Verb> <Object>

• Visualize extracted patterns in a node-link view– Occurrences -> Node size– Pattern position -> Edge direction

Phrase Net(larger next slide)

Portrait of the Artist as a Young Man<A> and

Phrase Net

Portrait of the Artist as a Young Man<A> and

Phrase NetsThe Bible: <A> begat

Phrase NetsOld and New Testaments: <A> of

Phrase Nets(<A> and ) and (<A> at )

References

• F. Viegas, M. Wattenberg, "Tag Clouds and the Case for Vernacular Visualization", interactions, Vol. 15, No. 4, Jul-Aug 2008, pp. 49-52.