Tools and Methods for Processing and Visualizing Large Corpora · 2016. 1. 21. · Tools and...

Post on 10-Mar-2021

1 views 0 download

transcript

Tools and Methods for Processing and Visualizing

Large Corpora

Gerold Schneider, Unversity of Zurich and University of Konstanz Mennatallah El-Assady, University of Konstanz

Hans Martin Lehmann, University of Zurich

d2e Conference, Helsinki

Page 1

Overview and Contents

Gerold Schneider, Menna el Assady, Hans Martin Lehmann Page 2

Several approaches and methods which we develop or use to create workflows from data to evidence.

1.  Do-it-yourself from words via numbers to trends as we use and develop them at the English Department in Zurich

•  Searching specific instances: Dependency Bank

•  Data-Driven Overuse: Lightside •  Visualize your trends: GoogleViz

2. Interactive Visualizations as we use and develop them at the Data Analysis and Visualization Group at the University of Constance

•  Topic Matrix View •  Statistical Visualizations: Tableau

•  Lexical Episode Plot

1.1 DYI d2e: Dependency Bank

Gerold Schneider, Menna el Assady, Hans Martin Lehmann Page 3

•  Search specific instances & sum by period/genre/etc.: Dependency Bank e.g. ‘education system’ vs. ‘system of education’

1.1 DYI d2e: Dependency Bank

Gerold Schneider, Menna el Assady, Hans Martin Lehmann Page 4

‘education system’ vs. ‘system of education’. Trend confirmed by Google Books Noun-noun compounds are generally increasing

Relative Frequency of open-form noun-noun compounds

University of Zurich, English Department, Hans Martin Lehmann Page 5

confirms and extends Leech et al. 2009

Syntactic query: stative verbs in the progressive

“I’m loving it” in the BNC

6

1.2 DIY d2e: Data-driven Overuse

Gerold Schneider, Menna el Assady, Hans Martin Lehmann Page 7

Machine-Learning for the Masses: Lightside http://lightsidelabs.com/what/research/ Allows you to do Machine Learning without programming skills. E.g. can we classify American speeches to republican/democrat? What are their typical linguistic features?

We use CORPS II corpus: 8 mio words, 3618 speeches (Guerini et al. 2013).

To do doc classfication with Lightside you must •  Create TAB-separated input.

•  Understand the broad idea of Naïve Bayes or regression, just because

•  you need to interpret the results •  you don’t want to crash your computer

•  Be patient during the massive calculations

1.2 DIY d2e: Data-driven Overuse: Lightside

Gerold Schneider, Menna el Assady, Hans Martin Lehmann Page 8

•  Create TAB-separated input.

•  Understand the broad idea of Naïve Bayes or regression, just because

•  you need to interpret the results •  you don’t want to crash your computer

Automated Media content analysis, Gerold Schneider Seite 9

1.2 DIY d2e: Data-driven Overuse: Lightside

Gerold Schneider, Menna el Assady, Hans Martin Lehmann Page 10

•  you need to interpret the results

1.3 DIY d2e: Google Visualization

Gerold Schneider, Menna el Assady, Hans Martin Lehmann Page 11

•  Back to noun-noun compounds: Relative frequency of the Alternation

1.3 DIY d2e: Google Visualization

Gerold Schneider, Menna el Assady, Hans Martin Lehmann Page 12

Back to noun-noun compounds: Relative frequency of the alternation in COHA You need basic programming skills in R Google Viz is an R library

There are excellent instructions by Martin Hilpert: thanks!

2. Interactive Visualizations We use and develop Interactive Visualizations at the Data Analysis and Visualization Group at the University of Constance.

Visualization for Digital Humanities A main driving force for visualization in linguistics are Digital Humanities projects. Visualization is needed because the massive amounts of information can’t easily be viewed or understood using the traditional method of reading the texts.

Idea: Spot concepts, and zoom in to read the interesting parts.

And how do we spot concepts? By looking at keywords, such as words that are overused in particular documents/sections/paragraphs.

Unfortunately, the mapping from words to concept is not 1:1 •  The same word can mean many things •  Different words can refer to the same concept •  Meanings change over time (Tony McEnery’s plenary) •  Proper names are often just actors and witnesses in bigger concepts.

Vita brevis, arma longa Noun-noun compounds suffer less from these, they denote (new) concepts

Gerold Schneider, Menna el Assady, Hans Martin Lehmann Page 13

2.1 Firthian Hypothesis and Topic Models Unfortunately, the mapping from words to concept is not 1:1

•  The same word can mean many things: but contexts disambiguate •  Different words can refer to the same concept. But only if they are in similar

contexts •  Word senses change over time: need data-driven, context-based methods

“words with similar distributional properties have similar meanings” (Sahlgren, 2006: 21)

•  Words in immediate Context ! Collocations (syntagmatic axis) •  Words in larger Context ! semantic associations, topics (paradigmatic axis)

Topics are clusters of words that frequently co-occur. Many approaches to topic modelling (e.g. LDA) use a probability distribution model maximizing

We apply a deterministic topic model

approach (IHTM) to of 60.000 news articles from 1860 to 2000 (COHA)

Gerold Schneider, Menna el Assady, Hans Martin Lehmann Page 14

p(topic | document) ⋅ p(word | topic)

2.1 Topic Matrix View The First Moon Missions

lunar module moon orbit command flight space land surface spacecraft rocket craft mission ship walk

2.1 Selected topics and their keywords

Gerold Schneider, Mennatallah El-Assady, Hans Martin Lehmann Page 17

Topics of War and Peace

2.2 Topic Evolution over Time

Topic Evolution over Time

2.3 Lexical Episodes Plots

Distant Reading

Zooming and Highlighting

Close Reading 21

Index 4 Index 17 Index 23 Index 94

Actual Distribution

Equidistance Distribution

25

Lexical Episodes

= portion within the word sequence of a corpus where a certain word appears more densely than expected from its frequency in the whole text.

episode

100

22

Lexical Episodes

100

23

Chapter Title

11.11.15

University of Zurich, Division/Office, Title of the presentation, Author Page 24

Gerold Schneider, Mennatallah El-Assady, Hans Martin Lehmann Page 25

Prohibition in the United States was a nationwide constitutional ban on the sale, production, importation, and transportation of alcoholic beverages that remained in place from 1920 to 1933. https://en.wikipedia.org/wiki/Unemployment_and_Farm_Relief_Act 1930 some pointers also to the great depression radio broadcast peak in 1930s Hydrogen bomb: 1 small peak in 1945 world peace from 1945 on tax income revenue lat 1940's: irgendetwas passierte da. government coalition early 1950's : https://en.wikipedia.org/wiki/Attlee_ministry ?? Sports 1990s TV shows 1990s