+ All Categories
Home > Documents > Tools and Methods for Processing and Visualizing Large Corpora · 2016. 1. 21. · Tools and...

Tools and Methods for Processing and Visualizing Large Corpora · 2016. 1. 21. · Tools and...

Date post: 10-Mar-2021
Category:
Upload: others
View: 1 times
Download: 0 times
Share this document with a friend
25
Tools and Methods for Processing and Visualizing Large Corpora Gerold Schneider, Unversity of Zurich and University of Konstanz Mennatallah El-Assady, University of Konstanz Hans Martin Lehmann, University of Zurich d2e Conference, Helsinki Page 1
Transcript
Page 1: Tools and Methods for Processing and Visualizing Large Corpora · 2016. 1. 21. · Tools and Methods for Processing and Visualizing Large Corpora Gerold Schneider, Unversity of Zurich

Tools and Methods for Processing and Visualizing

Large Corpora

Gerold Schneider, Unversity of Zurich and University of Konstanz Mennatallah El-Assady, University of Konstanz

Hans Martin Lehmann, University of Zurich

d2e Conference, Helsinki

Page 1

Page 2: Tools and Methods for Processing and Visualizing Large Corpora · 2016. 1. 21. · Tools and Methods for Processing and Visualizing Large Corpora Gerold Schneider, Unversity of Zurich

Overview and Contents

Gerold Schneider, Menna el Assady, Hans Martin Lehmann Page 2

Several approaches and methods which we develop or use to create workflows from data to evidence.

1.  Do-it-yourself from words via numbers to trends as we use and develop them at the English Department in Zurich

•  Searching specific instances: Dependency Bank

•  Data-Driven Overuse: Lightside •  Visualize your trends: GoogleViz

2. Interactive Visualizations as we use and develop them at the Data Analysis and Visualization Group at the University of Constance

•  Topic Matrix View •  Statistical Visualizations: Tableau

•  Lexical Episode Plot

Page 3: Tools and Methods for Processing and Visualizing Large Corpora · 2016. 1. 21. · Tools and Methods for Processing and Visualizing Large Corpora Gerold Schneider, Unversity of Zurich

1.1 DYI d2e: Dependency Bank

Gerold Schneider, Menna el Assady, Hans Martin Lehmann Page 3

•  Search specific instances & sum by period/genre/etc.: Dependency Bank e.g. ‘education system’ vs. ‘system of education’

Page 4: Tools and Methods for Processing and Visualizing Large Corpora · 2016. 1. 21. · Tools and Methods for Processing and Visualizing Large Corpora Gerold Schneider, Unversity of Zurich

1.1 DYI d2e: Dependency Bank

Gerold Schneider, Menna el Assady, Hans Martin Lehmann Page 4

‘education system’ vs. ‘system of education’. Trend confirmed by Google Books Noun-noun compounds are generally increasing

Page 5: Tools and Methods for Processing and Visualizing Large Corpora · 2016. 1. 21. · Tools and Methods for Processing and Visualizing Large Corpora Gerold Schneider, Unversity of Zurich

Relative Frequency of open-form noun-noun compounds

University of Zurich, English Department, Hans Martin Lehmann Page 5

confirms and extends Leech et al. 2009

Page 6: Tools and Methods for Processing and Visualizing Large Corpora · 2016. 1. 21. · Tools and Methods for Processing and Visualizing Large Corpora Gerold Schneider, Unversity of Zurich

Syntactic query: stative verbs in the progressive

“I’m loving it” in the BNC

6

Page 7: Tools and Methods for Processing and Visualizing Large Corpora · 2016. 1. 21. · Tools and Methods for Processing and Visualizing Large Corpora Gerold Schneider, Unversity of Zurich

1.2 DIY d2e: Data-driven Overuse

Gerold Schneider, Menna el Assady, Hans Martin Lehmann Page 7

Machine-Learning for the Masses: Lightside http://lightsidelabs.com/what/research/ Allows you to do Machine Learning without programming skills. E.g. can we classify American speeches to republican/democrat? What are their typical linguistic features?

We use CORPS II corpus: 8 mio words, 3618 speeches (Guerini et al. 2013).

To do doc classfication with Lightside you must •  Create TAB-separated input.

•  Understand the broad idea of Naïve Bayes or regression, just because

•  you need to interpret the results •  you don’t want to crash your computer

•  Be patient during the massive calculations

Page 8: Tools and Methods for Processing and Visualizing Large Corpora · 2016. 1. 21. · Tools and Methods for Processing and Visualizing Large Corpora Gerold Schneider, Unversity of Zurich

1.2 DIY d2e: Data-driven Overuse: Lightside

Gerold Schneider, Menna el Assady, Hans Martin Lehmann Page 8

•  Create TAB-separated input.

•  Understand the broad idea of Naïve Bayes or regression, just because

•  you need to interpret the results •  you don’t want to crash your computer

Page 9: Tools and Methods for Processing and Visualizing Large Corpora · 2016. 1. 21. · Tools and Methods for Processing and Visualizing Large Corpora Gerold Schneider, Unversity of Zurich

Automated Media content analysis, Gerold Schneider Seite 9

Page 10: Tools and Methods for Processing and Visualizing Large Corpora · 2016. 1. 21. · Tools and Methods for Processing and Visualizing Large Corpora Gerold Schneider, Unversity of Zurich

1.2 DIY d2e: Data-driven Overuse: Lightside

Gerold Schneider, Menna el Assady, Hans Martin Lehmann Page 10

•  you need to interpret the results

Page 11: Tools and Methods for Processing and Visualizing Large Corpora · 2016. 1. 21. · Tools and Methods for Processing and Visualizing Large Corpora Gerold Schneider, Unversity of Zurich

1.3 DIY d2e: Google Visualization

Gerold Schneider, Menna el Assady, Hans Martin Lehmann Page 11

•  Back to noun-noun compounds: Relative frequency of the Alternation

Page 12: Tools and Methods for Processing and Visualizing Large Corpora · 2016. 1. 21. · Tools and Methods for Processing and Visualizing Large Corpora Gerold Schneider, Unversity of Zurich

1.3 DIY d2e: Google Visualization

Gerold Schneider, Menna el Assady, Hans Martin Lehmann Page 12

Back to noun-noun compounds: Relative frequency of the alternation in COHA You need basic programming skills in R Google Viz is an R library

There are excellent instructions by Martin Hilpert: thanks!

Page 13: Tools and Methods for Processing and Visualizing Large Corpora · 2016. 1. 21. · Tools and Methods for Processing and Visualizing Large Corpora Gerold Schneider, Unversity of Zurich

2. Interactive Visualizations We use and develop Interactive Visualizations at the Data Analysis and Visualization Group at the University of Constance.

Visualization for Digital Humanities A main driving force for visualization in linguistics are Digital Humanities projects. Visualization is needed because the massive amounts of information can’t easily be viewed or understood using the traditional method of reading the texts.

Idea: Spot concepts, and zoom in to read the interesting parts.

And how do we spot concepts? By looking at keywords, such as words that are overused in particular documents/sections/paragraphs.

Unfortunately, the mapping from words to concept is not 1:1 •  The same word can mean many things •  Different words can refer to the same concept •  Meanings change over time (Tony McEnery’s plenary) •  Proper names are often just actors and witnesses in bigger concepts.

Vita brevis, arma longa Noun-noun compounds suffer less from these, they denote (new) concepts

Gerold Schneider, Menna el Assady, Hans Martin Lehmann Page 13

Page 14: Tools and Methods for Processing and Visualizing Large Corpora · 2016. 1. 21. · Tools and Methods for Processing and Visualizing Large Corpora Gerold Schneider, Unversity of Zurich

2.1 Firthian Hypothesis and Topic Models Unfortunately, the mapping from words to concept is not 1:1

•  The same word can mean many things: but contexts disambiguate •  Different words can refer to the same concept. But only if they are in similar

contexts •  Word senses change over time: need data-driven, context-based methods

“words with similar distributional properties have similar meanings” (Sahlgren, 2006: 21)

•  Words in immediate Context ! Collocations (syntagmatic axis) •  Words in larger Context ! semantic associations, topics (paradigmatic axis)

Topics are clusters of words that frequently co-occur. Many approaches to topic modelling (e.g. LDA) use a probability distribution model maximizing

We apply a deterministic topic model

approach (IHTM) to of 60.000 news articles from 1860 to 2000 (COHA)

Gerold Schneider, Menna el Assady, Hans Martin Lehmann Page 14

p(topic | document) ⋅ p(word | topic)

Page 15: Tools and Methods for Processing and Visualizing Large Corpora · 2016. 1. 21. · Tools and Methods for Processing and Visualizing Large Corpora Gerold Schneider, Unversity of Zurich
Page 16: Tools and Methods for Processing and Visualizing Large Corpora · 2016. 1. 21. · Tools and Methods for Processing and Visualizing Large Corpora Gerold Schneider, Unversity of Zurich

2.1 Topic Matrix View The First Moon Missions

lunar module moon orbit command flight space land surface spacecraft rocket craft mission ship walk

Page 17: Tools and Methods for Processing and Visualizing Large Corpora · 2016. 1. 21. · Tools and Methods for Processing and Visualizing Large Corpora Gerold Schneider, Unversity of Zurich

2.1 Selected topics and their keywords

Gerold Schneider, Mennatallah El-Assady, Hans Martin Lehmann Page 17

Page 18: Tools and Methods for Processing and Visualizing Large Corpora · 2016. 1. 21. · Tools and Methods for Processing and Visualizing Large Corpora Gerold Schneider, Unversity of Zurich

Topics of War and Peace

Page 19: Tools and Methods for Processing and Visualizing Large Corpora · 2016. 1. 21. · Tools and Methods for Processing and Visualizing Large Corpora Gerold Schneider, Unversity of Zurich

2.2 Topic Evolution over Time

Page 20: Tools and Methods for Processing and Visualizing Large Corpora · 2016. 1. 21. · Tools and Methods for Processing and Visualizing Large Corpora Gerold Schneider, Unversity of Zurich

Topic Evolution over Time

Page 21: Tools and Methods for Processing and Visualizing Large Corpora · 2016. 1. 21. · Tools and Methods for Processing and Visualizing Large Corpora Gerold Schneider, Unversity of Zurich

2.3 Lexical Episodes Plots

Distant Reading

Zooming and Highlighting

Close Reading 21

Page 22: Tools and Methods for Processing and Visualizing Large Corpora · 2016. 1. 21. · Tools and Methods for Processing and Visualizing Large Corpora Gerold Schneider, Unversity of Zurich

Index 4 Index 17 Index 23 Index 94

Actual Distribution

Equidistance Distribution

25

Lexical Episodes

= portion within the word sequence of a corpus where a certain word appears more densely than expected from its frequency in the whole text.

episode

100

22

Page 23: Tools and Methods for Processing and Visualizing Large Corpora · 2016. 1. 21. · Tools and Methods for Processing and Visualizing Large Corpora Gerold Schneider, Unversity of Zurich

Lexical Episodes

100

23

Page 24: Tools and Methods for Processing and Visualizing Large Corpora · 2016. 1. 21. · Tools and Methods for Processing and Visualizing Large Corpora Gerold Schneider, Unversity of Zurich

Chapter Title

11.11.15

University of Zurich, Division/Office, Title of the presentation, Author Page 24

Page 25: Tools and Methods for Processing and Visualizing Large Corpora · 2016. 1. 21. · Tools and Methods for Processing and Visualizing Large Corpora Gerold Schneider, Unversity of Zurich

Gerold Schneider, Mennatallah El-Assady, Hans Martin Lehmann Page 25

Prohibition in the United States was a nationwide constitutional ban on the sale, production, importation, and transportation of alcoholic beverages that remained in place from 1920 to 1933. https://en.wikipedia.org/wiki/Unemployment_and_Farm_Relief_Act 1930 some pointers also to the great depression radio broadcast peak in 1930s Hydrogen bomb: 1 small peak in 1945 world peace from 1945 on tax income revenue lat 1940's: irgendetwas passierte da. government coalition early 1950's : https://en.wikipedia.org/wiki/Attlee_ministry ?? Sports 1990s TV shows 1990s


Recommended