Download - Headline Analysis - UTKweb.eecs.utk.edu/.../headline-analysis.pdf · Headline Data Summary. Descriptive Statistics. How Can We use Clustering to Analyze our Headlines And Compare

Headline Analysis

John Qiu

William Mckeehan

Joshua Chavarria

Test Questions

1. What graph clustering?.

1. What is one of the graph clustering algorithms that was implemented in our

headline analysis?

1. What is the name of the API used to collect our data?

John Qiu

• Born in China, came to America at age 2 - Grew up in Franklin, TN

• BBA in Economics, Minor in Math - May 2014

• MS in Business Analytics - Dec 2016

• Work at Oak Ridge National Lab - Health Data Sciences Institute

• Focus on Natural Language Processing

William McKeehan

www.mckeehan.info

Joshua Chavarria• Computer Science Major

• Hometown: Los Angeles, CA

• Interests:

• Gaming

• Soccer

• Guitar

• Traveling

Introduction

● With headline analysis we are

clustering keywords in headlines

from a variety of sources in order

to compare them.

● Our hypothesis is that sources

with different perspectives are

going to have different

associations within their headlines

● (For example, CNN is more likely

to have Trump in a headline with

Russia, whereas Fox might have

Trump mentioned with Business.)

Motivation

• We believe that by looking at the associations within the

headlines of the sources, we can identify the different narratives

of each source.

• Goal: Compare a subset of news sources in order to show that

sources with differing perspectives would have different

associations within their headlines

Outline• Approach

• Overview

• Algorithms

• Applications

• Implementation

• Open Issues

Approach

1) Gather news source

2) Extract Entities

3) Note Relationships between co-

occurrences

4) Use clustering algorithms to aggregate

the relationships and compare sources

Overview of Cluster Analysis

• Cluster Analysis is not an algorithm, but rather a group of algorithms

• Any nonuniform data contains underlying structure due to the

heterogeneity of the data. The process of identifying this structure in

terms of grouping the data elements is called clustering

• Graph clustering is the process of finding sets of related vertices in a

graph and grouping them into “clusters”.

• This is a common technique amongst various fields, such as

statistical data analysis, data mining, and pattern recognition.

Overview of Cluster Analysis: Visual Example

Overview of Cluster Analysis• Given a data set, the goal of clustering is

to divide the data set into clusters such

that the elements assigned to a particular

cluster are similar or connected.

• Desirable Cluster Properties in Graphs:

• At least one path connecting each pair

of vertices within a cluster.

• If vertex u can’t reach vertex v, they

should not be in the same cluster.

• A subset of vertices forms a good

cluster if the induced subgraph is

dense.

Graph Clustering Algorithms: Intro

In a graph setting, clustering means partitioning the graph so that edges within a

group are large and edges across groups are small

Algorithms: Hierarchical Clustering

• A global clustering algorithm that creates a

hierarchical decomposition of sets of objects

using similarity matrix.

• Two Methods

• Agglomerative Approach (Bottom-Up)

• Divisive Approach (Top-Down)

• Advantages:

• Easy to implement and more robust to

noise.

• Disadvantages:

• Computationally demanding for large

data sets.

• Hard to identify clusters by dendogram

Agglomerative Hierarchical Clustering Pseudo Code:

Using Cosine Similarity as Similarity Measure:

• Initialize all vertices as individual clusters

• Using Adjacency Matrix, calculate pairwise similarity between all vertices

• Either:

• Merge the most similar vertices into same cluster (Single linkage

clustering) or

• Merge most different vertices into their most similar clusters (Complete-

linkage clustering)

• Update Adjacency Matrix

• Repeat for all vertices in a cluster Complexity:

Applications

• Clustering is often used to automatically generate feature representation for

data corresponding to a defined similarity measure.

• Specific uses include:

• Dimensionality reduction

• Multi-objective optimization

• Outlier/Anomaly detection

• Segmentation

• Applications:

• Recommendation systems - classifying users based on preferences

• Image Segmentation - classifying sections of images based on similar

pixels

Implementation: Data Collection

• Collect/compare headlines

• EventRegistery.org• Free

• Over 100,000 news publishers

• API

• Python Library

Bad Data Examples

• “DA seeks to revoke bond for accused drunk driver”

• “Levant Mediterranean dishes up small plates with big

flavor”

• “Manalapan (2) at Colts Neck (19) - Girls Lacrosse”

• “Checheche Catholic priest in sex scandal - Nehanda

Radio”

Native Media Source # Articles Political Affiliation

Agency Reuters 426 NA

Associated Press 688 NA

Cable Fox News 184 Trump

MSNBC 60 Clinton

Internet Breitbart 81 Trump

The Huffington

Post

254 Clinton

Network ABC News 134 Both

CBS News 78 Both

NBC News 96 Both

Newspaper New York Times 306 Clinton

Radio NPR.org 158 Clinton

Headline

Data

Summary

Descriptive Statistics

How Can We use Clustering to Analyze our Headlines

And Compare Sources?

We will be working weighted undirected graphs to represent our data in two ways

Word Level Representation:

Clustering on a single source’s word-co-occurrence graph is an abstraction of

related content can be compared between sources.

Document Level Representation:

Use document representation similarity measures for all documents to

reveal similarities.

How do Computers See/Read/Get Information

From Text?

1) Learn to Count Words

2) Learn which Words to count

3) Learn to produce representation words

1) Term Document Vector/Matrix (Salton 1968)Definition: A document D from a corpus with n many unique

terms can be represented by a Term Document

Vector D = [d1,...,dn ] of length n

Pros:

• Quick to generate/normalize.

• Simple to interpret

• Introduced similarity measure to text data -

Euclidian Distance and Centroid clustering

(Salton 1975)

Cons:

• Huge Dimensionality but really sparce

• No language structure - word order

• Not how words work

Reutersnum articles: 426

orig vocab size 1587

mindf2 vocab size 607

vocab size 607

clust finished in 0.463397979736

words related to trump

right

rutte

fillon

Associated Press

---- Associated Press -------------------------------------

num articles: 688



vocab size 860



conservative

---- Fox News -------------------------------------

num articles: 184

orig vocab size 938


vocab size 277



to

2016

struggle

starts

but

own

---- MSNBC -------------------------------------

num articles: 60

orig vocab size 274


vocab size 64



up

num articles: 81

orig vocab size 559


vocab size 137



gorsuch

for

or

clinton

The Huffington Post

num articles: 254



vocab size 362



election

didn

nomination

now

moonlight

ABC News

num articles: 134

orig vocab size 633


vocab size 177



lawmakers

aca

bill

listening

her

blueprint

himself

prosecutor

CBS News

num articles: 78

orig vocab size 457


vocab size 87



putin

health

russia

NBC News

num articles: 96

orig vocab size 513


vocab size 176



flynn

The New York Times

num articles: 306



vocab size 345



independence

let

pen

post

france

america

nears

ties

looks

foreign

pennsylvania

being

stories

at

Resultsotal num articles: 2465



vocab size 2332



governing

negotiate

feeling

that

camp

bad

citizens

gay

backing

demands

beijing

sparks

homes

partner

hike

2) Better Representations from Labeled Datasets

Part of Speech Tagging:

Brown Corpus 1960 1,000,000 words tagged with part of speech

Lemmatization - mapping words to a root form:

E.g. [Franch, French] -> French

Open Issues

• Parameter selection

• Scalability

• Evaluation

• Fake News

Issue - Parameter selection

• How do you

determine the

parameter values to

give as input to the

clustering algorithm?

Issue - Scalability

• How does the runtime and

memory consumption of the

algorithm behave for massive

input graphs?

Issue - Evaluation

• How to decide which clusterings is the best?

Issue - Fake News

References

http://www.lsi.upc.edu/~bejar/amlt/articulos/Graph%20Clustering03.pdf

http://world.mathigon.org/Graph_Theory

http://micans.org/mcl/

http://searchengineland.com/google-news-ranking-stories-30424

http://cs-people.bu.edu/mp/images/pap101a.pdf

https://en.wikipedia.org/wiki/Named-entity_recognition

https://en.wikipedia.org/wiki/Parse_tree

http://www.lsi.upc.edu/~bejar/amlt/articulos/Graph Clustering03.pdf

http://world.mathigon.org/Graph_Theory

http://micans.org/mcl/

http://searchengineland.com/google-news-ranking-stories-30424

http://cs-people.bu.edu/mp/images/pap101a.pdf

https://en.wikipedia.org/wiki/Named-entity_recognition

https://en.wikipedia.org/wiki/Parse_tree

Discussion