Headline Analysis
John Qiu
William Mckeehan
Joshua Chavarria
Test Questions
1. What graph clustering?.
1. What is one of the graph clustering algorithms that was implemented in our
headline analysis?
1. What is the name of the API used to collect our data?
John Qiu
• Born in China, came to America at age 2 - Grew up in Franklin, TN
• BBA in Economics, Minor in Math - May 2014
• MS in Business Analytics - Dec 2016
• Work at Oak Ridge National Lab - Health Data Sciences Institute
• Focus on Natural Language Processing
William McKeehan
www.mckeehan.info
Joshua Chavarria• Computer Science Major
• Hometown: Los Angeles, CA
• Interests:
• Gaming
• Soccer
• Guitar
• Traveling
Introduction
● With headline analysis we are
clustering keywords in headlines
from a variety of sources in order
to compare them.
● Our hypothesis is that sources
with different perspectives are
going to have different
associations within their headlines
● (For example, CNN is more likely
to have Trump in a headline with
Russia, whereas Fox might have
Trump mentioned with Business.)
Motivation
• We believe that by looking at the associations within the
headlines of the sources, we can identify the different narratives
of each source.
• Goal: Compare a subset of news sources in order to show that
sources with differing perspectives would have different
associations within their headlines
Outline• Approach
• Overview
• Algorithms
• Applications
• Implementation
• Open Issues
Approach
1) Gather news source
2) Extract Entities
3) Note Relationships between co-
occurrences
4) Use clustering algorithms to aggregate
the relationships and compare sources
Overview of Cluster Analysis
• Cluster Analysis is not an algorithm, but rather a group of algorithms
• Any nonuniform data contains underlying structure due to the
heterogeneity of the data. The process of identifying this structure in
terms of grouping the data elements is called clustering
• Graph clustering is the process of finding sets of related vertices in a
graph and grouping them into “clusters”.
• This is a common technique amongst various fields, such as
statistical data analysis, data mining, and pattern recognition.
Overview of Cluster Analysis: Visual Example
Overview of Cluster Analysis• Given a data set, the goal of clustering is
to divide the data set into clusters such
that the elements assigned to a particular
cluster are similar or connected.
• Desirable Cluster Properties in Graphs:
• At least one path connecting each pair
of vertices within a cluster.
• If vertex u can’t reach vertex v, they
should not be in the same cluster.
• A subset of vertices forms a good
cluster if the induced subgraph is
dense.
Graph Clustering Algorithms: Intro
In a graph setting, clustering means partitioning the graph so that edges within a
group are large and edges across groups are small
Algorithms: Hierarchical Clustering
• A global clustering algorithm that creates a
hierarchical decomposition of sets of objects
using similarity matrix.
• Two Methods
• Agglomerative Approach (Bottom-Up)
• Divisive Approach (Top-Down)
• Advantages:
• Easy to implement and more robust to
noise.
• Disadvantages:
• Computationally demanding for large
data sets.
• Hard to identify clusters by dendogram
Agglomerative Hierarchical Clustering Pseudo Code:
Using Cosine Similarity as Similarity Measure:
• Initialize all vertices as individual clusters
• Using Adjacency Matrix, calculate pairwise similarity between all vertices
• Either:
• Merge the most similar vertices into same cluster (Single linkage
clustering) or
• Merge most different vertices into their most similar clusters (Complete-
linkage clustering)
• Update Adjacency Matrix
• Repeat for all vertices in a cluster Complexity:
Applications
• Clustering is often used to automatically generate feature representation for
data corresponding to a defined similarity measure.
• Specific uses include:
• Dimensionality reduction
• Multi-objective optimization
• Outlier/Anomaly detection
• Segmentation
• Applications:
• Recommendation systems - classifying users based on preferences
• Image Segmentation - classifying sections of images based on similar
pixels
Implementation: Data Collection
• Collect/compare headlines
• EventRegistery.org• Free
• Over 100,000 news publishers
• API
• Python Library
Bad Data Examples
• “DA seeks to revoke bond for accused drunk driver”
• “Levant Mediterranean dishes up small plates with big
flavor”
• “Manalapan (2) at Colts Neck (19) - Girls Lacrosse”
• “Checheche Catholic priest in sex scandal - Nehanda
Radio”
Native Media Source # Articles Political Affiliation
Agency Reuters 426 NA
Associated Press 688 NA
Cable Fox News 184 Trump
MSNBC 60 Clinton
Internet Breitbart 81 Trump
The Huffington
Post
254 Clinton
Network ABC News 134 Both
CBS News 78 Both
NBC News 96 Both
Newspaper New York Times 306 Clinton
Radio NPR.org 158 Clinton
Headline
Data
Summary
Descriptive Statistics
How Can We use Clustering to Analyze our Headlines
And Compare Sources?
We will be working weighted undirected graphs to represent our data in two ways
Word Level Representation:
Clustering on a single source’s word-co-occurrence graph is an abstraction of
related content can be compared between sources.
Document Level Representation:
Use document representation similarity measures for all documents to
reveal similarities.
How do Computers See/Read/Get Information
From Text?
1) Learn to Count Words
2) Learn which Words to count
3) Learn to produce representation words
1) Term Document Vector/Matrix (Salton 1968)Definition: A document D from a corpus with n many unique
terms can be represented by a Term Document
Vector D = [d1,...,dn ] of length n
Pros:
• Quick to generate/normalize.
• Simple to interpret
• Introduced similarity measure to text data -
Euclidian Distance and Centroid clustering
(Salton 1975)
Cons:
• Huge Dimensionality but really sparce
• No language structure - word order
• Not how words work
Reutersnum articles: 426
orig vocab size 1587
mindf2 vocab size 607
vocab size 607
clust finished in 0.463397979736
words related to trump
right
rutte
fillon
Associated Press
---- Associated Press -------------------------------------
num articles: 688
orig vocab size 1687
mindf2 vocab size 860
vocab size 860
clust finished in 0.377697944641
words related to trump
conservative
---- Fox News -------------------------------------
num articles: 184
orig vocab size 938
mindf2 vocab size 277
vocab size 277
clust finished in 0.170491933823
words related to trump
to
2016
struggle
starts
but
own
---- MSNBC -------------------------------------
num articles: 60
orig vocab size 274
mindf2 vocab size 64
vocab size 64
clust finished in 0.00706195831299
words related to trump
up
num articles: 81
orig vocab size 559
mindf2 vocab size 137
vocab size 137
clust finished in 0.0235621929169
words related to trump
gorsuch
for
or
clinton
The Huffington Post
num articles: 254
orig vocab size 1239
mindf2 vocab size 362
vocab size 362
clust finished in 0.130997180939
words related to trump
election
didn
nomination
now
moonlight
ABC News
num articles: 134
orig vocab size 633
mindf2 vocab size 177
vocab size 177
clust finished in 0.0359399318695
words related to trump
lawmakers
aca
bill
listening
her
blueprint
himself
prosecutor
CBS News
num articles: 78
orig vocab size 457
mindf2 vocab size 87
vocab size 87
clust finished in 0.0120129585266
words related to trump
putin
health
russia
NBC News
num articles: 96
orig vocab size 513
mindf2 vocab size 176
vocab size 176
clust finished in 0.0229661464691
words related to trump
flynn
The New York Times
num articles: 306
orig vocab size 1262
mindf2 vocab size 345
vocab size 345
clust finished in 0.0602300167084
words related to trump
independence
let
pen
post
france
america
nears
ties
looks
foreign
pennsylvania
being
stories
at
Resultsotal num articles: 2465
orig vocab size 4636
mindf2 vocab size 2332
vocab size 2332
clust finished in 11.3553888798
words related to trump
governing
negotiate
feeling
that
camp
bad
citizens
gay
backing
demands
beijing
sparks
homes
partner
hike
2) Better Representations from Labeled Datasets
Part of Speech Tagging:
Brown Corpus 1960 1,000,000 words tagged with part of speech
Lemmatization - mapping words to a root form:
E.g. [Franch, French] -> French
Open Issues
• Parameter selection
• Scalability
• Evaluation
• Fake News
Issue - Parameter selection
• How do you
determine the
parameter values to
give as input to the
clustering algorithm?
Issue - Scalability
• How does the runtime and
memory consumption of the
algorithm behave for massive
input graphs?
Issue - Evaluation
• How to decide which clusterings is the best?
Issue - Fake News
References
http://www.lsi.upc.edu/~bejar/amlt/articulos/Graph%20Clustering03.pdf
http://world.mathigon.org/Graph_Theory
http://micans.org/mcl/
http://searchengineland.com/google-news-ranking-stories-30424
http://cs-people.bu.edu/mp/images/pap101a.pdf
https://en.wikipedia.org/wiki/Named-entity_recognition
https://en.wikipedia.org/wiki/Parse_tree
Discussion