Date post: | 18-Feb-2017 |
Category: |
Data & Analytics |
Upload: | paragonscienceinc |
View: | 951 times |
Download: | 0 times |
Finding Emerging Topics Using Chaos and Community Detection in Social Media GraphsSteve Kramer, Ph.D.President & Chief ScientistParagon Science, Inc.September 2015
Copyright © 2006-2015 Paragon Science, Inc. All rights reserved.
Overview Background Information about Paragon Science Example 1: Ebola Twitter Analysis 2014 Example 2: Stock Market Analysis via Twitter Q & A
Paragon Science, Inc. 2
About Paragon Science Advisory Board Company
• Analysis of Healthcare Data Digital Motorworks/CDK Global
• Vehicle Pricing Analytics Houston Law Firm
• Email Analysis for Patent Lawsuit Place IQ
• Mobile Phone Data Analysis RetailMeNot
• Web Analytics for Online Coupons Vast.com
• Web User Click Patterns
Paragon Science, Inc. 3
Founder: Dr. Steve Kramer• PhD in computational physics (nonlinear
dynamics)• Self-funded data science entrepreneur• 22 years of research and high-tech
experience• Manager and consultant at software
companies• Reviewer for scientific journals and
conferences• Member of StartOut Austin steering
committee
http://affinityincmagazine.com/paragon-science-puts-patented-technology-to-work-for-range-of-clients/
Paragon Science, Inc. 4
Using our patented anomaly detection software to find the “unknown unknowns”: unusual changes that represent revenue opportunities to exploit or risks to mitigate
Many possible application areas: • Social media alerting and sentiment change detection• Pricing and market trend analysis and alerting• Fraud prevention (banking, insurance, online auctions,…)
Key advantages• No machine learning or training required• Robust to missing or erroneous data• Highly scalable and parallelizable
What Are We Doing?
Paragon Science, Inc. 5
How Is It Done Today? Existing approaches
• Standard SNA metrics• Rule-based systems (transaction profiling, etc.)• Bayesian and other statistical/probabilistic models• Machine learning tools (neural nets, HMMs, etc.)
Some limitations of existing methods• Training requirements can be large for neural nets.• For rule-based systems, it is difficult to effectively predict or define
new “bad” anomalies or patterns in advance. • Many current methods are not scalable to real-world operational
requirements.
Paragon Science, Inc. 6
What Is New in Our Patented Approach? A powerful anomaly detection approach that
incorporates nonlinear time series analysis methods• US Patent #8738652 (1.usa.gov/1kkyVD9)
“Systems and Methods for Dynamic Anomaly Detection” Key questions answered:
• Which entities behave or evolve differently than others in the data set?
• Which entities have shifted their behavior unexpectedly?
Paragon Science, Inc. 7
What Is New in Our Approach? (Cont’d.) Our framework inherently captures the dynamics of the entities under
study, without having to specify in advance normal vs. abnormal behavior.
We can simultaneously analyze the time evolution of• Network structures• Any associated attributes (text terms, geospatial position, etc.)
Our technique is robust with respect to missing or erroneous data. As result, we can
• Find key players in rapidly changing networks• Provide early warning of viral videos and online documents• Focus attention on the most-anomalous events or transactions
Paragon Science, Inc. 8
Dynamic Anomaly Detection Overview A general approach that incorporates nonlinear time series
analysis methods• Complexity measures• Finite-time Lyapunov exponents (FTLEs)
Input data• Communications or transactional data streams• General time-dependent data sets
Key questions• Which entities behave or evolve differently than others in the data
set?• Which entities have shifted their behavior unexpectedly?
Paragon Science, Inc. 9
Finite-Time Lyapunov Exponents (FTLEs) General dynamical system
Flow map• Advects points in the state
space• Describes the time
evolution of the system
Paragon Science, Inc. 10
FTLEs characterize the amount of stretching or contraction about a point x0 during a time interval T• Stability• Predictability
Definition
Finite-Time Lyapunov Exponents (FTLEs)
Paragon Science, Inc. 11
Similarly, characteristic vectors derived from the flow map’s Jacobian can describe the generalized directions of the local stretching or contraction.
Possible derivation approaches:• Weight-based column sampling• Singular value decomposition (SVD)• Principal component analysis (PCA)
Derived Jacobian Vectors
Paragon Science, Inc. 12
Paragon Dynamic Anomaly Detection
Representation of Data at t=ti
Cluster Resolution
Feature Vector Encoding
Outlier Detectionat t=ti
3+Time Intervals?
Yes
No
Clustering /Segmentation
Dynamic Anomaly Detection
Nonlinear Time Series AnalysisFTLEs, Dynamic Thresholds, etc.
PatternClassification
Outlier Detection
Domain-Specific FilteringThreat Signatures,Risk Profiles, etc.
Example 1: Ebola Twitter Analysis 2014 Sample data set from Twitter API collected using twittertap
• Date range: 11/8/2014 – 11/16/2014• 2,541,812 tweets• 4,708,678 generated links with hashtags, URLs, and user replies
Research plan• Perform k-core decomposition• Run anomaly detection software on sub-networks of nodes in the
central core to find the most influential users and most viral URLs• Carry out community detection and topic detection
Paragon Science, Inc. 13
Twitter-Induced Social Networks
Paragon Science, Inc. 14
User A User B
User C
replies to
mentions
URL 1 URL 2
Hash Tag 1
Hash Tag 2
references
uses
uses
references
Paragon Science, Inc. 15
K-core Decomposition The k-core of a graph is a maximal subgraph in which each
vertex has at least degree k. • The coreness of a vertex is k if it belongs to the k-core but not to
the (k+1)-core. • The k-core decomposition is performing by recursively removing
all the vertices (along with their respective edges) that have degrees less than k.
The k-core decomposition of a network can be very effective in identifying the individuals within a network who are best positioned to spread or share information. • M. Kitska, et al., “Identifying influential spreaders in complex networks,”
arXiv:1001.5285v1 [physics.soc-ph] (2010).
K-Core Decomposition of the Ebola Network
Paragon Science, Inc. 16
http://sourceforge.net/projects/lanet-vi/
Central Core of the Ebola Network
Paragon Science, Inc. 17
Top URLs in the Central Core
Paragon Science, Inc. 18
URL K Shell
Degree
http://goo.gl/pFg3Z2 49 279 http://goo.gl/BFEUgy 49 233 http://goo.gl/S37kHT 49 212 http://goo.gl/silISF 47 364 http://invst.rs/7MKWHB 22 779 http://cnn.it/1wlIlUe 22 741 http://trib.al/YKSMCSN 22 734 http://nyp.st/136BPG3 22 698 http://nypost.com/2014/10/29/cdc-admits-droplets-from-a-sneeze-could-spread-ebola/
22 415
http://fxn.ws/1oVgLwc 22 406
Top-Ranked Website (URLs 1, 2, and 4)
Paragon Science, Inc. 19
UMA MENTIRA CHAMADA ,,EBOLA,, VEJAM !!! | NOTICIÃRIO DA WEBA statement made by a man in Ghana called Nana Kwame rocked the internet in recent days. The following information has to reach people. We need to see the Ebola for what it really is. It's time to wake up the world agenda behind this whole story.
Follow what this man has to say about what is happening in their country of origin:
People in the world need to know what is happening here in West Africa. They are lying! The '' Ebola''como a virus does not exist and is not contagious. The Red Cross brought a disease to four specific countries, for four specific reasons and is only contracted by those who receive treatments and injections of the Red Cross. That's why Liberians and Nigerians began to expel the Red Cross in their countries!
5th Ranked Website
Paragon Science, Inc. 20
6th Ranked Website
Paragon Science, Inc. 21
Topic Detection in the Ebola Twitter Network
Paragon Science, Inc. 22
User A User B
User C
replies to
mentions
URL 1 URL 2
references
Term 1
Term 2
Term N
Term 3
Topic 1
Topic 2
Topic M
Applicable “Soft” Clustering Methods K-Groups/Group Discovery Algoritjm (GDA)
• J. Kubica, A. Moore, and J. Schneider, “Tractable group detection on large link data sets,” The Third IEEE International Conference on Data Mining (2003).
Clique Percolation (http://www.cfinder.org/) • G. Palla, et al., “Uncovering the overlapping community structure
of complex networks in nature and society,” Nature, 435, p. 814 (2005).
Louvain Modularity Optimization• V. Blondel, et al., “Fast unfolding of communities in large
networks,” Journal of Statistical Mechanics: Theory and Experiment, 10, P10008 (2008).
Paragon Science, Inc. 23
Summary of Top 200 Topic Anomalies
Paragon Science, Inc. 24
Topic Peak Start Time Peak End Time Max Change Metric
# Anomalies
Topic 99 2014-11-06 06:18 2014-11-12 10:18 2.97 40Topic 8 2014-11-05 20:18 2014-11-07 07:18 2.891 34Topic 59 2014-11-06 20:18 2014-11-11 19:18 2.43 28Topic 1 2014-11-05 17:18 2014-11-05 19:18 2.32 3Topic 52 2014-11-05 17:18 2014-11-05 18:18 2.30 2Topic 50 2014-11-05 19:18 2014-11-06 15:18 2.22 11Topic 32 2014-11-05 18:18 2014-11-05 19:18 2.18 2Topic 20 2014-11-05 20:18 2014-11-06 02:18 2.11 7Topic 2 2014-11-07 07:18 2014-11-12 16:18 2.10 33Topic 28 2014-11-05 20:18 2014-11-05 22:18 2.00 3Topic 29 2014-11-08 02:18 2014-11-12 18:18 1.96 21Topic 97 2014-11-06 09:18 2014-11-07 03:18 1.91 4Topic 30 2014-11-05 20:18 2014-11-05 20:18 1.84 1Topic 22 2014-11-05 23:18 2014-11-06 02:18 1.79 4Topic 18 2014-11-05 17:18 2014-11-05 17:18 1.65 1Topic 15 2014-11-05 19:18 2014-11-05 19:18 1.63 1Topic 4 2014-11-08 14:18 2014-11-12 15:18 1.61 5
Key Sites Related to Top 5 Ebola Topic Anomalies
Paragon Science, Inc. 25
Topic Max Change Metric
Peak Datetime
Top Related URL Title
Topic 99
2.973 2014-11-06 17:18:27
FACT SHEET: Emergency Funding Request to Enhance the U.S. Government’s Response to Ebola at Home and Abroad | The White House
Topic 8
2.888 2014-11-05 20:18:27
BBC News - Ebola outbreak: Barack Obama 'to ask Congress for $6bn'
Topic 59
2.426 2014-11-07 02:18:27
» Obama Caught Ordering Press to Cover Up Ebola Alex Jones' Infowars: There's a war on for your mind!
Topic 1
2.321 2014-11-05 17:18:27
UMA MENTIRA CHAMADA ,,EBOLA,, VEJAM !!! | NOTICIÃRIO DA WEB
Topic 52
2.296 2014-11-05 17:18:27
Nigeria Property: Ebola Virus Originated From US Bio-warfare Labs In West Africa – American Prof
Example: Topic 99 URL-to-User Links
Paragon Science, Inc. 26
Topic 99a: Economic Consequences
Paragon Science, Inc. 27
Topic 99b: Mobile Data to Prevent Ebola
Paragon Science, Inc. 28
Topic 99c: ISIS and Ebola
Paragon Science, Inc. 29
Topic 99d: @ebolafiles (Twitter user)
Paragon Science, Inc. 30
Topic 99e: Emergency Funding Request
Paragon Science, Inc. 31
Topic 99f: Follow Ebola
Paragon Science, Inc. 32
Follow Ebola | Updated every second & see what the #CDC & #WHO is not telling you about #Ebola
Overview Background Information about Paragon Science Example 1: Ebola Twitter Analysis 2014 Example 2: Stock Market Analysis via Twitter Q & A
Paragon Science, Inc. 33
Twitter Stock Market Data Set Date range: August 5-29, 2015 175,246 tweets sent by 28,754 users Network graph generated includes these links:
• symbol links to URL: 430,842 (74,034 distinct URLs)• user links to URL: 149,117• user mentions user: 74,247 • user references hash tag: 176,670 • user references symbol: 501,165 • user replies to user:10,698
Goal: • Identify key influencers and emerging topics that could influence prices • Provide high-quality input for Moodzee predictive models
Paragon Science, Inc. 34
Twitter Stock Market Graph for August 2015
Paragon Science, Inc. 35
Twitter Stock Market Graph (Zoom 1)
Paragon Science, Inc. 36
Twitter Stock Market Graph (Zoom 2)
Paragon Science, Inc. 37
Identifying Key Influencers Perform k-core
decomposition Results:
• 50 k-shells• 102 users at the center of
the network• Examine stock symbol ->
URL links for the central users using uncertainty scores for the content of the web pages
Paragon Science, Inc. 38
Twitter User # LinksDayTradersGroup 855diggingplatinum 652Benzinga 261WrigleyTom 203SeekingAlpha 182OpenOutcrier 126theflynews 125WallStJesus 119Istock8 96valuewalk 93
Network of 102 Central Users and 2910 Neighbors
Paragon Science, Inc. 39
Network of 102 Central Users and Neighbors (Zoom 1)
Paragon Science, Inc. 40
Network of 102 Central Users and Neighbors (Zoom 2)
Paragon Science, Inc. 41
Using Financial Sentiment Scores: Uncertainty
Paragon Science, Inc. 42
Web Page Title URL(s) UncertaintyPredicting Is Hard Business | Seeking Alpha http://seekingalpha.com/article/3422496-predicting-
is-hard-business?source=feed_f69
In Today's Overheated Market, Control Risk In Your Retirement Portfolios With Sound Valuation | Seeking Alpha
http://seekingalpha.com/article/3455116-in-todays-overheated-market-control-risk-in-your-retirement-portfolios-with-sound-valuat
63
Comments On The Market Correction; Focus On Biotechs: Large Caps - Regeneron Pharmaceuticals, Inc. (NASDAQ:REGN) | Seeking Alpha
http://seekingalpha.com/article/3468626-comments-on-the-market-correction-focus-on-biotechs-large-caps?source=feed_f
55
TradingView: Free Stock Charts and Forex Charts Online.
http://www.tradingview.com 51
A MASSIVE New Platinum Pick Is Being Released At 9:30 am Today! Get On The List For Early Access To This New Play. | Blog
http://tinyurl.com/oea3bjx, http://tr.im/oCRrP, http://bit.ly/1JhlgVb
49
Our Pick On VGTL Has Gained 242.86% For Our Subscribers, In 2 Months! | Blog
http://bit.ly/1OOMiY9, http://tr.im/6hNJf 47
After 550% Gains On Our Picks In 5 Weeks, We Have A Major New Pick Coming Tomorrow! It is ONLY being released to Platinum Members Tomorrow, So Go Platinum To Get It Early! | Blog
http://ow.ly/QrGNn 47
Our Picks Gained Over 550% In The Past Month! And We Have A MASSIVE New Pick Coming To Our Platinum Members! Subscribe To Get It Early. | Blog
http://bit.ly/1UjdodT, http://goo.gl/r34fP7, http://tr.im/mZn9y
47
Our Pick On VGTL Has Gained 242.86% For Our Subscribers, In 2 Months! | Blog
http://tinyurl.com/qjwxxwk 47
What To Find Before Seeking Alpha: Position Size | Seeking Alpha
http://seekingalpha.com/article/3444516-what-to-find-before-seeking-alpha-position-size?source=twitter_sa_factset
37
Loughran and McDonald Financial Sentiment Dictionaries:Tim Loughran and Bill McDonald, 2011, “When is a Liability not a Liability? Textual Analysis, Dictionaries, and 10-Ks,” Journal of Finance, 66:1, 35-65
Anomaly Scores for Symbols -> URL Links
Paragon Science, Inc. 43
Largest jump in the anomaly scores: $BIDU on 8/13/2015
$BIDU Network at First Uncertainty Surge
Paragon Science, Inc. 44
Topic Detection in the Twitter URL Network
Paragon Science, Inc. 45
User A User B
User C
replies to
mentions
URL 1 URL 2
references
Term 1
Term 2
Term N
Term 3
Topic 1
Topic 2
Topic M
Topic Detection: Network of 698 Web Pages Shared by 102 Central Users
Paragon Science, Inc. 46
215 topics detected
Network of 698 Web Pages Shared by 102 Central Users
Paragon Science, Inc. 47
Network of 698 Web Pages Shared by 102 Central Users
Paragon Science, Inc. 48
Nodes colored by topic #
Web Site Titles in Largest Topic
Paragon Science, Inc. 49
SPY ETF Turns Negative For Year Before Clawing Back - Investors.com
$TSLA $GE $JCP $JWN $LOCO $KING $DD $JPM $AMAT $BAC $CBK: Stocks to Watch: Tesla, GE, JC Penney, Nordstrom | Stock News Hour
$AAPL Apple has completed a 6-month complex H&S top
$MU $SYMC $AAPL $ATML $SYNA $QLGC $CRUS $FCS $YHOO $BABA $AKAM $FSLR: It’s Not Just Apple: Yahoo!, Micron, Synaptics Fall on China Fears | Stock News Hour
$GS $NVDA $BRCM $MU $SWKS $QCOM $INTC $WYNN $AAPL $YHOO $CAT $GM $T $VZ: China Damage Spreading | Stock News Hour
$GOOGL $CAT $AAPL $SHAK $KHC $TW $JASO $RRGB $CSC $SYMC $CREE: Investors eye positive catalysts in oil, Google | Stock News Hour
$GOOGL $PCLN $CTRP $BIDU $FB $AMZN $BABA $EXPE $LONG $QUNR $AWAY: The only US Web company that’s figured out China | Stock News Hour
New Partner Company: Moodzee Text analytics for financial markets
• Predictive models• Advanced warning of price-moving events
Initial target users: Hedge funds Price correlations done, now back-testing then
paper trading then real trading
Paragon Science, Inc. 50
Alerts Correlation Analysis Downloader
Sentiment
Price-Movers
Anomalies
Paragon Science, Inc. 51
What Are the Payoffs? Find the “unknown unknowns” in dynamic data sets Quickly identify key influencers and trends in online
networks Provide early warning of viral videos, anomalous web
events, or unusual network traffic Enable enhanced business intelligence without having to
specify normal vs. abnormal behavior in advance
Third-Party Software Acknowledgements Paragon Science gratefully acknowledges the following researchers and software
providers:• Cytoscape (http://www.cytoscape.org/) • dynnetwork Cytoscape plugin (https://code.google.com/p/dynnetwork/) • Lanet-vi (http://sourceforge.net/projects/lanet-vi/)
◦ J. Alvarez-Hamelin, et al. "Understanding Edge Connectivity in the Internet through Core Decomposition," Internet Mathematics 7 (1): 45–66, 2011.
• Louvain community detection software (http://perso.crans.org/aynaud/communities/)◦ V. Blondel, et al., “Fast Unfolding of Communities in Large Networks,” Journal of
Statistical Mechanics: Theory and Experiment, 10, P10008, 2008.• Networkx (https://networkx.github.io/)
◦ A Hagberg, D Conway, "Hacking social networks using the Python programming language (Module II - Why do SNA in NetworkX)", Sunbelt 2010: International Network for Social Network Analysis.
Paragon Science, Inc. 52
Overview Background Information about Paragon Science Example 1: Ebola Twitter Analysis 2014 Example 2: Stock Market Analysis via Twitter Q & A
Paragon Science, Inc. 53