1 © 2005
“Business Intelligence Mining in Web 2.0:
Data, Text and Web Mining for Finance,
Accounting and Marketing Applications”
Hsinchun Chen, Ph.D.
Director, Artificial Intelligence Lab
Director, NSF COPLINK and Dark Web Research Centers
University of Arizona
Acknowledgements: NSF, LOC, ITIC/KDD, DHS, DOJ
2 © 2005
My Background
• NCTU SUNY Buffalo NYU U Arizona (MIS #4)
• MS, MIS, Design Science, AI, Search Engine, Digital
Library, Medical Informatics, Intelligence & Security
Informatics, Business Intelligence
• AI Lab, 25+ researchers; $25M funding ($1.5M/year),
180 top SCI papers (20+ papers/year); DL (#1), MIS
(#8); Scientific Advisor: NLC, NLM, Academia Sinica;
Chair, ICADL, IEEE ISI
• AE in ten top SCI journals, IEEE and AAAS Fellow
• DL/SE; GeneScene & BioPortal; COPLINK & Dark Web
(NYT, USA Today, Associated Press, etc.); Knowledge
Computing Corporation ($100M)
• Business Intelligence Mining???
3 © 2005
The Peta Age
The End of Theory
4 © 2005
Outline
• Web 2.0 + Data Mining, Text mining, Web mining
• Intelligence and Security Informatics
• Case Studies, Examples, and Lessons Learned:
Business Intelligence Data, Text and Web mining
• Opportunities and Future Directions: Finance,
Accounting, and Marketing Applications
5 © 2005
Web 2.0, Data Mining, Text
Mining, and Web Mining
6 © 2005 6
Web 2.0, by O’Reilly
• http://www.oreilly.com, “What is Web 2.0? Design Patterns and Business Models for the Next Generation of Software,” by Tim O’Reilly, 9/30/2005 (O’Reilly Media Web 2.0 Conference, 2004)
• Examples of Web 2.0: Google AdSense, Flikr, Napster, Wikipedia, blogging, search engine optimization, web services, participation, tagging (folksonomy), syndication, etc.
7 © 2005 7
Web 2.0, by O’Reilly
• Strategic positioning: “The Web as Platform”
• User positioning: “You control your own data”
• Core competencies:
– Services, not packageg software
– Architecture of participation
– Cost-effective scalability
– Remixable data sources and data transformations
– Software above the level of a single device
– Harnessing collective intelligence
8 © 2005 8
Web 2.0 Lessons
• The value of the software is proportional to the scale and dynamism of the data it helps to manage.
• Leverage customer-self service and algorithmic data management to reach out to the entire web, to the edges and not just the center, to the long tail and not just the head.
• The service automatically gets better the more better use it.
• Blogging and the wisdom of the crowds.
• Network effects from user participation are the key to market dominance in the Web 2.0 era.
• We, the media.
• Data is the next Intel inside.
9 © 2005 9
Web 2.0 Lessons (cont’d)
• Operations must become a core competency.
• The perceptual beta.
• Support lightweight programming models that allow for loosely coupled systems. (SOAP, REST, AJAX, etc.)
• Think syndication, not coordination.
• Innovation in assembly. The Mashups.
• Design for “hackability” and remixability.
• Some rights reserved.
10 © 2005 10
Web 2.0, Wikipedia
• “Web 2.0 is a trend in the use of the WWW technology and web design that aims to facilitate creativity, information sharing, and collaboration among users.”
• “Web 2.0 is the business revolution in the computer industry caused by the move to the Internet as platform, and an attempt to understand the rules for success on that new platform.”
11 © 2005 11
Web 2.0 Characteristics
• Rich user experience
• User participation
• Dynamic content
• Metadata
• Web standards and scalability
• Openness
• Freedom
• Collective intelligence
12 © 2005 12
Web 2.0 Features/Technologies
• Technological infrastructure: server software, content syndication, messaging protocols, browsers with plug-ins and extensions, various client applications.
• Cascading Style Sheets (CSS) to separate presentation from content
• Folksonomy (collective tagging)
• Microformats extending pages with semantics
• REST, XML, JSON based APIs
• Rich Internet application techniques based on AJAX
• RSS or Atom feeds for syndication and notification of data
• Mashups of content from different sources
• Weblog publishing, and wikis
13 © 2005 13
Web 2.0 Criticism
• “Web 2.0 as a piece of jargon,” by Tim Berners-Lee
• “A second bubble”
• “Bubble 2.0”
• “A mere augmentation of current cultural information exchanges that are bound by existing political and societal structures.”
14 © 2005
Web Programming with Amazon,
Google, and eBay APIs
15 © 2005
What is Web Services?
• Web Services:
– A new way of reuse/integrate third party softwre or
legacy system
– No matter where the software is, what platform it
residents, or which language it was written in
– Based on XML and Internet protocols (HTTP,
SMTP…)
• Benefits:
– Ease of integration
– Develop applications faster
16 © 2005
Web Services Architecture
• Simple Object Access Protocol (SOAP)
• Web Service Description Language (WSDL)
• Universal Description, Discovery and Integration
(UDDI)
17 © 2005
New Breeds of Web Services
• Representational State Transfer (REST) – Use HTTP Get method to invoke remote services (not XML)
– The response of remote service can be in XML or any textual format
– Benefits: • Easy to develop
• Easy to debug (with standard browser)
• Leverage existing web application infrastructure
18 © 2005
Server Responses in REST
• Really Simple Syndication (RSS, Atom)
– XML-based standard
– Designed for news-oriented websites to “Push” content to
readers
– Excellent to monitor new content from websites
• JavaScript Object Notation (JSON)
– Lightweight data-interchange format
– Human readable and writable and also machine friendly
– Wide support from most languages (Java, C, C#, PHP,
Ruby, Python…)
19 © 2005
Rich Interactivity Web - AJAX
• AJAX: Asynchronous JavaScript + XML
• AJAX incorporates: – standards-based presentation using XHTML and CSS;
– dynamic display and interaction using the Document Object Model;
– data interchange and manipulation using XML and XSLT;
– asynchronous data retrieval using XMLHttpRequest;
– and JavaScript binding everything together.
• Examples: – http://www.gmail.com
– http://www.kiko.com
• More info: http://www.adaptivepath.com/publications/essays/archives/000385.php
20 © 2005
AJAX Application Model
21 © 2005
Amazon Web Services (AWS)
• Amazon E-Commerce Service – Search catalog, retrieve product information, images and customer reviews
– Retrieve wish list, wedding registry…
– Search seller and offer
• Alexa Services – Retrieve information such as site rank, traffic rank, thumbnail, related sites
amount others given a target URL
• Amazon Historical Pricing – Programmatic access to over three years of actual sales data
• Amazon Simple Queue and Storage Service – A distributed resource manager to store web services results
• Amazon Elastic Compute Cloud – Sell computing capacity by the amount you use
22 © 2005
Google Web APIs
• Google has a long list of APIs
– http://code.google.com/apis/
• Google Search
– AJAX Search API
– SOAP Search API (deprecated)
– Custom search engine with Google Co-op
• Google Map API
• Google Data API (GData)
– Blogger, Google Base, Calendar, Gmail, Spreadsheets, and a lot more
• Google Talk XMPP for communication and IM
• Google Translation (http://www.oreillynet.com/pub/h/4807)
• Many more undocumented/unlisted APIs to be discovered in
Google Blog
23 © 2005
eBay API
• Buyers: – Get the current list of eBay categories
– View information about items listed on eBay
– Display eBay listings on other sites
– Leave feedback about other users at the conclusion of a commerce transaction
• Sellers: – Submit items for listing on eBay
– Get high bidder information for items you are selling
– Retrieve lists of items a particular user is currently selling through eBay
– Retrieve lists of items a particular user has bid on
24 © 2005
Other Services/APIs Providers
• Yahoo! http://developer.yahoo.com/
– Search (web, news, video, audio, image…)
– Flickr, del.icio.us, MyWeb, Answers API
• Windows Live http://msdn2.microsoft.com/en-us/live/default.aspx
– Search (SOAP, REST)
– Spaces (blog), Virtual Earth, Live ID
• Wikipedia
– Downloadable database http://en.wikipedia.org/wiki/Wikipedia:Technical_FAQ#Is_it_possible_to_download_the_contents_of_Wikipedia.3F
• Many more at Programmableweb.com
– http://www.programmableweb.com/apis
25 © 2005
Services by Category • Search
– Google, MSN, Yahoo
• E-Commerce
– Amazon, Ebay, Google Checkout
– TechBargain, DealSea, FatWallet
• Mapping
– Google, Yahoo!, Microsoft
• Community
– Blogger, MySpace, MyWeb
– del.icio.us, StumbleUpon
• Photo/ Video
– YouTube, Google Video, Flckr
• Identity/ Authentication
– Microsoft, Google, Yahoo
• News
– Various news feed websites including Reuters, Yahoo! and many more.
26 © 2005
Mashup:
A Novel Form of Web Reuse
• “A mashup is a website or application that combines
content from more than one source into an integrated
experience.” – Wikipedia
• API X + API Y = mashup Z
• Business model: Advertisement
27 © 2005
Web Mining: Machine Learning for
Web Applications
Hsinchun Chen and Michael Chau
ARIST, 38, 2004
28 © 2005
• The term Web Mining was coined by Etzioni (1996) to denote
the use of Data Mining techniques to automatically discover
Web documents and services, extract information from Web
resources, and uncover general patterns on the Web.
• In this article, we have adopted a broad definition that considers
Web mining to be “the discovery and analysis of useful
information from the World Wide Web” (Cooley et al., 1997).
• Also, web mining research overlaps substantially with other
areas, including data mining, text mining, information retrieval,
and web retrieval.
What is Web Mining?
29 © 2005
30 © 2005
• Machine learning algorithms can be classified as
– Supervised learning: Training examples contain input/output pair patterns. Learn how to predict the output values of new examples.
– Unsupervised learning: Training examples contain only the input patterns and no explicit target output. The learning algorithm needs to generalize from the input patterns to discover the output values.
• We have identified the following five major Machine Learning paradigms:
– Probabilistic models
– Symbolic learning and rule induction
– Neural networks
– Analytic learning and fuzzy logic.
– Evolution-based models
• Hybrid approaches
Machine Learning Paradigms
31 © 2005
• Learning techniques had been applied in Information Retrieval
(IR) applications long before the recent advances of the Web.
• In this section, we will briefly survey some of the research in this
area, covering the use of Machine Learning in
– Information extraction
– Relevance feedback
– Information filtering
– Text classification and text clustering
Machine Learning for Information
Retrieval: Pre-Web
32 © 2005
• Web Mining research can be classified into three categories:
– Web content mining refers to the discovery of useful information from Web contents, including text, images, audio, video, etc.
– Web structure mining studies the model underlying the link structures of the Web.
It has been used for search engine result ranking and other Web applications (e.g., Brin & Page,1998; Kleinberg, 1998).
– Web usage mining focuses on using data mining techniques to analyze search logs to find interesting patterns.
One of the main applications of Web usage mining is its use to learn user profiles (e.g., Armstrong et al., 1995; Wasfi et al., 1999).
Web Mining
33 © 2005
Intelligence and Security
Informatics:
COPLINK and Dark Web
34 © 2005
• Intelligence and Security Informatics (ISI): Development of advanced information technologies, systems, algorithms, and databases for national security related applications, through an integrated technological, organizational, and policy-based approach” (Chen et al., 2003a)
• Data, text, and web mining
• From COPLINK to Dark Web
H. Chen, computer scientist, artificial intelligence, U. of Arizona (2006)
35 © 2005 35
A knowledge discovery research
framework for ISI
A knowledge discovery research
framework for ISI
36 © 2005 36
• Information Sharing and Collaboration
• Crime Association Mining
• Crime Classification and Clustering
• Intelligence Text Mining
• Crime Spatial and Temporal Mining
• Criminal Network Analysis
ISI Research: KDD Techniques
37 © 2005
COPLINK
• 1996-, DOJ, NIJ, NSF, ITIC, DHS
• Connect
• Detect
• Agent
• STV (Spatio-Temporal Visualization)
• CAN (Criminal Activity Network)
• BorderSafe (Mutual Information)
• AI Lab Knowledge Computing Corporation
• Tucson, Phoenix AZ 1600 agencies, 20 states
38 © 2005
•Newsweek Magazine March3, 2003
•ABC News April 15, 2003
•The New York Times November 2, 2002
39 © 2005
Dark Web
• 2002-, ITIC, NSF, LOC
• Discussions: FBI, DOD/Dept of Army, NSA, DHS
• Connection:
– Web site spidering
– Forum spidering
– Video spidering
• Analysis and Visualization:
– Link and content analysis (web sites)
– Web metrics analysis (web sites sophistication)
– Authorship analysis (forums; CyberGate)
– Sentiment analysis (forums; CyberGate)
– Video coding and analysis (videos; MCT)
40 © 2005
The Dark Web project in the Press
Project Seeks to Track Terror Web
Posts, 11/11/2007
Researchers say tool could trace online posts
to terrorists, 11/11/2007
Mathematicians Work to Help Track Terrorist
Activity, 9/14/2007
Team from the University of Arizona
identifies and tracks terrorists on
the Web, 9/10/2007
41 © 2005 41
COPLINK Connect
Consolidating & Sharing Information promotes problem
solving and collaboration
Records
Management
Systems (RMS)
Mugshots
Database
Gang Database
42 © 2005 42
COPLINK Detect
Consolidated information enables targeted problem solving via powerful
investigative criminal association analysis
43 © 2005 43
COPLINK Detect 2.0/2.5
44 © 2005 44
Association Retrieval and Visualization
45 © 2005 45
Spatio-temporal Analysis and Visualization
46 © 2005 46
Border Crossing
• An aerial photograph of a
typical U.S. port of entry
(southern border).
• Vehicle lanes are backed up
with dozens of vehicles
during peak times.
• Criminal vehicles operate in
groups.
– If one is caught others turn
back into Mexico.
• They may join the lines one
at a time or use turn-out
points.
Vehicle lanes
Turn-out points
Turn-out points
Port of Entry
(Check points)
© 2006 Google – Imagery © 2006 DigitalGlobe, Map data ©2006 NAVTEQTM
47 © 2005 47
A Vehicle to Watch (via SNA)?
Violent crimes Narcotics crimes Violent & Narcotics
Shape Indicates Object Type
circles are people
rectangles are vehicles
Color Denotes Activity History
Larger Size Indicates higher
levels of activity
Border Crossing Plates are
outlined in Red
Gang related
48 © 2005
Dark Web Collection
Where/how to find them?
49 © 2005
Link to “The General of Islam” Radio Station
Source: http://www.al-ghazawat.110mb.com/,
French and Arabic Web Site
Web Site Example: Links to Multimedia and Manuals
Azzam
Speeches
Berg
beheading
others
videos of
Zarqawi
Complete
65 pages
manual of
a 50
caliber rifle
in pdf
50 © 2005
Web Site Example: Links to Web Sites and Forums
• Links to Several Iraqi
Jihadist Web Sites and
Forums
• Source:
http://almaaber.jeeran.com/,
Arabic Web Site
51 © 2005
Web 2.0 Example: Blog
On a personal blog
http://salafnews.wordpress.com/
, the blogger provides links to
many Islamic Jihadi video clips
he posted on YouTube.
By using selected Arabic lexicons,
we also find quite a few terrorism-
related videos on YouTube as well.
The blogger keeps posting
new videos on YouTube
even if his previous videos
were removed by YouTube.
52 © 2005
Web 2.0 Example: Second Life
• Recently, public media such as Economist and Australian reported
that Jihadists have set up ‘residents’ in Second Life, a famous
online 3D game.
• We do find several extremist groups in the game.
“AS SL's premier terrorist
roleplay group, our assumed identity, is that of
the Al-Quaeda style terrorist group, fighting a
just and holy crusade against the government of a
distant tyrannical, imperialistic and
overbearing superpower…”
Group: Terrorist of SL
53 © 2005
System Design
54 © 2005
Middle East Terrorist Web Collection File Type Breakdown
• Dynamic files (e.g., PHP, ASP, JSP, etc.) are widely used in terrorist Web sites, indicating a high level of technical sophistication.
• Multimedia is also heavily used in terrorist Web sites.
Terrorist Collection # of Files Volume(Bytes)
Total 222,687 12,362,050,865
Indexable Files 179,223 4,854,971,043
HTML Files 44,334 1,137,725,685
Word Files 278 16,371,586
PDF Files 3,145 542,061,545
Dynamic Files 130,972 3,106,537,495
Text Files 390 45,982,886
Powerpoint Files 6 6,087,168
XML Files 98 204,678
Multimedia Files 35,164 5,915,442,276
Image Files 31,691 525,986,847
Audio Files 2,554 3,750,390,404
Video Files 919 1,230,046,468
Archive Files 1,281 483,138,149
Non-Standard Files 7,019 1,108,499,397
Number of Files Distribution (Arabic)
80%
16%
0%
4%
IndexableFiles
MulmediaFiles
Archive Files
Non-StandardFiles
Volume Distribution (Arabic)
39%
48%
4%9% Indexable
Files
MulmediaFiles
Archive Files
Non-StandardFiles
(Terrorist)
(Terrorist)
55 © 2005
Dark Web Forums Identification
0 20 40 60 80 100 120
Middle-
Eastern
Latin-
American
US
Domestic
# of Forums
Local ISP
AOL
MSN
Google Groups
Yahoo! Groups
Websites
Websites 48 4 18
Yahoo! Groups 20 11 31
Google Groups 0 32 47
MSN 0 5 9
AOL 0 0 5
Local ISP 0 8 0
Middle-Eastern Latin-American US Domestic
Forum Identification -- Overall Distribution by ISP Providers
56 © 2005
Dark Web Analysis and
Visualization
57 © 2005
System Design: CyberGate System Design
58 © 2005
Social Network and Content Analysis
Who links to whom and who
influences whom?
How are the sites used?
Which sites are more sophisticated?
59 © 2005
MDS Visualization of Arab Group Web Sites
Hizb-Ut-Tahrir
Jihad
Supporters
Palestinian
supporters
Hizballah
Cluster
Palestinian
terrorists
60 © 2005
Comparison - Content Analysis
U.S. Domestic Terrorist Web sites
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Black Separatists Christian Identity Militia Neo-
confederates
Neo-Nazis/White
Supremacists
Eco-Terrorism
No
rma
lize
d C
on
ten
t L
ev
els Communications
Fundraising
Ideology
Propaganda (insiders)
Propaganda (outsiders)
Virtual Community
Command and Control
Recruitment and Training
Middle Eastern Terrorist Web sites
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Hizb-ut-Tahrir Hizbollah Al-Qaeda Linked
Websites
Jihad Sympathizers Palestinian terrorist
groups
No
rmal
ized
Co
nte
nt
Lev
els
Communications
Fundraising
Sharing Ideology
Propaganda (Insiders)
Propaganda(outsider)
Virtual Community
Command and Control
Recruitment and Training
61 © 2005
Sentiment Analysis
Which forums and who are more
violent and radical?
62 © 2005
Model Building – Training Data Annotation
Coding English Translation Arabic
Sentiment Racism Hate Anger Violence
0.4 (positive to
God)
1.0 0.6 0.2 0.3 In the name of God
the most merciful,
leading the faithful to
victory and defeating
the unbelievers and
polytheists
بسم هللا الرحمن
ناصر الرحيم الحمد هلل
هازمالمؤمنين و
الكفرة والمشركين
-0.5
(negative to
America and
its
collaborators)
0.6 0.8 0.4 0 We say to America
and its collaborators:
live in horror
نقول
ألمـــــــــــــــــــريكا
وعمالئها
عــــيشـــــــوا على
الـــــــــرعب
-0.4 (negative
to the enemies)
0 0 0.4 1.0 Oh God, destroy your
enemies and the
enemies of Muslims
أعداءك دمر اللهم
وأعداء المسلمين
0.4
(positive to
Jihad)
0 0 0 0.2 Jihad is fighting
God’s enemies
أعداء قتال الجهاد هو
هللا
63 © 2005
7. Results: Intensity Scores
U.S. Middle Eastern
Forum Racism Violence Forum Racism Violence
Angelic Adolf 5.513 0.962 Azzamy 30.182 19.833
Aryan Nation 9.921 5.683 Friends 2.076 6.238
CCNU 3.712 14.546 Islamic Union 2.657 9.198
Neo-Nazi 5.458 5.614 Kataeb 2.610 6.605
NSM 10.740 10.740 Kataeb Qassam 25.203 18.670
Smash Nazi 12.424 10.591 Taybah 14.989 15.348
White Knights 19.313 6.353 Osama Lover 14.369 14.584
World Knights 2.468 2.234 Wa Islamah 4.075 9.193
All Forums 10.988 6.902 All Forums 11.892 12.644
U.S. and Middle Eastern Intensity Scores
64 © 2005
7. Results: Intensity Relationship U.S. Forum Scores
0
100
200
300
400
0 100 200 300 400Hate Scores
Vio
len
ce
Sc
ore
s
Middle Eastern Forum Scores
0
100
200
300
400
0 50 100 150 200 250 300 350 400
Hate Scores
Vio
len
ce
Sc
ore
s
Affect Regression Analysis: Message Level
b 1
R 2
U.S. Middle
Eastern
N 4676 3349
beta (slope) 0.079 0.682
t-Stat 21.354 48.265
P-Value 0.000 0.000
R-Square 0.076 0.486
Strong hate and violence
Correlation, especially for
Middle-Eastern group.
65 © 2005
7. Results: Intensity Relationship
U.S. Middle
Eastern
N 8 8
beta (slope) 0.347 0.471
t-Stat 1.760 10.306
P-Value 0.139 0.000
R-Square 0.383 0.947
U.S.
0
5
10
15
20
25
30
35
0 5 10 15 20Violence
Racis
m
Affect Regression Analysis: Forum Level
Middle Eastern
0
5
10
15
20
25
30
35
0 5 10 15 20Violence
Racis
m
66 © 2005
Number of Posts By Month
• Al-Firdaws
consistently has
between 2,500-
3,000 posts per
month since the
second half of
2006.
• Montada very
active in 2002
and 2005.
Al-Firdaws Posts By Month
0
500
1000
1500
2000
2500
3000
3500
Ja
n-0
5
Ma
r-0
5
Ma
y-0
5
Ju
l-0
5
Se
p-0
5
No
v-0
5
Ja
n-0
6
Ma
r-0
6
Ma
y-0
6
Ju
l-0
6
Se
p-0
6
No
v-0
6
Ja
n-0
7
Ma
r-0
7
Ma
y-0
7
Ju
l-0
7
# p
os
ts
Montada Posts By Month
0
5000
10000
15000
20000
25000S
ep-0
0
Jan-0
1
May-0
1
Sep-0
1
Jan-0
2
May-0
2
Sep-0
2
Jan-0
3
May-0
3
Sep-0
3
Jan-0
4
May-0
4
Sep-0
4
Jan-0
5
May-0
5
Sep-0
5
Jan-0
6
May-0
6
Sep-0
6
Jan-0
7
May-0
7
# p
osts
67 © 2005
Affect Intensities – Temporal View
Al-Firdaws - Anger Montada - Anger
Al-Firdaws - Violence Montada - Violence
Al-Firdaws
has
considerably
higher
violence and
also greater
anger
intensity.
68 © 2005
Authorship/Writeprint Analysis
Who are the opinion leaders and
where are they?
69 © 2005
Arabic Feature Set
Lexical Syntactic StructuralContent
Specific
Feature Set
Char-B
ased
Word-B
ased
Punctuation
Function W
ords
Word S
tructure
Word R
oots
Technical S
tructure
Race/N
ationality
Violence
Char-Level
Letter Frequency
Special C
har.
Word-Level
Vocab. R
ichness
Word Length D
ist.
(262) (15)(62)(79)
(418)
(48) (31) (12) (200) (48) (11) (4)
(4) (35) (9) (6) (8) (15)
(50)M
essage Level
Paragraph Level
Contact Inform
ation
Font C
olor
Font S
ize
Em
bedded Images
(5) (6) (3) (29)
Hyperlinks
(14)
(8) (4) (7)
Elongation
(2)
70 © 2005
Sliding Window Algorithm Illustration
1,0,0,2,1,2
0,1,3,0,1,0
0.533 0.956
-0.541 0.445
0.034 0.089
0.653 0.456
0.975 -0.085
0.143 -0.381
Compute eigenvectors for
2 principal components of
feature group
Transform into 2-
dimensional space
x
Extract feature
usage vectors
y
x = Zx
y = Zy
Repeat steps
2 and 3
1.
3.
2.
x
y
Message Text
Feature Usage Vector Z
Eigenvectors
71 © 2005
Anonymous Messages Author Writeprints
Author B
Author A 10 messages
10 messages
72 © 2005
ClearGuidance.com
• Toronto plot forum Member Interaction Network – Blue nodes indicate members with the greatest number of in-links.
– These members are the core set of forum “experts” and propagandists
73 © 2005
Forum “Experts”
The series of overlapping circular patterns for bag-of-word
features indicates that the author’s discussion revolves around a
related set of topics.
Bag-of-words are predominantly
related to religious topics.
Many large red blots indicative
of the presence of features
unique to this author.
This author attempts to use his
religious “expertise”.
74 © 2005
This author was later arrested as a major culprit in
the Toronto terror plot (“Soldier of God”). He uses
many violent affect terms.
Radar chart showing violent
affect feature usages.
Comparison to mean shows
several high occurrence terms
(e.g., jihad, martyrdom).
Selected feature is use of term
“jihad” which is the highest in
the forum .
Text annotation view showing
key bag-of-words highlighted.
Selected feature (i.e., “jihad”) is
shown in red.
This author constantly attempts
to justify acts of violence and
terrorism. “…there are so many paid sheikhs
stuck in this life….no point going to
them for fatwas…personally
speaking…cuz they don’t even
agree with jihad in the first place”
75 © 2005
From Cyberspace to Virtual Worlds
Where are they heading?
How do they attract young audiences
(20 and younger)?
76 © 2005
Terrorists of SL
77 © 2005
Terrorism in SL
Group Name in SL No. of Members
terrorists of SL 228
Elite terrorist combat unit 9
Sl terrorist (S. L. T) 5
Second life terrorist association 5
Terrorists 4
The alkida terrorists 4
Shadows terrorists 4
Jihad terrorists 3
Elite jihad terrorist group 2
Automation jihad 2
77
78 © 2005
Terrorism in SL
78
79 © 2005
Terrorism in SL
79
80 © 2005
Case Studies, Examples, and
Lessons Learned: Business
Intelligence Data, Text and Web
Mining
81 © 2005
Data Mining for Credit Rating
82 © 2005
Credit Rating Analysis with Support
Vector Machines and Neural
Networks:
A Market Comparative Study
Zan Huang, Hsinchun Chen,
Chia-jung Hsu, Andy W. Chen,
Soushan Wu
Decision Support Systems, 37(4),
2004
83 © 2005
Our Study
• Apply a relatively new machine learning
technique, Support Vector Machines, with a
classic technique, Neural Networks
• Interpretation of the model
– Variable contribution analysis
• Cross market analysis
– United States and Taiwan market
84 © 2005
Statistical Methods
• Ordinary Least Squares (OLS) – Fisher 1959, Horrigan 1966, Pogue 1969, West 1970
• Multiple Discriminant Analysis (MDA) – Pinches and Mingo 1973,1975
• Logistic Regression Analysis – Ederington 1985
• Probit Analysis – Gentry 1988, Jackson
• Prediction Accuracy: 50 – 70%
• Frequently used financial variables – measures of size, financial leverage, long-term capital intensiveness,
return on investment, short-term capital intensiveness, earnings stability and debt coverage stability
85 © 2005
Artificial Intelligence Methods (cont.)
S tudy
B ond rating
categories M ethod A ccuracy D ata
S am ple
size
B enchm ark
statistical
m ethods
LinR
(64.7% )
S ingleton
and S urkan
1990
2 (Aaa vs. A1,
A2 or A3) B P 88%
U S (B ell
com panie
s) 126 M D A (39% )
G arw aglia
1991 3 B P 84.90% U S S P 797 N /A
55.17% (B P ) LinR (36.21% ),
31.03% (R B S ) M D A (36.20% ),
LogR (43.10% )
M oody and
U tans 1995 16 B P
36.2% , 63.8% (5
classes),
85.2% (3 classes) U S S & P N /A N /A
D utta and
S hekhar
1988
2 (AA vs. non-
AA) B P 83.30%
K im 1993 6 B P , R B S U S S & P
U S 30/17
110/58/60
86 © 2005
Artificial Intelligence Methods (cont.)
S tudy
B ond rating
categories M ethod A ccuracy D ata
S am ple
size
B enchm ark
statistical
m ethods
M aher and
S en 1997 6 B P
70% (7), 66.67%
(5)
U S
M oody's 299
LogR
(61.66% ), M D A
(58-61% )
B P
(w ith O P P )
K w on and
Lim 1998 5 AC LS , B P
59.9% (AC LS ),
72.5% (B P ) K orean 126 M D A (61.6% )
LogR
(53.3% )
75.5% (C B R , G A
com bined)
62.0% (C B R )
53-54% (ID 3)
71-73% (w ith
O P P ), 66-67%
(w ithout O P P ) K orean 126 M D A (58-62% )
C haveesuk et
al. 1999 6
B P , R B F,
LVQ
56.7% (B P ),
38.3% (R B F),
36.7% (LVQ ) U S S & P
60/60 (10
for each
category)
K w on et al.
1997 5
3886
M D A (58.4-
61.6% )
S hin and H an
2001 5 C B R , G A K orean
BP: Backpropagation Neural Networks, RBS: Rule-based System, ACLS: Analog Concept Learning System,
RBF: Radial Basis Function, LVQ: Learning Vector Quantization, CBR: Case-based Reasoning, GA: Genetic
Algorithm, MDA: Multiple Discriminant Analysis, LinR: Linear Regression, LogR: Logistic Regression, OPP:
Ordinary Pairwise Partitioning. Sample size: Training/tuning/testing.
87 © 2005
Taiwan Data Set
• Taiwan Ratings Corporation – Established in 1997, partnering with Standard &
Poor’s.
• Securities and Futures Institute – Quarter financial statement, financial ratios of publicly
traded companies
• Data Preparation – Used the credit rating and the company’s financial
variables 2 quarters before the rating releasing date
– 74 data points, 21 financial variables, 25 financial institutes, 1998-2002
88 © 2005
United States Data Set
• A comparable US data set from Standard & Poor’s Compustat
– Comparable financial variables
– S&P senior debt rating for all commercial banks (DNUM 6021)
– 36 commercial banks, 265 data points, 1991-2000.
TW data US data
twAAA 8 AA 20
twAA 11 A 181
twA 31 BBB 56
twBBB 23 BB 7
twBB 1 B 1
Total 74 Total 265
89 © 2005
Variable Selection
• ANOVA test
– Whether the differences of each financial variable
among different rating classes were significant.
– 5 uninformative variables removed from the data set
• Final data sets
– Taiwan: 14 financial ratios and 2 balance measures
– United States: 12 financial ratios and 2 balance
measures
90 © 2005
Financial Variables
Financial Ratio Name/ Description
ANOVA Between-
Group P-Value
X1 Total assets 0
X2 Total liabilities 0
X3 Long-term debts/ total invested capital 0.12
X4 Debt ratio 0
X5 Current ratio 0.36
X6 Times interest earned (EBIT/interest) 0
X7 Operating profit margin 0
X8 (Shareholders’ equity + long-term debt)/ fixed assets 0
X9 Quick ratio 0.37
X10 Return on total assets 0.01
X11 Return on equity 0.04
X12 Operating income/ received capitals 0
X13 Net income before tax/ received capitals 0
X14 Net profit margin 0
X15 Earnings per share 0
X16 Gross profit margin 0.02
X17 Non-operating income/ sales 0.81
X18 Net income before tax/ sales 0
X19 Cash flow from operating activities/ current liabilities 0.84
X20
(Cash flow from operating activities / (capital expenditures +
increased in inventory + cash dividends)) in last 5 years 0.64
X21
(Cash flow from operating activities – cash dividends)/ (fixed
assets + other assets + working capitals) 0.08
91 © 2005
Experiment Results
• 4 Models (Frequently used variables, full set of
variables) – TW I: Rating = f(X1,X2,X3,X4,X6,X7)
– TW II: Rating = f(X1, X2, X3, X4, X6, X7, X8, X10, X11, X12, X13,
X14, X15, X16, X18, X21)
– US I: Rating = f(X1,X2,X3,X6,X7)
– US II: Rating = f(X1, X2, X3, X6, X7, X8, X10, X11, X12, X13,
X14, X15, X16, X21)
92 © 2005
Experiment Results (cont.)
• Results
– SVM did not
outperform neural
networks.
– The small set of
frequently used
financial variables
contained most
relevant
information.
SVM Results NN Results Difference
TW I 79.73% 75.68% 4.05%
TW II 77.03% 75.68% 1.35%
US I 78.87% 80.00% -1.13%
US II 80.00% 79.25% 0.75%
Experiment Results
73.00%
74.00%
75.00%
76.00%
77.00%
78.00%
79.00%
80.00%
81.00%
TW I TW II US I US II
SVM Results
NN Results
93 © 2005
Measure of Relative Importance
• First order derivatives of the network parameters
– Neural network model
<y1, y2, …, yn>=f(<x1,x2, …, xm>)
– Contribution measure:
• Garson 1991
– Without direction
• Yoon 1994
– With direction
xjyi /
I
i
J
j I
i ji
jkji
J
j I
i ji
jkji
ik
w
vw
w
vw
Con
1 1
1
1
1
||
||||
||
||||
I
i
J
j jkji
J
j jkji
ik
vw
vwCon
1 1
1
• relative contribution of input i on out k
Connection strengths between input, hidden and output layers are
denoted as and . jiw jkv
ikCon
94 © 2005
Variable Contribution Analysis
• Garson’s measure
• Optimal set of variables for the two markets
– TW III: Rating = f(X1, X2, X3, X4, X6, X7, X8)
– US III: Rating = f(X1, X2, X3, X4, X7, X11)
Financial Variable Name/ Description
X1 Total assets
X2 Total liabilities
X3 Long-term debts/ total invested capital
X4 Debt ratio
X6 Times interest earned (EBIT/interest)
X7 Operating profit margin
X8 (Shareholders’ equity + long-term debt)/ fixed assets
X11 Return on equity
95 © 2005
Contribution Analysis Results
Variable Contribution (United States)
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
X1 X2 X3 X4 X7 X11
Financial Variable
Co
ntr
ibuti
on M
ea
su
re
AA
A
BBB
BB
B
Variable Contribution (Taiwan)
0
0.05
0.1
0.15
0.2
0.25
0.3
X1 X2 X3 X4 X6 X7 X8
Financial VarilablesC
ontr
ibuti
on M
ea
su
re
tw AAA
tw AA
tw A
tw BBB
tw BB
Financial Variable Name/ Description
X1 Total assets
X2 Total liabilities
X3 Long-term debts/ total invested capital
X4 Debt ratio
X6 Times interest earned (EBIT/interest)
X7 Operating profit margin
X8 (Shareholders’ equity + long-term debt)/ fixed assets
X11 Return on equity
96 © 2005
Cross Market Analysis
• US Model
– X1, X2, X3, X7 | X4, X11
– Most important: total assets, total liabilities, long-
term debts/total invested capital
• TW Model
– X4, X7, X8 | X1, X2, X3, X6
– Most important: operating profit margin, debt ratio
97 © 2005
Future Directions
• Data mining + text mining
– Add important financial variables from the text
format annual report
• Larger scale cross market analysis
– Mainland China, Taiwan, Hong Kong and
United States markets
• Multidimensional financial data visualization
and exploration
98 © 2005
Stock Prediction Based on
Breaking News
99 © 2005
Textual Analysis of Stock Market Prediction
Using Breaking Financial News
*The Effect of Momentum and Contraium
Selection trategies
*The Effect of Industry Classification
Robert P Schumaker and Hsinchun Chen
*JASIST, 59(2), 2008
With special thanks to Zan Huang and Daniel McDonald
100 © 2005
Introduction
• Stock Market Prediction
– Appealing
• Numerous attempts have been made
– Difficult to accurately predict human behavior
– Two Common Philosophies (Technical Analysis, 2005)
• Fundamental Analysis
– Stock Market activity can be predicted from the security’s relative data, statistics, earnings and management
• Technical Analysis
– Stock Market price trends are identified using charts and modeling techniques
– This philosophy is a form of market analysis that studies the supply and demand for securities based on historical trading volume and market price
101 © 2005
Introduction
• The Use of Textual Data in Prediction
– Text Classification Techniques
• Determine stock price direction
• Promising directional results on aggregate indices
– Limitations of Prior Studies
• Discrete Stock Price prediction from textual data
has not been performed
• Comparisons of regression-based machine learning
methods has not been performed
• Most prior studies on textual data limit themselves
to a ‘Bag of Words’ approach
102 © 2005
Literature Review
• Financial News Articles
– Large amounts of news articles exist for
securities
• Required reports, governmental compliance
• Unexpected reports lead to share price changes
– Can be capitalized on by NLP and text-processing
techniques
– Automated techniques can capitalize on information
quicker than human counterparts
» Cuts the lag time between information release and the
effect on stock price
103 © 2005
Literature Review
• Linguistic Techniques
– Bag of Words – all words from a document can be potentially used for machine learning
• Usually strip stop words and perform stemming
• Prior financial news article research used this method (Lavrenko et. al. 2000a and Gidofalvi, 2001)
– De facto method of financial article research
– Noun Phrases – only noun phrases from a document are used for machine learning
– Named Entities – Entities such as people, places, and organizations are used for machine learning
104 © 2005
Literature Review
• Linguistic Techniques continued…
– Building on Bag of Words
• Noun Phrases – only noun phrases from a
document are used for machine learning
• Still encompasses important article concepts (Tolle
and Chen 2000)
• Handles article scaling better
• Syntactic rules and lexicons are used in
identification
105 © 2005
Literature Review
• Linguistic Techniques continued…
– Building on Noun Phrases
• Named Entities – Entities such as people, places,
and organizations are used for machine learning
• Uses a semantic lexical hierarchy (McDonald et. al.
2005)
– Nouns and Noun Phrases are classified as person,
organization, or location (Sekine and Nobata 2003)
– Still encompasses the important article concepts
– Provides a more abstract representation than Bag of
Words or Noun Phrases
106 © 2005
Literature Review
• Textual Stock Market Prediction Taxonomy
Algorithm Classification Source Material Examples
Genetic Algorithm 2 tier Undisclosed number of chatroom postings Thomas & Sycara, 2002
3 tier Over 5,000 articles borrowed from Lavrenko Gidofalvi et al. 2001
5 tier 38,469 articles Lavrenko et al. 2000a
5 tier 6,239 articles Seo et al. 2002
3 tier About 350,000 articles Fung et al. 2002
3 tier 6,602 articles Mittermayer, 2004
SVM
Naïve Bayesian
3 classes – Typically consists of the classes: Up, Down, Unchanged
5 classes – Good, Good uncertain, Neutral, Bad uncertain, Bad
107 © 2005
Literature Review
• Evaluation Methods
– Prior studies: • Measures of Closeness (Cho et. al. 1998)
– How close the predicted price is to the actual price
– Measured using Mean Squared Error (MSE)
• Directional Accuracy (Gidofalvi, 2001) – Did the predicted stock price follow the same direction of
movement of the actual stock price
– Measured using classification bins (Up, Down, Unchanged)
• Simulated Trading (Lavrenko et. al. 2000a) – If we were to invest money in the system, what percentage
gain/loss would we expect
108 © 2005
Literature Review
• Textual Financial Information Taxonomy
Textual Financial Source Types Examples Description
8K SEC-mandated report on significant company changes
10K SEC-mandated Annual reports
Recommendations Buy/Hold/Sell based on expert assessment
Stock Alerts Alerts triggered by barriers such as support/resistance levels
Financial Times Provides news stories on company activities
Wall Street Journal Provides news stories on company activities
PRNewsWire Provides breaking financial news articles
Yahoo Finance Compilation of 45 independent financial news wire sources
Financial Discussion Boards The Motley Fool A forum for investors to share stock-related information
Company Generated Sources
Independently Generated Sources
Quarterly & Annual Reports
Analyst Created
News Outlets
News Wire Services
109 © 2005
Literature Review
• Company Generated Sources
– Quarterly & Annual Reports (Kloptchenko et al. 2004)
• Provides a linguistic structure to indicate how the
company may perform in the future
• Textual information may contain important
information not shown in the financial ratios
• Independently Generated Sources
– Analyst Created
• Neutral professional recommendations on
performance
110 © 2005
Literature Review
• Independently Generated Sources continued…
– News Outlets
• Centers that publish available financial information at specific
intervals
– Bloomberg, Dow Jones, Financial Times, Reuters, Wall Street Journal
(Cho, 1999)
– CNN Financial News, Business Wire, Forbes (Seo et. al. 2002)
– News Wire Services
• Centers that publish available financial information as soon as it is
publicly released or discovered
– Financial Discussion Boards
• Financial Nuggets may be contained in Web Bulletin Boards (Thomas
& Sycara, 2002)
• Susceptible to Noise
111 © 2005
Literature Review • News Wire Services
– Several sources release news articles to the market
• Comtex – real-time but subscription-based
• PRNewsWire – free real-time financial news service
– Has free XML/RSS feeds
– Has a free breaking news component
– One of the avenues that Market Makers receive their news
• Yahoo Finance – free real-time financial news service from a
compilation of sources (45 total)
– Associated Press
– Financial Times
– PRNewsWire
112 © 2005
Literature Review
• Intraday Stock Quote Gathering
– Most financial services provide end of day
quotes or intraday charts
– Historical intraday quotes can be gathered in
increments of 1, 5, 15 or 60 min
• One minute increments provide the most
information and are of sufficient granularity
for data analysis
113 © 2005
Research Questions
• How effective is the prediction of discrete
stock price values using textual financial
news articles?
• Which combination of textual analysis
techniques are the most valuable in stock
price prediction?
114 © 2005
System Design
News
Articles
Stock
Quotes
Bag of Words
Named Entities
Noun Phrases
SVR
Closeness
Regression Analysis DB
Textual Analysis
Machine Learning Algorithm (MLA)
Stock Quotation
Model Building
Directional Accuracy
Simulated Trading
Error Analysis
115 © 2005
System Design • System Training:
• For each news article:
– Determine the stock price trend for prior 60 minutes [-60,
0]
» Use linear regression to obtain trend slope
– Determine actual stock price 20 minutes after article
release
0 +20 -60
Time (minutes)
Stock Price
(dollars)
News Article release
116 © 2005
System Design
• Model Parameters
– Several parameters can be tested and
included in our models
• Most based on prior research
– Model Building
• M1: uses only extracted article terms to predict price
• M2: uses terms and stock price at article release
• M3: uses terms and regressed stock price estimate
117 © 2005
System Design
• News Article through the three Representations Schwab shares fell as much as 5.3 percent in morning trading on the New York Stock Exchange but later recouped some of the loss. San Francisco-based
Schwab expects fourth-quarter profit of about 14 cents per share two cents below what it reported for the third quarter citing the impact of fee waivers a
new national advertising campaign and severance charges. Analysts polled by Reuters Estimates on average had forecast profit of 16 cents per share for
the fourth quarter. In September Schwab said it would drop account service fees and order handling charges its seventh price cut since May 2004. Chris
Dodds the company s chief financial officer in a statement said the fee waivers and ad campaign will reduce fourth-quarter pre-tax profit by $40 million
while severance charges at Schwab s U.S. Trust unit for wealthy clients will cut profit by $10 million. The NYSE fined Schwab for not adequately
protecting clients from investment advisers who misappropriated assets using such methods as the forging of checks and authorization letters. The
improper activity took place from 1998 through the first quarter of 2003 the NYSE said. This case is a stern reminder that firms must have adequate
procedures to supervise and control transfers of assets from customer accounts said Susan Merrill the Big Board s enforcement chief. It goes to the heart
of customers expectations that their money is safe. Schwab also agreed to hire an outside consultant to review policies and procedures for the
disbursement of customer assets and detection of possible misappropriations the NYSE said. Company spokeswoman Alison Wertheim said neither
Schwab nor its employees were involved in the wrongdoing which she said was largely the fault of one party. She said Schwab has implemented a state-
of-the-art surveillance system and improved its controls to monitor independent investment advisers. According to the NYSE Schwab serves about 5 000
independent advisers who handle about 1.3 million accounts. Separately Schwab said October client daily average trades a closely watched indicator of
customer activity rose 10 percent from September to 258 900 though total client assets fell 1 percent to $1.152 trillion. Schwab shares fell 36 cents to
$15.64 in morning trading on the Big Board after earlier falling to $15.16. (Additional reporting by Dan Burns and Karey Wutkowski)
Bag of Words Noun Phrases Named Entities
fourth Reuters Reuters
fined NYSE fourth quarter
Schwab fourth quarter Schwab
profit profit
fell Schwab
NYSE
quarter
118 © 2005
Experimental Findings • How effective is the prediction of discrete stock price
values using textual financial news articles?
– Closeness measures for the different models (MSE)
• Model M2 (using article terms and the stock price at article release) had consistently lower MSE scores than linear regression (Regress) counterparts for each textual representation (p-values < 0.05)
• Named Entities had consistently lower MSE scores for each model compared against the other textual representations (p-values < 0.05)
Regress M1 M2 M3
Bag of Words MSE 0.07279 930.87 0.04422 0.12605
Noun Phrases MSE 0.07279 863.50 0.04887 0.17944
Named Entities MSE 0.07065 741.83 0.03407 0.07711
Average MSE 0.07212 848.15 0.04261 0.12893
MSE
119 © 2005
Experimental Findings
• Directional Accuracy of the models • Measures the percentage of the time that the
predicted stock price matches the +20min stock price direction
• Model M2 performed better on average (49.9%) than the other models predicting stock price direction
Directional Accuracy Regress M1 M2 M3
Bag of Words 47.8% 45.4% 49.9% 50.0%
Noun Phrases 47.7% 49.4% 50.8% 49.8%
Named Entities 46.9% 47.7% 48.8% 49.4%
Totals 47.5% 47.5% 49.9% 49.7%
120 © 2005
Experimental Findings
• Simulated Trading results of the models
– Percentage Return on money invested
• Model M2 had the best return on average (2.09%)
than the other models
Trading Engine Regress M1 M2 M3
Bag of Words -1.81% -0.34% 1.59% 0.98%
Noun Phrases -1.81% 0.62% 2.57% 1.17%
Named Entities -2.26% -0.47% 2.02% 2.97%
Totals -1.94% -0.05% 2.09% 1.43%
121 © 2005
Conclusions
• Model M2 performed the best of the Models
– Consistently performed better than the other
models
• Closeness (0.04261 to Regress at 0.07228)
• Directional Accuracy (49.9% to Model M3 at 49.7%)
• Simulated Trading (2.09% to Model M3 at 1.43%)
– This is the result of capitalizing on the article
terms and stock price at the time of article
release for prediction
122 © 2005
Conclusions
• Proper Nouns performed the best of the Textual Representations in Model M2
– Performed better in 2 of the 3 evaluation metrics
• Directional Accuracy (50.9% to Noun Phrases at 50.8%)
• Simulated Trading (2.84% to Noun Phrases at 2.57%)
123 © 2005
Future Directions
• Explore other stocks, e.g., NASDAQ
– Look at stocks outside of the S&P 500
• Look at the effect of “breaking” news articles on
different industries
• Explore news categories and news sentiments
124 © 2005
Product Opinion Classification in
Multilingual Web Forums
125 © 2005
Sentiment Analysis in Multiple
Languages: Feature Selection for
Opinion Classification in Web Forums
Ahmed Abbasi, Hsinchun Chen, and Arab
Salem
JCDL, 2007; ACM TOIS 26(3), 2008;
MISQ Forthcoming, 2008
126 © 2005 126
Sentiment Analysis
• Sentiment analysis attempts to identify and
analyze opinions and emotions.
• Hearst (1992) originally proposed the idea of
mining direction-based text.
• In recent years it has been applied to various
forms of web-based discourse (Agarwal et al.,
2003; Efron, 2004).
• Application to web group forums can provide
insight into important discussion and trends.
127 © 2005 127
Sentiment Analysis
• Traditional forms of content analysis, such as topical analysis may not be effective for forums.
• Nigam and Hurst (2004) found that only 3% of USENET sentences contained topical information.
• In contrast, web discourse is rich in sentiment related information (Subasic & Huettner, 2001).
128 © 2005 128
Sentiment Analysis Characteristics
• Tasks – Classification or trend analysis.
• Features – Attributes that are the most effective discriminators of
sentiment polarity.
• Techniques – Analytical methods used to discriminate between
sentiments.
• Domain – Reviews (movies, products, etc.), Web Discourse
(forums, blogs, web pages), and news articles.
129 © 2005 129
Sentiment Analysis Domains
• Reviews – Movie, product, and music reviews
• (Morinaga et al., 2002; Pang et al., 2002; Turney, 2002)
• Discourse – Include web forums, newsgroups, and blogs.
– Sentiments about specific issues/topics • Abortion, Gun Control, Politics (Agarwal et al., 2003; Efron,
2004)
– General sentiments • Donnath et al. (1999) evaluated the USENET forum
alt.soc.greek for sentiments relating to anger and aggression.
• News Articles/Documents – (Yi et al., 2003; Wilson et al., 2005)
130 © 2005 130
Taxonomy of Sentiment Analysis Research
Category Description Label
Tasks
Classification Classifying sentiment polarity C1
Trend Analysis Evaluating sentiment balance and temporal trends C2
Features
Syntactic Word N-grams, POS tags, punctuation F1
Semantic Polarity tags, appraisal groups, semantic orientation F2
Link Based Web links, send/reply patterns, and document citations F3
Stylometric Features such as average sentence length, special character frequencies F4
Techniques
Machine Learning Techniques such as SVM, Naïve Bayes, etc. T1
Link Analysis Techniques such as citation analysis and message send/reply patterns T2
Similarity Score Phrase pattern matching, semantic orientation, etc. T3
Visualization Loom, radar charts, etc. T4
Domains
Reviews Product and movie reviews D1
Discourse Web forums and blogs D2
News Articles Online news articles and documents D3
131 © 2005
Previous Sentiment Analysis Studies
Study Task Features Feature
Reduction
Techniques Data Type Multilingu
al Data
C1 C2 F1 F2 F3 F4 Yes/No T1 T2 T3 T4 D1 D2 D3 Yes/No
Donnath et al., 1999 √ √ No √ √ No
Subasic&Huett, 2001 √ √ No √ √ √ No
Tong, 2001 √ √ √ √ No √ √ √ No
Morinaga et al., 2002 √ √ √ Yes √ √ √ No
Pang et al., 2002 √ √ No √ √ No
Turney, 2002 √ √ No √ √ No
Agrawal et al., 2003 √ √ √ No √ √ √ No
Dave et al., 2003 √ √ No √ √ √ No
Nasukawa & Yi, 2003 √ √ √ No √ √ No
Yi et al., 2003 √ √ Yes √ √ √ No
Yu & Hatzivassil, 2003 √ √ √ No √ √ √ No
Beineke et al., 2004 √ √ No √ √ √ No
Efron, 2004 √ √ √ No √ √ √ No
132 © 2005
Previous Sentiment Analysis Studies
Study Task Features Feature
Reduction
Techniques Data Type Multilingu
al Data
C1 C2 F1 F2 F3 F4 Yes/No T1 T2 T3 T4 D1 D2 D3 Yes/No
Fei et al., 2004 √ √ No √ √ No
Gamon, 2004 √ √ √ Yes √ √ No
Grefenstette etal.,2004 √ √ No √ √ No
Hu & Liu, 2004 √ √ √ No √ √ No
Kanayama et al., 2004 √ √ √ No √ √ Yes
Kim & Hovy, 2004 √ √ No √ √ No
Pang & Lee, 2004 √ √ No √ √ √ No
Mullen & Collier, 2004 √ √ √ No √ √ No
Nigam & Hurst, 2004 √ √ √ No √ √ No
Liu et al., 2005 √ √ √ √ No √ √ √ No
Mishne, 2005 √ √ √ √ No √ √ No
Whitelaw et al., 2005 √ √ √ No √ √ No
Wilson et al., 2005 √ √ √ No √ √ No
133 © 2005 133
System Design - Overview
The system design has two major components:
A feature extractor that derives the extended feature set
The Ink Blot technique which can be used for text classification and
analysis
134 © 2005 134
Authorship Feature Set (Abbasi & Chen, 2005)
Lexical Syntactic StructuralContent
Specific
Feature Set
Char-B
ased
Word-B
ased
Punctuation
Function W
ords
Word S
tructure
Word R
oots
Technical S
tructure
Race/N
ationality
Violence
Char-Level
Letter Frequency
Special C
har.
Word-Level
Vocab. R
ichness
Word Length D
ist.
(262) (15)(62)(79)
(418)
(48) (31) (12) (200) (48) (11) (4)
(4) (35) (9) (6) (8) (15)
(50)M
essage Level
Paragraph Level
Contact Inform
ation
Font C
olor
Font S
ize
Em
bedded Images
(5) (6) (3) (29)
Hyperlinks
(14)
(8) (4) (7)
Elongation
(2)
135 © 2005
System Design – Feature Set
• The feature set is comprised of stylistic, topical, and sentiment features.
• A minimum frequency threshold of 10 is used to select the n-gram features.
Group Category Quantity Description
Style Word-Level Lexical 5 total words, % char. per word
Character-Level Lexical 5 total char., % char. per message
Character N-Grams < 18,278 count of letters, bigrams, trigrams
Digits N-Grams < 1,110 count of digits, bigrams, trigrams
Word Length Dist. 20 frequency of 1-20 letter words
Vocabulary Richness 8 richness (e.g., hapax legomena)
Special Characters 21 occurrences of char. (e.g., @#$%^&*+)
Function Words 300 frequency of function words (e.g., of, for)
Punctuation 8 occurrence of punctuation (e.g., !;:,.?)
POS Tag N-Grams varies frequency of tag n-grams (e.g., NP VB)
Message Structure 6 e.g., has greeting, has url
Paragraph Structure 8 e.g., no. of and sentences per paragraph
Technical Structure 50 e.g., file extensions, fonts, use of images
Misspelled Words < 5,513 common misspellings (e.g., “beleive”)
Topic Words N-Grams varies bag-of-word n-grams (e.g., “senior editor”)
Noun Phrases varies e.g., “New York, United States”
Named Entities varies e.g., “McDonalds”, “KFC”, “AOL”
Sentiment Polar Adjectives 3,000 positive and negative sentiment terms
136 © 2005 136
System Design: Ink Blots Ink Blot Technique Steps
1) Separate input text into two classes (one for class of interest, one class
containing all remaining texts).
2) Extract feature vectors for messages.
3) Input vectors into DTM as binary class problem.
4) For each feature in computed decision tree, determine blot size and color
based on DTM weight and feature usage.
5) Overlay feature blots onto their respective occurrences in text.
6) Repeat steps 1-5 for each class.
137 © 2005 137
Ink Blots: Pirated Software Sales Data
Author
A
Author
B
Author
C
Me
ss
ag
e 1
M
es
sag
e 2
Author
D
138 © 2005 138
Ink Blot Categorization on Shorter Messages
Author
A
B
C
D
139 © 2005 139
Evaluation - Hypotheses
• We propose the following research hypotheses relating to web forum text analysis:
• H1: There is no performance difference between the Ink Blot technique and the benchmark SVM technique for the categorization of topical/sentiment/genre information – SVM vs. Ink Blots
• H2: There is no performance difference between the use of bag-of-word features (baseline) and the extended feature set – Extended feature set vs. Baseline
140 © 2005 140
Evaluation – Experiment 1: Topics
• Topic Categorization
– Objective to test effectiveness of features and techniques for capturing topical information.
– Test bed = 10 topics taken from Enron email corpus (100 emails per topic).
– Two experiment settings were run, one using 5 topics and the other one using all 10 topics.
• Both techniques were run using 10-fold cross validation.
• For Ink Blots, the class with the highest ratio of red to blue blot area was assigned the anonymous message.
– Extended feature set = bag-of-words and noun phrases • Both effective in prior research (Dumais et al., 1998; Chen et al.,
2003).
141 © 2005 141
Evaluation – Experiment 1: Topics
• The extended feature set significantly outperformed the bag-of-words baseline.
• Both techniques coupled with the extended features achieved accuracy over 90% in all instances.
• However, SVM outperformed the Ink Blot technique for the 5 and 10 topic experiment settings.
• In both cases, the SVM performance was statistically significant based on the p-values for the pair wise t-tests.
# Topics
Techniques 5 Topics 10 Topics
SVM 95.70 93.25
Ink Blots 92.25 90.10
Baseline 88.75 86.55
142 © 2005 142
Evaluation – Experiment 2: Sentiments
• Sentiment Classification
– Objective to test effectiveness of features and techniques for capturing opinions.
– Test bed of 2000 digital camera product reviews taken from www.epinions.com.
• 1000 positive (4-5 star) and 1000 negative (1-2 star) reviews
• 500 for each star level (i.e., 1,2,4,5)
– Two experimental settings were tested • Classifying 1 star versus 5 star (extreme polarity)
• Classifying 1+2 star versus 4+5 star (milder polarity)
– Extended feature set encompassed a lexicon of 3000 positive or negatively oriented adjectives and word n-grams (Pang et al., 2002; Turney & Littman, 2003).
143 © 2005 143
Evaluation – Experiment 2: Sentiments
• SVM marginally outperformed Ink Blots – However the enhanced performance was not statistically
significant (p-values on pair wise t-tests > 0.05).
• The extended feature set significantly outperformed the bag-of-words baseline.
• The overall accuracies for both SVM and Ink Blots were consistent with previous work (i.e., in the 85%-95% range).
Sentiments
Techniques Extreme Polarity Mild Polarity
SVM 93.00 89.40
Ink Blots 92.20 86.80
Baseline 83.00 77.10
144 © 2005 144
Evaluation – Experiment 3: Genres
• Genre Classification
– Objective to test effectiveness of features and techniques for capturing genres.
– Test bed of 3000 forum postings from the Sun Technology Forum (forum.java.sun.com)
• Genres included questions, informative messages, and general messages (no information, just comments).
• 1000 messages used for each genre.
– Two experimental settings were run: • Questions (1000 messages) versus non-questions (500 informative,
500 comments)
• All three genres (1000 messages each)
– The extended feature set consisted of lexical, syntactic, structural, content-specific, and n-gram features.
145 © 2005 145
Evaluation – Experiment 3: Genres
• Ink Blots marginally outperformed SVM – However the enhanced performance was not statistically
significant based on pair wise t-tests (p-values > 0.05).
• The extended feature set significantly outperformed the bag-of-words baseline.
• The overall accuracies for both SVM and Ink Blots were consistent with previous results dealing with 2-3 genres
Genres
Techniques Questions vs. Non-
Questions
All Three Genres
SVM 98.10 96.40
Ink Blots 98.55 96.50
Baseline 90.10 86.00
146 © 2005 146
Conclusions
• In this work we presented a CMC archive visualization system consisting of: – An extended feature set for various CMC text mining tasks (e.g.,
topics, sentiments, affects, genres)
– The Ink Blot technique.
• We used the system to provide DL exploration services: – Categorization
– Analysis
– Visualization: To help identify pertinent and significant text features (suitable for human inspection and validation)
• Several analysis illustrations were presented and experiments were used to evaluate the categorization capabilities of the system.
147 © 2005 147
Conclusions and Future Directions
• Our research contributions are two fold: – Firstly, we are unaware of any prior research using such an
extensive set of features for representing CMC text.
– Secondly, we presented the Ink Blot technique for visualizing these features.
• We are expanding our feature sets and exploring other feature reduction and visualization techniques for CMC text analysis.
• We are testing selected techniques for opinion mining, internet frauds, and security informatics applications.
148 © 2005
Opportunities and Future
Directions
149 © 2005
Finance & Accounting Data Sources
in US and Taiwan: From Data to Text
and Web 2.0
Hsinmin Lu, Yida Chen & Hsinchun Chen
Artificial Intelligence Lab
University of Arizona
149
150 © 2005
US Financial Databases
150
151 © 2005
Types of Financial Data
• Company Financials:
– Balance sheets
– Income statement
– Company manager and ownership
– Earnings forecasts and analysis’ recommendations
– Mergers and acquisitions
– Audit information
– Banks and insurance companies
– Major company events
• Financial Markets and Prices:
– Stock prices
– Market indices and factors
– Mutual funds
– Bonds
– Derivatives
151
152 © 2005
Types of Financial Data
• Macroeconomics:
– GDP, production indices, consumer price index, wages, unemployment rate
• Financial News:
– Newspapers: Wall Street Journal, Financial Times
– Newswire: Reuters, PR Newswire
• Financial Blogs and Forums
152
153 © 2005
Data Providers
• Government – Security and Exchange
Commission (EDGAR)
– US Census Bureau
– Bureau of Labor Statistics
– Federal Reserve Banks
• Commercial Data Services – Wharton Research Data
Services (WRDS)
– Bloomberg
– Reuters
– Lexis Nexis
• Financial Web Sites – Yahoo Finance
– Google Finance
– Market Watch
– CNN Money
153
154 © 2005
Data Types and Data Providers
Company
Financial Financial
Markets News Macroeconomi
cs
GOV: SEC (EDGAR) X
GOV: Census X
GOV: Bureau of
Labor Stat. X
GOV: US Treasury X
WRDS X X X
Bloomgerg X X X X
Reuters X X X X
Yahoo Finance X X X X
Google Finance X X X X
Market Watch X X X X
CNN Money X X X X
Lexis Nexis X X
154
155 © 2005
User Interface
• Web interface for all government websites
– EDGAR provides FTP service
• WRDS: SSH and Web Interface
• Bloomberg: proprietary software and web
interface
• Reuters: proprietary software
• Lexis Nexis: web interface
155
156 © 2005
Data Standards
• Government – EDGAR: plain text, HTML, and XBRL (a XML-based
standard for business reporting)
– Other government websites usually provide data download in CSV or text format
• Commercial Data Services – WRDS: CSV, text, SAS data file, STATA data file
– Bloomberg and Reuters: text, CSV and Excel
– Lexis Nexis: HTML
• Financial Websites, Blogs and Forums – HTML, CSV, text (spidering needed)
156
157 © 2005
Company Financial Data
157
158 © 2005
Company Financial Data
• EDGAR
– 10-K and 10-Q: yearly and quarterly reports
– 8-K: reports major corporate events such as merges or changes
in registrant's certifying accountant
– Comment and response letters to company filings
– Form 3, 4, 5: Insider trading reports
<SEC-DOCUMENT>0001193125-06-001869.txt : 20060105
<SEC-HEADER>0001193125-06-001869.hdr.sgml : 20060105
<ACCEPTANCE-DATETIME>20060105170941
ACCESSION NUMBER: 0001193125-06-001869
CONFORMED SUBMISSION TYPE: 10-Q
PUBLIC DOCUMENT COUNT: 7
CONFORMED PERIOD OF REPORT: 20051130
[Data Truncated]
Data Example:
158
159 © 2005
Company Financial Data (cont’)
• COMPUSTAT (in WRDS) – Provided by Standard & Poor
– More than 24,000 active and inactive publicly held companies.
– Annual and quarterly income statement, balance sheet, statement of cash flows, and supplemental data items
– Also contain information on aggregates, industry segments, banks, market prices, dividends, and earnings.
– Available in various data formats (CSV, XLS, SAS data file, STATA data file)
159
160 © 2005
Company Financial Data (cont’) • COMPUSTAT Data Example (Annual company financial
information):
datadate fyear tic cusip conm curncd ni
2005123
1 2005 IBM 459200101 INTL BUSINESS
MACHINES CORP USD 7934
Net
Income
nopi np oancf pi pidom
1434 4228 14874 12226 7450
Nonoperating
Income Note
payable Operating Activities -
Net Cash Flow Pretax
Income PI
domestic
160
161 © 2005
Company Financial Data (cont’) • Yahoo Finance
– Provide easy-to-use interface about company profiles,
key statistics, SEC filings, and competitors
– Usually do not provide longitudinal data
Data Example:
161
162 © 2005
Company Financial Data (cont’) • IBES
– Provides analysts' earnings estimates, recommendations (buy-hold-sell), and actual reported earnings
– Data Example:
OFTIC MEASURE ANALYS FPI FPEDATS ESTDATS VALUE actual REPDATS
INTC EPS 48357 6 31-Mar-98 19-Feb-98 0.235 0.2025 14-Apr-98
INTC EPS 48357 6 31-Mar-98 5-Mar-98 0.1775 0.2025 14-Apr-98
INTC EPS 938 6 31-Mar-98 5-Mar-98 0.18 0.2025 14-Apr-98
INTC EPS 40196 6 31-Mar-98 3-Feb-98 0.2225 0.2025 14-Apr-98
INTC EPS 40196 6 31-Mar-98 5-Mar-98 0.1675 0.2025 14-Apr-98
INTC EPS 10635 6 31-Mar-98 14-Jan-98 0.2275 0.2025 14-Apr-98
INTC EPS 10635 6 31-Mar-98 5-Mar-98 0.1825 0.2025 14-Apr-98
INTC EPS 9077 6 31-Mar-98 28-Jan-98 0.24 0.2025 14-Apr-98
INTC EPS 9077 6 31-Mar-98 5-Mar-98 0.1925 0.2025 14-Apr-98
INTC EPS 9077 6 31-Mar-98 12-Mar-98 0.1875 0.2025 14-Apr-98
INTC EPS 9236 6 31-Mar-98 15-Jan-98 0.2275 0.2025 14-Apr-98 162
163 © 2005
Company Financial Data (cont’) • AuditAnalytics: Audit information on over 1,200 accounting firms and
15,000 publicly registered companies
– Who is auditing whom
– How much they are paying for what services
– Create reports by auditor, fees, location, industry
COMPANY_
FKEY AUDITOR_NAME FISCAL_
YEAR
FISCAL_
YEAR_EN
DED AUDIT_FE
ES NON_AUDIT_F
EES TOTAL_FEE
S
51143 PricewaterhouseCoope
rs LLP 2003 31-Dec-03 11300000 40900000 52200000
51143 Ernst & Young LLP 2003 31-Dec-03 2500000 8700000 11200000
51143 PricewaterhouseCoope
rs LLP 2004 31-Dec-04 21600000 55100000 76700000
51143 Ernst & Young LLP 2004 31-Dec-04 3300000 1400000 4700000
51143 PricewaterhouseCoope
rs LLP 2005 31-Dec-05 25300000 32000000 57300000
789019 Deloitte & Touche LLP 2003 30-Jun-03 10700000 16800000 27500000 163
164 © 2005
Financial Markets and Prices
164
165 © 2005
Financial Markets and Prices
• Yahoo Finance
– Provides daily securities and indices prices and
trading volumes
– Provides charting tools and data download
functionalities
Data Example:
165
166 © 2005
Financial Markets and Prices (cont’)
• The Center for Research in Security Prices (CRSP) – Maintained by CRSP at the Graduate School of
Business of the University of Chicago
– Comprehensive collection of security price, return, and volume data for the NYSE, AMEX and NASDAQ stock markets
• Various stock indices and mutual fund are also included
– Data frequency: daily, monthly
– Often merged with Compustat for research purpose
166
167 © 2005
Financial Markets and Prices (cont’)
• CRSP Data Example: DATE TICKER DIVAMT SHROUT BIDLO ASKHI PRC VOL RET BID ASK
20070126 IBM 1506352 96.84 97.83 97.45 5771100 -0.00062 97.45 97.46
20070129 IBM 1506352 97.45 98.66 98.54 7294800 0.011185 98.56 98.59
20070130 IBM 1506352 98.5 99.45 99.37 7178000 0.008423 99.37 99.4
20070131 IBM 1506352 98.35 99.48 99.15 6446400 -0.00221 99.1 99.15
20070201 IBM 1506352 97.96 99.18 99 6612400 -0.00151 99 99.02
20070202 IBM 1506352 98.88 99.73 99.17 6657000 0.001717 99.18 99.23
20070205 IBM 1506352 98.9 100.44 100.38 8184800 0.012201 100.25 100.33
20070206 IBM 1506352 99.54 100.4 99.85 6532800 -0.00528 99.85 99.86
20070207 IBM 0.3 1506352 99.12 100.36 99.54 7698200 -0.0001 99.54 99.58
20070208 IBM 1506352 98.65 99.74 99.62 6152300 0.000804 99.66 99.67
20070209 IBM 1506352 97.81 99.7 98.55 6101100 -0.01074 98.56 98.58
20070212 IBM 1506352 98.22 99.2 98.58 5331043 0.000304 98.6 98.64
20070213 IBM 1506352 97.8 98.74 98.29 5702815 -0.00294 98.27 98.34
20070214 IBM 1506352 98.25 99.43 99.2 5644733 0.009258 99.22 99.26
20070215 IBM 1506352 98.48 99.52 98.92 5568600 -0.00282 98.95 99.04
20070216 IBM 1506352 98.63 99.25 98.99 4800700 0.000708 98.97 98.99
20070220 IBM 1506352 98.55 99.46 99.35 4124200 0.003637 99.35 99.39
20070221 IBM 1506352 98.7 99.37 99.09 4302400 -0.00262 99.09 99.16
167
168 © 2005
Financial Markets and Prices (cont’)
• Trade and Quote (TAQ) :
Intraday transactions data (trades and quotes) for all securities listed
on the New York Stock Exchange (NYSE) and American Stock
Exchange (AMEX), as well as Nasdaq National Market System
(NMS) and SmallCap issues.
• Data Example:
symbol DATE TIME PRICE SIZE BRKB 7-Nov-02 9:37:48 2458 320
BRKB 7-Nov-02 9:37:49 2458 100
BRKB 7-Nov-02 9:37:51 2458 90 BRKB 7-Nov-02 9:37:51 2458 10
BRKB 7-Nov-02 9:37:56 2458 30 BRKB 7-Nov-02 9:39:13 2455.1 40
BRKB 7-Nov-02 9:39:21 2455.11 500
BRKB 7-Nov-02 9:41:10 2456 30 BRKB 7-Nov-02 9:41:35 2460 110
BRKB 7-Nov-02 9:42:24 2460 100 168
169 © 2005
Financial Markets and Prices (cont’)
• Federal Reserve Banks: Fed fund rates, interest
rates, foreign exchange rates
169
170 © 2005
Macroeconomics
170
171 © 2005
Macroeconomics
• Government:
– US Census Bureau: income, economic census
– Bureau of Labor Statistics: wages, earnings,
labor productivity, consumer price index
171
172 © 2005
Financial News
172
173 © 2005
Financial News
• Financial newspaper: – Financial Times: archived by Academic Onefile and Lexis
Nexis:
– Wall Street Journal: archived by ABI/Inform
– Partial current news articles are available from Yahoo News, Google News and their own websites
• Newswire : – PR Newswire: archived by Lexis Nexis (with timestamp
precision up to minutes) and General OneFile (fulltext and date)
– Reuters: to the best of my knowledge, no third party archive is available
– Partial current newswires are available from Yahoo Finance and their own websites (mostly with timestamp)
• Data format: HTML (can be spidered) 173
174 © 2005
Financial Newspaper: WSJ • The WSJ provides abstract, classification codes, and institutions involved
– Only on selected articles
– May be a good source of training data for entity extraction
• Data Example:
174
Title: Automotive Brief: Proton Holdings Bhd. (Eastern edition).
Date: Jun 1, 2006. pg. A13
Classification Codes: 9175 Western Europe, 9179 Asia & the
Pacific, 8680 Transportation equipment industry
Companies: Proton Holdings Bhd, PSA Peugeot Citroen SA (NAICS: 336111 )
Column Name: Business Brief Publication
Abstract: (Document Summary) Malaysia's national car maker Proton Holdings
Bhd. said it is pursuing alliance talks with France-based PSA Peugeot Citroen SA
and plans to introduce six new models by 2008 after reporting sharply lower profit
for its most recent fiscal year.
Full Text (140 words) Malaysia's national car maker Proton Holdings Bhd. said it
is pursuing alliance talks with [document truncated]
175 © 2005
Social Network in the Wall Street Journal
175
176 © 2005
Newswire: PR Newswire
• Lexis Nexis provides complete historical collection for the past 7 years (displaying New York time)
• A subset of news articles can be downloaded from Yahoo Finance (displaying GMT time)
• Compare the number of articles from the two sources (year 2008):
– Yahoo Finance contains most of the articles
– Missing articles are mostly related to politics
176
Jan. 21
(Monday)
Jan.
22
Jan. 23 Jan. 24 Jan. 25 Jan. 26 Jan. 27
Lexis
Nexis
331 944 860 836 439 49 14
Yahoo
Finance
348 899 810 770 379 10 13
177 © 2005
Newswire: PR Newswire
• Data Example (from Lexis Nexis)
177
Date: January 21, 2008 Monday 11:01 PM GMT
Title: Oil States Announces Fourth Quarter 2007 Earnings Conference
Call;
Wednesday, February 20, 2008 at 11:00 am Eastern Time
Length: 399 words
DATELINE: HOUSTON Jan. 21
Fulltext: HOUSTON, Jan. 21 /PRNewswire-FirstCall/ -- Oil States
International (NYSE:OIS) announced today that it has scheduled its
fourth quarter 2007 earnings conference call for Wednesday, February
20, 2008 at 11:00 am Eastern time. During the call, the company will
discuss the results for the quarter ended December 31, 2007, which are
expected to be released on February 19, after markets close.
[document truncated]
Web site: http://www.oilstatesintl.com/
178 © 2005
Current News Collection
• Wall Street Journal (full text) – 283,280 articles; 24,994 institutions
– 8/4/1999 to 3/2/2007
• New York Times – 673,142 articles
– 1/1/2000 to 3/1/2007
• Washington Post – 440,500 articles
– 1/1/2000 to 3/1/2007
• Financial Times – 476,000 articles
– 1/1/2000 to 3/1/2007
178
179 © 2005
Current Collection (cont’)
• PR Newswire (full text with date)
– 1,315,000 articles
– 1/1/2000 to 5/31/2007
• PR Newswire (title and timestamp)
– Collecting: 1/1/2000 to 5/31/2007
• PR Newswire (from Yahoo Finance; full text with timestamp)
– From 1/1/2008
– About 700 articles per day
179
180 © 2005
Current Collection (cont’)
• Reuters Newswire (from Yahoo Finance;
full text with timestamp)
– From 1/1/2008
– About 300 articles per day
• Associated Press (from Yahoo Finance; full
text with timestamp)
– From 1/1/2008
– About 500 articles per day
180
181 © 2005
Financial Data Sources for
Companies in Taiwan
182 © 2005
Types of Financial Data
• Company Financials:
– Balance sheets
– Income statement
– Company manager and ownership
– Earnings forecasts and analysis's’ recommendations
– Mergers and acquisitions
– Audit information
– Banks and insurance companies
– Major company events
• Financial Markets and Prices:
– Stock prices
– Market indices and factors
– Mutual funds
– Bonds
– Derivatives
183 © 2005
Types of Financial Data
• Macroeconomics:
– GDP, production indices, consumer price
index, wages, unemployment rate
• Financial News, Blogs and Forums
– Newspapers: 工商時報、財訊、Wall Street
Journal 中文版 …
– Newswire: N/A
184 © 2005
Data Service Providers
• Government – 中央銀行
– 行政院金管會、證期局、主計處
• Institutions – 證券交易所
– 證券暨期貨市場 發展基金會
– 中華信用評等公司
– 中華經濟研究院
– 財團法人經濟資訊推廣中心 (AREMOS)
• Commercial Data Services – 台灣經濟新報
– 中央日報全文影像資料庫
– 中央通訊社中英文新聞資料庫
– 知識贏家
– 聯合知識庫
– 臺灣新聞智慧網
• Financial Web Sites – Yahoo Finance Taiwan
185 © 2005
Data Types and Data Providers
Company
Financial Financial Markets News Macroeconomics
證券交易所 X
證券暨期貨市場
發展基金會 X X
中央銀行 X
行政院金管會 X X
行政院證期局 X X
行政院主計處 X
中華經濟研究院 X
財團法人經濟資訊推廣中心 X X X
台灣經濟新報 X X X
Yahoo Finance Taiwan X X X X
臺灣新聞智慧網、中央日報全文影像資料庫 、中央通訊社中英
文新聞資料庫、知識贏家、聯合知識庫
X
186 © 2005
User Interface
• Most data service provides use web
interface
• 經濟新報 and 財團法人經濟資訊推廣中心 use
proprietary systems
187 © 2005
Data Standards
• Government – Data provided on government websites usually
can be download in CSV or HTML format
• Institutions – 證券交易所: Fixed format text for submitting
company financial data
• Commercial Data Services – HTML, text, CSV
• Financial Websites, Blogs and Forums – HTML, text, CSV
188 © 2005
Company Financial Data
189 © 2005
Company Financial Data
• 公開資訊觀測站 (證交所)
– 公司財務報表
– 公司概況
– 董監股權異動
– 營運慨況
Data Example:
190 © 2005
Company Financial Data
• Yahoo Finance Taiwan
– 基本公司資料
– 營收盈餘
– 股利政策
Data Example:
191 © 2005
Company Financial Data
• Company finance data is also available
from the following data providers
–證券暨期貨市場發展基金會
–財團法人經濟資訊推廣中心 (AREMOS)
–台灣經濟新報
192 © 2005
Financial Markets and Prices
193 © 2005
Financial Markets and Prices
• 資訊王 (證券暨期貨市場發展基金會)
– 每日集中市場交易概況(個股、類股、大盤)
– 店頭市場交易概況
– 興櫃公司交易概況
• Similar information is available from 台灣經濟新報 and AREMOS
Data Example:
194 © 2005
Macroeconomics
195 © 2005
Macroeconomics
• Government: – 中央銀行: Exchange rate, interest rate
– 行政院主計處: Consumer price index, unemployment rate, GDP
– 中華經濟研究院: Economic growth predictions
– 財團法人經濟資訊推廣中心 (AREMOS): GDP, interest rate, consumer price index, unemployment rate
– 台灣經濟新報: GDP, interest rate, consumer price index, unemployment rate
196 © 2005
Individual Trading Data
• “We have acquired the complete transaction history
of all traders on the TSE from January 1, 1995,
through December 31, 1999. The trade data include
the date and time of the transaction, a stock identifier,
order type (buy or sell -- cash or margin), transaction
price, number of shares, a broker code, and the
identity of the trader.”
– 劉玉珍教授(政大財金)與李怡宗教授(政大會計) in “Who
Loses from Trade? Evidence from Taiwan;” under
review
197 © 2005
Financial News
198 © 2005
Financial News
• Financial newspaper:
– 聯合知識庫 (聯合報系: 聯合報、經濟日報、聯合晚報、商業週刊、遠見雜誌、天下雜誌)
– 知識贏家(中時報系: 中國時報、工商時報、中時晚報)
– 臺灣新聞智慧網 (中國時報、聯合報、經濟日報、民生報、聯合晚報、星報、中國時報、工商時報、中央日報、自由時報、經濟日報)
– 中央日報全文影像資料庫
– 中央通訊社中英文新聞資料庫
• Newswire :
– N/A
• Data format: HTML
199 © 2005
Financial Blogs and Forums:
US and Taiwan
200 © 2005
Integrated Web Platforms for Financial
Information
• Several platforms have been established to integrate various sources of financial information in one page. – ValueWiki
• Created in March, 2007
• Organize by companies, each page showing all available information of a company and providing outward links to forums and websites
– BoardCentral • Provide most recent 10-20 messages from 13 major forums,
including Yahoo! Finance, RagingBull, and InvestorVillage
– pfblogs.org • Integrate 1137 financial blogs
• Provide a single page to view all the latest posting from those blogs
201 © 2005
ValueWiki
• Each page is a portfolio
of a company
• Provide
– Instance stock price
– News feeds
– Background
information
– Outward linkage to
• Relevant
websites
• Rumor sites
• Forums
202 © 2005
ValueWiki (Cont’d) • ValueWiki also provides other
types of services
– Blogs
– Instant chat
– Message boards
• The founders of ValueWiki
have their own blog posting
some valuable information,
such as
– Top 100 financial blogs by
Alexa and Technorati
– Top 50 Web 2.0 financial
blogs by Alexa and Technorati
– Personal Top 60 financial
blogs
203 © 2005
BoardCentral • Organize by companies or
stocks
• Provide most recent 10 ~ 20
messages from 13 popular
forums, including
– Yahoo!Fiance
– RagingBull
– Google Finance
– InvestorVillage
– The Motley Fool
– StockHouse
– ClearStation
– TheLion
– FreeRealTime.com
– msn.money
– SiliconInvestor
204 © 2005
BoardCentral (Cont’d)
• It also provides other
types of information
– Stock summary
– Stock news from
• Yahoo!Finance
• Google Finance
• MarketWatch
– StockCharts
– Competitors and
related companies
205 © 2005
pfblogs.org
• Provide posting from
1137 financial blogs
• Have 110966 entries
currently
• Provide a search
function and sorting
by
– Personal Finance
– Real estate
– Investing
• When viewing
articles, it links back
to original blogs
206 © 2005
Combination of Forums and Blogs • Some financial forum sites also provide space for personal blogs
– InvestorVillage
– Stockhouse
– TheLion
– msn.money
– Yahoo! Finance
207 © 2005
Case study: Microsoft-Yahoo Bid
• We use the recent Microsoft-Yahoo
acquisition to see how information was
passed around in forums and blogs before
the official announcement.
• Timeline
– Microsoft officially announced $44.6 millions to
buy Yahoo on 01/31/2008
– News press reported it on 02/01/2008
208 © 2005
Discussion in Forums
• Major discussion about the acquisition in forums started after
02/01/08 except in Yahoo! Finance
Date Main Message Trigger
01/01/08 Yahoo! Being Prepped For Sale * Blog Posting of Seeking Alpha
01/08/08 More takeover talk @Seeking Alpha * Blog Posting of Seeking Alpha
01/10/08 NY Post says MSFT may bid for YHOO New York Post
01/10/08 MICROSOFT WILL BUY YAHOO!!
01/10/08 A Microsoft/Yahoo Rumor Right Before Yahoo Earnings
01/10/08 This rumor could be "real" this time
01/11/08 Yahoo Spokesperson Kara Swisher Denies Microsoft Rumor
01/12/08 Will Microsoft Pay $50 Billion For Yahoo – Maybe #
01/13/08 Motely: 5 reasons why MSFT will buy Yahoo % Report from TheMotleyFool
01/13/08 Yahoo Might Not Want To Be Bought
01/24/08 Bidding war over Yahoo! after earning New York Post
01/25/08 Turns out those rumors are true!!
01/29/08 MICROSOFT will buy YAHOO By Stock Price
01/29/08 take over happening soon…
01/29/08 MSFT to Buy YHOO
01/30/08 MICROSOFT WILLING TO BID FOR YAHOOOOOO
01/30/08 If MSFT doesn't buy YHOO at $30
01/31/08 Has MSFT denied buyout interest?
01/31/08 Just sell the damn company to Bill Gates!!
209 © 2005
Triggers for Forum Discussion
• From Yahoo! Finance forum, we can see five trigger points for its discussion before the official announcement.
– 01/01/08
• Blog Posting by Ashkan Karbasfrooshan in Seeking Alpha: “Is Yahoo! Being Prepped for a Sale”
– 01/10/08
• News article in New York Post: “Microsoft Deal King to Launch Own Firm”
– 01/13/08
• Report by Rick Aristotle Munarriz in the MotleyFool: “5 Reasons Why Microsoft Will Buy Yahoo”
– 01/24/08
• News article in New York Post: “Sharks Circle Yahoo! – LBO, Media Bigs Attracted to Battered Stock”
– 01/29/08
• By consistently decreasing stock price
01/29/08
210 © 2005
Different Reactions
from the Sources
• When trigger
source is from
news, members
have more positive
attitude and active
discussion.
• When it is from
reacting to stock
price or some
member’s
prophecy, the
response is
relative skeptical.
By News
By Stock Price
211 © 2005
Discussion in Blogs • Seven blog postings were found from pfblogs.com.
• Except for the first posting on 01/01/08, blog postings were triggered by
– Other blog postings
– News articles
Date Blog Posting Blog Trigger
01/01/08 Is Yahoo! Being Prepped for a Sale? Seeking Alpha
01/04/08 Response to Ashkan Karbasfrooshan's 'Is Yahoo Is Being Prepped for Sale?'
Seeking Alpha By the first blog post
01/08/08 Google's Search Share and Microsoft's Fast Acquisition
Seeking Alpha By the news of Microsoft’s buying of FAST Search
01/10/08 Yahoo Jumps On Renewed Rumors Of MSFT
Bid
Barron’s By the news article of New York Post
01/10/08 Microsoft/Yahoo Takeover Talk: Here We Go Again
Seeking Alpha
By the news article of New York Post
01/10/08 5 Reasons Why Microsoft Will Buy Yahoo! MotleyFool By the news article of New York Post
01/11/08 Yahoo: Bernstein Cuts Target; MSFT Deal Unlikely; Equity Stakes, Cash Now Valued More Highly Than Core Biz
Barron’s By the news article of New York Post
212 © 2005
Layers of Influence • We can see a clear flow among stock price, news articles, blogs, and forums. The
direction of the flow is related to the credibility of the media.
Stock Price News Blogs Forum
01/01/08
01/02/08
01/08/08
01/10/08
01/24/08
01/29/08
213 © 2005
Financial Forums in Taiwan • Twelve active forums are found in Taiwan stock market.
• Compared to the U.S. financial forums, Taiwan forums have four characteristics:
– Established as user services to support main business
– Organized by buying strategies rather than by companies or stocks
– Focus on the sharing of star stocks
– Adopt a subscription and membership system
Forums Access
智富論壇 smartnet.com.tw Free
聲動討論 168.com.tw Credit/Free
發財網 e-stock estock.marbo.com.tw Free
基智網 FundDJ funddj.com Credit/Free
理財網 MoneyDJ moneydj.com Credit/Free
Yahoo!奇摩股市 tw.mb.yahoo.com Free
聚財網 www.wearn.com Credit
理財經算網 RickMall Free
DigitalTimes 科技網 digitimes.com.tw Free
中時理財網 理財心經 tb.chinatimes.com Free
永不老論壇 yongbulao.com Credit/Free
23XX電子論壇 23xx.com.tw Free
214 © 2005
Main Business Forum Hosts
• Financial/General Publication –智富論壇 –聚財網 –中時理財網 –DigitalTimes
• Financial Software and Services – 聲動討論
– 發財網
– 基智網
– 理財網
215 © 2005
Organization of Forums • Taiwan financial forums have discussion on the overall stock market
rather than on specific companies or stocks except 23xx.com.tw.
216 © 2005
Organization of Forums (Cont’d) • 23xx.com.tw is the only one focusing on individual
companies in the Taiwan high technology industry.
217 © 2005
Sharing of Star Stocks • The discussion in Taiwan is mostly triggered by stock
prices and overall market trend rather than financial news.
– Focus on the sharing of star stocks.
218 © 2005
Subscription and Membership System
• Some forums claim to have domain experts and insider information (明牌) and require subscription to view those advanced posting.
– 聚財網
– 永不老論壇
– 聲動討論
– 基智網
– 理財網
219 © 2005
Financial Blogs in Taiwan • Financial bloggers in Taiwan can be found in popular blog or forum hosting sites.
– 無名小站 wretch.cc
– 痞客幫 pixnet.net
– 聚財網
– 基智網
– 中時理財網
• Some popular financial blogs may also require subscription or membership.
220 © 2005
Future Directions and Research Opportunities
221 © 2005
Possible Directions: Accounting and Risk Assessment
• Cumulative Abnormal Return (CAR) based on
Accounting indicators (e.g., Unexpected Earning)
and qualitative financial and corporate news (e.g.,
news announcements, events, positive/negative
sentiments)
• Enterprise Risk Management (ERM) based on
financial/corporate news; Strategic Risk and
Operational Risk; news categories (e.g., merger,
new product announcement, management change,
lawsuit) and sentiments
222 © 2005
Possible Directions: Corporate Governance
• Social network analysis based on corporate board
members and their affiliations; executive compensation
and corporate government
• Identify possible illegal insider trading activities; isolate
news-sensitive traders and news-neural traders
223 © 2005
Possible Directions: Corporate Sentiments
• Customer sentiment tracking (forums and blogs) for
corporate going concerns
• News blogs and forums vs. newswire: How discussions
in forums and blogs interact with breaking news and
company performances?
• Use web data to study “infectious behaviors” in forums
and blog; identify web opinion leaders and their impacts
224 © 2005
Possible Directions: Stock Advice
• Stock advisory system: Recommend stock trading
strategy based on breaking news (news category and
sentiment) and corporate assessment
• Stock co-movement analysis: Linking co-occurrence of
news articles to the co-movement of stock prices
224
225 © 2005 225
Web Marketing Research
Hsinchun Chen & Bob Lusch
University of Arizona
226 © 2005 226
Overview
• Sentiment index: Michigan Consumer Sentiment Survey, BrandIndex.com
• Marketing tools: MarketTools, TrendIQ, Passenger
• Web sentiment and opinion: Blogspot, eBlogger, Technorati, ProgrammableWeb.com, Epinions.com
• Web marketing research opportunities
227 © 2005 227
Michigan CSI
• University of Michigan Consumer Sentiment Survey, since 1952 (monthly)
• 500 telephone interviews in the US per month
• Five questions: – Q1…financial…better off/worse off than a year ago…?
– Q2…a year from now…better off/worse off financially…?
– Q3…business conditions in the country…good times/bad times financially…?
– Q4…country as a whole…next five years…good times or unemployment…?
– Q5…big things people buy for their homes…now is a good time to buy…?
228 © 2005 228
Michigan CSI (cont’d)
• Index of Consumer Sentiment (ICS): Q1-Q5
• Index of Current Economic Conditions (ICC): Q1 and 5.
• Index of Consumer Expectation (ICE): Q2-4.
229 © 2005 229
Michigan CSI: Research Opportunities
• Automated Web collection and sentiment analysis of consumer confidence
• What forums, blogs, etc. to collect, and where?
• Experimental validation (correlation) of historical Michigan CSI vs. past sentiment of Web blogs and forums
• Experiment on world events-Web sentiment correlation
• Company Web sentiment index (based news, blogs, forums) vs. stock performance? Contrarian Sentiment Index for stock prediction?
230 © 2005 230
BrandIndex.com
• A UK-based company; tracking over 1,100 consumer brands across 32 sectors on a 7-point scale
• Based on 2,000 online interviews/surveys per day from a panel of 200,000 (polling research)
• Seven points: buzz, general impression, quality, value, satisfaction, recommend, corporate reputation.
231 © 2005 231
BrandIndex.com: Research Opportunities
• Web-based collection and sentiment analysis of product comments (news forums, blogs)
• Correlating with breaking news and events on products and companies
• Correlating with Epinions.com consumer sentiment evaluations
• Automating analysis of specific critiques of products and reasons
232 © 2005 232
Marketing Tools
• Most companies developed online survey and marketing analysis tools for companies
• MarketTools: Online survey tools and communities; claim to have Internet text analysis ability for 50M sites (no evidence)
• TrendIQ: Analyze market shares, buzz trends, sentiment scoring, relationship ID, Internet share analysis, etc. Some graphing tools; but little evidence of capabilities or success (web site and results un-impressive)
• PeopleTrend.com: Powered by TrendIQ, Presidential Election Heat Map, CEOs of large company (un-impressive)
• Passenger: builds online branded communities
233 © 2005 233
Marketing Tools: Research Opportunities
• Need to focus on convincing business cases and scenarios
• Need to provide good (understandable, insightful) visualizations for results
234 © 2005 234
Web Sentiment and Opinions
• Many blog creation and hosting sites
• eBlogger: blog creation
• WordPress, Blogspot
• Where to find top bloggers in selected topics?
• Where to identify major forums for product, company, event, etc. opinions?
235 © 2005 235
ProgrammableWeb.com
• A major hub for Web Mashups; More than 650 Web APIs and 2700 Mashups; API Directory, Mashup Directory, Market Trends, Major Players
• Top Mashup Tags grouped by category
• Web 2.0 API Directory grouped by category, e.g., advertising, news, sports, health, maps, etc.
• Some major APIs: Financial APIs (25), News (10), Government (13), Medical (5), Shopping (32), Sports (5), etc. Each API site has detailed API information and examples for implementation.
• Most popular Mashups; Searching tag cloud
236 © 2005 236
ProgrammableWeb.com: Research
Opportunities
• Excellent site for identifying data sources for various applications, e.g., business, sports, medical, etc.
• Good integration of data sources and visualization for web
• What data/web mining opportunities?
237 © 2005 237
Technorati.com
• Many useful blog resources and data: Top 100 blogs, Top Tags, Popular, Ping, Widgets, Watchlist, Photos, Videos, etc.
• Blog directory grouped by topics, including: business, economy, stocks, sports, consumer products, health, politics, etc.
• Top news, videos, movies, etc.
• Top Tags for each blog (tag cloud)
• Automatic Ping support
• Mentions of tagged topic by day
• Widgets: blog searching and info, pinging
238 © 2005 238
Technorati.com: Research Opportunities
• Automatic spidering of top blogs and contents by topics
• Pining of new contents
• Promising for products, companies, politics, health topics, environmental issues
• Trackback of popular blogs to develop social networks of communities
• How about international, multilingual blogs, e.g., Taiwan, Japan, Arabic, etc.?
• How about analysis of popular vidoes and tags for specific products?
239 © 2005 239
EPinions.com
• A service of Shopping.com (an EBay company); Members are paid to provide quality, meaningful web reviews/comments for various product categories; use Web of Trust (of trusted people)
• Reviews are grouped by category, e.g., computers, cars, cameras, personal finance, sports, etc.
• Most reviews contain Rating (overall 1-5 and sub-categories), Pros, Cons, and free-text comments (specific to product)
• Reviews also link to product information, e.g, specs, pricing, vendors, etc.
240 © 2005 240
EPinions.com: Research Opportunities
• Excellent source for training English sentiment polarity analysis algorithms – correlating free-text comments with rating scores
• Immediate experiment on English product-specific sentiment analysis algorithms
• Generic polarity analysis engine or product-specific polarity analysis engine?
• How to identify product feature like/dislike and reasoning based on product specs information (what do you like about this)?
• What about other languages, e.g., Taiwan, Arabic, etc.
• What about sentiment visualization?
241 © 2005 241
Future Directions
• How/where to identify Web data sources for various topics (business, company, product, health, politics, environment)?
• What are the major news sources, forums, and bloggers?
• Need to develop and test sentiment analysis algorithms for various topics
• Need to focus on selected topics: company/product, environment, politics, health
• How about Taiwanese and Arabic contents?
• How about visualization?
242 © 2005 242
Future Directions
• Web Sentiment Index for companies and products
• WalMart Corporate Sentiment Tracking (“Save Money Live Better”; Go Green Expo)
• Green 100 Index based on environmental concerns and activities
• Online branding community tracking
243 © 2005 243
Forming Research Partnership
• Great research opportunities for data, text and web mining for finance, accounting, and marketing application Business Intelligence 2.0
• Need for domain expertise and problem framing, e.g., ERM, Corporate Governance, Consumer Sentiment, etc.
• Much progress in computational techniques, e.g., web site/forum/blog spidering, text indexing, sentiment analysis, classification techniques, visualizations and integrated systems
• Publications and industry opportunities!!!
244 © 2005
Hsinchun Chen …
Artificial Intelligence Lab, Dark Web
Project …
http://ai.arizona.edu …