+ All Categories
Home > Documents > Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data,...

Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data,...

Date post: 17-Mar-2018
Category:
Upload: trinhnhan
View: 243 times
Download: 0 times
Share this document with a friend
244
1 © 2005 “Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, Accounting and Marketing Applications” Hsinchun Chen, Ph.D. Director, Artificial Intelligence Lab Director, NSF COPLINK and Dark Web Research Centers University of Arizona Acknowledgements: NSF, LOC, ITIC/KDD, DHS, DOJ
Transcript
Page 1: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

1 © 2005

“Business Intelligence Mining in Web 2.0:

Data, Text and Web Mining for Finance,

Accounting and Marketing Applications”

Hsinchun Chen, Ph.D.

Director, Artificial Intelligence Lab

Director, NSF COPLINK and Dark Web Research Centers

University of Arizona

Acknowledgements: NSF, LOC, ITIC/KDD, DHS, DOJ

Page 2: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

2 © 2005

My Background

• NCTU SUNY Buffalo NYU U Arizona (MIS #4)

• MS, MIS, Design Science, AI, Search Engine, Digital

Library, Medical Informatics, Intelligence & Security

Informatics, Business Intelligence

• AI Lab, 25+ researchers; $25M funding ($1.5M/year),

180 top SCI papers (20+ papers/year); DL (#1), MIS

(#8); Scientific Advisor: NLC, NLM, Academia Sinica;

Chair, ICADL, IEEE ISI

• AE in ten top SCI journals, IEEE and AAAS Fellow

• DL/SE; GeneScene & BioPortal; COPLINK & Dark Web

(NYT, USA Today, Associated Press, etc.); Knowledge

Computing Corporation ($100M)

• Business Intelligence Mining???

Page 3: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

3 © 2005

The Peta Age

The End of Theory

Page 4: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

4 © 2005

Outline

• Web 2.0 + Data Mining, Text mining, Web mining

• Intelligence and Security Informatics

• Case Studies, Examples, and Lessons Learned:

Business Intelligence Data, Text and Web mining

• Opportunities and Future Directions: Finance,

Accounting, and Marketing Applications

Page 5: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

5 © 2005

Web 2.0, Data Mining, Text

Mining, and Web Mining

Page 6: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

6 © 2005 6

Web 2.0, by O’Reilly

• http://www.oreilly.com, “What is Web 2.0? Design Patterns and Business Models for the Next Generation of Software,” by Tim O’Reilly, 9/30/2005 (O’Reilly Media Web 2.0 Conference, 2004)

• Examples of Web 2.0: Google AdSense, Flikr, Napster, Wikipedia, blogging, search engine optimization, web services, participation, tagging (folksonomy), syndication, etc.

Page 7: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

7 © 2005 7

Web 2.0, by O’Reilly

• Strategic positioning: “The Web as Platform”

• User positioning: “You control your own data”

• Core competencies:

– Services, not packageg software

– Architecture of participation

– Cost-effective scalability

– Remixable data sources and data transformations

– Software above the level of a single device

– Harnessing collective intelligence

Page 8: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

8 © 2005 8

Web 2.0 Lessons

• The value of the software is proportional to the scale and dynamism of the data it helps to manage.

• Leverage customer-self service and algorithmic data management to reach out to the entire web, to the edges and not just the center, to the long tail and not just the head.

• The service automatically gets better the more better use it.

• Blogging and the wisdom of the crowds.

• Network effects from user participation are the key to market dominance in the Web 2.0 era.

• We, the media.

• Data is the next Intel inside.

Page 9: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

9 © 2005 9

Web 2.0 Lessons (cont’d)

• Operations must become a core competency.

• The perceptual beta.

• Support lightweight programming models that allow for loosely coupled systems. (SOAP, REST, AJAX, etc.)

• Think syndication, not coordination.

• Innovation in assembly. The Mashups.

• Design for “hackability” and remixability.

• Some rights reserved.

Page 10: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

10 © 2005 10

Web 2.0, Wikipedia

• “Web 2.0 is a trend in the use of the WWW technology and web design that aims to facilitate creativity, information sharing, and collaboration among users.”

• “Web 2.0 is the business revolution in the computer industry caused by the move to the Internet as platform, and an attempt to understand the rules for success on that new platform.”

Page 11: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

11 © 2005 11

Web 2.0 Characteristics

• Rich user experience

• User participation

• Dynamic content

• Metadata

• Web standards and scalability

• Openness

• Freedom

• Collective intelligence

Page 12: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

12 © 2005 12

Web 2.0 Features/Technologies

• Technological infrastructure: server software, content syndication, messaging protocols, browsers with plug-ins and extensions, various client applications.

• Cascading Style Sheets (CSS) to separate presentation from content

• Folksonomy (collective tagging)

• Microformats extending pages with semantics

• REST, XML, JSON based APIs

• Rich Internet application techniques based on AJAX

• RSS or Atom feeds for syndication and notification of data

• Mashups of content from different sources

• Weblog publishing, and wikis

Page 13: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

13 © 2005 13

Web 2.0 Criticism

• “Web 2.0 as a piece of jargon,” by Tim Berners-Lee

• “A second bubble”

• “Bubble 2.0”

• “A mere augmentation of current cultural information exchanges that are bound by existing political and societal structures.”

Page 14: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

14 © 2005

Web Programming with Amazon,

Google, and eBay APIs

Page 15: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

15 © 2005

What is Web Services?

• Web Services:

– A new way of reuse/integrate third party softwre or

legacy system

– No matter where the software is, what platform it

residents, or which language it was written in

– Based on XML and Internet protocols (HTTP,

SMTP…)

• Benefits:

– Ease of integration

– Develop applications faster

Page 16: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

16 © 2005

Web Services Architecture

• Simple Object Access Protocol (SOAP)

• Web Service Description Language (WSDL)

• Universal Description, Discovery and Integration

(UDDI)

Page 17: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

17 © 2005

New Breeds of Web Services

• Representational State Transfer (REST) – Use HTTP Get method to invoke remote services (not XML)

– The response of remote service can be in XML or any textual format

– Benefits: • Easy to develop

• Easy to debug (with standard browser)

• Leverage existing web application infrastructure

Page 18: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

18 © 2005

Server Responses in REST

• Really Simple Syndication (RSS, Atom)

– XML-based standard

– Designed for news-oriented websites to “Push” content to

readers

– Excellent to monitor new content from websites

• JavaScript Object Notation (JSON)

– Lightweight data-interchange format

– Human readable and writable and also machine friendly

– Wide support from most languages (Java, C, C#, PHP,

Ruby, Python…)

Page 19: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

19 © 2005

Rich Interactivity Web - AJAX

• AJAX: Asynchronous JavaScript + XML

• AJAX incorporates: – standards-based presentation using XHTML and CSS;

– dynamic display and interaction using the Document Object Model;

– data interchange and manipulation using XML and XSLT;

– asynchronous data retrieval using XMLHttpRequest;

– and JavaScript binding everything together.

• Examples: – http://www.gmail.com

– http://www.kiko.com

• More info: http://www.adaptivepath.com/publications/essays/archives/000385.php

Page 20: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

20 © 2005

AJAX Application Model

Page 21: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

21 © 2005

Amazon Web Services (AWS)

• Amazon E-Commerce Service – Search catalog, retrieve product information, images and customer reviews

– Retrieve wish list, wedding registry…

– Search seller and offer

• Alexa Services – Retrieve information such as site rank, traffic rank, thumbnail, related sites

amount others given a target URL

• Amazon Historical Pricing – Programmatic access to over three years of actual sales data

• Amazon Simple Queue and Storage Service – A distributed resource manager to store web services results

• Amazon Elastic Compute Cloud – Sell computing capacity by the amount you use

Page 22: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

22 © 2005

Google Web APIs

• Google has a long list of APIs

– http://code.google.com/apis/

• Google Search

– AJAX Search API

– SOAP Search API (deprecated)

– Custom search engine with Google Co-op

• Google Map API

• Google Data API (GData)

– Blogger, Google Base, Calendar, Gmail, Spreadsheets, and a lot more

• Google Talk XMPP for communication and IM

• Google Translation (http://www.oreillynet.com/pub/h/4807)

• Many more undocumented/unlisted APIs to be discovered in

Google Blog

Page 23: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

23 © 2005

eBay API

• Buyers: – Get the current list of eBay categories

– View information about items listed on eBay

– Display eBay listings on other sites

– Leave feedback about other users at the conclusion of a commerce transaction

• Sellers: – Submit items for listing on eBay

– Get high bidder information for items you are selling

– Retrieve lists of items a particular user is currently selling through eBay

– Retrieve lists of items a particular user has bid on

Page 24: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

24 © 2005

Other Services/APIs Providers

• Yahoo! http://developer.yahoo.com/

– Search (web, news, video, audio, image…)

– Flickr, del.icio.us, MyWeb, Answers API

• Windows Live http://msdn2.microsoft.com/en-us/live/default.aspx

– Search (SOAP, REST)

– Spaces (blog), Virtual Earth, Live ID

• Wikipedia

– Downloadable database http://en.wikipedia.org/wiki/Wikipedia:Technical_FAQ#Is_it_possible_to_download_the_contents_of_Wikipedia.3F

• Many more at Programmableweb.com

– http://www.programmableweb.com/apis

Page 25: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

25 © 2005

Services by Category • Search

– Google, MSN, Yahoo

• E-Commerce

– Amazon, Ebay, Google Checkout

– TechBargain, DealSea, FatWallet

• Mapping

– Google, Yahoo!, Microsoft

• Community

– Blogger, MySpace, MyWeb

– del.icio.us, StumbleUpon

• Photo/ Video

– YouTube, Google Video, Flckr

• Identity/ Authentication

– Microsoft, Google, Yahoo

• News

– Various news feed websites including Reuters, Yahoo! and many more.

Page 26: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

26 © 2005

Mashup:

A Novel Form of Web Reuse

• “A mashup is a website or application that combines

content from more than one source into an integrated

experience.” – Wikipedia

• API X + API Y = mashup Z

• Business model: Advertisement

Page 27: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

27 © 2005

Web Mining: Machine Learning for

Web Applications

Hsinchun Chen and Michael Chau

ARIST, 38, 2004

Page 28: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

28 © 2005

• The term Web Mining was coined by Etzioni (1996) to denote

the use of Data Mining techniques to automatically discover

Web documents and services, extract information from Web

resources, and uncover general patterns on the Web.

• In this article, we have adopted a broad definition that considers

Web mining to be “the discovery and analysis of useful

information from the World Wide Web” (Cooley et al., 1997).

• Also, web mining research overlaps substantially with other

areas, including data mining, text mining, information retrieval,

and web retrieval.

What is Web Mining?

Page 29: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

29 © 2005

Page 30: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

30 © 2005

• Machine learning algorithms can be classified as

– Supervised learning: Training examples contain input/output pair patterns. Learn how to predict the output values of new examples.

– Unsupervised learning: Training examples contain only the input patterns and no explicit target output. The learning algorithm needs to generalize from the input patterns to discover the output values.

• We have identified the following five major Machine Learning paradigms:

– Probabilistic models

– Symbolic learning and rule induction

– Neural networks

– Analytic learning and fuzzy logic.

– Evolution-based models

• Hybrid approaches

Machine Learning Paradigms

Page 31: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

31 © 2005

• Learning techniques had been applied in Information Retrieval

(IR) applications long before the recent advances of the Web.

• In this section, we will briefly survey some of the research in this

area, covering the use of Machine Learning in

– Information extraction

– Relevance feedback

– Information filtering

– Text classification and text clustering

Machine Learning for Information

Retrieval: Pre-Web

Page 32: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

32 © 2005

• Web Mining research can be classified into three categories:

– Web content mining refers to the discovery of useful information from Web contents, including text, images, audio, video, etc.

– Web structure mining studies the model underlying the link structures of the Web.

It has been used for search engine result ranking and other Web applications (e.g., Brin & Page,1998; Kleinberg, 1998).

– Web usage mining focuses on using data mining techniques to analyze search logs to find interesting patterns.

One of the main applications of Web usage mining is its use to learn user profiles (e.g., Armstrong et al., 1995; Wasfi et al., 1999).

Web Mining

Page 33: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

33 © 2005

Intelligence and Security

Informatics:

COPLINK and Dark Web

Page 34: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

34 © 2005

• Intelligence and Security Informatics (ISI): Development of advanced information technologies, systems, algorithms, and databases for national security related applications, through an integrated technological, organizational, and policy-based approach” (Chen et al., 2003a)

• Data, text, and web mining

• From COPLINK to Dark Web

H. Chen, computer scientist, artificial intelligence, U. of Arizona (2006)

Page 35: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

35 © 2005 35

A knowledge discovery research

framework for ISI

A knowledge discovery research

framework for ISI

Page 36: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

36 © 2005 36

• Information Sharing and Collaboration

• Crime Association Mining

• Crime Classification and Clustering

• Intelligence Text Mining

• Crime Spatial and Temporal Mining

• Criminal Network Analysis

ISI Research: KDD Techniques

Page 37: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

37 © 2005

COPLINK

• 1996-, DOJ, NIJ, NSF, ITIC, DHS

• Connect

• Detect

• Agent

• STV (Spatio-Temporal Visualization)

• CAN (Criminal Activity Network)

• BorderSafe (Mutual Information)

• AI Lab Knowledge Computing Corporation

• Tucson, Phoenix AZ 1600 agencies, 20 states

Page 38: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

38 © 2005

•Newsweek Magazine March3, 2003

•ABC News April 15, 2003

•The New York Times November 2, 2002

Page 39: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

39 © 2005

Dark Web

• 2002-, ITIC, NSF, LOC

• Discussions: FBI, DOD/Dept of Army, NSA, DHS

• Connection:

– Web site spidering

– Forum spidering

– Video spidering

• Analysis and Visualization:

– Link and content analysis (web sites)

– Web metrics analysis (web sites sophistication)

– Authorship analysis (forums; CyberGate)

– Sentiment analysis (forums; CyberGate)

– Video coding and analysis (videos; MCT)

Page 40: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

40 © 2005

The Dark Web project in the Press

Project Seeks to Track Terror Web

Posts, 11/11/2007

Researchers say tool could trace online posts

to terrorists, 11/11/2007

Mathematicians Work to Help Track Terrorist

Activity, 9/14/2007

Team from the University of Arizona

identifies and tracks terrorists on

the Web, 9/10/2007

Page 41: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

41 © 2005 41

COPLINK Connect

Consolidating & Sharing Information promotes problem

solving and collaboration

Records

Management

Systems (RMS)

Mugshots

Database

Gang Database

Page 42: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

42 © 2005 42

COPLINK Detect

Consolidated information enables targeted problem solving via powerful

investigative criminal association analysis

Page 43: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

43 © 2005 43

COPLINK Detect 2.0/2.5

Page 44: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

44 © 2005 44

Association Retrieval and Visualization

Page 45: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

45 © 2005 45

Spatio-temporal Analysis and Visualization

Page 46: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

46 © 2005 46

Border Crossing

• An aerial photograph of a

typical U.S. port of entry

(southern border).

• Vehicle lanes are backed up

with dozens of vehicles

during peak times.

• Criminal vehicles operate in

groups.

– If one is caught others turn

back into Mexico.

• They may join the lines one

at a time or use turn-out

points.

Vehicle lanes

Turn-out points

Turn-out points

Port of Entry

(Check points)

© 2006 Google – Imagery © 2006 DigitalGlobe, Map data ©2006 NAVTEQTM

Page 47: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

47 © 2005 47

A Vehicle to Watch (via SNA)?

Violent crimes Narcotics crimes Violent & Narcotics

Shape Indicates Object Type

circles are people

rectangles are vehicles

Color Denotes Activity History

Larger Size Indicates higher

levels of activity

Border Crossing Plates are

outlined in Red

Gang related

Page 48: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

48 © 2005

Dark Web Collection

Where/how to find them?

Page 49: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

49 © 2005

Link to “The General of Islam” Radio Station

Source: http://www.al-ghazawat.110mb.com/,

French and Arabic Web Site

Web Site Example: Links to Multimedia and Manuals

Azzam

Speeches

Berg

beheading

others

videos of

Zarqawi

Complete

65 pages

manual of

a 50

caliber rifle

in pdf

Page 50: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

50 © 2005

Web Site Example: Links to Web Sites and Forums

• Links to Several Iraqi

Jihadist Web Sites and

Forums

• Source:

http://almaaber.jeeran.com/,

Arabic Web Site

Page 51: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

51 © 2005

Web 2.0 Example: Blog

On a personal blog

http://salafnews.wordpress.com/

, the blogger provides links to

many Islamic Jihadi video clips

he posted on YouTube.

By using selected Arabic lexicons,

we also find quite a few terrorism-

related videos on YouTube as well.

The blogger keeps posting

new videos on YouTube

even if his previous videos

were removed by YouTube.

Page 52: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

52 © 2005

Web 2.0 Example: Second Life

• Recently, public media such as Economist and Australian reported

that Jihadists have set up ‘residents’ in Second Life, a famous

online 3D game.

• We do find several extremist groups in the game.

“AS SL's premier terrorist

roleplay group, our assumed identity, is that of

the Al-Quaeda style terrorist group, fighting a

just and holy crusade against the government of a

distant tyrannical, imperialistic and

overbearing superpower…”

Group: Terrorist of SL

Page 53: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

53 © 2005

System Design

Page 54: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

54 © 2005

Middle East Terrorist Web Collection File Type Breakdown

• Dynamic files (e.g., PHP, ASP, JSP, etc.) are widely used in terrorist Web sites, indicating a high level of technical sophistication.

• Multimedia is also heavily used in terrorist Web sites.

Terrorist Collection # of Files Volume(Bytes)

Total 222,687 12,362,050,865

Indexable Files 179,223 4,854,971,043

HTML Files 44,334 1,137,725,685

Word Files 278 16,371,586

PDF Files 3,145 542,061,545

Dynamic Files 130,972 3,106,537,495

Text Files 390 45,982,886

Powerpoint Files 6 6,087,168

XML Files 98 204,678

Multimedia Files 35,164 5,915,442,276

Image Files 31,691 525,986,847

Audio Files 2,554 3,750,390,404

Video Files 919 1,230,046,468

Archive Files 1,281 483,138,149

Non-Standard Files 7,019 1,108,499,397

Number of Files Distribution (Arabic)

80%

16%

0%

4%

IndexableFiles

MulmediaFiles

Archive Files

Non-StandardFiles

Volume Distribution (Arabic)

39%

48%

4%9% Indexable

Files

MulmediaFiles

Archive Files

Non-StandardFiles

(Terrorist)

(Terrorist)

Page 55: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

55 © 2005

Dark Web Forums Identification

0 20 40 60 80 100 120

Middle-

Eastern

Latin-

American

US

Domestic

# of Forums

Local ISP

AOL

MSN

Google Groups

Yahoo! Groups

Websites

Websites 48 4 18

Yahoo! Groups 20 11 31

Google Groups 0 32 47

MSN 0 5 9

AOL 0 0 5

Local ISP 0 8 0

Middle-Eastern Latin-American US Domestic

Forum Identification -- Overall Distribution by ISP Providers

Page 56: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

56 © 2005

Dark Web Analysis and

Visualization

Page 57: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

57 © 2005

System Design: CyberGate System Design

Page 58: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

58 © 2005

Social Network and Content Analysis

Who links to whom and who

influences whom?

How are the sites used?

Which sites are more sophisticated?

Page 59: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

59 © 2005

MDS Visualization of Arab Group Web Sites

Hizb-Ut-Tahrir

Jihad

Supporters

Palestinian

supporters

Hizballah

Cluster

Palestinian

terrorists

Page 60: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

60 © 2005

Comparison - Content Analysis

U.S. Domestic Terrorist Web sites

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Black Separatists Christian Identity Militia Neo-

confederates

Neo-Nazis/White

Supremacists

Eco-Terrorism

No

rma

lize

d C

on

ten

t L

ev

els Communications

Fundraising

Ideology

Propaganda (insiders)

Propaganda (outsiders)

Virtual Community

Command and Control

Recruitment and Training

Middle Eastern Terrorist Web sites

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Hizb-ut-Tahrir Hizbollah Al-Qaeda Linked

Websites

Jihad Sympathizers Palestinian terrorist

groups

No

rmal

ized

Co

nte

nt

Lev

els

Communications

Fundraising

Sharing Ideology

Propaganda (Insiders)

Propaganda(outsider)

Virtual Community

Command and Control

Recruitment and Training

Page 61: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

61 © 2005

Sentiment Analysis

Which forums and who are more

violent and radical?

Page 62: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

62 © 2005

Model Building – Training Data Annotation

Coding English Translation Arabic

Sentiment Racism Hate Anger Violence

0.4 (positive to

God)

1.0 0.6 0.2 0.3 In the name of God

the most merciful,

leading the faithful to

victory and defeating

the unbelievers and

polytheists

بسم هللا الرحمن

ناصر الرحيم الحمد هلل

هازمالمؤمنين و

الكفرة والمشركين

-0.5

(negative to

America and

its

collaborators)

0.6 0.8 0.4 0 We say to America

and its collaborators:

live in horror

نقول

ألمـــــــــــــــــــريكا

وعمالئها

عــــيشـــــــوا على

الـــــــــرعب

-0.4 (negative

to the enemies)

0 0 0.4 1.0 Oh God, destroy your

enemies and the

enemies of Muslims

أعداءك دمر اللهم

وأعداء المسلمين

0.4

(positive to

Jihad)

0 0 0 0.2 Jihad is fighting

God’s enemies

أعداء قتال الجهاد هو

هللا

Page 63: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

63 © 2005

7. Results: Intensity Scores

U.S. Middle Eastern

Forum Racism Violence Forum Racism Violence

Angelic Adolf 5.513 0.962 Azzamy 30.182 19.833

Aryan Nation 9.921 5.683 Friends 2.076 6.238

CCNU 3.712 14.546 Islamic Union 2.657 9.198

Neo-Nazi 5.458 5.614 Kataeb 2.610 6.605

NSM 10.740 10.740 Kataeb Qassam 25.203 18.670

Smash Nazi 12.424 10.591 Taybah 14.989 15.348

White Knights 19.313 6.353 Osama Lover 14.369 14.584

World Knights 2.468 2.234 Wa Islamah 4.075 9.193

All Forums 10.988 6.902 All Forums 11.892 12.644

U.S. and Middle Eastern Intensity Scores

Page 64: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

64 © 2005

7. Results: Intensity Relationship U.S. Forum Scores

0

100

200

300

400

0 100 200 300 400Hate Scores

Vio

len

ce

Sc

ore

s

Middle Eastern Forum Scores

0

100

200

300

400

0 50 100 150 200 250 300 350 400

Hate Scores

Vio

len

ce

Sc

ore

s

Affect Regression Analysis: Message Level

b 1

R 2

U.S. Middle

Eastern

N 4676 3349

beta (slope) 0.079 0.682

t-Stat 21.354 48.265

P-Value 0.000 0.000

R-Square 0.076 0.486

Strong hate and violence

Correlation, especially for

Middle-Eastern group.

Page 65: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

65 © 2005

7. Results: Intensity Relationship

U.S. Middle

Eastern

N 8 8

beta (slope) 0.347 0.471

t-Stat 1.760 10.306

P-Value 0.139 0.000

R-Square 0.383 0.947

U.S.

0

5

10

15

20

25

30

35

0 5 10 15 20Violence

Racis

m

Affect Regression Analysis: Forum Level

Middle Eastern

0

5

10

15

20

25

30

35

0 5 10 15 20Violence

Racis

m

Page 66: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

66 © 2005

Number of Posts By Month

• Al-Firdaws

consistently has

between 2,500-

3,000 posts per

month since the

second half of

2006.

• Montada very

active in 2002

and 2005.

Al-Firdaws Posts By Month

0

500

1000

1500

2000

2500

3000

3500

Ja

n-0

5

Ma

r-0

5

Ma

y-0

5

Ju

l-0

5

Se

p-0

5

No

v-0

5

Ja

n-0

6

Ma

r-0

6

Ma

y-0

6

Ju

l-0

6

Se

p-0

6

No

v-0

6

Ja

n-0

7

Ma

r-0

7

Ma

y-0

7

Ju

l-0

7

# p

os

ts

Montada Posts By Month

0

5000

10000

15000

20000

25000S

ep-0

0

Jan-0

1

May-0

1

Sep-0

1

Jan-0

2

May-0

2

Sep-0

2

Jan-0

3

May-0

3

Sep-0

3

Jan-0

4

May-0

4

Sep-0

4

Jan-0

5

May-0

5

Sep-0

5

Jan-0

6

May-0

6

Sep-0

6

Jan-0

7

May-0

7

# p

osts

Page 67: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

67 © 2005

Affect Intensities – Temporal View

Al-Firdaws - Anger Montada - Anger

Al-Firdaws - Violence Montada - Violence

Al-Firdaws

has

considerably

higher

violence and

also greater

anger

intensity.

Page 68: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

68 © 2005

Authorship/Writeprint Analysis

Who are the opinion leaders and

where are they?

Page 69: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

69 © 2005

Arabic Feature Set

Lexical Syntactic StructuralContent

Specific

Feature Set

Char-B

ased

Word-B

ased

Punctuation

Function W

ords

Word S

tructure

Word R

oots

Technical S

tructure

Race/N

ationality

Violence

Char-Level

Letter Frequency

Special C

har.

Word-Level

Vocab. R

ichness

Word Length D

ist.

(262) (15)(62)(79)

(418)

(48) (31) (12) (200) (48) (11) (4)

(4) (35) (9) (6) (8) (15)

(50)M

essage Level

Paragraph Level

Contact Inform

ation

Font C

olor

Font S

ize

Em

bedded Images

(5) (6) (3) (29)

Hyperlinks

(14)

(8) (4) (7)

Elongation

(2)

Page 70: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

70 © 2005

Sliding Window Algorithm Illustration

1,0,0,2,1,2

0,1,3,0,1,0

0.533 0.956

-0.541 0.445

0.034 0.089

0.653 0.456

0.975 -0.085

0.143 -0.381

Compute eigenvectors for

2 principal components of

feature group

Transform into 2-

dimensional space

x

Extract feature

usage vectors

y

x = Zx

y = Zy

Repeat steps

2 and 3

1.

3.

2.

x

y

Message Text

Feature Usage Vector Z

Eigenvectors

Page 71: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

71 © 2005

Anonymous Messages Author Writeprints

Author B

Author A 10 messages

10 messages

Page 72: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

72 © 2005

ClearGuidance.com

• Toronto plot forum Member Interaction Network – Blue nodes indicate members with the greatest number of in-links.

– These members are the core set of forum “experts” and propagandists

Page 73: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

73 © 2005

Forum “Experts”

The series of overlapping circular patterns for bag-of-word

features indicates that the author’s discussion revolves around a

related set of topics.

Bag-of-words are predominantly

related to religious topics.

Many large red blots indicative

of the presence of features

unique to this author.

This author attempts to use his

religious “expertise”.

Page 74: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

74 © 2005

This author was later arrested as a major culprit in

the Toronto terror plot (“Soldier of God”). He uses

many violent affect terms.

Radar chart showing violent

affect feature usages.

Comparison to mean shows

several high occurrence terms

(e.g., jihad, martyrdom).

Selected feature is use of term

“jihad” which is the highest in

the forum .

Text annotation view showing

key bag-of-words highlighted.

Selected feature (i.e., “jihad”) is

shown in red.

This author constantly attempts

to justify acts of violence and

terrorism. “…there are so many paid sheikhs

stuck in this life….no point going to

them for fatwas…personally

speaking…cuz they don’t even

agree with jihad in the first place”

Page 75: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

75 © 2005

From Cyberspace to Virtual Worlds

Where are they heading?

How do they attract young audiences

(20 and younger)?

Page 76: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

76 © 2005

Terrorists of SL

Page 77: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

77 © 2005

Terrorism in SL

Group Name in SL No. of Members

terrorists of SL 228

Elite terrorist combat unit 9

Sl terrorist (S. L. T) 5

Second life terrorist association 5

Terrorists 4

The alkida terrorists 4

Shadows terrorists 4

Jihad terrorists 3

Elite jihad terrorist group 2

Automation jihad 2

77

Page 78: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

78 © 2005

Terrorism in SL

78

Page 79: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

79 © 2005

Terrorism in SL

79

Page 80: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

80 © 2005

Case Studies, Examples, and

Lessons Learned: Business

Intelligence Data, Text and Web

Mining

Page 81: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

81 © 2005

Data Mining for Credit Rating

Page 82: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

82 © 2005

Credit Rating Analysis with Support

Vector Machines and Neural

Networks:

A Market Comparative Study

Zan Huang, Hsinchun Chen,

Chia-jung Hsu, Andy W. Chen,

Soushan Wu

Decision Support Systems, 37(4),

2004

Page 83: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

83 © 2005

Our Study

• Apply a relatively new machine learning

technique, Support Vector Machines, with a

classic technique, Neural Networks

• Interpretation of the model

– Variable contribution analysis

• Cross market analysis

– United States and Taiwan market

Page 84: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

84 © 2005

Statistical Methods

• Ordinary Least Squares (OLS) – Fisher 1959, Horrigan 1966, Pogue 1969, West 1970

• Multiple Discriminant Analysis (MDA) – Pinches and Mingo 1973,1975

• Logistic Regression Analysis – Ederington 1985

• Probit Analysis – Gentry 1988, Jackson

• Prediction Accuracy: 50 – 70%

• Frequently used financial variables – measures of size, financial leverage, long-term capital intensiveness,

return on investment, short-term capital intensiveness, earnings stability and debt coverage stability

Page 85: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

85 © 2005

Artificial Intelligence Methods (cont.)

S tudy

B ond rating

categories M ethod A ccuracy D ata

S am ple

size

B enchm ark

statistical

m ethods

LinR

(64.7% )

S ingleton

and S urkan

1990

2 (Aaa vs. A1,

A2 or A3) B P 88%

U S (B ell

com panie

s) 126 M D A (39% )

G arw aglia

1991 3 B P 84.90% U S S P 797 N /A

55.17% (B P ) LinR (36.21% ),

31.03% (R B S ) M D A (36.20% ),

LogR (43.10% )

M oody and

U tans 1995 16 B P

36.2% , 63.8% (5

classes),

85.2% (3 classes) U S S & P N /A N /A

D utta and

S hekhar

1988

2 (AA vs. non-

AA) B P 83.30%

K im 1993 6 B P , R B S U S S & P

U S 30/17

110/58/60

Page 86: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

86 © 2005

Artificial Intelligence Methods (cont.)

S tudy

B ond rating

categories M ethod A ccuracy D ata

S am ple

size

B enchm ark

statistical

m ethods

M aher and

S en 1997 6 B P

70% (7), 66.67%

(5)

U S

M oody's 299

LogR

(61.66% ), M D A

(58-61% )

B P

(w ith O P P )

K w on and

Lim 1998 5 AC LS , B P

59.9% (AC LS ),

72.5% (B P ) K orean 126 M D A (61.6% )

LogR

(53.3% )

75.5% (C B R , G A

com bined)

62.0% (C B R )

53-54% (ID 3)

71-73% (w ith

O P P ), 66-67%

(w ithout O P P ) K orean 126 M D A (58-62% )

C haveesuk et

al. 1999 6

B P , R B F,

LVQ

56.7% (B P ),

38.3% (R B F),

36.7% (LVQ ) U S S & P

60/60 (10

for each

category)

K w on et al.

1997 5

3886

M D A (58.4-

61.6% )

S hin and H an

2001 5 C B R , G A K orean

BP: Backpropagation Neural Networks, RBS: Rule-based System, ACLS: Analog Concept Learning System,

RBF: Radial Basis Function, LVQ: Learning Vector Quantization, CBR: Case-based Reasoning, GA: Genetic

Algorithm, MDA: Multiple Discriminant Analysis, LinR: Linear Regression, LogR: Logistic Regression, OPP:

Ordinary Pairwise Partitioning. Sample size: Training/tuning/testing.

Page 87: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

87 © 2005

Taiwan Data Set

• Taiwan Ratings Corporation – Established in 1997, partnering with Standard &

Poor’s.

• Securities and Futures Institute – Quarter financial statement, financial ratios of publicly

traded companies

• Data Preparation – Used the credit rating and the company’s financial

variables 2 quarters before the rating releasing date

– 74 data points, 21 financial variables, 25 financial institutes, 1998-2002

Page 88: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

88 © 2005

United States Data Set

• A comparable US data set from Standard & Poor’s Compustat

– Comparable financial variables

– S&P senior debt rating for all commercial banks (DNUM 6021)

– 36 commercial banks, 265 data points, 1991-2000.

TW data US data

twAAA 8 AA 20

twAA 11 A 181

twA 31 BBB 56

twBBB 23 BB 7

twBB 1 B 1

Total 74 Total 265

Page 89: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

89 © 2005

Variable Selection

• ANOVA test

– Whether the differences of each financial variable

among different rating classes were significant.

– 5 uninformative variables removed from the data set

• Final data sets

– Taiwan: 14 financial ratios and 2 balance measures

– United States: 12 financial ratios and 2 balance

measures

Page 90: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

90 © 2005

Financial Variables

Financial Ratio Name/ Description

ANOVA Between-

Group P-Value

X1 Total assets 0

X2 Total liabilities 0

X3 Long-term debts/ total invested capital 0.12

X4 Debt ratio 0

X5 Current ratio 0.36

X6 Times interest earned (EBIT/interest) 0

X7 Operating profit margin 0

X8 (Shareholders’ equity + long-term debt)/ fixed assets 0

X9 Quick ratio 0.37

X10 Return on total assets 0.01

X11 Return on equity 0.04

X12 Operating income/ received capitals 0

X13 Net income before tax/ received capitals 0

X14 Net profit margin 0

X15 Earnings per share 0

X16 Gross profit margin 0.02

X17 Non-operating income/ sales 0.81

X18 Net income before tax/ sales 0

X19 Cash flow from operating activities/ current liabilities 0.84

X20

(Cash flow from operating activities / (capital expenditures +

increased in inventory + cash dividends)) in last 5 years 0.64

X21

(Cash flow from operating activities – cash dividends)/ (fixed

assets + other assets + working capitals) 0.08

Page 91: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

91 © 2005

Experiment Results

• 4 Models (Frequently used variables, full set of

variables) – TW I: Rating = f(X1,X2,X3,X4,X6,X7)

– TW II: Rating = f(X1, X2, X3, X4, X6, X7, X8, X10, X11, X12, X13,

X14, X15, X16, X18, X21)

– US I: Rating = f(X1,X2,X3,X6,X7)

– US II: Rating = f(X1, X2, X3, X6, X7, X8, X10, X11, X12, X13,

X14, X15, X16, X21)

Page 92: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

92 © 2005

Experiment Results (cont.)

• Results

– SVM did not

outperform neural

networks.

– The small set of

frequently used

financial variables

contained most

relevant

information.

SVM Results NN Results Difference

TW I 79.73% 75.68% 4.05%

TW II 77.03% 75.68% 1.35%

US I 78.87% 80.00% -1.13%

US II 80.00% 79.25% 0.75%

Experiment Results

73.00%

74.00%

75.00%

76.00%

77.00%

78.00%

79.00%

80.00%

81.00%

TW I TW II US I US II

SVM Results

NN Results

Page 93: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

93 © 2005

Measure of Relative Importance

• First order derivatives of the network parameters

– Neural network model

<y1, y2, …, yn>=f(<x1,x2, …, xm>)

– Contribution measure:

• Garson 1991

– Without direction

• Yoon 1994

– With direction

xjyi /

I

i

J

j I

i ji

jkji

J

j I

i ji

jkji

ik

w

vw

w

vw

Con

1 1

1

1

1

||

||||

||

||||

I

i

J

j jkji

J

j jkji

ik

vw

vwCon

1 1

1

• relative contribution of input i on out k

Connection strengths between input, hidden and output layers are

denoted as and . jiw jkv

ikCon

Page 94: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

94 © 2005

Variable Contribution Analysis

• Garson’s measure

• Optimal set of variables for the two markets

– TW III: Rating = f(X1, X2, X3, X4, X6, X7, X8)

– US III: Rating = f(X1, X2, X3, X4, X7, X11)

Financial Variable Name/ Description

X1 Total assets

X2 Total liabilities

X3 Long-term debts/ total invested capital

X4 Debt ratio

X6 Times interest earned (EBIT/interest)

X7 Operating profit margin

X8 (Shareholders’ equity + long-term debt)/ fixed assets

X11 Return on equity

Page 95: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

95 © 2005

Contribution Analysis Results

Variable Contribution (United States)

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

X1 X2 X3 X4 X7 X11

Financial Variable

Co

ntr

ibuti

on M

ea

su

re

AA

A

BBB

BB

B

Variable Contribution (Taiwan)

0

0.05

0.1

0.15

0.2

0.25

0.3

X1 X2 X3 X4 X6 X7 X8

Financial VarilablesC

ontr

ibuti

on M

ea

su

re

tw AAA

tw AA

tw A

tw BBB

tw BB

Financial Variable Name/ Description

X1 Total assets

X2 Total liabilities

X3 Long-term debts/ total invested capital

X4 Debt ratio

X6 Times interest earned (EBIT/interest)

X7 Operating profit margin

X8 (Shareholders’ equity + long-term debt)/ fixed assets

X11 Return on equity

Page 96: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

96 © 2005

Cross Market Analysis

• US Model

– X1, X2, X3, X7 | X4, X11

– Most important: total assets, total liabilities, long-

term debts/total invested capital

• TW Model

– X4, X7, X8 | X1, X2, X3, X6

– Most important: operating profit margin, debt ratio

Page 97: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

97 © 2005

Future Directions

• Data mining + text mining

– Add important financial variables from the text

format annual report

• Larger scale cross market analysis

– Mainland China, Taiwan, Hong Kong and

United States markets

• Multidimensional financial data visualization

and exploration

Page 98: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

98 © 2005

Stock Prediction Based on

Breaking News

Page 99: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

99 © 2005

Textual Analysis of Stock Market Prediction

Using Breaking Financial News

*The Effect of Momentum and Contraium

Selection trategies

*The Effect of Industry Classification

Robert P Schumaker and Hsinchun Chen

*JASIST, 59(2), 2008

With special thanks to Zan Huang and Daniel McDonald

Page 100: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

100 © 2005

Introduction

• Stock Market Prediction

– Appealing

• Numerous attempts have been made

– Difficult to accurately predict human behavior

– Two Common Philosophies (Technical Analysis, 2005)

• Fundamental Analysis

– Stock Market activity can be predicted from the security’s relative data, statistics, earnings and management

• Technical Analysis

– Stock Market price trends are identified using charts and modeling techniques

– This philosophy is a form of market analysis that studies the supply and demand for securities based on historical trading volume and market price

Page 101: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

101 © 2005

Introduction

• The Use of Textual Data in Prediction

– Text Classification Techniques

• Determine stock price direction

• Promising directional results on aggregate indices

– Limitations of Prior Studies

• Discrete Stock Price prediction from textual data

has not been performed

• Comparisons of regression-based machine learning

methods has not been performed

• Most prior studies on textual data limit themselves

to a ‘Bag of Words’ approach

Page 102: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

102 © 2005

Literature Review

• Financial News Articles

– Large amounts of news articles exist for

securities

• Required reports, governmental compliance

• Unexpected reports lead to share price changes

– Can be capitalized on by NLP and text-processing

techniques

– Automated techniques can capitalize on information

quicker than human counterparts

» Cuts the lag time between information release and the

effect on stock price

Page 103: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

103 © 2005

Literature Review

• Linguistic Techniques

– Bag of Words – all words from a document can be potentially used for machine learning

• Usually strip stop words and perform stemming

• Prior financial news article research used this method (Lavrenko et. al. 2000a and Gidofalvi, 2001)

– De facto method of financial article research

– Noun Phrases – only noun phrases from a document are used for machine learning

– Named Entities – Entities such as people, places, and organizations are used for machine learning

Page 104: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

104 © 2005

Literature Review

• Linguistic Techniques continued…

– Building on Bag of Words

• Noun Phrases – only noun phrases from a

document are used for machine learning

• Still encompasses important article concepts (Tolle

and Chen 2000)

• Handles article scaling better

• Syntactic rules and lexicons are used in

identification

Page 105: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

105 © 2005

Literature Review

• Linguistic Techniques continued…

– Building on Noun Phrases

• Named Entities – Entities such as people, places,

and organizations are used for machine learning

• Uses a semantic lexical hierarchy (McDonald et. al.

2005)

– Nouns and Noun Phrases are classified as person,

organization, or location (Sekine and Nobata 2003)

– Still encompasses the important article concepts

– Provides a more abstract representation than Bag of

Words or Noun Phrases

Page 106: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

106 © 2005

Literature Review

• Textual Stock Market Prediction Taxonomy

Algorithm Classification Source Material Examples

Genetic Algorithm 2 tier Undisclosed number of chatroom postings Thomas & Sycara, 2002

3 tier Over 5,000 articles borrowed from Lavrenko Gidofalvi et al. 2001

5 tier 38,469 articles Lavrenko et al. 2000a

5 tier 6,239 articles Seo et al. 2002

3 tier About 350,000 articles Fung et al. 2002

3 tier 6,602 articles Mittermayer, 2004

SVM

Naïve Bayesian

3 classes – Typically consists of the classes: Up, Down, Unchanged

5 classes – Good, Good uncertain, Neutral, Bad uncertain, Bad

Page 107: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

107 © 2005

Literature Review

• Evaluation Methods

– Prior studies: • Measures of Closeness (Cho et. al. 1998)

– How close the predicted price is to the actual price

– Measured using Mean Squared Error (MSE)

• Directional Accuracy (Gidofalvi, 2001) – Did the predicted stock price follow the same direction of

movement of the actual stock price

– Measured using classification bins (Up, Down, Unchanged)

• Simulated Trading (Lavrenko et. al. 2000a) – If we were to invest money in the system, what percentage

gain/loss would we expect

Page 108: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

108 © 2005

Literature Review

• Textual Financial Information Taxonomy

Textual Financial Source Types Examples Description

8K SEC-mandated report on significant company changes

10K SEC-mandated Annual reports

Recommendations Buy/Hold/Sell based on expert assessment

Stock Alerts Alerts triggered by barriers such as support/resistance levels

Financial Times Provides news stories on company activities

Wall Street Journal Provides news stories on company activities

PRNewsWire Provides breaking financial news articles

Yahoo Finance Compilation of 45 independent financial news wire sources

Financial Discussion Boards The Motley Fool A forum for investors to share stock-related information

Company Generated Sources

Independently Generated Sources

Quarterly & Annual Reports

Analyst Created

News Outlets

News Wire Services

Page 109: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

109 © 2005

Literature Review

• Company Generated Sources

– Quarterly & Annual Reports (Kloptchenko et al. 2004)

• Provides a linguistic structure to indicate how the

company may perform in the future

• Textual information may contain important

information not shown in the financial ratios

• Independently Generated Sources

– Analyst Created

• Neutral professional recommendations on

performance

Page 110: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

110 © 2005

Literature Review

• Independently Generated Sources continued…

– News Outlets

• Centers that publish available financial information at specific

intervals

– Bloomberg, Dow Jones, Financial Times, Reuters, Wall Street Journal

(Cho, 1999)

– CNN Financial News, Business Wire, Forbes (Seo et. al. 2002)

– News Wire Services

• Centers that publish available financial information as soon as it is

publicly released or discovered

– Financial Discussion Boards

• Financial Nuggets may be contained in Web Bulletin Boards (Thomas

& Sycara, 2002)

• Susceptible to Noise

Page 111: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

111 © 2005

Literature Review • News Wire Services

– Several sources release news articles to the market

• Comtex – real-time but subscription-based

• PRNewsWire – free real-time financial news service

– Has free XML/RSS feeds

– Has a free breaking news component

– One of the avenues that Market Makers receive their news

• Yahoo Finance – free real-time financial news service from a

compilation of sources (45 total)

– Associated Press

– Financial Times

– PRNewsWire

Page 112: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

112 © 2005

Literature Review

• Intraday Stock Quote Gathering

– Most financial services provide end of day

quotes or intraday charts

– Historical intraday quotes can be gathered in

increments of 1, 5, 15 or 60 min

• One minute increments provide the most

information and are of sufficient granularity

for data analysis

Page 113: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

113 © 2005

Research Questions

• How effective is the prediction of discrete

stock price values using textual financial

news articles?

• Which combination of textual analysis

techniques are the most valuable in stock

price prediction?

Page 114: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

114 © 2005

System Design

News

Articles

Stock

Quotes

Bag of Words

Named Entities

Noun Phrases

SVR

Closeness

Regression Analysis DB

Textual Analysis

Machine Learning Algorithm (MLA)

Stock Quotation

Model Building

Directional Accuracy

Simulated Trading

Error Analysis

Page 115: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

115 © 2005

System Design • System Training:

• For each news article:

– Determine the stock price trend for prior 60 minutes [-60,

0]

» Use linear regression to obtain trend slope

– Determine actual stock price 20 minutes after article

release

0 +20 -60

Time (minutes)

Stock Price

(dollars)

News Article release

Page 116: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

116 © 2005

System Design

• Model Parameters

– Several parameters can be tested and

included in our models

• Most based on prior research

– Model Building

• M1: uses only extracted article terms to predict price

• M2: uses terms and stock price at article release

• M3: uses terms and regressed stock price estimate

Page 117: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

117 © 2005

System Design

• News Article through the three Representations Schwab shares fell as much as 5.3 percent in morning trading on the New York Stock Exchange but later recouped some of the loss. San Francisco-based

Schwab expects fourth-quarter profit of about 14 cents per share two cents below what it reported for the third quarter citing the impact of fee waivers a

new national advertising campaign and severance charges. Analysts polled by Reuters Estimates on average had forecast profit of 16 cents per share for

the fourth quarter. In September Schwab said it would drop account service fees and order handling charges its seventh price cut since May 2004. Chris

Dodds the company s chief financial officer in a statement said the fee waivers and ad campaign will reduce fourth-quarter pre-tax profit by $40 million

while severance charges at Schwab s U.S. Trust unit for wealthy clients will cut profit by $10 million. The NYSE fined Schwab for not adequately

protecting clients from investment advisers who misappropriated assets using such methods as the forging of checks and authorization letters. The

improper activity took place from 1998 through the first quarter of 2003 the NYSE said. This case is a stern reminder that firms must have adequate

procedures to supervise and control transfers of assets from customer accounts said Susan Merrill the Big Board s enforcement chief. It goes to the heart

of customers expectations that their money is safe. Schwab also agreed to hire an outside consultant to review policies and procedures for the

disbursement of customer assets and detection of possible misappropriations the NYSE said. Company spokeswoman Alison Wertheim said neither

Schwab nor its employees were involved in the wrongdoing which she said was largely the fault of one party. She said Schwab has implemented a state-

of-the-art surveillance system and improved its controls to monitor independent investment advisers. According to the NYSE Schwab serves about 5 000

independent advisers who handle about 1.3 million accounts. Separately Schwab said October client daily average trades a closely watched indicator of

customer activity rose 10 percent from September to 258 900 though total client assets fell 1 percent to $1.152 trillion. Schwab shares fell 36 cents to

$15.64 in morning trading on the Big Board after earlier falling to $15.16. (Additional reporting by Dan Burns and Karey Wutkowski)

Bag of Words Noun Phrases Named Entities

fourth Reuters Reuters

fined NYSE fourth quarter

Schwab fourth quarter Schwab

profit profit

fell Schwab

NYSE

quarter

Page 118: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

118 © 2005

Experimental Findings • How effective is the prediction of discrete stock price

values using textual financial news articles?

– Closeness measures for the different models (MSE)

• Model M2 (using article terms and the stock price at article release) had consistently lower MSE scores than linear regression (Regress) counterparts for each textual representation (p-values < 0.05)

• Named Entities had consistently lower MSE scores for each model compared against the other textual representations (p-values < 0.05)

Regress M1 M2 M3

Bag of Words MSE 0.07279 930.87 0.04422 0.12605

Noun Phrases MSE 0.07279 863.50 0.04887 0.17944

Named Entities MSE 0.07065 741.83 0.03407 0.07711

Average MSE 0.07212 848.15 0.04261 0.12893

MSE

Page 119: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

119 © 2005

Experimental Findings

• Directional Accuracy of the models • Measures the percentage of the time that the

predicted stock price matches the +20min stock price direction

• Model M2 performed better on average (49.9%) than the other models predicting stock price direction

Directional Accuracy Regress M1 M2 M3

Bag of Words 47.8% 45.4% 49.9% 50.0%

Noun Phrases 47.7% 49.4% 50.8% 49.8%

Named Entities 46.9% 47.7% 48.8% 49.4%

Totals 47.5% 47.5% 49.9% 49.7%

Page 120: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

120 © 2005

Experimental Findings

• Simulated Trading results of the models

– Percentage Return on money invested

• Model M2 had the best return on average (2.09%)

than the other models

Trading Engine Regress M1 M2 M3

Bag of Words -1.81% -0.34% 1.59% 0.98%

Noun Phrases -1.81% 0.62% 2.57% 1.17%

Named Entities -2.26% -0.47% 2.02% 2.97%

Totals -1.94% -0.05% 2.09% 1.43%

Page 121: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

121 © 2005

Conclusions

• Model M2 performed the best of the Models

– Consistently performed better than the other

models

• Closeness (0.04261 to Regress at 0.07228)

• Directional Accuracy (49.9% to Model M3 at 49.7%)

• Simulated Trading (2.09% to Model M3 at 1.43%)

– This is the result of capitalizing on the article

terms and stock price at the time of article

release for prediction

Page 122: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

122 © 2005

Conclusions

• Proper Nouns performed the best of the Textual Representations in Model M2

– Performed better in 2 of the 3 evaluation metrics

• Directional Accuracy (50.9% to Noun Phrases at 50.8%)

• Simulated Trading (2.84% to Noun Phrases at 2.57%)

Page 123: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

123 © 2005

Future Directions

• Explore other stocks, e.g., NASDAQ

– Look at stocks outside of the S&P 500

• Look at the effect of “breaking” news articles on

different industries

• Explore news categories and news sentiments

Page 124: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

124 © 2005

Product Opinion Classification in

Multilingual Web Forums

Page 125: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

125 © 2005

Sentiment Analysis in Multiple

Languages: Feature Selection for

Opinion Classification in Web Forums

Ahmed Abbasi, Hsinchun Chen, and Arab

Salem

JCDL, 2007; ACM TOIS 26(3), 2008;

MISQ Forthcoming, 2008

Page 126: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

126 © 2005 126

Sentiment Analysis

• Sentiment analysis attempts to identify and

analyze opinions and emotions.

• Hearst (1992) originally proposed the idea of

mining direction-based text.

• In recent years it has been applied to various

forms of web-based discourse (Agarwal et al.,

2003; Efron, 2004).

• Application to web group forums can provide

insight into important discussion and trends.

Page 127: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

127 © 2005 127

Sentiment Analysis

• Traditional forms of content analysis, such as topical analysis may not be effective for forums.

• Nigam and Hurst (2004) found that only 3% of USENET sentences contained topical information.

• In contrast, web discourse is rich in sentiment related information (Subasic & Huettner, 2001).

Page 128: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

128 © 2005 128

Sentiment Analysis Characteristics

• Tasks – Classification or trend analysis.

• Features – Attributes that are the most effective discriminators of

sentiment polarity.

• Techniques – Analytical methods used to discriminate between

sentiments.

• Domain – Reviews (movies, products, etc.), Web Discourse

(forums, blogs, web pages), and news articles.

Page 129: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

129 © 2005 129

Sentiment Analysis Domains

• Reviews – Movie, product, and music reviews

• (Morinaga et al., 2002; Pang et al., 2002; Turney, 2002)

• Discourse – Include web forums, newsgroups, and blogs.

– Sentiments about specific issues/topics • Abortion, Gun Control, Politics (Agarwal et al., 2003; Efron,

2004)

– General sentiments • Donnath et al. (1999) evaluated the USENET forum

alt.soc.greek for sentiments relating to anger and aggression.

• News Articles/Documents – (Yi et al., 2003; Wilson et al., 2005)

Page 130: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

130 © 2005 130

Taxonomy of Sentiment Analysis Research

Category Description Label

Tasks

Classification Classifying sentiment polarity C1

Trend Analysis Evaluating sentiment balance and temporal trends C2

Features

Syntactic Word N-grams, POS tags, punctuation F1

Semantic Polarity tags, appraisal groups, semantic orientation F2

Link Based Web links, send/reply patterns, and document citations F3

Stylometric Features such as average sentence length, special character frequencies F4

Techniques

Machine Learning Techniques such as SVM, Naïve Bayes, etc. T1

Link Analysis Techniques such as citation analysis and message send/reply patterns T2

Similarity Score Phrase pattern matching, semantic orientation, etc. T3

Visualization Loom, radar charts, etc. T4

Domains

Reviews Product and movie reviews D1

Discourse Web forums and blogs D2

News Articles Online news articles and documents D3

Page 131: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

131 © 2005

Previous Sentiment Analysis Studies

Study Task Features Feature

Reduction

Techniques Data Type Multilingu

al Data

C1 C2 F1 F2 F3 F4 Yes/No T1 T2 T3 T4 D1 D2 D3 Yes/No

Donnath et al., 1999 √ √ No √ √ No

Subasic&Huett, 2001 √ √ No √ √ √ No

Tong, 2001 √ √ √ √ No √ √ √ No

Morinaga et al., 2002 √ √ √ Yes √ √ √ No

Pang et al., 2002 √ √ No √ √ No

Turney, 2002 √ √ No √ √ No

Agrawal et al., 2003 √ √ √ No √ √ √ No

Dave et al., 2003 √ √ No √ √ √ No

Nasukawa & Yi, 2003 √ √ √ No √ √ No

Yi et al., 2003 √ √ Yes √ √ √ No

Yu & Hatzivassil, 2003 √ √ √ No √ √ √ No

Beineke et al., 2004 √ √ No √ √ √ No

Efron, 2004 √ √ √ No √ √ √ No

Page 132: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

132 © 2005

Previous Sentiment Analysis Studies

Study Task Features Feature

Reduction

Techniques Data Type Multilingu

al Data

C1 C2 F1 F2 F3 F4 Yes/No T1 T2 T3 T4 D1 D2 D3 Yes/No

Fei et al., 2004 √ √ No √ √ No

Gamon, 2004 √ √ √ Yes √ √ No

Grefenstette etal.,2004 √ √ No √ √ No

Hu & Liu, 2004 √ √ √ No √ √ No

Kanayama et al., 2004 √ √ √ No √ √ Yes

Kim & Hovy, 2004 √ √ No √ √ No

Pang & Lee, 2004 √ √ No √ √ √ No

Mullen & Collier, 2004 √ √ √ No √ √ No

Nigam & Hurst, 2004 √ √ √ No √ √ No

Liu et al., 2005 √ √ √ √ No √ √ √ No

Mishne, 2005 √ √ √ √ No √ √ No

Whitelaw et al., 2005 √ √ √ No √ √ No

Wilson et al., 2005 √ √ √ No √ √ No

Page 133: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

133 © 2005 133

System Design - Overview

The system design has two major components:

A feature extractor that derives the extended feature set

The Ink Blot technique which can be used for text classification and

analysis

Page 134: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

134 © 2005 134

Authorship Feature Set (Abbasi & Chen, 2005)

Lexical Syntactic StructuralContent

Specific

Feature Set

Char-B

ased

Word-B

ased

Punctuation

Function W

ords

Word S

tructure

Word R

oots

Technical S

tructure

Race/N

ationality

Violence

Char-Level

Letter Frequency

Special C

har.

Word-Level

Vocab. R

ichness

Word Length D

ist.

(262) (15)(62)(79)

(418)

(48) (31) (12) (200) (48) (11) (4)

(4) (35) (9) (6) (8) (15)

(50)M

essage Level

Paragraph Level

Contact Inform

ation

Font C

olor

Font S

ize

Em

bedded Images

(5) (6) (3) (29)

Hyperlinks

(14)

(8) (4) (7)

Elongation

(2)

Page 135: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

135 © 2005

System Design – Feature Set

• The feature set is comprised of stylistic, topical, and sentiment features.

• A minimum frequency threshold of 10 is used to select the n-gram features.

Group Category Quantity Description

Style Word-Level Lexical 5 total words, % char. per word

Character-Level Lexical 5 total char., % char. per message

Character N-Grams < 18,278 count of letters, bigrams, trigrams

Digits N-Grams < 1,110 count of digits, bigrams, trigrams

Word Length Dist. 20 frequency of 1-20 letter words

Vocabulary Richness 8 richness (e.g., hapax legomena)

Special Characters 21 occurrences of char. (e.g., @#$%^&*+)

Function Words 300 frequency of function words (e.g., of, for)

Punctuation 8 occurrence of punctuation (e.g., !;:,.?)

POS Tag N-Grams varies frequency of tag n-grams (e.g., NP VB)

Message Structure 6 e.g., has greeting, has url

Paragraph Structure 8 e.g., no. of and sentences per paragraph

Technical Structure 50 e.g., file extensions, fonts, use of images

Misspelled Words < 5,513 common misspellings (e.g., “beleive”)

Topic Words N-Grams varies bag-of-word n-grams (e.g., “senior editor”)

Noun Phrases varies e.g., “New York, United States”

Named Entities varies e.g., “McDonalds”, “KFC”, “AOL”

Sentiment Polar Adjectives 3,000 positive and negative sentiment terms

Page 136: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

136 © 2005 136

System Design: Ink Blots Ink Blot Technique Steps

1) Separate input text into two classes (one for class of interest, one class

containing all remaining texts).

2) Extract feature vectors for messages.

3) Input vectors into DTM as binary class problem.

4) For each feature in computed decision tree, determine blot size and color

based on DTM weight and feature usage.

5) Overlay feature blots onto their respective occurrences in text.

6) Repeat steps 1-5 for each class.

Page 137: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

137 © 2005 137

Ink Blots: Pirated Software Sales Data

Author

A

Author

B

Author

C

Me

ss

ag

e 1

M

es

sag

e 2

Author

D

Page 138: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

138 © 2005 138

Ink Blot Categorization on Shorter Messages

Author

A

B

C

D

Page 139: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

139 © 2005 139

Evaluation - Hypotheses

• We propose the following research hypotheses relating to web forum text analysis:

• H1: There is no performance difference between the Ink Blot technique and the benchmark SVM technique for the categorization of topical/sentiment/genre information – SVM vs. Ink Blots

• H2: There is no performance difference between the use of bag-of-word features (baseline) and the extended feature set – Extended feature set vs. Baseline

Page 140: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

140 © 2005 140

Evaluation – Experiment 1: Topics

• Topic Categorization

– Objective to test effectiveness of features and techniques for capturing topical information.

– Test bed = 10 topics taken from Enron email corpus (100 emails per topic).

– Two experiment settings were run, one using 5 topics and the other one using all 10 topics.

• Both techniques were run using 10-fold cross validation.

• For Ink Blots, the class with the highest ratio of red to blue blot area was assigned the anonymous message.

– Extended feature set = bag-of-words and noun phrases • Both effective in prior research (Dumais et al., 1998; Chen et al.,

2003).

Page 141: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

141 © 2005 141

Evaluation – Experiment 1: Topics

• The extended feature set significantly outperformed the bag-of-words baseline.

• Both techniques coupled with the extended features achieved accuracy over 90% in all instances.

• However, SVM outperformed the Ink Blot technique for the 5 and 10 topic experiment settings.

• In both cases, the SVM performance was statistically significant based on the p-values for the pair wise t-tests.

# Topics

Techniques 5 Topics 10 Topics

SVM 95.70 93.25

Ink Blots 92.25 90.10

Baseline 88.75 86.55

Page 142: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

142 © 2005 142

Evaluation – Experiment 2: Sentiments

• Sentiment Classification

– Objective to test effectiveness of features and techniques for capturing opinions.

– Test bed of 2000 digital camera product reviews taken from www.epinions.com.

• 1000 positive (4-5 star) and 1000 negative (1-2 star) reviews

• 500 for each star level (i.e., 1,2,4,5)

– Two experimental settings were tested • Classifying 1 star versus 5 star (extreme polarity)

• Classifying 1+2 star versus 4+5 star (milder polarity)

– Extended feature set encompassed a lexicon of 3000 positive or negatively oriented adjectives and word n-grams (Pang et al., 2002; Turney & Littman, 2003).

Page 143: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

143 © 2005 143

Evaluation – Experiment 2: Sentiments

• SVM marginally outperformed Ink Blots – However the enhanced performance was not statistically

significant (p-values on pair wise t-tests > 0.05).

• The extended feature set significantly outperformed the bag-of-words baseline.

• The overall accuracies for both SVM and Ink Blots were consistent with previous work (i.e., in the 85%-95% range).

Sentiments

Techniques Extreme Polarity Mild Polarity

SVM 93.00 89.40

Ink Blots 92.20 86.80

Baseline 83.00 77.10

Page 144: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

144 © 2005 144

Evaluation – Experiment 3: Genres

• Genre Classification

– Objective to test effectiveness of features and techniques for capturing genres.

– Test bed of 3000 forum postings from the Sun Technology Forum (forum.java.sun.com)

• Genres included questions, informative messages, and general messages (no information, just comments).

• 1000 messages used for each genre.

– Two experimental settings were run: • Questions (1000 messages) versus non-questions (500 informative,

500 comments)

• All three genres (1000 messages each)

– The extended feature set consisted of lexical, syntactic, structural, content-specific, and n-gram features.

Page 145: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

145 © 2005 145

Evaluation – Experiment 3: Genres

• Ink Blots marginally outperformed SVM – However the enhanced performance was not statistically

significant based on pair wise t-tests (p-values > 0.05).

• The extended feature set significantly outperformed the bag-of-words baseline.

• The overall accuracies for both SVM and Ink Blots were consistent with previous results dealing with 2-3 genres

Genres

Techniques Questions vs. Non-

Questions

All Three Genres

SVM 98.10 96.40

Ink Blots 98.55 96.50

Baseline 90.10 86.00

Page 146: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

146 © 2005 146

Conclusions

• In this work we presented a CMC archive visualization system consisting of: – An extended feature set for various CMC text mining tasks (e.g.,

topics, sentiments, affects, genres)

– The Ink Blot technique.

• We used the system to provide DL exploration services: – Categorization

– Analysis

– Visualization: To help identify pertinent and significant text features (suitable for human inspection and validation)

• Several analysis illustrations were presented and experiments were used to evaluate the categorization capabilities of the system.

Page 147: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

147 © 2005 147

Conclusions and Future Directions

• Our research contributions are two fold: – Firstly, we are unaware of any prior research using such an

extensive set of features for representing CMC text.

– Secondly, we presented the Ink Blot technique for visualizing these features.

• We are expanding our feature sets and exploring other feature reduction and visualization techniques for CMC text analysis.

• We are testing selected techniques for opinion mining, internet frauds, and security informatics applications.

Page 148: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

148 © 2005

Opportunities and Future

Directions

Page 149: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

149 © 2005

Finance & Accounting Data Sources

in US and Taiwan: From Data to Text

and Web 2.0

Hsinmin Lu, Yida Chen & Hsinchun Chen

Artificial Intelligence Lab

University of Arizona

149

Page 150: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

150 © 2005

US Financial Databases

150

Page 151: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

151 © 2005

Types of Financial Data

• Company Financials:

– Balance sheets

– Income statement

– Company manager and ownership

– Earnings forecasts and analysis’ recommendations

– Mergers and acquisitions

– Audit information

– Banks and insurance companies

– Major company events

• Financial Markets and Prices:

– Stock prices

– Market indices and factors

– Mutual funds

– Bonds

– Derivatives

151

Page 152: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

152 © 2005

Types of Financial Data

• Macroeconomics:

– GDP, production indices, consumer price index, wages, unemployment rate

• Financial News:

– Newspapers: Wall Street Journal, Financial Times

– Newswire: Reuters, PR Newswire

• Financial Blogs and Forums

152

Page 153: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

153 © 2005

Data Providers

• Government – Security and Exchange

Commission (EDGAR)

– US Census Bureau

– Bureau of Labor Statistics

– Federal Reserve Banks

• Commercial Data Services – Wharton Research Data

Services (WRDS)

– Bloomberg

– Reuters

– Lexis Nexis

• Financial Web Sites – Yahoo Finance

– Google Finance

– Market Watch

– CNN Money

153

Page 154: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

154 © 2005

Data Types and Data Providers

Company

Financial Financial

Markets News Macroeconomi

cs

GOV: SEC (EDGAR) X

GOV: Census X

GOV: Bureau of

Labor Stat. X

GOV: US Treasury X

WRDS X X X

Bloomgerg X X X X

Reuters X X X X

Yahoo Finance X X X X

Google Finance X X X X

Market Watch X X X X

CNN Money X X X X

Lexis Nexis X X

154

Page 155: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

155 © 2005

User Interface

• Web interface for all government websites

– EDGAR provides FTP service

• WRDS: SSH and Web Interface

• Bloomberg: proprietary software and web

interface

• Reuters: proprietary software

• Lexis Nexis: web interface

155

Page 156: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

156 © 2005

Data Standards

• Government – EDGAR: plain text, HTML, and XBRL (a XML-based

standard for business reporting)

– Other government websites usually provide data download in CSV or text format

• Commercial Data Services – WRDS: CSV, text, SAS data file, STATA data file

– Bloomberg and Reuters: text, CSV and Excel

– Lexis Nexis: HTML

• Financial Websites, Blogs and Forums – HTML, CSV, text (spidering needed)

156

Page 157: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

157 © 2005

Company Financial Data

157

Page 158: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

158 © 2005

Company Financial Data

• EDGAR

– 10-K and 10-Q: yearly and quarterly reports

– 8-K: reports major corporate events such as merges or changes

in registrant's certifying accountant

– Comment and response letters to company filings

– Form 3, 4, 5: Insider trading reports

<SEC-DOCUMENT>0001193125-06-001869.txt : 20060105

<SEC-HEADER>0001193125-06-001869.hdr.sgml : 20060105

<ACCEPTANCE-DATETIME>20060105170941

ACCESSION NUMBER: 0001193125-06-001869

CONFORMED SUBMISSION TYPE: 10-Q

PUBLIC DOCUMENT COUNT: 7

CONFORMED PERIOD OF REPORT: 20051130

[Data Truncated]

Data Example:

158

Page 159: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

159 © 2005

Company Financial Data (cont’)

• COMPUSTAT (in WRDS) – Provided by Standard & Poor

– More than 24,000 active and inactive publicly held companies.

– Annual and quarterly income statement, balance sheet, statement of cash flows, and supplemental data items

– Also contain information on aggregates, industry segments, banks, market prices, dividends, and earnings.

– Available in various data formats (CSV, XLS, SAS data file, STATA data file)

159

Page 160: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

160 © 2005

Company Financial Data (cont’) • COMPUSTAT Data Example (Annual company financial

information):

datadate fyear tic cusip conm curncd ni

2005123

1 2005 IBM 459200101 INTL BUSINESS

MACHINES CORP USD 7934

Net

Income

nopi np oancf pi pidom

1434 4228 14874 12226 7450

Nonoperating

Income Note

payable Operating Activities -

Net Cash Flow Pretax

Income PI

domestic

160

Page 161: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

161 © 2005

Company Financial Data (cont’) • Yahoo Finance

– Provide easy-to-use interface about company profiles,

key statistics, SEC filings, and competitors

– Usually do not provide longitudinal data

Data Example:

161

Page 162: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

162 © 2005

Company Financial Data (cont’) • IBES

– Provides analysts' earnings estimates, recommendations (buy-hold-sell), and actual reported earnings

– Data Example:

OFTIC MEASURE ANALYS FPI FPEDATS ESTDATS VALUE actual REPDATS

INTC EPS 48357 6 31-Mar-98 19-Feb-98 0.235 0.2025 14-Apr-98

INTC EPS 48357 6 31-Mar-98 5-Mar-98 0.1775 0.2025 14-Apr-98

INTC EPS 938 6 31-Mar-98 5-Mar-98 0.18 0.2025 14-Apr-98

INTC EPS 40196 6 31-Mar-98 3-Feb-98 0.2225 0.2025 14-Apr-98

INTC EPS 40196 6 31-Mar-98 5-Mar-98 0.1675 0.2025 14-Apr-98

INTC EPS 10635 6 31-Mar-98 14-Jan-98 0.2275 0.2025 14-Apr-98

INTC EPS 10635 6 31-Mar-98 5-Mar-98 0.1825 0.2025 14-Apr-98

INTC EPS 9077 6 31-Mar-98 28-Jan-98 0.24 0.2025 14-Apr-98

INTC EPS 9077 6 31-Mar-98 5-Mar-98 0.1925 0.2025 14-Apr-98

INTC EPS 9077 6 31-Mar-98 12-Mar-98 0.1875 0.2025 14-Apr-98

INTC EPS 9236 6 31-Mar-98 15-Jan-98 0.2275 0.2025 14-Apr-98 162

Page 163: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

163 © 2005

Company Financial Data (cont’) • AuditAnalytics: Audit information on over 1,200 accounting firms and

15,000 publicly registered companies

– Who is auditing whom

– How much they are paying for what services

– Create reports by auditor, fees, location, industry

COMPANY_

FKEY AUDITOR_NAME FISCAL_

YEAR

FISCAL_

YEAR_EN

DED AUDIT_FE

ES NON_AUDIT_F

EES TOTAL_FEE

S

51143 PricewaterhouseCoope

rs LLP 2003 31-Dec-03 11300000 40900000 52200000

51143 Ernst & Young LLP 2003 31-Dec-03 2500000 8700000 11200000

51143 PricewaterhouseCoope

rs LLP 2004 31-Dec-04 21600000 55100000 76700000

51143 Ernst & Young LLP 2004 31-Dec-04 3300000 1400000 4700000

51143 PricewaterhouseCoope

rs LLP 2005 31-Dec-05 25300000 32000000 57300000

789019 Deloitte & Touche LLP 2003 30-Jun-03 10700000 16800000 27500000 163

Page 164: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

164 © 2005

Financial Markets and Prices

164

Page 165: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

165 © 2005

Financial Markets and Prices

• Yahoo Finance

– Provides daily securities and indices prices and

trading volumes

– Provides charting tools and data download

functionalities

Data Example:

165

Page 166: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

166 © 2005

Financial Markets and Prices (cont’)

• The Center for Research in Security Prices (CRSP) – Maintained by CRSP at the Graduate School of

Business of the University of Chicago

– Comprehensive collection of security price, return, and volume data for the NYSE, AMEX and NASDAQ stock markets

• Various stock indices and mutual fund are also included

– Data frequency: daily, monthly

– Often merged with Compustat for research purpose

166

Page 167: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

167 © 2005

Financial Markets and Prices (cont’)

• CRSP Data Example: DATE TICKER DIVAMT SHROUT BIDLO ASKHI PRC VOL RET BID ASK

20070126 IBM 1506352 96.84 97.83 97.45 5771100 -0.00062 97.45 97.46

20070129 IBM 1506352 97.45 98.66 98.54 7294800 0.011185 98.56 98.59

20070130 IBM 1506352 98.5 99.45 99.37 7178000 0.008423 99.37 99.4

20070131 IBM 1506352 98.35 99.48 99.15 6446400 -0.00221 99.1 99.15

20070201 IBM 1506352 97.96 99.18 99 6612400 -0.00151 99 99.02

20070202 IBM 1506352 98.88 99.73 99.17 6657000 0.001717 99.18 99.23

20070205 IBM 1506352 98.9 100.44 100.38 8184800 0.012201 100.25 100.33

20070206 IBM 1506352 99.54 100.4 99.85 6532800 -0.00528 99.85 99.86

20070207 IBM 0.3 1506352 99.12 100.36 99.54 7698200 -0.0001 99.54 99.58

20070208 IBM 1506352 98.65 99.74 99.62 6152300 0.000804 99.66 99.67

20070209 IBM 1506352 97.81 99.7 98.55 6101100 -0.01074 98.56 98.58

20070212 IBM 1506352 98.22 99.2 98.58 5331043 0.000304 98.6 98.64

20070213 IBM 1506352 97.8 98.74 98.29 5702815 -0.00294 98.27 98.34

20070214 IBM 1506352 98.25 99.43 99.2 5644733 0.009258 99.22 99.26

20070215 IBM 1506352 98.48 99.52 98.92 5568600 -0.00282 98.95 99.04

20070216 IBM 1506352 98.63 99.25 98.99 4800700 0.000708 98.97 98.99

20070220 IBM 1506352 98.55 99.46 99.35 4124200 0.003637 99.35 99.39

20070221 IBM 1506352 98.7 99.37 99.09 4302400 -0.00262 99.09 99.16

167

Page 168: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

168 © 2005

Financial Markets and Prices (cont’)

• Trade and Quote (TAQ) :

Intraday transactions data (trades and quotes) for all securities listed

on the New York Stock Exchange (NYSE) and American Stock

Exchange (AMEX), as well as Nasdaq National Market System

(NMS) and SmallCap issues.

• Data Example:

symbol DATE TIME PRICE SIZE BRKB 7-Nov-02 9:37:48 2458 320

BRKB 7-Nov-02 9:37:49 2458 100

BRKB 7-Nov-02 9:37:51 2458 90 BRKB 7-Nov-02 9:37:51 2458 10

BRKB 7-Nov-02 9:37:56 2458 30 BRKB 7-Nov-02 9:39:13 2455.1 40

BRKB 7-Nov-02 9:39:21 2455.11 500

BRKB 7-Nov-02 9:41:10 2456 30 BRKB 7-Nov-02 9:41:35 2460 110

BRKB 7-Nov-02 9:42:24 2460 100 168

Page 169: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

169 © 2005

Financial Markets and Prices (cont’)

• Federal Reserve Banks: Fed fund rates, interest

rates, foreign exchange rates

169

Page 170: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

170 © 2005

Macroeconomics

170

Page 171: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

171 © 2005

Macroeconomics

• Government:

– US Census Bureau: income, economic census

– Bureau of Labor Statistics: wages, earnings,

labor productivity, consumer price index

171

Page 172: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

172 © 2005

Financial News

172

Page 173: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

173 © 2005

Financial News

• Financial newspaper: – Financial Times: archived by Academic Onefile and Lexis

Nexis:

– Wall Street Journal: archived by ABI/Inform

– Partial current news articles are available from Yahoo News, Google News and their own websites

• Newswire : – PR Newswire: archived by Lexis Nexis (with timestamp

precision up to minutes) and General OneFile (fulltext and date)

– Reuters: to the best of my knowledge, no third party archive is available

– Partial current newswires are available from Yahoo Finance and their own websites (mostly with timestamp)

• Data format: HTML (can be spidered) 173

Page 174: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

174 © 2005

Financial Newspaper: WSJ • The WSJ provides abstract, classification codes, and institutions involved

– Only on selected articles

– May be a good source of training data for entity extraction

• Data Example:

174

Title: Automotive Brief: Proton Holdings Bhd. (Eastern edition).

Date: Jun 1, 2006. pg. A13

Classification Codes: 9175 Western Europe, 9179 Asia & the

Pacific, 8680 Transportation equipment industry

Companies: Proton Holdings Bhd, PSA Peugeot Citroen SA (NAICS: 336111 )

Column Name: Business Brief Publication

Abstract: (Document Summary) Malaysia's national car maker Proton Holdings

Bhd. said it is pursuing alliance talks with France-based PSA Peugeot Citroen SA

and plans to introduce six new models by 2008 after reporting sharply lower profit

for its most recent fiscal year.

Full Text (140 words) Malaysia's national car maker Proton Holdings Bhd. said it

is pursuing alliance talks with [document truncated]

Page 175: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

175 © 2005

Social Network in the Wall Street Journal

175

Page 176: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

176 © 2005

Newswire: PR Newswire

• Lexis Nexis provides complete historical collection for the past 7 years (displaying New York time)

• A subset of news articles can be downloaded from Yahoo Finance (displaying GMT time)

• Compare the number of articles from the two sources (year 2008):

– Yahoo Finance contains most of the articles

– Missing articles are mostly related to politics

176

Jan. 21

(Monday)

Jan.

22

Jan. 23 Jan. 24 Jan. 25 Jan. 26 Jan. 27

Lexis

Nexis

331 944 860 836 439 49 14

Yahoo

Finance

348 899 810 770 379 10 13

Page 177: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

177 © 2005

Newswire: PR Newswire

• Data Example (from Lexis Nexis)

177

Date: January 21, 2008 Monday 11:01 PM GMT

Title: Oil States Announces Fourth Quarter 2007 Earnings Conference

Call;

Wednesday, February 20, 2008 at 11:00 am Eastern Time

Length: 399 words

DATELINE: HOUSTON Jan. 21

Fulltext: HOUSTON, Jan. 21 /PRNewswire-FirstCall/ -- Oil States

International (NYSE:OIS) announced today that it has scheduled its

fourth quarter 2007 earnings conference call for Wednesday, February

20, 2008 at 11:00 am Eastern time. During the call, the company will

discuss the results for the quarter ended December 31, 2007, which are

expected to be released on February 19, after markets close.

[document truncated]

Web site: http://www.oilstatesintl.com/

Page 178: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

178 © 2005

Current News Collection

• Wall Street Journal (full text) – 283,280 articles; 24,994 institutions

– 8/4/1999 to 3/2/2007

• New York Times – 673,142 articles

– 1/1/2000 to 3/1/2007

• Washington Post – 440,500 articles

– 1/1/2000 to 3/1/2007

• Financial Times – 476,000 articles

– 1/1/2000 to 3/1/2007

178

Page 179: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

179 © 2005

Current Collection (cont’)

• PR Newswire (full text with date)

– 1,315,000 articles

– 1/1/2000 to 5/31/2007

• PR Newswire (title and timestamp)

– Collecting: 1/1/2000 to 5/31/2007

• PR Newswire (from Yahoo Finance; full text with timestamp)

– From 1/1/2008

– About 700 articles per day

179

Page 180: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

180 © 2005

Current Collection (cont’)

• Reuters Newswire (from Yahoo Finance;

full text with timestamp)

– From 1/1/2008

– About 300 articles per day

• Associated Press (from Yahoo Finance; full

text with timestamp)

– From 1/1/2008

– About 500 articles per day

180

Page 181: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

181 © 2005

Financial Data Sources for

Companies in Taiwan

Page 182: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

182 © 2005

Types of Financial Data

• Company Financials:

– Balance sheets

– Income statement

– Company manager and ownership

– Earnings forecasts and analysis's’ recommendations

– Mergers and acquisitions

– Audit information

– Banks and insurance companies

– Major company events

• Financial Markets and Prices:

– Stock prices

– Market indices and factors

– Mutual funds

– Bonds

– Derivatives

Page 183: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

183 © 2005

Types of Financial Data

• Macroeconomics:

– GDP, production indices, consumer price

index, wages, unemployment rate

• Financial News, Blogs and Forums

– Newspapers: 工商時報、財訊、Wall Street

Journal 中文版 …

– Newswire: N/A

Page 184: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

184 © 2005

Data Service Providers

• Government – 中央銀行

– 行政院金管會、證期局、主計處

• Institutions – 證券交易所

– 證券暨期貨市場 發展基金會

– 中華信用評等公司

– 中華經濟研究院

– 財團法人經濟資訊推廣中心 (AREMOS)

• Commercial Data Services – 台灣經濟新報

– 中央日報全文影像資料庫

– 中央通訊社中英文新聞資料庫

– 知識贏家

– 聯合知識庫

– 臺灣新聞智慧網

• Financial Web Sites – Yahoo Finance Taiwan

Page 185: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

185 © 2005

Data Types and Data Providers

Company

Financial Financial Markets News Macroeconomics

證券交易所 X

證券暨期貨市場

發展基金會 X X

中央銀行 X

行政院金管會 X X

行政院證期局 X X

行政院主計處 X

中華經濟研究院 X

財團法人經濟資訊推廣中心 X X X

台灣經濟新報 X X X

Yahoo Finance Taiwan X X X X

臺灣新聞智慧網、中央日報全文影像資料庫 、中央通訊社中英

文新聞資料庫、知識贏家、聯合知識庫

X

Page 186: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

186 © 2005

User Interface

• Most data service provides use web

interface

• 經濟新報 and 財團法人經濟資訊推廣中心 use

proprietary systems

Page 187: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

187 © 2005

Data Standards

• Government – Data provided on government websites usually

can be download in CSV or HTML format

• Institutions – 證券交易所: Fixed format text for submitting

company financial data

• Commercial Data Services – HTML, text, CSV

• Financial Websites, Blogs and Forums – HTML, text, CSV

Page 188: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

188 © 2005

Company Financial Data

Page 189: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

189 © 2005

Company Financial Data

• 公開資訊觀測站 (證交所)

– 公司財務報表

– 公司概況

– 董監股權異動

– 營運慨況

Data Example:

Page 190: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

190 © 2005

Company Financial Data

• Yahoo Finance Taiwan

– 基本公司資料

– 營收盈餘

– 股利政策

Data Example:

Page 191: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

191 © 2005

Company Financial Data

• Company finance data is also available

from the following data providers

–證券暨期貨市場發展基金會

–財團法人經濟資訊推廣中心 (AREMOS)

–台灣經濟新報

Page 192: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

192 © 2005

Financial Markets and Prices

Page 193: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

193 © 2005

Financial Markets and Prices

• 資訊王 (證券暨期貨市場發展基金會)

– 每日集中市場交易概況(個股、類股、大盤)

– 店頭市場交易概況

– 興櫃公司交易概況

• Similar information is available from 台灣經濟新報 and AREMOS

Data Example:

Page 194: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

194 © 2005

Macroeconomics

Page 195: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

195 © 2005

Macroeconomics

• Government: – 中央銀行: Exchange rate, interest rate

– 行政院主計處: Consumer price index, unemployment rate, GDP

– 中華經濟研究院: Economic growth predictions

– 財團法人經濟資訊推廣中心 (AREMOS): GDP, interest rate, consumer price index, unemployment rate

– 台灣經濟新報: GDP, interest rate, consumer price index, unemployment rate

Page 196: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

196 © 2005

Individual Trading Data

• “We have acquired the complete transaction history

of all traders on the TSE from January 1, 1995,

through December 31, 1999. The trade data include

the date and time of the transaction, a stock identifier,

order type (buy or sell -- cash or margin), transaction

price, number of shares, a broker code, and the

identity of the trader.”

– 劉玉珍教授(政大財金)與李怡宗教授(政大會計) in “Who

Loses from Trade? Evidence from Taiwan;” under

review

Page 197: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

197 © 2005

Financial News

Page 198: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

198 © 2005

Financial News

• Financial newspaper:

– 聯合知識庫 (聯合報系: 聯合報、經濟日報、聯合晚報、商業週刊、遠見雜誌、天下雜誌)

– 知識贏家(中時報系: 中國時報、工商時報、中時晚報)

– 臺灣新聞智慧網 (中國時報、聯合報、經濟日報、民生報、聯合晚報、星報、中國時報、工商時報、中央日報、自由時報、經濟日報)

– 中央日報全文影像資料庫

– 中央通訊社中英文新聞資料庫

• Newswire :

– N/A

• Data format: HTML

Page 199: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

199 © 2005

Financial Blogs and Forums:

US and Taiwan

Page 200: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

200 © 2005

Integrated Web Platforms for Financial

Information

• Several platforms have been established to integrate various sources of financial information in one page. – ValueWiki

• Created in March, 2007

• Organize by companies, each page showing all available information of a company and providing outward links to forums and websites

– BoardCentral • Provide most recent 10-20 messages from 13 major forums,

including Yahoo! Finance, RagingBull, and InvestorVillage

– pfblogs.org • Integrate 1137 financial blogs

• Provide a single page to view all the latest posting from those blogs

Page 201: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

201 © 2005

ValueWiki

• Each page is a portfolio

of a company

• Provide

– Instance stock price

– News feeds

– Background

information

– Outward linkage to

• Relevant

websites

• Rumor sites

• Forums

Page 202: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

202 © 2005

ValueWiki (Cont’d) • ValueWiki also provides other

types of services

– Blogs

– Instant chat

– Message boards

• The founders of ValueWiki

have their own blog posting

some valuable information,

such as

– Top 100 financial blogs by

Alexa and Technorati

– Top 50 Web 2.0 financial

blogs by Alexa and Technorati

– Personal Top 60 financial

blogs

Page 203: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

203 © 2005

BoardCentral • Organize by companies or

stocks

• Provide most recent 10 ~ 20

messages from 13 popular

forums, including

– Yahoo!Fiance

– RagingBull

– Google Finance

– InvestorVillage

– The Motley Fool

– StockHouse

– ClearStation

– TheLion

– FreeRealTime.com

– msn.money

– SiliconInvestor

Page 204: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

204 © 2005

BoardCentral (Cont’d)

• It also provides other

types of information

– Stock summary

– Stock news from

• Yahoo!Finance

• Google Finance

• MarketWatch

– StockCharts

– Competitors and

related companies

Page 205: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

205 © 2005

pfblogs.org

• Provide posting from

1137 financial blogs

• Have 110966 entries

currently

• Provide a search

function and sorting

by

– Personal Finance

– Real estate

– Investing

• When viewing

articles, it links back

to original blogs

Page 206: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

206 © 2005

Combination of Forums and Blogs • Some financial forum sites also provide space for personal blogs

– InvestorVillage

– Stockhouse

– TheLion

– msn.money

– Yahoo! Finance

Page 207: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

207 © 2005

Case study: Microsoft-Yahoo Bid

• We use the recent Microsoft-Yahoo

acquisition to see how information was

passed around in forums and blogs before

the official announcement.

• Timeline

– Microsoft officially announced $44.6 millions to

buy Yahoo on 01/31/2008

– News press reported it on 02/01/2008

Page 208: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

208 © 2005

Discussion in Forums

• Major discussion about the acquisition in forums started after

02/01/08 except in Yahoo! Finance

Date Main Message Trigger

01/01/08 Yahoo! Being Prepped For Sale * Blog Posting of Seeking Alpha

01/08/08 More takeover talk @Seeking Alpha * Blog Posting of Seeking Alpha

01/10/08 NY Post says MSFT may bid for YHOO New York Post

01/10/08 MICROSOFT WILL BUY YAHOO!!

01/10/08 A Microsoft/Yahoo Rumor Right Before Yahoo Earnings

01/10/08 This rumor could be "real" this time

01/11/08 Yahoo Spokesperson Kara Swisher Denies Microsoft Rumor

01/12/08 Will Microsoft Pay $50 Billion For Yahoo – Maybe #

01/13/08 Motely: 5 reasons why MSFT will buy Yahoo % Report from TheMotleyFool

01/13/08 Yahoo Might Not Want To Be Bought

01/24/08 Bidding war over Yahoo! after earning New York Post

01/25/08 Turns out those rumors are true!!

01/29/08 MICROSOFT will buy YAHOO By Stock Price

01/29/08 take over happening soon…

01/29/08 MSFT to Buy YHOO

01/30/08 MICROSOFT WILLING TO BID FOR YAHOOOOOO

01/30/08 If MSFT doesn't buy YHOO at $30

01/31/08 Has MSFT denied buyout interest?

01/31/08 Just sell the damn company to Bill Gates!!

Page 209: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

209 © 2005

Triggers for Forum Discussion

• From Yahoo! Finance forum, we can see five trigger points for its discussion before the official announcement.

– 01/01/08

• Blog Posting by Ashkan Karbasfrooshan in Seeking Alpha: “Is Yahoo! Being Prepped for a Sale”

– 01/10/08

• News article in New York Post: “Microsoft Deal King to Launch Own Firm”

– 01/13/08

• Report by Rick Aristotle Munarriz in the MotleyFool: “5 Reasons Why Microsoft Will Buy Yahoo”

– 01/24/08

• News article in New York Post: “Sharks Circle Yahoo! – LBO, Media Bigs Attracted to Battered Stock”

– 01/29/08

• By consistently decreasing stock price

01/29/08

Page 210: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

210 © 2005

Different Reactions

from the Sources

• When trigger

source is from

news, members

have more positive

attitude and active

discussion.

• When it is from

reacting to stock

price or some

member’s

prophecy, the

response is

relative skeptical.

By News

By Stock Price

Page 211: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

211 © 2005

Discussion in Blogs • Seven blog postings were found from pfblogs.com.

• Except for the first posting on 01/01/08, blog postings were triggered by

– Other blog postings

– News articles

Date Blog Posting Blog Trigger

01/01/08 Is Yahoo! Being Prepped for a Sale? Seeking Alpha

01/04/08 Response to Ashkan Karbasfrooshan's 'Is Yahoo Is Being Prepped for Sale?'

Seeking Alpha By the first blog post

01/08/08 Google's Search Share and Microsoft's Fast Acquisition

Seeking Alpha By the news of Microsoft’s buying of FAST Search

01/10/08 Yahoo Jumps On Renewed Rumors Of MSFT

Bid

Barron’s By the news article of New York Post

01/10/08 Microsoft/Yahoo Takeover Talk: Here We Go Again

Seeking Alpha

By the news article of New York Post

01/10/08 5 Reasons Why Microsoft Will Buy Yahoo! MotleyFool By the news article of New York Post

01/11/08 Yahoo: Bernstein Cuts Target; MSFT Deal Unlikely; Equity Stakes, Cash Now Valued More Highly Than Core Biz

Barron’s By the news article of New York Post

Page 212: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

212 © 2005

Layers of Influence • We can see a clear flow among stock price, news articles, blogs, and forums. The

direction of the flow is related to the credibility of the media.

Stock Price News Blogs Forum

01/01/08

01/02/08

01/08/08

01/10/08

01/24/08

01/29/08

Page 213: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

213 © 2005

Financial Forums in Taiwan • Twelve active forums are found in Taiwan stock market.

• Compared to the U.S. financial forums, Taiwan forums have four characteristics:

– Established as user services to support main business

– Organized by buying strategies rather than by companies or stocks

– Focus on the sharing of star stocks

– Adopt a subscription and membership system

Forums Access

智富論壇 smartnet.com.tw Free

聲動討論 168.com.tw Credit/Free

發財網 e-stock estock.marbo.com.tw Free

基智網 FundDJ funddj.com Credit/Free

理財網 MoneyDJ moneydj.com Credit/Free

Yahoo!奇摩股市 tw.mb.yahoo.com Free

聚財網 www.wearn.com Credit

理財經算網 RickMall Free

DigitalTimes 科技網 digitimes.com.tw Free

中時理財網 理財心經 tb.chinatimes.com Free

永不老論壇 yongbulao.com Credit/Free

23XX電子論壇 23xx.com.tw Free

Page 214: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

214 © 2005

Main Business Forum Hosts

• Financial/General Publication –智富論壇 –聚財網 –中時理財網 –DigitalTimes

• Financial Software and Services – 聲動討論

– 發財網

– 基智網

– 理財網

Page 215: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

215 © 2005

Organization of Forums • Taiwan financial forums have discussion on the overall stock market

rather than on specific companies or stocks except 23xx.com.tw.

Page 216: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

216 © 2005

Organization of Forums (Cont’d) • 23xx.com.tw is the only one focusing on individual

companies in the Taiwan high technology industry.

Page 217: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

217 © 2005

Sharing of Star Stocks • The discussion in Taiwan is mostly triggered by stock

prices and overall market trend rather than financial news.

– Focus on the sharing of star stocks.

Page 218: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

218 © 2005

Subscription and Membership System

• Some forums claim to have domain experts and insider information (明牌) and require subscription to view those advanced posting.

– 聚財網

– 永不老論壇

– 聲動討論

– 基智網

– 理財網

Page 219: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

219 © 2005

Financial Blogs in Taiwan • Financial bloggers in Taiwan can be found in popular blog or forum hosting sites.

– 無名小站 wretch.cc

– 痞客幫 pixnet.net

– 聚財網

– 基智網

– 中時理財網

• Some popular financial blogs may also require subscription or membership.

Page 220: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

220 © 2005

Future Directions and Research Opportunities

Page 221: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

221 © 2005

Possible Directions: Accounting and Risk Assessment

• Cumulative Abnormal Return (CAR) based on

Accounting indicators (e.g., Unexpected Earning)

and qualitative financial and corporate news (e.g.,

news announcements, events, positive/negative

sentiments)

• Enterprise Risk Management (ERM) based on

financial/corporate news; Strategic Risk and

Operational Risk; news categories (e.g., merger,

new product announcement, management change,

lawsuit) and sentiments

Page 222: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

222 © 2005

Possible Directions: Corporate Governance

• Social network analysis based on corporate board

members and their affiliations; executive compensation

and corporate government

• Identify possible illegal insider trading activities; isolate

news-sensitive traders and news-neural traders

Page 223: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

223 © 2005

Possible Directions: Corporate Sentiments

• Customer sentiment tracking (forums and blogs) for

corporate going concerns

• News blogs and forums vs. newswire: How discussions

in forums and blogs interact with breaking news and

company performances?

• Use web data to study “infectious behaviors” in forums

and blog; identify web opinion leaders and their impacts

Page 224: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

224 © 2005

Possible Directions: Stock Advice

• Stock advisory system: Recommend stock trading

strategy based on breaking news (news category and

sentiment) and corporate assessment

• Stock co-movement analysis: Linking co-occurrence of

news articles to the co-movement of stock prices

224

Page 225: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

225 © 2005 225

Web Marketing Research

Hsinchun Chen & Bob Lusch

University of Arizona

Page 226: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

226 © 2005 226

Overview

• Sentiment index: Michigan Consumer Sentiment Survey, BrandIndex.com

• Marketing tools: MarketTools, TrendIQ, Passenger

• Web sentiment and opinion: Blogspot, eBlogger, Technorati, ProgrammableWeb.com, Epinions.com

• Web marketing research opportunities

Page 227: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

227 © 2005 227

Michigan CSI

• University of Michigan Consumer Sentiment Survey, since 1952 (monthly)

• 500 telephone interviews in the US per month

• Five questions: – Q1…financial…better off/worse off than a year ago…?

– Q2…a year from now…better off/worse off financially…?

– Q3…business conditions in the country…good times/bad times financially…?

– Q4…country as a whole…next five years…good times or unemployment…?

– Q5…big things people buy for their homes…now is a good time to buy…?

Page 228: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

228 © 2005 228

Michigan CSI (cont’d)

• Index of Consumer Sentiment (ICS): Q1-Q5

• Index of Current Economic Conditions (ICC): Q1 and 5.

• Index of Consumer Expectation (ICE): Q2-4.

Page 229: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

229 © 2005 229

Michigan CSI: Research Opportunities

• Automated Web collection and sentiment analysis of consumer confidence

• What forums, blogs, etc. to collect, and where?

• Experimental validation (correlation) of historical Michigan CSI vs. past sentiment of Web blogs and forums

• Experiment on world events-Web sentiment correlation

• Company Web sentiment index (based news, blogs, forums) vs. stock performance? Contrarian Sentiment Index for stock prediction?

Page 230: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

230 © 2005 230

BrandIndex.com

• A UK-based company; tracking over 1,100 consumer brands across 32 sectors on a 7-point scale

• Based on 2,000 online interviews/surveys per day from a panel of 200,000 (polling research)

• Seven points: buzz, general impression, quality, value, satisfaction, recommend, corporate reputation.

Page 231: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

231 © 2005 231

BrandIndex.com: Research Opportunities

• Web-based collection and sentiment analysis of product comments (news forums, blogs)

• Correlating with breaking news and events on products and companies

• Correlating with Epinions.com consumer sentiment evaluations

• Automating analysis of specific critiques of products and reasons

Page 232: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

232 © 2005 232

Marketing Tools

• Most companies developed online survey and marketing analysis tools for companies

• MarketTools: Online survey tools and communities; claim to have Internet text analysis ability for 50M sites (no evidence)

• TrendIQ: Analyze market shares, buzz trends, sentiment scoring, relationship ID, Internet share analysis, etc. Some graphing tools; but little evidence of capabilities or success (web site and results un-impressive)

• PeopleTrend.com: Powered by TrendIQ, Presidential Election Heat Map, CEOs of large company (un-impressive)

• Passenger: builds online branded communities

Page 233: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

233 © 2005 233

Marketing Tools: Research Opportunities

• Need to focus on convincing business cases and scenarios

• Need to provide good (understandable, insightful) visualizations for results

Page 234: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

234 © 2005 234

Web Sentiment and Opinions

• Many blog creation and hosting sites

• eBlogger: blog creation

• WordPress, Blogspot

• Where to find top bloggers in selected topics?

• Where to identify major forums for product, company, event, etc. opinions?

Page 235: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

235 © 2005 235

ProgrammableWeb.com

• A major hub for Web Mashups; More than 650 Web APIs and 2700 Mashups; API Directory, Mashup Directory, Market Trends, Major Players

• Top Mashup Tags grouped by category

• Web 2.0 API Directory grouped by category, e.g., advertising, news, sports, health, maps, etc.

• Some major APIs: Financial APIs (25), News (10), Government (13), Medical (5), Shopping (32), Sports (5), etc. Each API site has detailed API information and examples for implementation.

• Most popular Mashups; Searching tag cloud

Page 236: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

236 © 2005 236

ProgrammableWeb.com: Research

Opportunities

• Excellent site for identifying data sources for various applications, e.g., business, sports, medical, etc.

• Good integration of data sources and visualization for web

• What data/web mining opportunities?

Page 237: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

237 © 2005 237

Technorati.com

• Many useful blog resources and data: Top 100 blogs, Top Tags, Popular, Ping, Widgets, Watchlist, Photos, Videos, etc.

• Blog directory grouped by topics, including: business, economy, stocks, sports, consumer products, health, politics, etc.

• Top news, videos, movies, etc.

• Top Tags for each blog (tag cloud)

• Automatic Ping support

• Mentions of tagged topic by day

• Widgets: blog searching and info, pinging

Page 238: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

238 © 2005 238

Technorati.com: Research Opportunities

• Automatic spidering of top blogs and contents by topics

• Pining of new contents

• Promising for products, companies, politics, health topics, environmental issues

• Trackback of popular blogs to develop social networks of communities

• How about international, multilingual blogs, e.g., Taiwan, Japan, Arabic, etc.?

• How about analysis of popular vidoes and tags for specific products?

Page 239: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

239 © 2005 239

EPinions.com

• A service of Shopping.com (an EBay company); Members are paid to provide quality, meaningful web reviews/comments for various product categories; use Web of Trust (of trusted people)

• Reviews are grouped by category, e.g., computers, cars, cameras, personal finance, sports, etc.

• Most reviews contain Rating (overall 1-5 and sub-categories), Pros, Cons, and free-text comments (specific to product)

• Reviews also link to product information, e.g, specs, pricing, vendors, etc.

Page 240: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

240 © 2005 240

EPinions.com: Research Opportunities

• Excellent source for training English sentiment polarity analysis algorithms – correlating free-text comments with rating scores

• Immediate experiment on English product-specific sentiment analysis algorithms

• Generic polarity analysis engine or product-specific polarity analysis engine?

• How to identify product feature like/dislike and reasoning based on product specs information (what do you like about this)?

• What about other languages, e.g., Taiwan, Arabic, etc.

• What about sentiment visualization?

Page 241: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

241 © 2005 241

Future Directions

• How/where to identify Web data sources for various topics (business, company, product, health, politics, environment)?

• What are the major news sources, forums, and bloggers?

• Need to develop and test sentiment analysis algorithms for various topics

• Need to focus on selected topics: company/product, environment, politics, health

• How about Taiwanese and Arabic contents?

• How about visualization?

Page 242: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

242 © 2005 242

Future Directions

• Web Sentiment Index for companies and products

• WalMart Corporate Sentiment Tracking (“Save Money Live Better”; Go Green Expo)

• Green 100 Index based on environmental concerns and activities

• Online branding community tracking

Page 243: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

243 © 2005 243

Forming Research Partnership

• Great research opportunities for data, text and web mining for finance, accounting, and marketing application Business Intelligence 2.0

• Need for domain expertise and problem framing, e.g., ERM, Corporate Governance, Consumer Sentiment, etc.

• Much progress in computational techniques, e.g., web site/forum/blog spidering, text indexing, sentiment analysis, classification techniques, visualizations and integrated systems

• Publications and industry opportunities!!!

Page 244: Data, Text and Web Mining for Finance, - Department Business Intelligence Mining in Web 2.0: Data, Text and Web Mining for Finance, ... Web mining to be “the discovery and analysis

244 © 2005

Hsinchun Chen …

Artificial Intelligence Lab, Dark Web

Project …

[email protected]

http://ai.arizona.edu …


Recommended