Core Information Processing Technologies Technical Presentation & Demos

Miha Grčar (Department of Knowledge Technologies, Jožef Stefan Institute)

Achim Klein (University of Hohenheim)

Core Information Processing Technologies

Technical Presentation & Demos

Luxembourg, November 2011

Technical WPs

Luxembourg, Nov 2011

Architecture, Integration & Scaling Strategy

Man

agem

ent

WP

10

WP2 & WP7

Dis

sem

inat

ion

& E

xplo

itatio

nW

P9

WP3 WP4 WP6

OntologyInfrastructure

InformationExtraction

Sentiment Analysis

Decision SupportInfrastructure

Domain-independent GUI(Open Source)

Information Integration

Data, Information & Knowledge Base

WP5

WP1 & WP8

UC#1Market

Surveillance

UC#2 Reputational

Risk management

UC#3 Online Retail

Brokerage

DataAcquisition

DataAcquisition

We are here

2FIRST Y1 Review Meeting

Data acquisition pipeline (Dacq)

FIRST Y1 Review Meeting

Boilerplate remover

Language detector

Duplicate detector

Natural language preproc.

Semantic annotator

ZeroMQ emitter

Boilerplate remover

Language detector

Duplicate detector


Semantic annotator

ZeroMQ emitter

Boilerplate remover

RSS reader

RSS reader

RSS reader

Language detector

Duplicate detector


Semantic annotator

ZeroMQ emitter

.

.

.

.

.

.

Loadbalancing

One readerper site

processingpipelines



Data acquisition pipeline (Dacq)

FIRST Y1 Review Meeting 4

Demo video(3:20)

Data acquisition pipeline


Boilerplate remover

Language detector

Duplicate detector


Semantic annotator

ZeroMQ emitter

Boilerplate remover

Language detector

Duplicate detector


Semantic annotator

ZeroMQ emitter

Boilerplate remover

Language detector

Duplicate detector


Semantic annotator

ZeroMQ emitter

RSS reader

RSS reader

.

.

.

.

.

.

RSS reader


Boilerplate removal

Demo video(1:30)



Boilerplate remover

Language detector

Duplicate detector


Semantic annotator

ZeroMQ emitter

Boilerplate remover

Language detector

Duplicate detector


Semantic annotator

ZeroMQ emitter

Boilerplate remover

Language detector

Duplicate detector


Semantic annotator

ZeroMQ emitter

RSS reader

RSS reader

.

.

.

.

.

.

RSS reader


Language detection

Motivation: language-specific text analysis components

Relatively simple problemSolutions based on word or character sequences

(language models)Side effects: removes “garbage” and can be used to

identify code pageOur implementation based on frequencies of

character sequences


Demo video(0:45)




Boilerplate remover

Language detector

Duplicate detector


Semantic annotator

ZeroMQ emitter

Boilerplate remover

Language detector

Duplicate detector


Semantic annotator

ZeroMQ emitter

Boilerplate remover

Language detector

Duplicate detector


Semantic annotator

ZeroMQ emitter

RSS reader

RSS reader

.

.

.

.

.

.

RSS reader


Near-duplicate detection

Why is this a difficult problem?We are dealing with millions of documents –

cannot afford to compare every document with every document

We are also looking for near-duplicates, not only exact matches

Overlooked boilerplate “produces” false near-duplicates

FIRST Y1 Review Meeting Luxembourg, Nov 201110

Demo video(1:00)

Near-duplicate detection

Existing approaches like SimHash, shingling and sketching, SpotSigs…Apart from SpotSigs, they require “clean”

documents Hard to interpret similarity value (how many

characters, words, sentences?)

Developing a novel solution to remove boilerplate and detect duplicates [with clear interpretation] in the same framework

Luxembourg, Nov 2011FIRST Y1 Review Meeting 11

Technical WPs



Man

agem

ent

WP

10

WP2 & WP7

Dis

sem

inat

ion

& E

xplo

itatio

nW

P9

WP3 WP4 WP6



Sentiment Analysis





WP5

WP1 & WP8

UC#1Market

Surveillance

UC#2 Reputational

Risk management

UC#3 Online Retail

Brokerage

DataAcquisition




We are here

FIRST ontology

SentimentObject FinancialInstrument Index Stock_Index Stock Company Country


Seedindices

Constituents(stocks)

Companies

Countries

FIRST ontology

:NASDAQ_100 a :Stock_Index ;rdfs:label "NASDAQ-100" .

:MICROSOFT a :Stock ;rdfs:label "MICROSOFT CORP COM USD0.00000625" ;:memberOf :NASDAQ_100 .

:MICROSOFT_CORP a :Company ;rdfs:label "Microsoft Corp." ;:issues :MICROSOFT .

:USA a :Country ;rdfs:label "USA" .

:MICROSOFT_CORP :locatedIn :USA .

:MICROSOFT_CORP:hasGazetteer :MICROSOFT_CORP_Gazetteer .

:MICROSOFT_CORP_Gazetteer:hasTerm "Microsoft Corp" ;:hasTerm "Microsoft Corporation" ;:hasStopWord "CORP" ;:hasStopWord "CORPORATION" ;a :Gazetteer .


Microsoft Corporation is engaged in developing, licensing and supporting a range of software products and services. Microsoft also designs and sells hardware, and delivers online advertising to the customers.

Microsoft Corp

correlationDefinitionInfluencesIndicator

feat

ureH

asC

orre

latio

nDef

initi

on

indicatorHas

CorrelationDefinitioncorr

elat

ionD

efin

ition

Influ

ence

sFea

ture

objectHasCorrelationDefinition

correlationDefinitionInfluencesObject

FIRST ontology

Sentiment Object

Company

Financial Instrument

MacroIndicator

MicroIndicator

Indicator

Technical

FundamentalFeature

CorrelationDefinition

Volatility

Price

Reputation

OrientationPhrase

Annotation pipeline


Boilerplate remover

Language detector

Duplicate detector


Semantic annotator

ZeroMQ emitter

Boilerplate remover

Language detector

Duplicate detector


Semantic annotator

ZeroMQ emitter

Boilerplate remover

Language detector

Duplicate detector


Semantic annotator

ZeroMQ emitter

RSS reader

RSS reader

.

.

.

.

.

.

RSS reader

Ontology-basedsemantic

annotation


Demo video(3:00)

Technical WPs



Man

agem

ent

WP

10

WP2 & WP7

Dis

sem

inat

ion

& E

xplo

itatio

nW

P9

WP3 WP4 WP6



Sentiment Analysis





WP5

WP1 & WP8

UC#1Market

Surveillance

UC#2 Reputational

Risk management

UC#3 Online Retail

Brokerage

DataAcquisition

Sentiment Analysis

We are here


Sentiment Analysis

Object: Sentiment in financial web texts

Problem: Classification of sentiment orientation with respect to expected future …price change of financial instrumentsvolatility change of financial instrumentsreputation change of companies

Approach: Knowledge-based sentiment classificationStarting at the sentence-levelSpecific to features of objects

(e.g., reputation of a company)

Example

Ambiguity: ”The low clarity of messages implies that quite often people would be likely to disagree on the classification” [Das and Chen 2007].

Identification and differentiation of objects (and features)Relationships of indicators (e.g., earnings) and objects

Short term: uptrend Support for the SPX remains at 848 and then 789, with resistance at 912 and then 935. Short term momentum was overbought during the rally early in the week and is now displaying a positive divergence at friday's lows. Should the market fail to hold this pivot (SPX 840) in the days and weeks ahead the uptrend is likely over.Long term: bear marketThe Cycle wave bear market of October 2007 continues. Thus far, equity markets worldwide have declined on average about 50%. The opportunity still remains for the US and World economies to avoid a devastating Supercycle bear market like that of 1929-1932.

http://caldaroew.spaces.live.com/Blog/cns!D2CB8C5EBA2ADE86!27847.entry

Manual Sentiment Annotation


Topic

FIRST Knowledge-based Sentiment Analysis Approach

1. Identify

2. Extract

3. Classify sentiment orientation {positive, negative} for all sentence-level sentiments

4. Aggregate

All sentence-level sentiments

Scoring Document-level sentiment score

All sentiment scores for a given day

Averaging Sentiment Index [-1,1]

All sentiments in one document

Rules,Ontology

Support for the SPX remains at 848 and then 789, with resistance at 912 …

Support for the SPX remains at 848 and then 789, with resistance at 912 …

Rules,Ontology

All sentiment objects and features

Sentence-levelSentiment Classification

a) directly Example: „I expect the S&P 500 to rise“

positive sentiment

Addressed by rules

b) indirectly, via an indicator Example: „I think U.S. interest rates will rise“

negative sentiment

Addressed by ontology

http://business.financialpost.com/2011/10/04/economic-uncertainty-could-fan-volatility/Oct 4, 2011 – 3:24 PM ET

The fourth quarter began on Monday with the broad S&P 500 on the precipice of a bear market and investors lacking confidence in either European or U.S. policymakers being able to stem the disquiet surrounding the debt crisis.Wall Street typically defines a bear market as a drop of 20 percent or more from a recent high.Volatility is at its most persistently elevated since the financial crisis of 2008, as measured by the popular VIX , or CBOE Volatility Index. Barring a knock-out U.S. earnings period in the next month, it could remain high, and investors should brace for wild swings and more down days.

Example Text: S&P 500

Sentiment Sentences on Price Change of S&P 500


Negative sentiment about the future price change of the S&P 500

Sentiment Sentences on Volatility Change of S&P 500


Positive sentiment about the future

volatility change of the S&P 500

Document-level Sentiment with respect to multiple objects/features

Initial Experiment Results

Accuracy of knowledge-based sentiment classification vs. standard machine learning methodsSmall manually classified corpusResult: 7% more accurate

Portfolio selection experimentUse sentiment to select Dow Jones stocksResult: Excess returns seem possible

More information in paper: „Extracting Investor Sentiment from Weblog Texts: A Knowledge-based Approach“, published in IEEE CEC 2011 conference proceedings


Main Y1 Achievements

Data acquisition software runningSentiment analysis for web texts

Sentence-level, and specific to features of objects Initial experiment results are promising

Ontology available (~4000 instances) Sentence-level annotated corpus available

(900 documents and growing) Delivered as D3.1 and D4.1 Book chapter on data acquisition in preparation Paper on sentiment extraction

(best paper award at CEC 2011 conference)


Next Steps

Improve ontology and gazetteers

Use corpus to improve sentiment classification

Increase throughput of sentiment extraction

Luxembourg, Nov 2011FIRST Y1 Review Meeting

Thank you

30

Date post:	23-Feb-2016
Category:	Documents
Upload:	joylyn
View:	25 times
Download:	0 times

Core Information Processing Technologies Technical Presentation & Demos

Documents