+ All Categories
Home > Documents > Core Information Processing Technologies Technical Presentation & Demos

Core Information Processing Technologies Technical Presentation & Demos

Date post: 23-Feb-2016
Category:
Upload: joylyn
View: 25 times
Download: 0 times
Share this document with a friend
Description:
Core Information Processing Technologies Technical Presentation & Demos. Miha Gr čar ( Dep artment of Knowledge Technologies, Jožef Stefan Institute ) Achim Klein (University of Hohenheim). Technical WPs. WP1 & WP8. WP2 & WP7. Architecture, Integration & Scaling Strategy. UC#1 - PowerPoint PPT Presentation
Popular Tags:
30
Miha Grčar (Department of Knowledge Technologies, Jožef Stefan Institute) Achim Klein (University of Hohenheim) Core Information Processing Technologies Technical Presentation & Demos Luxembourg, November 2011
Transcript
Page 1: Core Information Processing Technologies Technical  Presentation  & Demos

Miha Grčar (Department of Knowledge Technologies, Jožef Stefan Institute)

Achim Klein (University of Hohenheim)

Core Information Processing Technologies

Technical Presentation & Demos

Luxembourg, November 2011

Page 2: Core Information Processing Technologies Technical  Presentation  & Demos

Technical WPs

Luxembourg, Nov 2011

Architecture, Integration & Scaling Strategy

Man

agem

ent

WP

10

WP2 & WP7

Dis

sem

inat

ion

& E

xplo

itatio

nW

P9

WP3 WP4 WP6

OntologyInfrastructure

InformationExtraction

Sentiment Analysis

Decision SupportInfrastructure

Domain-independent GUI(Open Source)

Information Integration

Data, Information & Knowledge Base

WP5

WP1 & WP8

UC#1Market

Surveillance

UC#2 Reputational

Risk management

UC#3 Online Retail

Brokerage

DataAcquisition

DataAcquisition

We are here

2FIRST Y1 Review Meeting

Page 3: Core Information Processing Technologies Technical  Presentation  & Demos

Data acquisition pipeline (Dacq)

FIRST Y1 Review Meeting

Boilerplate remover

Language detector

Duplicate detector

Natural language preproc.

Semantic annotator

ZeroMQ emitter

Boilerplate remover

Language detector

Duplicate detector

Natural language preproc.

Semantic annotator

ZeroMQ emitter

Boilerplate remover

RSS reader

RSS reader

RSS reader

Language detector

Duplicate detector

Natural language preproc.

Semantic annotator

ZeroMQ emitter

.

.

.

.

.

.

Loadbalancing

One readerper site

processingpipelines

Luxembourg, Nov 20113

Page 4: Core Information Processing Technologies Technical  Presentation  & Demos

Luxembourg, Nov 2011

Data acquisition pipeline (Dacq)

FIRST Y1 Review Meeting 4

Demo video(3:20)

Page 5: Core Information Processing Technologies Technical  Presentation  & Demos

Data acquisition pipeline

FIRST Y1 Review Meeting

Boilerplate remover

Language detector

Duplicate detector

Natural language preproc.

Semantic annotator

ZeroMQ emitter

Boilerplate remover

Language detector

Duplicate detector

Natural language preproc.

Semantic annotator

ZeroMQ emitter

Boilerplate remover

Language detector

Duplicate detector

Natural language preproc.

Semantic annotator

ZeroMQ emitter

RSS reader

RSS reader

.

.

.

.

.

.

RSS reader

Luxembourg, Nov 20115

Page 6: Core Information Processing Technologies Technical  Presentation  & Demos

Boilerplate removal

Demo video(1:30)

Page 7: Core Information Processing Technologies Technical  Presentation  & Demos

Data acquisition pipeline

FIRST Y1 Review Meeting

Boilerplate remover

Language detector

Duplicate detector

Natural language preproc.

Semantic annotator

ZeroMQ emitter

Boilerplate remover

Language detector

Duplicate detector

Natural language preproc.

Semantic annotator

ZeroMQ emitter

Boilerplate remover

Language detector

Duplicate detector

Natural language preproc.

Semantic annotator

ZeroMQ emitter

RSS reader

RSS reader

.

.

.

.

.

.

RSS reader

Luxembourg, Nov 20117

Page 8: Core Information Processing Technologies Technical  Presentation  & Demos

Language detection

Motivation: language-specific text analysis components

Relatively simple problemSolutions based on word or character sequences

(language models)Side effects: removes “garbage” and can be used to

identify code pageOur implementation based on frequencies of

character sequences

FIRST Y1 Review Meeting

Demo video(0:45)

Luxembourg, Nov 20118

Page 9: Core Information Processing Technologies Technical  Presentation  & Demos

Data acquisition pipeline

FIRST Y1 Review Meeting

Boilerplate remover

Language detector

Duplicate detector

Natural language preproc.

Semantic annotator

ZeroMQ emitter

Boilerplate remover

Language detector

Duplicate detector

Natural language preproc.

Semantic annotator

ZeroMQ emitter

Boilerplate remover

Language detector

Duplicate detector

Natural language preproc.

Semantic annotator

ZeroMQ emitter

RSS reader

RSS reader

.

.

.

.

.

.

RSS reader

Luxembourg, Nov 20119

Page 10: Core Information Processing Technologies Technical  Presentation  & Demos

Near-duplicate detection

Why is this a difficult problem?We are dealing with millions of documents –

cannot afford to compare every document with every document

We are also looking for near-duplicates, not only exact matches

Overlooked boilerplate “produces” false near-duplicates

FIRST Y1 Review Meeting Luxembourg, Nov 201110

Demo video(1:00)

Page 11: Core Information Processing Technologies Technical  Presentation  & Demos

Near-duplicate detection

Existing approaches like SimHash, shingling and sketching, SpotSigs…Apart from SpotSigs, they require “clean”

documents Hard to interpret similarity value (how many

characters, words, sentences?)

Developing a novel solution to remove boilerplate and detect duplicates [with clear interpretation] in the same framework

Luxembourg, Nov 2011FIRST Y1 Review Meeting 11

Page 12: Core Information Processing Technologies Technical  Presentation  & Demos

Technical WPs

Luxembourg, Nov 2011

Architecture, Integration & Scaling Strategy

Man

agem

ent

WP

10

WP2 & WP7

Dis

sem

inat

ion

& E

xplo

itatio

nW

P9

WP3 WP4 WP6

OntologyInfrastructure

InformationExtraction

Sentiment Analysis

Decision SupportInfrastructure

Domain-independent GUI(Open Source)

Information Integration

Data, Information & Knowledge Base

WP5

WP1 & WP8

UC#1Market

Surveillance

UC#2 Reputational

Risk management

UC#3 Online Retail

Brokerage

DataAcquisition

OntologyInfrastructure

InformationExtraction

12FIRST Y1 Review Meeting

We are here

Page 13: Core Information Processing Technologies Technical  Presentation  & Demos

FIRST ontology

SentimentObject FinancialInstrument Index Stock_Index Stock Company Country

Luxembourg, Nov 2011FIRST Y1 Review Meeting 13

Seedindices

Constituents(stocks)

Companies

Countries

Page 14: Core Information Processing Technologies Technical  Presentation  & Demos

FIRST ontology

:NASDAQ_100 a :Stock_Index ;rdfs:label "NASDAQ-100" .

:MICROSOFT a :Stock ;rdfs:label "MICROSOFT CORP COM USD0.00000625" ;:memberOf :NASDAQ_100 .

:MICROSOFT_CORP a :Company ;rdfs:label "Microsoft Corp." ;:issues :MICROSOFT .

:USA a :Country ;rdfs:label "USA" .

:MICROSOFT_CORP :locatedIn :USA .

:MICROSOFT_CORP:hasGazetteer :MICROSOFT_CORP_Gazetteer .

:MICROSOFT_CORP_Gazetteer:hasTerm "Microsoft Corp" ;:hasTerm "Microsoft Corporation" ;:hasStopWord "CORP" ;:hasStopWord "CORPORATION" ;a :Gazetteer .

Luxembourg, Nov 2011FIRST Y1 Review Meeting 14

Microsoft Corporation is engaged in developing, licensing and supporting a range of software products and services. Microsoft also designs and sells hardware, and delivers online advertising to the customers.

Microsoft Corp

Page 15: Core Information Processing Technologies Technical  Presentation  & Demos

correlationDefinitionInfluencesIndicator

feat

ureH

asC

orre

latio

nDef

initi

on

indicatorHas

CorrelationDefinitioncorr

elat

ionD

efin

ition

Influ

ence

sFea

ture

objectHasCorrelationDefinition

correlationDefinitionInfluencesObject

FIRST ontology

Sentiment Object

Company

Financial Instrument

MacroIndicator

MicroIndicator

Indicator

Technical

FundamentalFeature

CorrelationDefinition

Volatility

Price

Reputation

OrientationPhrase

Page 16: Core Information Processing Technologies Technical  Presentation  & Demos

Annotation pipeline

FIRST Y1 Review Meeting

Boilerplate remover

Language detector

Duplicate detector

Natural language preproc.

Semantic annotator

ZeroMQ emitter

Boilerplate remover

Language detector

Duplicate detector

Natural language preproc.

Semantic annotator

ZeroMQ emitter

Boilerplate remover

Language detector

Duplicate detector

Natural language preproc.

Semantic annotator

ZeroMQ emitter

RSS reader

RSS reader

.

.

.

.

.

.

RSS reader

Ontology-basedsemantic

annotation

Luxembourg, Nov 201116

Demo video(3:00)

Page 17: Core Information Processing Technologies Technical  Presentation  & Demos

Technical WPs

Luxembourg, Nov 2011

Architecture, Integration & Scaling Strategy

Man

agem

ent

WP

10

WP2 & WP7

Dis

sem

inat

ion

& E

xplo

itatio

nW

P9

WP3 WP4 WP6

OntologyInfrastructure

InformationExtraction

Sentiment Analysis

Decision SupportInfrastructure

Domain-independent GUI(Open Source)

Information Integration

Data, Information & Knowledge Base

WP5

WP1 & WP8

UC#1Market

Surveillance

UC#2 Reputational

Risk management

UC#3 Online Retail

Brokerage

DataAcquisition

Sentiment Analysis

We are here

17FIRST Y1 Review Meeting

Page 18: Core Information Processing Technologies Technical  Presentation  & Demos

Sentiment Analysis

Object: Sentiment in financial web texts

Problem: Classification of sentiment orientation with respect to expected future …price change of financial instrumentsvolatility change of financial instrumentsreputation change of companies

Approach: Knowledge-based sentiment classificationStarting at the sentence-levelSpecific to features of objects

(e.g., reputation of a company)

Page 19: Core Information Processing Technologies Technical  Presentation  & Demos

Example

Ambiguity: ”The low clarity of messages implies that quite often people would be likely to disagree on the classification” [Das and Chen 2007].

Identification and differentiation of objects (and features)Relationships of indicators (e.g., earnings) and objects

Short term: uptrend Support for the SPX remains at 848 and then 789, with resistance at 912 and then 935. Short term momentum was overbought during the rally early in the week and is now displaying a positive divergence at friday's lows. Should the market fail to hold this pivot (SPX 840) in the days and weeks ahead the uptrend is likely over.Long term: bear marketThe Cycle wave bear market of October 2007 continues. Thus far, equity markets worldwide have declined on average about 50%. The opportunity still remains for the US and World economies to avoid a devastating Supercycle bear market like that of 1929-1932.

http://caldaroew.spaces.live.com/Blog/cns!D2CB8C5EBA2ADE86!27847.entry

Page 20: Core Information Processing Technologies Technical  Presentation  & Demos

Manual Sentiment Annotation

Luxembourg, Nov 2011FIRST Y1 Review Meeting 20

Topic

Page 21: Core Information Processing Technologies Technical  Presentation  & Demos

FIRST Knowledge-based Sentiment Analysis Approach

1. Identify

2. Extract

3. Classify sentiment orientation {positive, negative} for all sentence-level sentiments

4. Aggregate

All sentence-level sentiments

Scoring Document-level sentiment score

All sentiment scores for a given day

Averaging Sentiment Index [-1,1]

All sentiments in one document

Rules,Ontology

Support for the SPX remains at 848 and then 789, with resistance at 912 …

Support for the SPX remains at 848 and then 789, with resistance at 912 …

Rules,Ontology

All sentiment objects and features

Page 22: Core Information Processing Technologies Technical  Presentation  & Demos

Sentence-levelSentiment Classification

a) directly Example: „I expect the S&P 500 to rise“

positive sentiment

Addressed by rules

b) indirectly, via an indicator Example: „I think U.S. interest rates will rise“

negative sentiment

Addressed by ontology

Page 23: Core Information Processing Technologies Technical  Presentation  & Demos

http://business.financialpost.com/2011/10/04/economic-uncertainty-could-fan-volatility/Oct 4, 2011 – 3:24 PM ET

The fourth quarter began on Monday with the broad S&P 500 on the precipice of a bear market and investors lacking confidence in either European or U.S. policymakers being able to stem the disquiet surrounding the debt crisis.Wall Street typically defines a bear market as a drop of 20 percent or more from a recent high.Volatility is at its most persistently elevated since the financial crisis of 2008, as measured by the popular VIX , or CBOE Volatility Index. Barring a knock-out U.S. earnings period in the next month, it could remain high, and investors should brace for wild swings and more down days.

Example Text: S&P 500

Page 24: Core Information Processing Technologies Technical  Presentation  & Demos

Sentiment Sentences on Price Change of S&P 500

Luxembourg, Nov 2011FIRST Y1 Review Meeting 24

Negative sentiment about the future price change of the S&P 500

Page 25: Core Information Processing Technologies Technical  Presentation  & Demos

Sentiment Sentences on Volatility Change of S&P 500

Luxembourg, Nov 2011FIRST Y1 Review Meeting 25

Positive sentiment about the future

volatility change of the S&P 500

Page 26: Core Information Processing Technologies Technical  Presentation  & Demos

Document-level Sentiment with respect to multiple objects/features

Page 27: Core Information Processing Technologies Technical  Presentation  & Demos

Initial Experiment Results

Accuracy of knowledge-based sentiment classification vs. standard machine learning methodsSmall manually classified corpusResult: 7% more accurate

Portfolio selection experimentUse sentiment to select Dow Jones stocksResult: Excess returns seem possible

More information in paper: „Extracting Investor Sentiment from Weblog Texts: A Knowledge-based Approach“, published in IEEE CEC 2011 conference proceedings

Luxembourg, Nov 2011FIRST Y1 Review Meeting 27

Page 28: Core Information Processing Technologies Technical  Presentation  & Demos

Main Y1 Achievements

Data acquisition software runningSentiment analysis for web texts

Sentence-level, and specific to features of objects Initial experiment results are promising

Ontology available (~4000 instances) Sentence-level annotated corpus available

(900 documents and growing) Delivered as D3.1 and D4.1 Book chapter on data acquisition in preparation Paper on sentiment extraction

(best paper award at CEC 2011 conference)

Luxembourg, Nov 2011FIRST Y1 Review Meeting 28

Page 29: Core Information Processing Technologies Technical  Presentation  & Demos

Next Steps

Improve ontology and gazetteers

Use corpus to improve sentiment classification

Increase throughput of sentiment extraction

Luxembourg, Nov 2011FIRST Y1 Review Meeting

Page 30: Core Information Processing Technologies Technical  Presentation  & Demos

Thank you

30


Recommended