+ All Categories
Home > Documents > New Media Analysis Using Focused Crawl and Natural ... et al.pdf · Using Focused Crawl and Natural...

New Media Analysis Using Focused Crawl and Natural ... et al.pdf · Using Focused Crawl and Natural...

Date post: 26-Aug-2018
Category:
Upload: dodang
View: 220 times
Download: 0 times
Share this document with a friend
17
Tomas Krilavičius Žygimantas Medelis New Media Analysis Using Focused Crawl and Natural Language Processing: Case of Lithuanian News Websites Jurgita Kapočiūtė-Dzikienė Tomas Žalandauskas
Transcript

Tomas Krilavičius

Žygimantas Medelis

New Media Analysis Using Focused Crawl and

Natural Language Processing: Case of Lithuanian News Websites

Jurgita Kapočiūtė-Dzikienė

Tomas Žalandauskas

Krilavičius, Medelis, Kapočiūtė-Dzikienė, Žalandauskas News Media Analysis using Focuse Crawl and NLP 2

Problem

How to monitor Lithuanian news media,

identifying main topics and facts?

Potential users

business intelligence, political campaing,military intelligence, police

Krilavičius, Medelis, Kapočiūtė-Dzikienė, Žalandauskas News Media Analysis using Focuse Crawl and NLP 3

Purpose of the System

Collect relevant information from the Internet

Prepare text for analysis

Extract relevant information

Store it

Provide tools to

Search it (e.g., faceted search)

Analyse it (e.g., visualisation, word frequency)

Krilavičius, Medelis, Kapočiūtė-Dzikienė, Žalandauskas News Media Analysis using Focuse Crawl and NLP 4

Novelty and Scientific Problems

Problem is not new

There exist solutions

Reference architecture

For selected languages (e.g., English, French, Russian)

Problem: text analysis is language-dependent

Our results: case-based analysis of media in Lithuanian

Krilavičius, Medelis, Kapočiūtė-Dzikienė, Žalandauskas News Media Analysis using Focuse Crawl and NLP 5

General Architecture

Krilavičius, Medelis, Kapočiūtė-Dzikienė, Žalandauskas News Media Analysis using Focuse Crawl and NLP 6

Focused Crawl

Web Crawler that downloads only relevant pages

Removes ads and topics of no interest

Uses several types of filters

Link filters

URL filters [.href elements analysis]

Anchor text filters

Content filters

Reg-exp based terms filters

Classifiers employing words lists, topic maps, ontologies

Highly configurable tools exist, e.g. Apache Nutch

Mostly language-agnostic

Krilavičius, Medelis, Kapočiūtė-Dzikienė, Žalandauskas News Media Analysis using Focuse Crawl and NLP 7

Text Preprocessing

document accents spacing

etc. stopwords

noun groups

stemming indexing

structure recognition

text +

structure text

structure full text

terms index

Krilavičius, Medelis, Kapočiūtė-Dzikienė, Žalandauskas News Media Analysis using Focuse Crawl and NLP 8

Linguistic Infrastructure: Corpora

Stop-words

Easy to build

Available as a part of TokenMill's language pack

General corpus

Hunspell

Domain vocabulary

Synonyms

sinonimai.lt

Taxonomy

Commercial usage of university/institute owned corpus is not clear

Krilavičius, Medelis, Kapočiūtė-Dzikienė, Žalandauskas News Media Analysis using Focuse Crawl and NLP 9

Linguistic Infrastructure: Small Tools

Language identifiers

LT language identifier (TokenMill language pack)

Sentence splitters (TokenMill lang. pack)

Structure recognition Probably, Tilde,

Fotonija, Leksinova, etc. have some in-house tools

Krilavičius, Medelis, Kapočiūtė-Dzikienė, Žalandauskas News Media Analysis using Focuse Crawl and NLP 10

Linguistic Infrastructure: Word Level Analyzers

Stemmers

Porter-stemmer, prototype for Lithuanian language is available (Krilavičius, Medelis, 2010)

Phonetic algorithms

Lithuanian Soundex (Paliulionis, 2009; Krilavičius, Kuliešienė, 2010)

Morphological analysis/lemmatization/POS

Morfolema, Zinkevičius

Morfologinis anotatorius (VMU, Computer Ling. Center)

Tilde?

Petkevičius (in progress, VMU, Informatics fac.)

Krilavičius, Medelis, Kapočiūtė-Dzikienė, Žalandauskas News Media Analysis using Focuse Crawl and NLP 11

Linguistic Infrastructure: Text Analytics

Classification (categorization)

Some EuroVoc-based results by Daudaravičius, 2012

Report and publications in progress Kapočiūtė-Dzikienė, Krilavičius

Clustering

No published results for Lithuanian language

Report and publications in progress

Some preliminary results in Zuokas, Medelis,

Kaušas, Krilavičius, 2010

Krilavičius, Medelis, Kapočiūtė-Dzikienė, Žalandauskas News Media Analysis using Focuse Crawl and NLP 12

Linguistic Infrastructure: Text Analytics

NER tools

RegExp-based: person-names, dates, locations, etc. Kapočiūtė, Raškinis, 2005

GATE (JAPE)-based: citations, person-names, dates Zuokas, Medelis, Kaušas, Krilavičius, 2010; Krilavičius, Medelis, Balčas, Širvinskas, 2012

AI/ML-based methods – just starting Zuokas, Medelis, Kaušas, Krilavičius, 2010

Krilavičius, Medelis, Kapočiūtė-Dzikienė, Žalandauskas News Media Analysis using Focuse Crawl and NLP 13

Experimental Implementation

Crawler: Apache Nutch

Text preprocessing and NER

GATE (General Architecture for Text Engineering)

Hunspell

Lithuanian (Porter-based) stemmer

Classification and clustering: Apache Mahout

Faceted search: Apache Solr (Apache Lucene)

Krilavičius, Medelis, Kapočiūtė-Dzikienė, Žalandauskas News Media Analysis using Focuse Crawl and NLP 14

Experimental Results

Krilavičius, Medelis, Kapočiūtė-Dzikienė, Žalandauskas News Media Analysis using Focuse Crawl and NLP 15

Results

Overview of existing/missing NLP/IR tools for Lithuanian language

Experimental implementation

Running in production

… and based on that – future research plans

Classification

Clustering

Corpora, e.g. TREC-annotated

Stemmer

NER

Ontologies

Krilavičius, Medelis, Kapočiūtė-Dzikienė, Žalandauskas News Media Analysis using Focuse Crawl and NLP 16

Conclusions

Some tools exist, but more is missing in some cases they are just well hidden

Plenty of things to do, but not all very interesting research-wise (e.g., EN Soundex is over 100 years old; Porter stemmer, 1979)

Enough tools to build media monitoring systems, but a lot of improvements are possible

Krilavičius, Medelis, Kapočiūtė-Dzikienė, Žalandauskas News Media Analysis using Focuse Crawl and NLP 17

THANKS


Recommended