Tomas Krilavičius
Žygimantas Medelis
New Media Analysis Using Focused Crawl and
Natural Language Processing: Case of Lithuanian News Websites
Jurgita Kapočiūtė-Dzikienė
Tomas Žalandauskas
Krilavičius, Medelis, Kapočiūtė-Dzikienė, Žalandauskas News Media Analysis using Focuse Crawl and NLP 2
Problem
How to monitor Lithuanian news media,
identifying main topics and facts?
Potential users
business intelligence, political campaing,military intelligence, police
Krilavičius, Medelis, Kapočiūtė-Dzikienė, Žalandauskas News Media Analysis using Focuse Crawl and NLP 3
Purpose of the System
Collect relevant information from the Internet
Prepare text for analysis
Extract relevant information
Store it
Provide tools to
Search it (e.g., faceted search)
Analyse it (e.g., visualisation, word frequency)
Krilavičius, Medelis, Kapočiūtė-Dzikienė, Žalandauskas News Media Analysis using Focuse Crawl and NLP 4
Novelty and Scientific Problems
Problem is not new
There exist solutions
Reference architecture
For selected languages (e.g., English, French, Russian)
Problem: text analysis is language-dependent
Our results: case-based analysis of media in Lithuanian
Krilavičius, Medelis, Kapočiūtė-Dzikienė, Žalandauskas News Media Analysis using Focuse Crawl and NLP 5
General Architecture
Krilavičius, Medelis, Kapočiūtė-Dzikienė, Žalandauskas News Media Analysis using Focuse Crawl and NLP 6
Focused Crawl
Web Crawler that downloads only relevant pages
Removes ads and topics of no interest
Uses several types of filters
Link filters
URL filters [.href elements analysis]
Anchor text filters
Content filters
Reg-exp based terms filters
Classifiers employing words lists, topic maps, ontologies
Highly configurable tools exist, e.g. Apache Nutch
Mostly language-agnostic
Krilavičius, Medelis, Kapočiūtė-Dzikienė, Žalandauskas News Media Analysis using Focuse Crawl and NLP 7
Text Preprocessing
document accents spacing
etc. stopwords
noun groups
stemming indexing
structure recognition
text +
structure text
structure full text
terms index
Krilavičius, Medelis, Kapočiūtė-Dzikienė, Žalandauskas News Media Analysis using Focuse Crawl and NLP 8
Linguistic Infrastructure: Corpora
Stop-words
Easy to build
Available as a part of TokenMill's language pack
General corpus
Hunspell
Domain vocabulary
Synonyms
sinonimai.lt
Taxonomy
Commercial usage of university/institute owned corpus is not clear
Krilavičius, Medelis, Kapočiūtė-Dzikienė, Žalandauskas News Media Analysis using Focuse Crawl and NLP 9
Linguistic Infrastructure: Small Tools
Language identifiers
LT language identifier (TokenMill language pack)
Sentence splitters (TokenMill lang. pack)
Structure recognition Probably, Tilde,
Fotonija, Leksinova, etc. have some in-house tools
Krilavičius, Medelis, Kapočiūtė-Dzikienė, Žalandauskas News Media Analysis using Focuse Crawl and NLP 10
Linguistic Infrastructure: Word Level Analyzers
Stemmers
Porter-stemmer, prototype for Lithuanian language is available (Krilavičius, Medelis, 2010)
Phonetic algorithms
Lithuanian Soundex (Paliulionis, 2009; Krilavičius, Kuliešienė, 2010)
Morphological analysis/lemmatization/POS
Morfolema, Zinkevičius
Morfologinis anotatorius (VMU, Computer Ling. Center)
Tilde?
Petkevičius (in progress, VMU, Informatics fac.)
Krilavičius, Medelis, Kapočiūtė-Dzikienė, Žalandauskas News Media Analysis using Focuse Crawl and NLP 11
Linguistic Infrastructure: Text Analytics
Classification (categorization)
Some EuroVoc-based results by Daudaravičius, 2012
Report and publications in progress Kapočiūtė-Dzikienė, Krilavičius
Clustering
No published results for Lithuanian language
Report and publications in progress
Some preliminary results in Zuokas, Medelis,
Kaušas, Krilavičius, 2010
Krilavičius, Medelis, Kapočiūtė-Dzikienė, Žalandauskas News Media Analysis using Focuse Crawl and NLP 12
Linguistic Infrastructure: Text Analytics
NER tools
RegExp-based: person-names, dates, locations, etc. Kapočiūtė, Raškinis, 2005
GATE (JAPE)-based: citations, person-names, dates Zuokas, Medelis, Kaušas, Krilavičius, 2010; Krilavičius, Medelis, Balčas, Širvinskas, 2012
AI/ML-based methods – just starting Zuokas, Medelis, Kaušas, Krilavičius, 2010
Krilavičius, Medelis, Kapočiūtė-Dzikienė, Žalandauskas News Media Analysis using Focuse Crawl and NLP 13
Experimental Implementation
Crawler: Apache Nutch
Text preprocessing and NER
GATE (General Architecture for Text Engineering)
Hunspell
Lithuanian (Porter-based) stemmer
Classification and clustering: Apache Mahout
Faceted search: Apache Solr (Apache Lucene)
Krilavičius, Medelis, Kapočiūtė-Dzikienė, Žalandauskas News Media Analysis using Focuse Crawl and NLP 14
Experimental Results
Krilavičius, Medelis, Kapočiūtė-Dzikienė, Žalandauskas News Media Analysis using Focuse Crawl and NLP 15
Results
Overview of existing/missing NLP/IR tools for Lithuanian language
Experimental implementation
Running in production
… and based on that – future research plans
Classification
Clustering
Corpora, e.g. TREC-annotated
Stemmer
NER
Ontologies
Krilavičius, Medelis, Kapočiūtė-Dzikienė, Žalandauskas News Media Analysis using Focuse Crawl and NLP 16
Conclusions
Some tools exist, but more is missing in some cases they are just well hidden
Plenty of things to do, but not all very interesting research-wise (e.g., EN Soundex is over 100 years old; Porter stemmer, 1979)
Enough tools to build media monitoring systems, but a lot of improvements are possible