Applications of Natural Language Processing
Course 3 - 8 March 2012
„Al. I. Cuza” University of Iasi, Romania
Faculty of Computer Science
1
Data Mining ◦ Definition
◦ Examples
◦ Data, Information, Knowledge
◦ Elements, Levels of Analysis
◦ Notable Uses
◦ Resources
Text Mining ◦ Definition
◦ Domains
◦ Applications
◦ TerMine, AcroMine, FACTA+, KLEIO, MEDIE
2
The process of analyzing data from different perspectives and summarizing it into useful information (information that can be used to increase revenue, cuts costs, etc.)
The process of finding correlations or patterns among dozens of fields in large relational databases
Data, Information, and Knowledge
3
4
Midwest grocery chain used the data mining capacity of Oracle software to analyze local buying patterns
They discovered that when men bought diapers on Thursdays and Saturdays, they also tended to buy beer +
In order to increase revenue =>
◦ They could move the beer display closer to the diaper display
◦ They could make sure beer and diapers were sold at full price on Thursdays
5
Any facts, numbers, or text that can be processed by a computer
Organizations are accumulating vast and growing amounts of data in different formats and different databases: ◦ operational or transactional data such as, sales,
cost, inventory, payroll, and accounting
◦ nonoperational data, such as industry sales, forecast data, and macro economic data
◦ meta data - data about the data itself, such as logical database design or data dictionary definitions
6
The patterns, associations, or relationships among all this data can provide information
For example, analysis of retail point of sale transaction data can yield information on which products are selling and when
7
Information can be converted into knowledge about historical patterns and future trends
For example, summary information on retail supermarket sales can be analyzed in light of promotional efforts to provide knowledge of consumer buying behavior
Thus, a manufacturer or retailer could determine which items are most susceptible to promotional efforts
8
Data warehousing represents an ideal vision of maintaining a central repository of all organizational data
Centralization of data is needed to maximize user
access and analysis
9
Enables these companies to determine relationships among "internal" factors (price, product positioning, or staff skills) and "external" factors (economic indicators, competition, and customer demographics)
◦ vs
And, it enables them to determine the impact on sales, customer satisfaction, and corporate profits. Finally, it enables them to "drill down" into summary information to view detail transactional data
10
NBA - The Advanced Scout software
analyzes the movements of players to
help coaches orchestrate plays and
strategies
For example, an analysis of the play-by-play
sheet of the game played between the New York Knicks and the Cleveland Cavaliers on January 6, 1995 reveals that when Mark Price played the Guard position, John Williams attempted four jump shots and made each one!
11
Several types of analytical software are available: statistical, machine learning, and neural networks. Generally, any of four types of relationships are sought:
Classes – use existing data. For example, a restaurant chain could mine customer purchase data to determine when customers visit and what they typically order
Clusters – group existing data according to logical relationships or consumer preferences. For example, data can be mined to identify market segments or consumer affinities
12
Associations – data are mined to identify associations. The beer-diaper example
Sequential patterns - Data is mined to anticipate behavior patterns and trends. For example, an outdoor equipment retailer could predict the likelihood of a backpack being purchased based on a consumer's purchase of sleeping bags and hiking shoes
13
Extract, transform, and load transaction data onto the data warehouse system
Store and manage the data in a multidimensional database system
Provide data access to business analysts and information technology professionals
Analyze the data by application software
Present the data in a useful format, such as a graph or table
14
Artificial neural networks learn through training and resemble biological neural networks in structure
Genetic algorithms: a design based
on the concepts of natural evolution
Decision trees: Tree-shaped
structures that represent sets of decisions
15
Nearest neighbor method: classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset
Rule induction: The extraction of useful if-then rules from data based on statistical significance
Data visualization: The visual interpretation of complex relationships in multidimensional data
16
Games: dots-and-boxes and chess
Business: customer relationship management, businesses employing, identifying the characteristics of their most successful employees, market basket analysis
Science and engineering: genetics, bioinformatics, medicine, education and electrical power engineering
17
Spatial data mining: geography, GIS
Organizations possessing huge databases with thematic and geographically are:
◦ offices requiring analysis or dissemination of geo-referenced statistical data
◦ public health services searching for explanations of disease clusters
◦ environmental agencies assessing the impact of changing land-use patterns on climate change
◦ geo-marketing companies doing customer segmentation based on spatial location
18
Sensor data mining: wireless sensor networks (air pollution monitoring)
Visual data mining: The process of turning from analogical into digital
Music data mining: discover relevant similarities among music corpora
19
Surveillance: stop terrorist programs
◦ Pattern mining: For example, an association rule "beer ⇒ potato chips (80%)" states that four out of five customers that bought beer also bought potato chips
◦ Subject-based data mining: associations between individuals in data
20
Data mining in meteorology: changes in temperature, air pressure, moisture and wind direction - Self-Organizing Map (SOM)
Educational data mining: methods to better understand students (feedback, recommendations, predicting performance, planning and scheduling)
21
It is important to note that the term data mining has no ethical implications
Data mining requires data preparation which can uncover information or patterns which may compromise confidentiality and privacy obligations (in special through data aggregation)
The threat to an individual's privacy comes into play when the data, once compiled, cause the data miner, or anyone who has access to the newly compiled data set, to be able to identify specific individuals, especially when originally the data were anonymous
=> 22
http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/hall/resources.htm
23
The process of discovering and extracting of previously unknown knowledge from unstructured data
Text mining (sometimes text data mining) comprises three main activities:
◦ Information retrieval to gather relevant texts
◦ Information extraction to identify and extract entities, facts and relationships between them
◦ Data mining to find associations among the pieces of information extracted from many different texts
24
Data Mining ◦ In Text Mining, patterns are extracted from natural
language text rather than databases
Web Mining ◦ In Text Mining, the input is free unstructured text,
whilst web sources are structured
Information Retrieval (Information Access) ◦ No genuinely new information is found
◦ The desired information merely coexists with other valid pieces of information
25
Computation Linguistics (CPL) & Natural Language Processing (NLP) ◦ An extrapolation from Data Mining on numerical
data to Data Mining from textual collections [Hearst 1999]
◦ CPL computes statistics over large text collections in order to discover useful patterns which are used to inform algorithms for various sub-problems within NLP, e.g. Parts Of Speech tagging, and Word Sense Disambiguation [Armstrong 1994]
26
Information retrieval
Data mining
Machine learning
Statistics
Computational linguistics
Multilingual data mining: the ability to gain information across languages and cluster similar items from different linguistic sources according to their meaning
28
Text categorization
Text clustering
Concept/entity extraction
Production of granular taxonomies
Sentiment analysis
Document summarization
Entity relation modeling (i.e., learning relations between named entities)
29
Security applications: analysis of plain text sources such as Internet news and the study of text encryption
Biomedical applications: GoPubmed - the first semantic search engine on the Web (in biomedical literature), PubGene – combines biomedical text mining
with network visualization
30
Software and applications: IBM, Microsoft – for tracking and monitoring terrorist activities
Online media applications: editors are benefiting by being able to share, associate and package news across properties, significantly increasing opportunities to monetize content
31
Marketing applications: more specifically in analytical customer relationship management
Sentiment analysis: analysis of movie reviews, students evaluations, children stories and news stories
32
National Centre for Text Mining (NaCTeM) (University of Manchester + Tsujii Lab, University of Tokyo) http://www.nactem.ac.uk/index.php
School of Information at University of California, Berkeley http://www.ischool.berkeley.edu/
33
Automatically detects and extracts multi-word technical terms from text
http://www.nactem.ac.uk/software/termine/
34
http://eurosport.yahoo.com/07032012/58/upbeat-federer-beloved-indian-wells.html
35
Finds expanded forms of acronyms from a database of those previously used by authors
http://www.nactem.ac.uk/software/acromine/
37
Tool that helps discover associations between biomedical concepts contained in MEDLINE articles
http://refine1-nactem.mc.man.ac.uk/facta/
39
An advanced information retrieval system providing knowledge enriched searching for biomedicine
http://www.nactem.ac.uk/Kleio/
41
Uses semantic search to retrieve biomedical correlations from MEDLINE
http://www.nactem.ac.uk/medie/
43
1) A miniMEDIE application for Romanian that for a subject and a verb can identify possible objects from a corpora built before.
Use the Romanian POS service from address: http://instrumente.infoiasi.ro/WebPosTagger/
47
Data Mining: What is Data Mining? http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/palace/datamining.htm
Microsoft Association Rules: http://e-university.wisdomjobs.com/data-mining/chapter-377-199/microsoft-association-rules.html
Data mining: http://en.wikipedia.org/wiki/Data_mining
Data mining in meteorology: http://en.wikipedia.org/wiki/Data_mining_in_meteorology
Applied data mining: http://en.wikipedia.org/wiki/Category:Applied_data_mining
Self organizing map: http://en.wikipedia.org/wiki/Self-Organizing_Map
GoPubMed: http://www.gopubmed.com/web/gopubmed/
PubGene: http://www.pubgene.org/
Text Mining: http://en.wikipedia.org/wiki/Text_mining
NaCTeM: http://www.nactem.ac.uk/index.php
NaCTeM Brochure: http://www.nactem.ac.uk/brochure/NaCTeM_Brochure.pdf
Text Mining Tutorial: http://eprints.pascal-network.org/archive/00000017/01/Tutorial_Marko.pdf
Text Mining: http://www.cs.sunysb.edu/~cse634/presentations/TextMining.pdf
Text Mining Resources: http://bioinformatics.ualr.edu/resources/links/text_mining_category.html
BioNLP Shared Task: https://sites.google.com/site/bionlpst/ 48