Date post: | 11-May-2015 |
Category: |
Technology |
Upload: | lucenerevolution |
View: | 4,724 times |
Download: | 5 times |
Text classification with Lucene/Solr and LibSVM
By Majirus FANSI, PhdAgile Software Developer@majirus
Motivation: Guiding user search
● Search engines are basically keyword-oriented
– What about the meaning?● Synonym search needs listing the synonyms● More-Like-This component is about more like THIS● Category search for better user experience
– Deals with the cases where user keywords are not in the collection
– User searches for « emarketing », you returns documents on « webmarketing »
Outline
● Text Categorization
● Introducing Machine Learning
● Why SVM?
● How Solr can help ?
Putting it all Together is our aim
Text classification or Categorization
● Aims– Classify documents into a fixed number of predefined
categories● Each document can be in multiple, exactly one, or no
category at all.● Applications
– Classifying emails (Spam / Not Spam)– Guiding user search
● Challenges– Building text classifiers by hand is difficult and time
consuming– It is advantageous to learn classifiers from examples
Machine Learning
● Definition (by Tom Mitchell - 1998)“A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E”
● Experience E: watching the label of a document● Task T: classify a document● Performance P: probability that a document is correctly
classified.
Machine Learning Algorithms
● Usupervised learning– Let the program learn by itself
● Market segmentation, social network analysis...
● Supervised learning– Teach the computer program how to do something– We give the algorithm the “right answers” for some
examples
Supervised learning problems
– Regression● Predict continuous valued output
● Ex: price of the houses in Corbeil Essonnes
– Classification● Predict a discrete valued output (+1, -1)
Supervised learning: Working
Training algorithm
Hypothesis hFeature vector (x)
Predicted value (y)
Training Set (X, Y)m training examples
(X(i),Y(i)) : ith training example
It's the job of the learning algorithm to produce the model h
X's : input variable or featuresY's : output/target variable
h(x)
Classifier/Decision Boundary
● Carves up the feature space into volumes ● Feature vectors in volumes assigned to the same class ● Decision regions separated by surfaces ● Decision boundary linear if a straight line in the
dimensional space– A line in 2D, a plane in 3D, a hyperplane in 4+D
Which Algorithm for text classifier
Properties of text
● High dimensional input space– More than 10 000 features
● Very few irrelevant features● Document vectors are sparse
– few entries which are non zero● Most text categorization problems are linearly separable
No need to map the input features to a higher dimension space
Classification algorithm /choosing the method
● Thorsten Joachims compares SVM to Naive Bayes, Rocchio, K-nearest neighbor and C4.5 decision tree
● SVM consistently achieve good performance on categorization task
– It outperforms the other methods– Eliminates the need for feature selection– More robust than the other
Thorsten Joachims, 1998. Text Categorization with SVM : Learning with many relevant features
SVM ? Yes But...
« The research community should direct efforts towards increasing the size of annotated training collections, while deemphasizing the focus on comparing different learning techniques trained only on small training corpora »
Banko & Brill in « scaling very very large corpora for natural language disambiguation »
What is SVM - Support Vector Machine?
● « Support Vector Networks » Cortes & Vapnik, 1995● SVM implements the following idea
– Maps the input vectors into some high dimensional feature space Z
● Through some non linear mapping choosing a priori
– In this feature space a linear decision surface is constructed
– Special properties of the decision surface ensures high generalization ability of the learning machine
SVM - Classification of an unknown pattern
sv2
x
svk
sv1
Input vector, x
Support vectors zi in feature
space
classification
wN
X
Non-linear transformation
Input vector in feature space
w2
w1
SVM - decision boundary
● Optimal hyperplane– Training data can be separated without errors– It is the linear decision function with maximal
margin between the vectors of the two classes
● Soft margin hyperplane– Training data cannot be separated without errors
Optimal hyperplane
Optimal hyperplane - figure
Optimal hyperplane
Optimal margin
x2
x1
SVM - optimal hyperplane
● Given the training set X of (x1, y
1), (x
2, y
2), … (x
m, y
m) ; y
i Є{-1, 1}
● X is linearly separable if there exists a vector w and a scalar b s.t.
● Vectors xi for which y
i (w.x
i+b) = 1 is termed support vectors
– Used to construct the hyperplane– if the training vectors are separated without errors by
an optimal hyperplane
● The optimal hyperplane – Unique one which separates the training data with a
maximal margin
w.x i+b≥1 if y i=1(1)
w.x i+b≤−1 if y i=−1(2)(1), (2)⇒ yi(w.xi+b)≥1(3)
E [Pr (error)]≤E[number of support vectors]number of training vectors
(4)
w0 . z+b0=0(5)
SVM - optimal hyperplane – decision function
● Let us consider the optimal hyperplane
● The weight w0 can be written as some linear combination
of SVs
● The linear decision function I(z) is of the form
● zi.z is the dot product between sv
s z
i and vector z
w0 . z+b0=0(5)
w0= ∑support vectors
αi zi(6)
I (z )=sign( ∑support vectors
α i z i . z+b0)(7)
Soft margin hyperplane
Soft margin Classification
● Want to separate the training set with a minimal number of errors
s.t.
● The functional (5) describes the number of training errors
● Removing the subset of training errors from training set● Remaining part separated without errors● By constructing an optimal hyperplane
Φ(ξ)=∑i=1
m
ξiσ ; ξi≥0 ; for small σ>0(5)
y i(w.xi+b)≥1−ξ i; i=1,... ,m(6)
SVM - soft margin Idea
● Soft margin svm can be expressed as
● For sufficiently large C, the vector w0 and constant b
0,
that minimizes (7) under (8) determine the hyperplane that
– minimizes the sum of deviations, ξ, of training errors– Maximizes the margin for correctly classified vectors
minw , b ,ξ
12
w2+C∑i=1
m
ξi(7)
s.t. y i(w.xi+b)≥1−ξ i ξ i≥0 (8)
SVM - soft margin figure
separator
soft margin
x2
x1
ξ=0
ξ=0
0<ξ<1
ξ>1
Constructing text classifier with SVM
Constructing and using the text classifier
● Which library ?– Efficient optimization packages are available
● SVMlight, LibSVM
● From text to features vectors– Lucene/solr helps here
● Multi-class classification vs One-vs-the-rest
● Using the categories for semantic search● Dedicated solr index with the most predictive
terms
SVM library
● SVMlight– By Thorsten Joachim
● LibSVM– By Chan & Lin from Taiwan university– Under heavy development and testing– Library for java, C, python,...,Package for R language
● LibLinear– By Chan, Lin & al.– Brother of LibSVM– Recommended by LibSVM authors for large-scale
linear classification
LibLinear
● A Library for Large Linear Classification– Binary and Multi-class– implements Logistic Regression and linear SVM
● Format of training and testing data file is :– <label> <index1>:<value1><index2>:<value2>... – Each line contains an instance and is ended by a '\n'– <label> is an integer indicating the class label– The pair <index>:<value> gives a feature value
● <index> is an integer starting from 1● <value> is a real number
– Indices must be in ascending order
LibLinear input and dictionary
● Example input file for training1 101:1 123:5 234:2-1 54:2 64:1 453:3
– Do not have to represent the zeros.
● Need a dictionary of terms in lexicographical order
1 .net2 aa...6000 jav...7565 solr
Building the dictionary
● Divide the overall training data into a number of portions
– Using knowledge of your domain● Software development portion● marketing portion...
– Avoid a very large dictionary● A java dev position and a marketing position
share few common terms● Use Expert boolean queries to load a dedicated solr
core per domain– description:python AND title:python
Building the dictionary with Solr
● What do we need in the dictionary– Terms properly analyzed
● LowerCaseFilterFactory, StopFilterFactory,● ASCIIFoldingFilterFactory,
SnowballPorterFilterFactory
– Terms that occurs in a number of documents (df >min)● Rare terms may cause the model to overfit
● Terms are retrieved from solr
– Using solr TermVectorsComponent
Solr TermVectorComponent
● SearchComponent designed to return information about terms in documents
– tv.df returns the document frequency per term in the document
– tv.tf returns document term frequency info per term in the document
● Used as feature value
– tv.fl provides the list of fields to get term vectors for● Only the catch-all field we use for classification
Solr Core configuration
● Set termvectors attribute on fields you will use– <field name="title_and_description" type="texte_analyse"
indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true"/>
– Normalize your text and use stemming during the analysis
● Enable TermVectorComponent in solrconfig– <searchComponent name="tvComponent"
class="org.apache.solr.handler.component.TermVectorComponent"/>
– Configure a RequestHandler to use this component● <lst name="defaults"> <bool name="tv">true</bool> </lst>● <arr name="last-components"> <str>tvComponent</str> </arr>
Constructing Training and Test sets per model
Feature extraction
● Domain expert query is used to extract docs for each category
– TVC returns the terms info of the terms in each document
– Each term is replaced by its index from the dictionary● This is the attribute
– Its tf info is used as value● Some use presence/absence (or 1/0)● Others tf-idf
term_index_from_dico:term_freq is an input feature
Training and Test sets partition
● We shuffle documents set so that high score docs do not go to the same bucket
● We split the result list so that– 60 % to the training set (TS)
● Here are positive examples (the +1s)– 20 % to the validation set (VS)
● Positive in this model, negative in others– 20 % is used for other classes training set (OTS)
● These are negative examples to others● Balanced training set (≈50 % of +1s and ≈50 % of -1s)
– The negatives come form other's 20 % OTS
Model file
● Model file is saved after training– One model per category– It outlines the following
● solver_type L2R_L2LOSS_SVC● nr_class 2● label 1 -1● nr_feature 8920● bias 1.000000000000000● w
● -0.1626437446641374 ● 0 ● 7.152404908494515e-05
w.xi + b ≥ 1 if y
i = 1
Most predictives terms
● Model file contains the weight vector w
● Use w to compute the most predictves terms of the model– Give an indication as to whether the model is good or
not● You are the domain expert
– Useful to extend basic keyword search to semantic search
Toward semantic search - Indexing
● Create a category core in solr– Each document represents a category
● One field for the category ID● One multi-valued field holds its top predictives
terms
● At indexing time– Each document is sent to the classification service– The service returns the categories of the document– Categories are saved in a multi-valued field along with
other domain-pertinents document fields
Toward semantic search - searching
● At search time– User query is run on the category core
● What about libShortText
– The returned categories are used to extend the initial query
● A boost < 1 is assigned to the category
References
● Cortes and Vapnik, 1995. Support-Vector Networks
● Chang and Lin, 2012. LibSVM : A Library for Support Vector Machines
● Fan, Lin, et al. LibLinear, 2012 : A Library for Large Linear Classification
● Thorsten Joachims, 1998. Text Categorization with SVM : Learning with many relevant features
● Rifkin and Klautau, 2004. In Defense of One-Vs-All classification
A big thank you
● Lucene/Solr Revolution EU 2013 organizers
● To Valtech Management
● To Michels, Maj-Daniels, and Marie-Audrey Fansi
● To all of you for your presence and attention
Questions ?
To my wife, Marie-Audrey, for all the attention she pay to our family