Post on 26-May-2015
transcript
FUNCTION OF RIVAL SIMILARITY IN A COGNITIVE DATA ANALYSIS
Nikolay ZagoruikoIrina Borisova, Vladimir Dyubanov, Olga Kytnenko
Institute of Mathematics of the Siberian Devisionof the Russian Academy of Sciences,
Pr. Koptyg 4, 630090 Novosibirsk, Russia,
zag@math.nsc.ru
Data Analysis, Pattern Recognition, Empirical Prediction, Discovering of Regularities, Data Mining, Machine Learning,
Knowledge Discovering, Intelligence Data Analysis, Cognitive Calculations
The special attention involves ability of the person
- To estimate similarities and distinctions between objects, - To make classification of objects, - To recognize a belonging of new objects to available classes, - To discover natural dependences between characteristics and - To use these dependences (knowledge) for forecasting
Specificity of Data Mining tasks:
• Polytypic attributes
• Number of attributes>> number of objects
• Presence of noise, spikes and blanks
• Absence of the information on distributions
Situation in Data Mining
Thousands of algorithmsReasons: Types of scales, dependences of features, lows of
distribution, linear-nonlinear decision rules, small or big training set,
How to make algorithms, which will be invariant to this features?
Which function is common for all DM algorithms?
Basic function, used by the person at the clustering, recognition, feature selection etc., is function of estimation of similarity between objects.
Measures of Similarity
2
1
21
1
21
3
41
( )
( , ) 1 ( ) ,
( , ) 1 | |
( , ) 1 max | |,
( , )( , ) ,
max( , )
( , ) 1 ,
....
na bi i
i
na b
i i ii
na b
i i ii
a bi i
a bni i
i a bi i i
x x
S a b x x
S a b x x
S a b x x
min x xS a b
x x
S a b e
Similarity is not absolute, but a relative category
Is an object b similar to a or it is not similar? Whether the objects b and a belong to one class?
a b
a b c
a b c
We should know the answer on question: In competition with what?
Measure F(z,a|b) of similarity of the object z to object a in competition with object b
Locality: F is depend on distances (z,a) and (z,b) only.
Normality: If z=a, F(z,a|b)=+1. If z=b, F(z,a|b)=-1.
If (z,a)=(z,b), F(z,a|b)=F(z,b|a)= 0.
Invariance to moving and rotation of coordinates.
Antysimmetricity: F(z,a|b)= -F(z,b|a)
======================================
Simmetry: F(z,a|b)=F(z,b|a)
Thriangularity: F(z,a|b)+F(a,b|z)≥F(b,z|a)
======================================
Competitive Space
Function of Concurrent (Rival) Similarity (FRiS)
r1
r2
-1
z
A
+1
B
d2
F
A Bz
r1
r2
)(
)()2|1,(
12
12
rr
rrzF
Methods of DM, using FRiS-function, allows to improve a old algorithms
and to solve a some new tasks:
• Quantitative estimation of compactness • Choice of informative attributes • Construction of decision rules• Censoring of the training set• Generalized classification• Filling of blanks (inputation)• Forecasting• Ordering of objects
A
B
A
B
A
B
All pattern recognition methods are based on hypothesis of compactness
Braverman E.M., 1962
The patterns are compact if-the number of boundary points is not enough in comparison with common number; - compact patterns are separated from each other refer to not too elaborate borders.
Compactness
Compactness
For high compactness it is need:
Maximum of the similarity between objects of one pattern
Minimum of the similarity between objectsof different patterns
r2
r1
i
A
B
j
b
r1 r2
j
j
b
b
r2
r1
Maximal similarity between objects of the same pattern
Compact patterns should satisfy to condition of the
1
1( , | )
AM
ijA
D F j i bM
Max inCompactness
2 1 2 1( , | ) ( ) / ( )F j i b r r r r
Min out
r2
r1
r1
r2
i
A
B
j
q
sb
Compactness
Maximal difference of these objects with the objects of other patterns
1 1
1( , | )
A BM M
ii qA B
T F q s iM M
*A BC C C
Compact patterns should satisfy to the condition
( ) / 2i i iC D T
2 1 2 1( , | ) ( ) / ( )F q s i r r r r
1
1 AM
A iiA
C CM
1
1 BM
B qqB
C CM
Algorithm FRiS-Stolpfor selection of the standards (“stolps”)
max ( ) / 2i i iC D T
Decision rules
Decision rules
Recognition
k=K
k=K+2
k=K+11
k=K+29
Censoring of the training setCensoring
Censoring of the training setCensoring
Censoring of the training setCensoring
Censoring of the training set
1.0.8689 -90(90)-202.0.8902 -90(90)-203.0.9084 -90(90)-204.0.9167 -90(90)-205.0.8903 -90(90)-206.0.7309 -88(90)-97.0.2324 -86(90)-7
MMmmkCH /',...)/(
H P
=argmax |r|(H,P) =1,2,…7
Censoring
=4 or 5
Informativeness by Fisherfor normal distribution
1 22 21 2
| |FI
Compactness has the same sense and can be used as a criteria of informativeness, which is invariant to
low of distribution and to relation of N:M
Results of comparative researches have shown appreciable advantage of this criterion in comparison
with number of errors at the Cross-Validation
Criteria of informativeness
Comparison of the criteria (CV - FRiS)
Order of attributes by informativeness
....... ....... C = 0,661
....... ....... C = 0,883
noise0,6
0,7
0,8
0,9
1
1,1
0,05 0,1 0,15 0,2 0,25 0,3
Fs
U
Fs
U
N=100 M=2*100
mt =2*35 mC =2*65 +noise
noise
Criteria
Algorithm GRAD It based on combination of two greedy approaches:
forward and backward searches.
At a stage forward algorithm Addition
is used
At a stage backward algorithm Deletion is used
GRAD
Algorithm AdDel To easing influence of collecting errors a relaxation method it is applied.n1 - number of most informative attributes, add-on to subsystem (Addition),n2<n1 - number of less informative attributes, eliminated from subsystem (Deletion).
AdDel Relaxation method: n steps forward - n/2 steps back
Algorithm AdDel. Reliability (R) of recognition at
different dimension space.
R(AdDel) > R(DelAd) > R(Ad) > R(Del)
GRAD
Algorithm GRAD• AdDel can work with not single attributes only, but also with groups of
attributes (granules) of different capacity m=1,2,3,…: , , ,…
The granules can be formed by the exhaustive search method.
• But: Problem of combinatory explosion!
Decision: orientation on individual informativeness of attributes
Dependence of frequency f hits in an informative subsystem from serial number L on individual informativeness
It allows to granulate a most informative part attributes only
GRAD
L
f
Algorithm GRAD(Granulated AdDel)
1. Independent testing N attributes
Selection m1<<N first best (m1 granules power 1)
2. Forming combinations
Selection m2<< first best (m2 granules power 2)
3. Forming combinations
Selection m3<< first best (m3 granules power 3)
M =<m1,m2,m3> - set of secondary attributes (granules)AdDel(M) selects m*<<|M| best granules, which included n* attributes
21mC
21mC
31mC
31mC
2 6 9 25,3 ,5 , ,...X x x x x
GRAD
Value of FRiS for points on a plane
Classification (Algorithm FRiS-Class)
FRiS-Cluster divides a objects on clustersFRiS-Tax unites a clusters to classes (taxons)
Using FRiS-function allows:- To make a taxons of any form;- To search a optimal number of taksons.
r1
r2*
r1 r2*
Примеры таксономии алгоритмом FRiS-Class
Comparison the FRiS-Class with other algorithms of taxonomy
0,3
0,4
0,5
0,6
0,7
0,8
0,9
2 3 4 5 6 7 8 9 10 11 12 13 14 15
FRiS-Cluster
Kmeans
Forel
Scat
FRiS-Tax
K
Taxonomic Decision Rule
Taxonomic Decision Rule
Taxonomic Decision Rule
Universal classification
Labeled Semilabeled Unlabeled
(Pattern Rec) (ТРФ) (Clastering)
Universal classification
Unlabeled Semilabeled Labeled
(Clastering) (Pattern Rec)
=================================
FRiS-TDR
Some real tasks DM
Task K M NMedicine:Diagnostics of Diabetes II type 3 43 5520 Diagnostics of Prostate Cancer 4 322 17153Recognition of type of Leukemia 2 38 7129
Physics:Complex analysis of spectra 7 20-400 1024
Commerse:Forecasting of book sealing(Data Mining Cup 2009) - 4812 1862
Data Mining Cup 2009http:www.prudsys.deServiceDownloadsbin
Prognosis of data at absolure scale
1…………………………………………1856 1…8
TRAINING
1... 84% = 0.. A = 0 - 2300.2394
CONTROL
1.......2418
To predict 19344 cells
DMC 2009
618 teams from 164 Universities of 42 countries participated
231 have sent decisions, 49 were selected for rating
1 Uni Karlsruhe TH_ II
17260 16 TU Graz
23626
2 TU Dortmund 17912 18 Uni Weimar_I 23796
3 TU Dresden 18163 19 Zhejiang University of Sc. and Tech 23952
4 Novosibirsk State University 18353 20 University Laval 24884
5 Uni Karlsruhe TH_ I
18763 24 University of Southampton
25694
6 FH Brandenburg_I
19814 25 Telkom Institute of Technology
25829
7 FH Brandenburg_II
20140 26 University of Central Florida
26254
8 Hochschule Anhalt
20767 32 Indian Institute of Technology
28517
9 Uni Hamburg_
21064 34 Anna University Coimbatore
28670
10 KTH Royal Institute of Technology
21195 38 Technical University of Kosice 32841
11 RWTH Aachen_I 21780 39 Uiversity of Edinburgh 45096
14 Budapest University of Technology
23277 48 Warsaw School of Economics
77551
15 Isfahan University of Technology
23488 49 FH Hannover 1938612
NN Teams Errors NN Teams Errors
Comparison with 10 methods of feature selection
• Jeffery I.,Higgins D.,Culhane A. Comparison and evaluation of methods for generating differentially expressed gene lists from microarray data. //
• http://www.biomedcentral.com/1471-2105/7/3599 tasks on microarray data. 10 methods the feature selection.Independent attributes. Selection of n first (best). Criteria – min of errors on CV: 10 time by 50%.
4 decision rules:Support Vector Machine (SVM), Between Group Analysis (BGA),
Naive Bayes Classification (NBC), K-Nearest Neighbors (KNN).
40 decision of each of 9 tasks
Methods of selection
Methods Results
Significance analysis of microarrays (SAM) 42Analysis of variance (ANOVA) 43Empirical Bayes t-statistic 32Template matching 38 maxT 37 Between group analysis (BGA) 43 Area under the receiver operating characteristic curve (ROC) 37Welch t-statistic 39 Fold change 47 Rank products 42 FRiS-GRAD 12
Empirical Bayes t-statistic – for middle set of objectsArea under a ROC curve – for small noise and large set Rank products – for large noise and small set
Results on tasks
• Задача N0 m1/m2 max of 4 GRAD• ALL1 12625 95/33 100.0 100.0• ALL2 12625 24/101 78.2 80.8• ALL3 12625 65/35 59.1 73.8• ALL4 12625 26/67 82.1 83,9• Prostate 12625 50/53 90.2 93.1 • Myeloma 12625 36/137 82.9 81.4• ALL/AML 7129 47/25 95.9 100.0• DLBCL 7129 58/19 94.3 89.8• Colon 2000 22/40 88.6 89.5
Recognition of two types of Leukemia - ALL and AML
ALL AMLTraining set 38 27 11 N = 7129Control set 34 20 14
I.Guyon, J.Weston, S.Barnhill, V.Vapnik Gene Selection for Cancer Classification using
Support Vector Machines. Machine Learning. 2002, 46 1-3: pp. 389-422.
Training set 38 Test set 34N g Vsuc Vext Vmed Tsuc Text Tmed P7129 0,95 0,01 0,42 0,85 -0,05 0,42 294096 0,82 -0,67 0,30 0,71 -0,77 0,34 242048 0,97 0,00 0,51 0,85 -0,21 0,41 291024 1,00 0,41 0,66 0,94 -0,02 0,47 32512 0,97 0,20 0,79 0,88 0,01 0,51 30256 1,00 0,59 0,79 0,94 0,07 0,62 32128 1,00 0,56 0,80 0,97 -0,03 0,46 3364 1,00 0,45 0,76 0,94 0,11 0,51 3232 1,00 0,45 0,65 0,97 0,00 0,39 3316 1,00 0,25 0,66 1,00 0,03 0,38 348 1,00 0,21 0,66 1,00 0,05 0,49 344 0,97 0,01 0,49 0,91 -0,08 0,45 312 0,97 -0,02 0,42 0,88 -0,23 0,44 301 0,92 -0,19 0,45 0,79 -0,27 0,23 27
Pentium T=3 hours
FRiS Decision Rules P
0,72656 537/1 , 1833/1 , 2641/2 , 4049/2 34
0,71373 1454/1 , 2641/1 , 4049/1 34
0,71208 2641/1 , 3264/1 , 4049/1 34
0,71077 435/1 , 2641/2 , 4049/2 , 6800/1 34
0,70993 2266/1 , 2641/2 , 4049/2 34
0,70973 2266/1 , 2641/2 , 2724/1 , 4049/2 34
0,70711 2266/1 , 2641/2 , 3264/1 , 4049/2 34
0,70574 2641/2 , 3264/1 , 4049/2 , 4446/1 34
0,70532 435/1 , 2641/2 , 2895/1 , 4049/2 34
0,70243 2641/2 , 2724/1 , 3862/1 , 4049/2 34
Name of gene Weight
2641/1 , 4049/1 33 2641/1 32
В 27 первых подпространствах P =34/34
Pentium T=15 sec
I.Guyon, J.Weston, S.Barnhill, V.Vapnik Zagoruiko N., Borisova I., Dyubanov V., Kutnenko O.
Best features SVM FRiS
FRE 803,4846 30(88%) 33(97%)
4846 27(79%) 30(88%)
Projection a training set on 2641 и 4049 features
AML
ALL
Diabetes of II type Ordering of patients
M=43 17+8+18, N=5520
• Average similarity Fav of patients to healthy people
Healthy Patients
Group of risk
The group of risk did not participate in training
It is useful for early diagnostics of diseases and for monitoring process of treatment
F=+1
F=-1
Methods of DM, using FRiS-function, allows to improve a old algorithms
and to solve a some new tasks:
• Quantitative estimation of compactness • Choice of informative attributes • Construction of decision rules• Censoring of the training set• Generalized classification• Filling of blanks (inputation)• Forecasting• Ordering of objects
Unsettled problems
• Stolp+corridor (FRiS+LDR)• Imputation of polytypical tables• Unite of tasks of different types (UC+X)• Optimization of algorithms• Realization of program system (OTEX 2)• Applications (medicine, genetics,…)• …..
Conclusion
FRiS-function:1.Provides effective measure of similarity,
informativeness and compactness
2.Provides unification of methods and invariance to parameters of tasks,low of distribution, relation M:N
3.Provides high enough quality of decisions
Publications:
http://math.nsc.ru/~wwwzag
Thank you!
• Questions, please?