Date post: | 28-Sep-2015 |
Category: |
Documents |
Upload: | jennifer-parker |
View: | 77 times |
Download: | 19 times |
4/28/15, 7:02 PMSAS Programming for Data Mining: AUC calculation using Wilcoxon Rank Sum Test
Page 1 of 8http://www.sas-programming.com/2009/10/auc-calculation-using-wilcoxon-rank-sum.html
Copyright 2006-2014 / SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks ofSAS Institute Inc. in the USA and other countries. indicates USA registration. Other brand and product names are trademarks of theirrespective companies.
SAS Programming for Data Mining
About Home Bayesian using SAS
Friday, October 23, 2009
AUC calculation using Wilcoxon Rank Sum TestAccurately Calculate AUC (Area Under the Curve) in SAS for a binary classifier rank ordered data
In order to calculate AUC for a given SAS data set that is already rank ordered by a binary classifier (such as linearlogistic regression), where we have the binary outcome Y and rank order measurement P_0 or P_1 (for class 0 and 1respectively), we can use PROC NPAR1WAY to obtain Wilcoxon Rank Sum statistics and from there we are able toobtain accurate measurement of AUC for this given data.
The relationship between AUC and Wilcoxon Rank Sum test statistics is: AUC = (W-W0)/(N1*N0)+0.5 where N1 andN0 are the frequency of class 1 and 0, and W0 is the Expected Sum of Ranks under H0: Randomly ordered, and W isthe Wilcoxon Rank Sums.
In one application example shown below, PROC LOGISTIC reports c=0.911960, while this method calculates it asAUC=0.9119491555
%macro AUC( dsn, Target, score);ods select none;ods output WilcoxonScores=WilcoxonScore;proc npar1way wilcoxon data=&dsn ; where &Target^=.; class &Target; var &score; run;ods select all;
data AUC; set WilcoxonScore end=eof; retain v1 v2 1; if _n_=1 then v1=abs(ExpectedSum - SumOfScores); v2=N*v2; if eof then do; d=v1/v2; Gini=d * 2; AUC=d+0.5; put AUC= GINI=; keep AUC Gini; output; end;run;%mend;
data test; do i = 1 to 10000; x = ranuni(1); y=(x + rannor(2315)*0.2 > 0.35 ) ; output; end;run;
ods select none;ods output Association=Asso;proc logistic data = test desc; model y = x; score data = test out = predicted ; run;ods select all;
data _null_;
About Me
Join this sitewith Google Friend Connect
Members (67) More
Already a member? Sign in
Follow Me
Search
Search This Blog
Analytics in Writing
MySAS.NET
PROC-X Aggregator
SAS Analysis by Charlie
SAS Community
SAS Die Hard
SAS Graph Examples
SAS Support
SAS-L Archives
StatComput by Wensui
Sites on SAS
Sites on R & Python
Este sitio emplea cookies como ayuda para prestar servicios. Al utilizar este sitio, ests aceptando el uso de cookies. Ms informacin Entendido
4/28/15, 7:02 PMSAS Programming for Data Mining: AUC calculation using Wilcoxon Rank Sum Test
Page 2 of 8http://www.sas-programming.com/2009/10/auc-calculation-using-wilcoxon-rank-sum.html
set Asso; if Label2='c' then put 'c-stat=' nValue2;run;%AUC( predicted, y, p_0);
NPAR1WAY gets AUC = 0.91766634744;LOGISTIC reports c-statistic = 0.917659
So, which one is more accurate? I would say, NPAR1WAY. The reason is that we can also use yet another procedure,PROC FREQ to verify the gini value which is 2*(AUC-0.5). Gini index is called Somers'D in PROC FREQ. Here, fromNPAR1WAY, gini value is calculated as 0.8353269487, the same as reported Somer's D C|R (since the columnvariable is predictor)from PROC FREQ:
proc freq data=test noprint; tables y*x/ measures; output out=_measures measures;run;
data _null_; set _measures; put _SMDCR_=;run;
Then why not just use PROC FREQ since the coding is so simple? Well, the answer is really about the SPEED!Check the log below for a data with only 100000 observations, 37.63sec vs. 0.15 sec in real time:
35463547 data one;3548 call streaminit(98676876);3549 do id=1 to 1e5;3550 score=ranuni(0)*1000;3551 if score+rannor(0)>0 then y=1;3552 else y=0;3553 output;3554 drop id;3555 end;3556 run;
NOTE: The data set WORK.ONE has 100000 observations and 2 variables.NOTE: DATA statement used (Total process time): real time 0.04 seconds cpu time 0.04 seconds
35573558 proc freq data=one noprint;3559 tables score*y/measures noprint;3560 output out=_freq_out measures;3561 run;
NOTE: There were 100000 observations read from the data set WORK.ONE.NOTE: The data set WORK._FREQ_OUT has 1 observations and 27 variables.NOTE: PROCEDURE FREQ used (Total process time): real time 37.63 seconds cpu time 37.56 seconds
35623563 data _null_;3564 set _freq_out;3565 AUC=_smdrc_/2 + 0.5;3566 put "AUC = " AUC " SOMER'S D R|C = " _smdrc_;3567 run;
AUC = 0.9995285252 SOMER'S D R|C = 0.9990570504NOTE: There were 1 observations read from the data set WORK._FREQ_OUT.NOTE: DATA statement used (Total process time): real time 0.00 seconds cpu time 0.00 seconds
35683569 %AUC(one, y, score);
NOTE: The data set WORK.WILCOXONSCORE has 2 observations and 7 variables.
Exploring Data Blog
Python Scikit
Python SciPy
R Bloggers Aggregator
R Cookbook
R Graphics
R Project
Baidu
Bing
Colt: JAVA Lib for Computing
Kaggle (DM Competition)
MITBBS
NIST Math & Stat Div
Stats Blog
Tim's TextMining
UCI Machine LearningRepository
UCLA Stat Computing
Recommended Sites
Array (5)
AUC (1)
Bayesian (2)
Boost Algorithms (4)
Data Manipulation (14)
Data Mining (12)
Erlang C (1)
Filter (1)
Finite Mixture Model (1)
Format (1)
Gap Statistic (1)
Gini Index (1)
GRAPH (2)
Hash Object (4)
Heckman Selection model (1)
HOSVD (2)
Tag
4/28/15, 7:02 PMSAS Programming for Data Mining: AUC calculation using Wilcoxon Rank Sum Test
Page 3 of 8http://www.sas-programming.com/2009/10/auc-calculation-using-wilcoxon-rank-sum.html
Posted by Liang Xie at 10/23/2009 02:09:00 PM
Labels: AUC, Macro Programming, predictive modeling, PROC NPAR1WAY
NOTE: There were 100000 observations read from the data set WORK.ONE. WHERE y not = .;NOTE: PROCEDURE NPAR1WAY used (Total process time): real time 0.10 seconds cpu time 0.09 seconds
AUC=0.9995285252 Gini=0.9990570504NOTE: There were 2 observations read from the data set WORK.WILCOXONSCORE.NOTE: The data set WORK.AUC has 1 observations and 2 variables.NOTE: DATA statement used (Total process time): real time 0.01 seconds cpu time 0.00 seconds
Recommend this on Google
Post a Comment
Create a Link
6 comments:Charlie Shipp Family said...
Looks great .!.
Thanks for your work in sasCommunity.
Charlie Shipp
11:25 PM, February 27, 2010
eskimokitty said...
It is awesome. Thank you so much!!
4:37 PM, June 08, 2011
raspcompote said...
Thank you for this useful post.
3:47 PM, March 13, 2012
Luis Gustavo said...
I'd like to know where did you find this relationship between AUC and Wilcoxon Rank Sum Test. I'm tryingto study more about it and it would really help!
Thanks
10:32 AM, February 06, 2014
Liang Xie said...
The relationship is well explained at the Wikipedia page below:http://en.wikipedia.org/wiki/Mann%E2%80%93Whitney_U
7:37 PM, February 13, 2014
Jon Dickens said...
You are confusing the Gini with the accuracy ratio but you are not alone several people at SAS make thesame mistake.
if you are interested in discussing this issue, then contact me via linkedin.
Jon Dickens
3:11 PM, July 20, 2014
Links to this post
HPGLIMMIX (1)
Index (2)
K-means Clustering (3)
K/N Algorithm (1)
kernel (1)
KNN (3)
LGD (1)
Macro Programming (7)
Moore-Penrose pseudoinverse(3)
multi-threading (1)
Nearest Neighbor (3)
Over-dispersion (1)
PCA (3)
predictive modeling (17)
PROC APPEND (1)
PROC CANDISC (1)
PROC CORR (3)
PROC DISCRIM (5)
PROC DISTANCE (2)
PROC EXPAND (1)
PROC FACTOR (1)
PROC FASTCLUS (3)
PROC FCMP (1)
PROC FMM (1)
PROC FORMAT (1)
PROC GENDMO (1)
PROC GLIMMIX (3)
PROC GLMMOD (1)
PROC GLMSELECT (2)
PROC GPLOT (2)
PROC HPMIXED (3)
PROC KRIGE2D (1)
PROC LIFEREG (1)
PROC MEANS (3)
PROC MIXED (1)
PROC MODECLUS (1)
PROC NPAR1WAY (1)
PROC ORTHOREG (1)
4/28/15, 7:02 PMSAS Programming for Data Mining: AUC calculation using Wilcoxon Rank Sum Test
Page 4 of 8http://www.sas-programming.com/2009/10/auc-calculation-using-wilcoxon-rank-sum.html
Newer Post Older PostHome
Subscribe to: Post Comments (Atom)
SAS SQL SAS Programming Read SAS Dataset SAS SPSS
PROC PLS (2)
PROC PRINCOMP (9)
PROC QLIM (1)
PROC REG (6)
PROC SCORE (7)
PROC SQL (2)
PROC STANDARD (1)
PROC STDIZE (1)
PROC UNIVARIATE (2)
quantile computing (1)
Queueing Model (1)
R (3)
random number (1)
Random Split (1)
RISK (1)
SAS (2)
Statistical Graphics (1)
SVD (11)
Tensor (2)
Tobit Model (1)
sklearn DecisionTree plotexample needs pydotplusIn Python, sklearn (scikit-learn)'s DecisionTreeexample uses pydot forplotting the generated tree:@here. But for Python 3,pydot has...Apr-25-2015 | More
Migrating code pieces toGitHubOne of the original reasonsfor this blog was to keeptrack of my SAS code aswell as its relevant context.That was the mindset whenI was a SAS...Feb-05-2015 | More
%SVD macro with BY-Processing
For theRegularizedDiscriminantAnalysis Cross
Validation, we need tocompute SVD for each pairof \((\lambda, \gamma)\),and the factorization...Dec-18-2014 | More
Recent Posts
4/28/15, 7:02 PMSAS Programming for Data Mining: AUC calculation using Wilcoxon Rank Sum Test
Page 5 of 8http://www.sas-programming.com/2009/10/auc-calculation-using-wilcoxon-rank-sum.html
Dec-18-2014 | More
Experient downdatingalgorithm for Leave-One-Out CV in RDAIn this post, I want todemonstrate a piece ofexperiment code fordowndating algorithm forLeave-One-Out (LOO)Cross Validation inRegularized...Dec-15-2014 | More
Control Excel via SAS DDE& Python win32comExcel is probably the mostused interface betweenhuman and data.Whenever you are dealingwith business people,Excel is the de facto meansfor all...Dec-15-2014 | More
%HPGLIMMIX SAS macrois available online at JSSwebsiteMy paper "%HPGLIMMIX:A High-Performance SASMacro for GLMMEstimation" is nowavailable at Journal ofStatistical Software website@here. SAS macro...Jul-01-2014 | More
Market trend in advancedanalytics for SAS, R andPython
Disclaimer: Thisstudy is a view onthe markettrend on demand
of advanced analyticssoftware and theiradoptions from the jobmarket perspective,...Dec-06-2013 | More
I don't always doregression, but when I do, Ido it in SAS ...
There are severalexciting add-insfrom SASAnalytics products
running on v9.4, especiallythe SAS/STAT highperformance procedures,where "high...Jul-19-2013 | More
Finding the closest pair indatat using PROCMODECLUS
UPDATE: RickWicklin kindlyshared hisvisualization efforts
on the output to put a more
4/28/15, 7:02 PMSAS Programming for Data Mining: AUC calculation using Wilcoxon Rank Sum Test
Page 6 of 8http://www.sas-programming.com/2009/10/auc-calculation-using-wilcoxon-rank-sum.html
on the output to put a morestraightforward sense onthe results. Thanks. Here...May-08-2013 | More
Large Scale Linear MixedModel
Update at the end:
****************************;Bob at r4stats.com claimedthat a linear mixed modelwith over 5 millionobservations and 2million...Mar-26-2013 | More
Poor man's HPQLIM?Tobit model is atype of censoredregression and isone of the most
important regressionmodels you will encounterin business. Amemiya1984...Feb-26-2013 | More
Kaggle Digit Recoginizer:SAS k-Nearest Neighborsolution
Kaggle is hostingan educationaldata miningcompetition:
Kaggle Digit Recognizer,using MNIST data.Handwritten digitrecognition is one of...Dec-10-2012 | More
KNN Classification andRegression in SAS
PDF available athere. Related poston KNNclassification using
SAS is here. In data miningand predictive modeling, itrefers to a memory-based(or...Nov-25-2012 | More
Finite Mixture Model forLoss Given Default (LGD)
Loss Given Default(LGD) is a keybusiness metric ofrisk in financial
service. One uniquefeature of this metric isoverdispersion and theother is...Oct-04-2012 | More
SAS functions forcomputing parameters inErlang-C model
Call centermanagement is
4/28/15, 7:02 PMSAS Programming for Data Mining: AUC calculation using Wilcoxon Rank Sum Test
Page 7 of 8http://www.sas-programming.com/2009/10/auc-calculation-using-wilcoxon-rank-sum.html
management isboth Arts andSciences. While
driving moral and settingup strategies is more aboutArts, staffing and servicinglevel...Jul-12-2012 | More
Stochastic GradientDecending LogisticRegression in SASTest the StochasticGradient DecendingLogistic Regression inSAS. The logic and codefollows the code piece ofRavi Varadhan, Ph.D fromthis...May-24-2012 | More
Multi-Threaded PrincipleComponent Analysis
SAS used to notsupportmultithreading inPCA, then I figured
out that its server versionsupports this functionality,see here. Today, I...Jan-31-2012 | More
Random Number Seeds:NOT only the first onematters!
Today, Rick (blog@ here) wrotean article aboutrandom number
seed in SAS to be used inrandom number functionsin DATA Step. Rick noticedwhen...Jan-30-2012 | More
Using PROC CANCORR tosolve large scale PLSproblem
Partial LeastSquare (PLS) is apowerful tool fordiscriminant
analysis with large numberof predictors [1]. PLSextracts latent factorsthat...Nov-16-2011 | More
Bayesian Computation (3)In Chapter 3 of "BayesianComputation with R", JimAlbert talked about how toconduct 2 fundamentaltasks of Statistics, namelyEstimation and...Oct-06-2011 | More Powered By : Blogger Plugins
Blog Archive
4/28/15, 7:02 PMSAS Programming for Data Mining: AUC calculation using Wilcoxon Rank Sum Test
Page 8 of 8http://www.sas-programming.com/2009/10/auc-calculation-using-wilcoxon-rank-sum.html
2015 (2)
2014 (4)
2013 (5)
2012 (7)
2011 (11)
2010 (19)
2009 (12)
December (3)
October (1)
AUC calculationusing WilcoxonRank Sum Test
September (2)
August (2)
July (1)
June (1)
April (1)
March (1)
2008 (1)
2007 (5)
2006 (3)
SAS Data Mining SAS Output SAS Analysis SAS Macro
Pageviews last month
4 9 8 9
Copyright (c). Liang Xie. Awesome Inc. template. Powered by Blogger.