+ All Categories
Home > Documents > Chapter 3 Local Regression - Biostatistics - Departments -...

Chapter 3 Local Regression - Biostatistics - Departments -...

Date post: 19-Apr-2018
Category:
Upload: dodan
View: 233 times
Download: 8 times
Share this document with a friend
13
Chapter 3 Local Regression Local regression is used to model a relation between a predictor variable and re- sponse variable. To keep things simple we will consider the fixed design model. We assume a model of the form where is an unknown function and is an error term, representing random errors in the observations or variability from sources not included in the . We assume the errors are IID with mean 0 and finite variance var . We make no global assumptions about the function but assume that locally it can be well approximated with a member of a simple class of parametric function, e.g. a constant or straight line. Taylor’s theorem says that any continuous function can be approximated with polynomial. 3.1 Taylor’s theorem We are going to show three forms of Taylor’s theorem. 16
Transcript
Page 1: Chapter 3 Local Regression - Biostatistics - Departments - …ririzarr/Teaching/754/section-03… ·  · 2001-03-28Chapter 3 Local Regression Local regression is used to model a

Chapter 3

Local Regression

Local regressionis usedto modela relationbetweena predictorvariableandre-sponsevariable.To keepthingssimplewe will considerthefixeddesignmodel.Weassumeamodelof theform �������������� ����where

�����is anunknown functionand

���is anerror term,representingrandom

errorsin theobservationsor variability from sourcesnot includedin the��

.

Weassumetheerrors���

areIID with mean0 andfinite variancevar�����������

.

We make no global assumptionsaboutthe function�

but assumethat locally itcanbewell approximatedwith amemberof asimpleclassof parametricfunction,e.g.aconstantor straightline. Taylor’stheoremsaysthatany continuousfunctioncanbeapproximatedwith polynomial.

3.1 Taylor’ s theorem

Wearegoingto show threeformsof Taylor’s theorem.

16

Page 2: Chapter 3 Local Regression - Biostatistics - Departments - …ririzarr/Teaching/754/section-03… ·  · 2001-03-28Chapter 3 Local Regression Local regression is used to model a

3.2. FITTING LOCAL POLYNOMIALS 17� This is theoriginal. Suppose�

is a realfunctionon �������! , �#"%$'&�(*) is contin-uouson �+���!�� , �#",$�)-�./� is boundedfor

.102� ����� � thenfor any distinctpoints�3546 ( in �����!�� thereexist a point

between�3147847 ( suchthat���� ( �9�:�����3;�� $'&�(< = > ( �#"

= )?��3;�@�A �� (�B �3!� = ��"%$�)?���C A � (DB �3;� $FENotice: if we view G $'&�(= > (IHKJ+LNM "PORQS)=UT � (VB �3;� = asfunctionof

( , it’ s a poly-nomialin thefamily of polynomialsW $�X�( �ZY[�D���9� � 3D � ( \ E]E]E � $ $ � � � 3 � E]E]E �!� $ �-^�08_ $�X�(�` E� Statisticiansometimesusewhat is calledYoung’s form of Taylor’s Theo-rem:

Let�

besuchthat�#"%$�)?���3;�

is boundedfor�3

then�����9�a����3;�� $< = > ( � "= ) ���3;�@�A �� B �3;� = 7bc�Rd B �3ed $ � � as

d B �3fd[g h ENotice: againthefirst termof theright handsideis in

W $�X�( .� In someof theasymptotictheorypresentedin thisclasswearegoingto useanotherrefinementof Taylor’s theoremcalledJackson’s Inequality:

Suppose�

is a realfunctionon �+���!�� withC

is continuousderivativesthenikjmln!oKp L qsrutO o[v w!x ymz d {|��� B �������d~}2��� � B �� @�� $with

W =thelinearspaceof polynomialsof degree

@.

3.2 Fitting local polynomials

Wewill now definetherecipeto obtaina loesssmoothfor a targetcovariate�3

.

Page 3: Chapter 3 Local Regression - Biostatistics - Departments - …ririzarr/Teaching/754/section-03… ·  · 2001-03-28Chapter 3 Local Regression Local regression is used to model a

18 CHAPTER3. LOCAL REGRESSION

Thefirst stepin loessis to definea weight function (similar to the kernelC

wedefinedfor kernel smoothers).For computationaland theoreticalpurposeswewill definethis weight function so that only valueswithin a smoothingwindow� �3� 7�����3;� � �3 B �#��3;� will beconsideredin theestimateof

����3;�.

Notice: In local regression�#��3;�

is calledthe spanor bandwidth. It is like thekernelsmootherscaleparameter

�. As will beseenabit later, in local regression,

thespanmaydependon thetargetcovariate�3

.

This is easily achieved by consideringweight functions that areh

outsideof� B�� � � . For exampleTukey’s tri-weight function���������� � ��B d ��d �!�?� d �9df} �h d ��d�� � ETheweightsequenceis theneasilydefinedby� �S��3;�9�:� � �� B �3�#��� �We definea window by a proceduresimilar to the

@nearestpoints. We want to

include ��� � hfh % of thedata.

Within thesmoothingwindow,������

is approximatedby apolynomial.For exam-ple,aquadraticapproximation�������:��3D 6� ( �� B �3;�� �� � � � B �3;� � for

I0 � �3 B �#��3;� � �3D ��#��3;� EFor continuousfunction,Taylor’s theoremtells ussomethingabouthow goodanapproximationthis is.

To obtainthelocal regressionestimate �����3;� wesimplyfind the � �����3 � � ( � � � � ^thatminimizes�� �:�e�s� i�j�l� oR ¢¡2£< � > ( � �-���3;� � ��� B Y���3� �� ( � B �3;�� �� � � �� B �3;� ` �

Page 4: Chapter 3 Local Regression - Biostatistics - Departments - …ririzarr/Teaching/754/section-03… ·  · 2001-03-28Chapter 3 Local Regression Local regression is used to model a

3.3. DEFINING THE SPAN 19

anddefine ��D��3;��� ���3 .NoticethattheKernelsmootheris a specialcaseof local regression.Proving thisis a Homework problem.

3.3 Defining the span

In practice,it is quitecommonto havethe��

irregularlyspaced.If wehaveafixedspan

�thenonemayhavelocalestimatesbasedonmany pointsandothersis very

few. For thisreasonwemaywantto consideranearestneighborstrategy to defineaspanfor eachtargetcovariate

�3.

Define ¤ �?��3R�5�¥d �3 B ���d , let ¤ " � ) ��3;� betheorderedvaluesof suchdistances.Oneof theargumentsin thelocal regressionfunctionloess() (availablein themodreg library) is thespan. A spanof � meansthatfor eachlocalfit wewanttouse ��� � hfh�¦ of thedata.

Let § be equalto � n truncatedto an integer. Thenwe definethe span�#��3R�\�¤ ",¨S) ��3;� . As � increasestheestimatebecomessmoother.

In Figures3.1– 3.3weseeloesssmoothsfor theCD4cell countdatausingspansof 0.05,0.25,0.75,and0.95. The smoothpresentedin the Figuresarefitting aconstant,line, andparabolarespectively.

Page 5: Chapter 3 Local Regression - Biostatistics - Departments - …ririzarr/Teaching/754/section-03… ·  · 2001-03-28Chapter 3 Local Regression Local regression is used to model a

20 CHAPTER3. LOCAL REGRESSION

Figure3.1: CD4cell countsinceseroconversionfor HIV infectedmen.

−2 0 2 4

050

010

0015

00

span = 0.05

Time since zeroconversion

CD

4

−2 0 2 4

050

010

0015

00

span = 0.25

Time since zeroconversion

CD

4

−2 0 2 4

050

010

0015

00

span = 0.75

Time since zeroconversion

CD

4

−2 0 2 4

050

010

0015

00

span = 0.95

Time since zeroconversion

CD

4

Degree=1

Page 6: Chapter 3 Local Regression - Biostatistics - Departments - …ririzarr/Teaching/754/section-03… ·  · 2001-03-28Chapter 3 Local Regression Local regression is used to model a

3.3. DEFINING THE SPAN 21

Figure3.2: CD4cell countsinceseroconversionfor HIV infectedmen.

−2 0 2 4

050

010

0015

00

span = 0.05

Time since zeroconversion

CD

4

−2 0 2 4

050

010

0015

00

span = 0.25

Time since zeroconversion

CD

4

−2 0 2 4

050

010

0015

00

span = 0.75

Time since zeroconversion

CD

4

−2 0 2 4

050

010

0015

00

span = 0.95

Time since zeroconversion

CD

4

Degree=2, the default

Page 7: Chapter 3 Local Regression - Biostatistics - Departments - …ririzarr/Teaching/754/section-03… ·  · 2001-03-28Chapter 3 Local Regression Local regression is used to model a

22 CHAPTER3. LOCAL REGRESSION

Figure3.3: CD4cell countsinceseroconversionfor HIV infectedmen.

−2 0 2 4

050

010

0015

00

span = 0.05

Time since zeroconversion

CD

4

−2 0 2 4

050

010

0015

00

span = 0.25

Time since zeroconversion

CD

4

−2 0 2 4

050

010

0015

00

span = 0.75

Time since zeroconversion

CD

4

−2 0 2 4

050

010

0015

00

span = 0.95

Time since zeroconversion

CD

4

Degree=0

Page 8: Chapter 3 Local Regression - Biostatistics - Departments - …ririzarr/Teaching/754/section-03… ·  · 2001-03-28Chapter 3 Local Regression Local regression is used to model a

3.4. SYMMETRIC ERRORSAND ROBUST FITTING 23

3.4 Symmetric errors and Robust fitting

If theerrorshave a symmetricdistribution (with long tails),or if thereappearstobeoutlierswecanuserobustloess.

Webegin with theestimatedescribedabove ��D��� . Theresiduals������a©[� B ������ª�arecomputed.

Let « �ª�#¬ � �9� � Y �­B �ª��®�./� � ` � d ��d�4 �h �°¯�d ��d¢¯ �bethebisquareweightfunction.Let ± = median(

d ����?d ). Therobustweightsare² �|� « � ����N¬!³ ± �Thelocalregressionis repeatedbutwith new weights² � � �S��� . Therobustestimateis theresultof repeatingtheprocedureseveraltimes.

If we believe the variancevar������´� � �m� � we could alsousethis double-weight

procedurewith ² �|� � ® � � .3.4.1 Example

Radiolabelingbasedgeneexpressionmeasurementsareusefulfor cancerresearchbecausethey canbecarriedout usingsmallamountsof biologicalmaterials.Sta-tistical issuesaredifferentfrom fluorescenceexpressiondata,becauseradiolabel-ing givesabsoluteintensitiesthat reflectgeneexpressionandthereis no internalcontrol.

Thedata-setdescribedherewasobtainedto identify genesthatmaybeassociatedwith lungcancer. Lungcancertissuewasobtainedfrom varioussubjects.Normal

Page 9: Chapter 3 Local Regression - Biostatistics - Departments - …ririzarr/Teaching/754/section-03… ·  · 2001-03-28Chapter 3 Local Regression Local regression is used to model a

24 CHAPTER3. LOCAL REGRESSION

tissuesfrom thesametypeof cellswasobtainedfrom thosesamesubjects.Fromeachof thesetissues2 sampleswerepreparedusing2 differentisotopicbatches.Eachof these4 sampleswerehybridizedwith a filter spottedwith cDNA frommany genesin a µ�¶k� � µ grid. We referto thesespottedfilters asarrays.Eachofthesearrayswerescannedto produceanimagefile whichwasthenanalyzedwithspecializedsoftwarethatproducedanintensitylevel for eachgrid pointor spotonthearray.

Not all thevaluesreadfrom thearraysareassociatedwith genes.Therewere207spotswhereno cDNA wasspotted.They wereleft empty. Becausethereis non-specificbindingbetweenthesamplesandthefilters, positive valuesareobtainedfromtheseemptyspots.Theintensitiesreadfrom theseemptyspotsprovidedirectevidenceaboutmeasurementerrorassociatedwith the system.Spotsassociatedwith genesthat arenot expressedwill alsohave intensitiesdue to non-specificbinding.

Canwe rankgenesby differentialexpressionbetweencancerandnormaltissuesin eachsubject?

If we denotewith · and ¸ thelog intensitiesof eachspotwe couldsaya geneisdifferentiallyexpressedif ¸ B · is significantlybiggerthan0 for thespotrelatedto that gene. Oneproblemwith this is that thereis a filter effect, so ¸ canbesystematicallysmallerthan · .

A commonprocedurein microarraydataanalysisis to simplynormalizethefiltersby subtractingthemeanof eachfilter from eachvalue,i.e. consider

© " £]¹»ºN¼ w?½ �%¾-¿*À )� �©[� BÂÁ© andsimilarly for thes. Thedangerwith doingthisis thatmany of thegenes

spottedonthearraysareusuallyselectedbecauseresearchersconsiderthemlikelyto beover-expressed.Thismeansthatthemeanof the

©sshouldbelargerthanthe

sandthisdifferencein meanis confoundedwith thedifferencein filter effect. Bysubtractingmeanswewouldbesubtractingoutsomeof thedifferentialexpressionbetweencancerandnormaltissues.

In Figure3.4weplot theratioof theintensitiesvs. theproductof theintensitiesina log scale,i.e.

© B vs.Ã Ä©

, for thetwo replicatesof subject1. Noticethatthefilter effectseemsto changewith the total intensityof a particularspot. For this

Page 10: Chapter 3 Local Regression - Biostatistics - Departments - …ririzarr/Teaching/754/section-03… ·  · 2001-03-28Chapter 3 Local Regression Local regression is used to model a

3.5. MULTIVARIATE LOCAL REGRESSION 25

reasonusingmediansor trimmedmeansto remove thefilter effect is not a goodsolution.If wemodel

and

©asrandomvariablesthenwehave thattheexpected

filter effect dependson the total intensity, i.e. E�© B �d ° �©Å�

is not constant.This arisesbecausespecificbinding andnon-specificbinding are two differentnaturalprocesses.Becausewe have no way of knowing which pointsrepresentnon-specificbinding andwhich representspecificbinding we cannotnormalizeby just estimatingtwo means.Rather, we estimateE

�ª© B �d ©\ Æ��usingloess.

It is critical to usea robust loess,sothatlargedifferencesdo not affect thefit toomuch.Noticein Figure?? thedifferencein therobustandnon-robustestimates.

Figure3.4: Total intensityplottedagainstratiowith aloesspredictionusingGaus-sianandsymmetrickernel.

1e+04 1e+06 1e+08 1e+10

0.3

0.4

0.5

0.6

0.7

0.8

X * Y

Y/X

gaussiansymmetric

3.5 Multi variate Local Regression

BecauseTaylor’s theoremsalsoappliesto multidimensionalfunctionsit is rela-tively straightforward to extend local regressionto caseswherewe have morethanonecovariate.For exampleif wehavea regressionmodelfor two covariates�������D��� ( � �� � �� ����

Page 11: Chapter 3 Local Regression - Biostatistics - Departments - …ririzarr/Teaching/754/section-03… ·  · 2001-03-28Chapter 3 Local Regression Local regression is used to model a

26 CHAPTER3. LOCAL REGRESSION

with���� � ©u� unknown. Around a target point · 3Ç�È���3 ( � �3 � � a local quadratic

approximationis now���� ( � � ���a��3R É� ( �� (;B �3 ( �/ Ê� � �� � B �3 � �/ É� � � (;B �3 ( �K� � B �3 � �s �� �uË�� (;B �3 ( � � �� ��Ì��� � B �3 � � �Oncewe definea distance,betweena point · and · 3 , anda span

�we candefine

definewaitsasin theprevioussections:� �-� · 3;����� � dmd · � �s· 3fdmd� � EIt makessenseto re-scale

( and � sowesmooththesamewayin bothdirections.

Thiscanbedonethroughthedistancefunction,for exampleby definingadistancefor thespace

_ Àwith dmd · dmd � � À< Í > ( �

Í ®[Î Í � �with

Î Ía scalefor dimensionÏ . A naturalchoicefor these

Î Íarethe standard

deviationof thecovariates.

Notice: We have not talkedaboutk-nearestneighbors.As we will seein ChapterVII thecurseof dimensionalitywill make thishard.

3.5.1 Example

We look at part of the dataobtainedfrom a studyby Socket et. al. (1987)onthe factorsaffecting patternsof insulin-dependentdiabetesmellitus in children.Theobjective wasto investigatethedependenceof the level of serumC-peptideon variousother factorsin order to understandthe patternsof residualinsulinsecretion.Theresponsemeasurementis thelogarithmof C-peptideconcentration(pmol/ml) at diagnosis,andthepredictorsareageandbasedeficit, a measureofacidity. In Figure3.5 we show a loesstwo dimensionalsmooth.Notice that theeffectof ageis clearlynon-linear.

Page 12: Chapter 3 Local Regression - Biostatistics - Departments - …ririzarr/Teaching/754/section-03… ·  · 2001-03-28Chapter 3 Local Regression Local regression is used to model a

3.5. MULTIVARIATE LOCAL REGRESSION 27

Figure3.5: Loessfit for predictingC.Peptidefrom Base.deficitandAge.

Age

Base Deficit

Predicted

Page 13: Chapter 3 Local Regression - Biostatistics - Departments - …ririzarr/Teaching/754/section-03… ·  · 2001-03-28Chapter 3 Local Regression Local regression is used to model a

Bibliography

[1] Cleveland,R. B., Cleveland,W. S.,McRae,J. E., andTerpenning,I. (1990).Stl: A seasonal-trenddecompositionprocedurebasedon loess. Journal ofOfficial Statistics, 6:3–33.

[2] Cleveland,W. S. andDevlin, S. J. (1988). Locally weightedregression:Anapproachto regressionanalysisby local fitting. Journal of theAmericanSta-tistical Association, 83:596–610.

[3] Cleveland,W. S., Grosse,E., and Shyu, W. M. (1993). Local regressionmodels.In Chambers,J. M. andHastie,T. J.,editors,StatisticalModelsin S,chapter8, pages309–376.Chapman& Hall, New York.

[4] Loader, C. R. (1999),LocalRegressionandLikelihood, New York: Springer.

[5] Socket, E.B., Daneman,D. Clarson,C., and Ehrich, R.M. (1987).Factorsaffectingandpatternsof residualinsulin secretionduringthefirst yearof typeI (insulindependent)diabetesmellitusin children.Diabetes30,453–459.

28


Recommended