Math385/585AppliedRegressionAnalysisFall2017
Section0011:50to2:50MWF
Instructor:Dr.ChrisEdwards Phone:948-3969 Office:Swart123
Classroom:Swart3 Text:AppliedLinearStatisticalModels,5thedition,byKutner,Nachtsheim,Neter,andLi.Earliereditionsofthetextwilllikelybeadequate,butyouwillhavetoallowfordifferentpagenumbersandhomeworkproblemnumbers.
Catalog Description: A practical introduction to regression emphasizing applications rather thantheory. Simple and multiple regression analysis, basic components of experimental design, andelementarymodelbuilding.Bothconventionalandcomputertechniqueswillbeusedinperformingtheanalyses.Prerequisite:Math201orMath301andMath256eachwithagradeofCorbetter.
Course Objectives: Linear models in statistics are the backbone of many applications, includingregressionandANOVAtechniques.Math385 focusesstudentsontheregressionaspectofmodelingwhile Math 386 focuses students on the ANOVA aspect. In Math 385, students will learn how tocalculateandinterpretregressionestimates,includingparameterestimates,fits,andresiduals,andwillbeabletoperformstatisticalinference.Inadditiontosimplelinearregression,successfulstudentswillunderstand the issues introduced inmultiple linear regression, including polynomial regression andnon-linearregression.Finally,thestudentwillbeabletoassessmodeladequacyandknowmethodstoupdateandimprovethemodel.
Uponsuccessfulcompletionofthecourse,studentsareexpectedtohavetheabilitytocompletethefollowing:• Identifyandunderstandthecomponentsandassumptionsforthestandardlinearregressionmodel• Usestatisticalinferenceonregressionmodelcoefficients,includingconfidenceintervalsandhypothesistests
• ConstructandinterprettheANOVAtablefordescribingalinearregressionmodel• Calculateandanalyzeresidualsfromaregressionmodel• Performdiagnosticsonaregressionmodel,includingassessinglackoffit• Performremedialmeasuressuchastransformationstoimprovearegressionmodel• Understandhowlinearalgebracanbeusedtodescribeamultipleregressionmodel• Performinferenceinmultipleregressionandunderstandhowtheincreasednumberofdimensionsaddscomplexitytotheinterpretationsduetocollinearity
• Understandhowtofitpolynomialregressionmodels• Knowhowtouseindicatorvariablesinregressionmodels• Beabletobuildamodelfromapoolofvariables,usingtechniquessuchasBestSubsetsandStepwiseRegression
• Identifyoutliers,inboththeXandYdimensions,inmultipleregressionmodels• Understandthebasicsofnon-linearregression,includingLogisticRegression
Grading:Finalgradesarebasedonthese300points:
Topic Points TentativeDate ChaptersExam1 SimpleLinearRegression 70pts. October6 1to4Exam2 MultipleRegressionI 70pts. November13 5to8Exam3 MultipleRegressionII 70pts. December15 9to11,13
and14Homework 15PointsEach 90pts.
Homework: I will collect (around) 5 homework problemsapproximatelyonceeveryotherweek.Theduedatesare listedonthecourseoutlinebelow.Isuggestthatyouworktogetherinsmallgroupsonthehomeworkifyoulike;don’tforgetthatIamaresourceforyoutouse.Oftenwewillusecomputersoftwareto perform our analyses; include printouts where appropriate,but pleasemake your papers readable. In otherwords, I don’twant25pagesofprintouthanded in ifyoucansummarize it intwopages.
OfficeHours:OfficehoursaretimeswhenIwillbeinmyofficetohelpyou.TherearemanyothertimeswhenIaminmyoffice.IfIaminandnotbusy, Iwillbehappytohelp.MyofficehoursforFall2017semesterare3:00to3:45MondayandWednesday,and9:00to11:00Tuesday.
Philosophy:Istronglybelievethatyou,thestudent,aretheonlypersonwhocanmakeyourselflearn.Therefore,wheneveritisappropriate,Iexpectyoutodiscoverthemathematicswewillbeexploring.Idonotfeelthatlecturingtoyouwillteachyouhowtodomathematics.Ihopetobeyourguidewhilewelearnsomemathematics,butyouwillneedtodothelearning.Iexpecteachofyoutocometoclasspreparedtodigesttheday’smaterial.ThatmeansyouwillbenefitmostbyhavingreadeachsectionofthetextandtheDayByDaynotesbeforeclass.
Mypersonalbelief is thatone learnsbestbydoing. Ibelieve thatyoumustbe trulyengaged in thelearningprocesstolearnwell.Therefore,Idonotthinkthatmyroleasyourteacheristotellyoutheanswerstotheproblemswewillencounter;ratherIbelieveIshouldpointyouinadirectionthatwillallow you to see the solutions yourselves. To accomplish that goal, I will find different interactiveactivitiesforustoworkon.Yourjobistouseme,yourtext,yourfriends,andanyotherresourcestobecomeadeptatthematerial.TheDayByDaynotesalsoincludeSkillsthatIexpectyoutoattain.
Math 585 Expectations: Expectations for the graduate students are understandably more rigorousthanfortheundergraduatestudent.StudentstakingMath585willhaveanextratheoreticalproblemadded to eachhomework, tobe assignedduring the semester. In addition, a final projectworth50pointswillbedueattheendofthesemester.Thisprojectwillinvolveacompleteanalysisofadataset,includingmodelestimation,development,andvalidation.
Finalgradesareassignedasfollows:270pts. A(90%)260pts. A-(87%)250pts. B+(83%)240pts. B(80%)230pts. B-(77%)220pts. C+(73%)210pts. C(70%)200pts. C-(67%)190pts. D+(63%)180pts. D(60%)179pts.orless F
Monday Wednesday Friday
September4NoClass
September6Day1Introduction,LeastSquares
September8Day2Models
Sections1.1to1.5
September11Day3Estimation
Sections1.6to1.8
September13Day4Inference
Sections2.1to2.3
September15Day5IntervalEstimatesSections2.4to2.6
September18Day6Homework1Due
ANOVASection2.7
September20Day7GLM
Section2.8
September22Day8ResidualsI
Sections3.1to3.6
September25Day9ResidualsII
Sections3.1to3.6
September27Day10LackofFitSection3.7
September29Day11TransformationsSections3.8to3.9
October2Day12Homework2Due
SimultaneousInferenceSections4.1to4.3
October4Day13Review
October6Day14Exam1
October9Day15IntrotoMatricesSections5.1to5.7
October11Day16RegressionMatricesSections5.8to5.13
October13Day17Mult.Reg.ModelsSections6.1to6.2
October16Day18Inference
Sections6.3to6.6
October18Day19IntervalsSection6.7
October20Day20DiagnosticsSection6.8
October23Day21Homework3Due
ExtraSSSection7.1
October25Day22GLMTests
Sections7.2to7.3
October27Day23ComputationalProblemsand
MulticollinearitySections7.5to7.6
October30Day24PolynomialModels
Section8.1
November1Day25InteractionsISection8.1
November3Day26InteractionsIISection8.2
November6Day27DummyVariablesISections8.3to8.7
November8Day28DummyVariablesIISections8.3to8.7
November10Day29Homework4Due
Review
November13Day30Exam2
November15Day31ModelBuilding
Sections9.1to9.3
November17Day32BestSubsets
Sections9.4to9.6
November20Day33Diagnostics
Sections10.1to10.2
November22NoClass
November24NoClass
November27Day34XOutliers
Section10.3
November29Day35Homework5Due
YOutliersSection10.4
December1Day36Trees
Section11.4
December4Day37Non-LinearRegressionISections13.1to13.2
December6Day38Non-LinearRegressionIISections13.3to13.4
December8Day39LogisticRegressionSections14.2to14.3
December11Day40Homework6DueLogisticInference
Section14.5
December13Day41Review
December15Day42Exam3
HomeworkAssignments:(subjecttochangeifwediscoverdifficultiesaswego)
Homework1 DueSeptember181.19,p.35
GradePointAverage. Thedirector of admissions of a small college selected 120 students atrandomfromthenewfreshmanclassinastudytodeterminewhetherastudent’sgradepointaverage(GPA)attheendofthefreshmanyear(𝑌)canbepredictedfromtheACTtestscore(𝑋). The results of the study follow. Assume that first-order regression model (1.1) is
appropriate.
a.) Obtain the least squares estimates of 𝛽! and 𝛽!, and state the estimated regressionfunction.
b.) Plottheestimatedregressionfunctionandthedata.Doestheestimatedregressionfunctionappeartofitthedatawell?
c.) Obtainapointestimateof themean freshmanGPAforstudentswithACTtestscore𝑋 =30.
d.) What is the point estimate of the change in themean responsewhen the entrance testscoreincreasesbyonepoint?
1.23,p.36
RefertoGradePointAverageProblem1.19.
a.) Obtaintheresiduals𝑒!.Dotheysumtozeroinaccordwith(1.17)?
b.) Estimate𝜎!and𝜎.Inwhatunitsis𝜎expressed?
1.33,p.37
Refertotheregressionmodel𝑌! = 𝛽! + 𝜀! inExercise1.30Derivetheleastsquaresestimatorof𝛽!forthismodel.
2.4,p.90
RefertoGradePointAverageProblem1.19.
𝑖: 1 2 3 … 118 119 120
𝑋!: 21 14 28 … 28 16 28
𝑌!: 3.897 3.885 3.778 … 3.914 1.860 2.948
a.) Obtaina99percentconfidence interval for𝛽!. Interpretyourconfidence interval.Does itinclude zero? Why might the director of admissions be interested in whether theconfidenceintervalincludeszero?
b.) Test,usingtheteststatistic𝑡∗,whetherornotalinearassociationexistsbetweenstudent’sACTscore(𝑋)andGPAattheendofthefreshmanyear(𝑌).Usealevelofsignificanceof0.01.Statethealternatives,decisionrule,andconclusion.
c.) WhatistheP-valueofourtestinpart(b)?Howdoesitsupporttheconclusionreachedinpart(b)?
2.55,p.97
DerivetheexpressionforSSRin(2.51):
𝑆𝑆𝑅 = 𝑏!! 𝑋! − 𝑋 !!! .
Homework2 DueOctober22.23,p.93
RefertoGradePointAverageProblem1.19.
a.) SetuptheANOVAtable.
b.) What isestimatedbyMSR inyourANOVAtable?ByMSE?UnderwhatconditiondoMSRandMSEestimatethesamequantity?
c.) Conduct and 𝐹 test of whether or not 𝛽! = 0. Control the 𝛼 risk at 0.01. State thealternatives,decisionrule,andconclusion.
d.) Whatistheabsolutemagnitudeofthereductioninthevariationof𝑌when𝑋isintroducedintotheregressionmodel?What istherelativereduction?What isthenameofthelattermeasure?
e.) Obtain𝑟andattachtheappropriatesign.
f.) Whichmeasure,𝑅!or𝑟,hasthemoreclear-cutoperationalinterpretation?Explain.
2.67,p.99
RefertoGradePointAverageProblem1.19.
a.) Plot the data, with the least squares regression line for ACT scores between 20 and 30superimposed?
b.) Ontheplotfrompart(a),superimposeaplotofthe95percentconfidencebandforthetrueregressionlineforACTscoresbetween20and30.Doestheconfidencebandsuggestthatthetrueregressionrelationhasbeenpreciselyestimated?Discuss.
3.3,p.146-147
RefertoGradePointAverageProblem1.19.
a.) PrepareaboxplotfortheACTscores𝑋!.Arethereanynoteworthyfeaturesinthisplot?
b.) Prepareadotplotoftheresiduals.Whatinformationdoesthisplotprovide?
c.) Plot the residuals𝑒! against the fitted values𝑌!.What departures from regressionmodel(2.1)canbestudiedfromthisplot?Whatareyourfindings?
d.) Prepareanormalprobabilityplotoftheresiduals.Alsoobtainthecoefficientofcorrelationbetween the ordered residuals and their expected values under normality. Test thereasonablenessof thenormalityassumptionhereusingTableB.6and𝛼 = 0.05.Whatdoyouconclude?
e.) Conclude theBrown-Forsythe test todeterminewhetherornot theerror variance varieswith the level of𝑋. Divide the data into the two groups,𝑋 > 26 and𝑋 ≥ 26, and use𝛼 = 0.01. State the decision rule and conclusion. Does your conclusion support yourpreliminaryfindingsinpart(c)?
f.) Information is given below for each student on two variables not included in themodel,namely,intelligencetestscore 𝑋! .
3.21,p.151
Derivetheresultin(3.29):
𝑌!" − 𝑌!"!
!!
!!!
!
!!!
= 𝑌!" − 𝑌!!
!!
!!!
!
!!!
+ 𝑌! − 𝑌!"!
!!
!!!
!
!!!
SSE=SSPE+SSLF
Homework3 DueOctober233.17,p.150-151
Sales growth. A marketing researcher studied annual sales of a product that had beenintroduced10yearsago.Thedataareasfollows,where𝑋istheyear(coded)and𝑌issalesinthousandsofunits:
𝑖: 1 2 3 4 5 6 7 8 9 10
𝑋!: 0 1 2 3 4 5 6 7 8 9
𝑌!: 98 135 162 178 221 232 283 300 374 395
a.) Prepareascatterplotofthedata.Doesalinearrelationappearadequatehere?
b.) Use the Box-Cox procedure and standardization (3.36) to find an appropriate powertransformationof𝑌.EvaluateSSEfor𝜆 = 0.3, 0.4, 0.5, 0.6, 0.7.Whattransformationof𝑌issuggested?
c.) Usethetransformation𝑌! = 𝑌andobtaintheestimatedlinearregressionfunctionforthetransformeddata.
d.) Plot the estimated regression line and the transformed data. Does the regression lineappeartobeagoodfittothetransformeddata?
e.) Obtain the residuals and plot them against the fitted values. Also prepare a normalprobabilityplot.Whatdoyourplotsshow?
f.) Expresstheestimatedregressionfunctionintheoriginalunits.
4.21,p.175
Whenthepredictorvariableissocodedthat𝑋 = 0andthenormalerrorregressionmodel(2.1)applies, are𝑏! and𝑏! independent? Are the joint confidence intervals for𝛽! and𝛽! thenindependent?
5.7,p.210
RefertoPlastichardnessProblem1.22.Usingmatrixmethods,find:
1) 𝒀′𝒀
2) 𝑿′𝑿
3) 𝑿′𝒀
5.20,p.211
Findthematrix𝑨ofthequadraticform:7𝑌!! − 8𝑌!𝑌! + 8𝑌!!.
5.26,p.212
RefertoPlastichardnessProblems1.22and5.7.
a) Usingmatrixmethods,obtainthefollowing:
1) 𝑿′𝑿 !!
2) 𝒃
3) 𝒀
4) 𝑯
5) SSE
6) 𝒔!{𝒃}
7) 𝑠!{pred}when𝑋! = 30.
b) Frompart(a6),obtainthefollowing:
1) 𝑠! 𝑏!
2) 𝑠 𝑏!,𝑏!
3) 𝑠 𝑏!
c) ObtainthematrixofthequadraticformforSSE.
Homework4 DueNovember106.10,p.249
RefertoGroceryretailerProblem6.9.
a) Fit regressionmodel (6.5) to the data for three predictor variables. State the estimatedregressionfunction.Howare𝑏!,𝑏!,and𝑏!interpretedhere?
b) Obtain the residuals andprepare aboxplotof the residuals.What informationdoes thisplotprovide?
c) Plottheresidualsagainst𝑌,𝑋!,𝑋!,𝑋!,and𝑋!𝑋!onseparategraphs.Alsoprepareanormalprobabilityplot.Interprettheplotsandsummarizeyourfindings.
d) Prepare a time plot of the residuals. Is there any indication that the error terms arecorrelated?Discuss.
e) Dividethe52casesintotwogroups,placingthe26caseswiththesmallestfittedvalues𝑌!into group 1 and the other 26 cases into group 2. Conduct the Brown-Forsythe test forconstancyoftheerrorvariance,using𝛼 = 0.01.Statethedecisionruleandconclusion.
7.4,p.289
RefertoGroceryretailerProblem6.9.
a) Obtaintheanalysisofvariancetablethatdecomposestheregressionsumofsquares intoextrasumsofsquaresassociatedwith𝑋!;withX3,given𝑋!;andwith𝑋!,given𝑋!andX3.
b) Test whether 𝑋! can be dropped from the regression model given that 𝑋! and X3 areretained.Use the𝐹∗ test statisticand𝛼 = 0.05. State thealternatives,decision rule,andconclusion.WhatistheP-valueofthetest?
c) Does SSR(𝑋!)+ SSR(𝑋!|𝑋!) equal SSR(𝑋!)+ SSR(𝑋!|𝑋!) here? Must this always be thecase?
7.17,p.290
RefertoGroceryretailerProblem6.9.
a) Transform the variables by means of the correlation transformation (7.44) and fit thestandardizedregressionmodel(7.45).
b) Calculate the coefficients of determination between all pairs of predictor variables. Is itmeaningfulheretoconsiderthestandardizedregressioncoefficientstoreflecttheeffectofonepredictorvariablewhentheothersareheldconstant?
c) Transform the estimated standardized regression coefficients bymeans of (7.53) back totheones for the fitted regressionmodel in theoriginalvariables.Verify that theyare thesameastheonesobtainedinProblem6.10a.
8.16,p.337-338
Refer to Grade point average Problem 1.19. An assistant to the director of admissionconjecturedthatthepredictivepowerofthemodelcouldbeimprovedbyaddinginformationonwhetherthestudenthadchosenamajorfieldofconcentrationatthetimetheapplicationwassubmitted.Assumethatregressionmodel(8.33)isappropriate,where𝑋! isentrancetestscore and 𝑋! = 1 if student had indicated a major field of concentration at the time ofapplicationand0ifthemajorfieldwasundecided.DataforX2wereasfollows:
𝑖: 1 2 3 … 118 119 120
𝑋!!: 0 1 0 … 1 1 0
a) Explainhoweachregressioncoefficientinmodel(8.33)isinterpretedhere.
b) Fittheregressionmodelandstatetheestimatedregressionfunction.
c) Testwhether the𝑋! variable can be dropped from the regressionmodel; use𝛼 = 0.01.Statethealternatives,decisionrule,andconclusion.
d) Obtaintheresiduals for regressionmodel (8.33)andplot themagainst𝑋!𝑋!. Is thereanyevidenceinyourplotthatitwouldbehelpfultoincludeaninteractionterminthemodel?
8.34,p.340
Inaregressionstudy,threetypesofbankswereinvolved,namely,commercial,mutualsavings,andsavingsandloan.Considerthefollowingsystemofindicatorvariablesfortypeofbank:
Typeofbank 𝑋! 𝑋!Commercial 1 0
Mutualsavings 0 1
Savingsandloan −1 −1
a) Developafirst-orderlinearregressionmodelforrelatinglastyear’sprofitorloss(𝑌)tosizeofbank(𝑋!)andtypeofbank(𝑋!,𝑋!).
b) Statetheresponsefunctionsforthethreetypesofbanks.
c) Interpreteachofthefollowingquantities;
1) 𝛽!
2) 𝛽!
3) −𝛽! − 𝛽!
Homework5 DueNovember299.15,p.378-379
Kidney function. Creatinineclearance(𝑌) is an importantmeasureof kidney function,but isdifficult to obtain in a clinical office setting because it requires 24-hour urine collection. Todeterminewhetherthismeasurecanbepredictedfromsomedatathatareeasilyavailable,akidneyspecialistobtainedthedatathatfollowfor33malesubjects.Thepredictorvariablesareserumcreatinineconcentration(𝑋!),age(𝑋!),andweight(𝑋!).
a) Prepare separate dot plots for each of the three predictor variables. Are there anynoteworthyfeaturesintheseplots?Comment.
b) Obtainthescatterplotmatrix.Alsoobtainthecorrelationmatrixofthe𝑋variables.Whatdo the scatter plots suggest about the nature of the functional relationship between theresponsevariable𝑌andeachpredictorvariable?Discuss.Areanyseriousmulticollinearityproblemsevident?Explain.
c) Fit themultiple regression function containing the threepredictor variables as first-orderterms.Doesitappearthatallpredictorvariablesshouldberetained?
9.16,p.379
RefertoKidneyfunctionProblem9.15.
a) Usingfirst-orderandsecond-ordertermsforeachofthethreepredictorvariables(centeredaroundthemean)inthepoolofpotential𝑋variables(includingcrossproductsofthefirst-orderterms),findthethreebesthierarchicalsubsetregressionmodelsaccordingtothe𝐶!criterion.
b) Istheremuchdifferencein𝐶!forthethreebestsubsetmodels?
9.19,p.379
RefertoKidneyfunctionProblem9.15.
a) Using thesamepoolofpotential𝑋 variablesas inProblem9.16a, find thebest subsetofvariablesaccordingtoforwardstepwiseregressionwith𝛼limitsof0.10and0.15toaddordeleteavariable,respectively.
b) Howdoesthebestsubsetaccordingtoforwardstepwiseregressioncomparewiththebestsubsetaccordingtothe𝑅!,!! criterionobtainedinProblem9.16a?
10.10a,p415
RefertoGroceryretailerProblems6.9and6.10.
a) Obtainthestudentizeddeletedresidualsandidentifyanyoutlying𝑌observations.UsetheBonferronioutliertestprocedurewith𝛼 = 0.05.Statethedecisionruleandconclusion.
Homework6 DueDecember1110.10b-f,p415
RefertoGroceryretailerProblems6.9and6.10.
b) Obtainthediagonalelementsofthehatmatrix.Identifyanyoutlying𝑋observationsusingtheruleofthumbpresentedinthechapter.
c) Managementwishestopredictthetotallaborhoursrequiredtohandlethenextshipmentcontaining 𝑋! = 300,000 cases whose indirect costs of the total hours is 𝑋! = 7.2 and𝑋! = 0 (no holiday in week). Construct a scatter plot of 𝑋! against 𝑋! and determinevisually whether this prediction involves an extrapolation beyond the range of the data.Also,use (10.29) todeterminewhether anextrapolation is involved.Doyour conclusionsfromthetwomethodsagree?
d) Cases16,22,43,and48appeartobeoutlying𝑋observations,andcases10,32,38,and40appear to be outlying𝑌 observations. Obtain the DFFITS,DFBETAS, and Cook’s distancevaluesforeachofthesecasestoassesstheirinfluence.Whatdoyouconclude?
e) Calculatetheaverageabsolutepercentdifferenceinthefittedvalueswithandwithouteachofthesecases.Whatdoesthismeasureindicateabouttheinfluenceofeachofthecases?
f) Calculate Cook’s distance 𝐷! for each case and prepare an index plot. Are any casesinfluentialaccordingtothismeasure?
11.29,p.479
RefertoMuscleMassProblem1.27.
a) Fitatwo-regionregressiontree.Whatisthefirstsplitpointbasedonage?WhatisSSEforthistwo-regiontree?
b) Find the second split point given the two-region tree in part (a). What is SSE for theresultingthree-regiontree?
c) Findthethirdsplitpointgiventhethree-regiontreeinpart(b).WhatisSSEfortheresultingfour-regiontree?
d) Prepareascatterplotofthedatawiththefour-regiontreeinpart(c)superimposed.Howwell does the tree fit the data?What does the tree suggest about the change inmusclemasswithage?
e) Preparearesidualplotof𝑒! versus𝑌!forthefour-regiontreeinpart(d).Stateyourfindings.
13.10,p.550
Enzymekinetics. Inanenzymekineticsstudythevelocityofa reaction(𝑌) isexpectedtoberelatedtotheconcentration(𝑋)asfollows:
𝑌! =𝛾!𝑋!𝛾! + 𝑋!
+ 𝜀!
Eighteenconcentrationshavebeenstudiedandtheresultsfollow:
i: 1 2 3 … 16 17 18
𝑋!: 1 1.5 2 … 30 35 40
𝑌!: 2.1 2.5 4.9 … 19.7 21.3 21.6
a) Toobtainstartingvaluesforg0andg1,observethatwhentheerrortermisignoredwehave𝑌!! = 𝛽! + 𝛽!𝑋!!, where 𝑌!! =
!!!, 𝛽! =
!!!, 𝛽! =
!!!! and 𝑋!! =
!!!. Therefore fit a linear
regression function to the transformed data to obtain initial estimates 𝑔!(!) = !
!! and
𝑔!(!) = !!
!!.
b) Using the starting values obtained in part (a), find the least square estimates of theparameters𝛾!and𝛾!.
13.12,p.550
RefertoEnzymekineticsProblem13.10.Assumethatthefittedmodelisappropriateandthatlarge-sampleinferencescanbeemployedhere.
1) Obtainanapproximate95percentconfidenceintervalfor𝛾!.
2) Test whether or not 𝛾! = 20; use 𝛼 = 0.05. State the alternatives, decision rule, andconclusion.