Regression Analysis of University Giving Data · PDF fileRegression Analysis of University...

Regression Analysis of University Giving Data

by

Yi Jin

A Project Report

Submitted to the Faculty

of

WORCESTER POLYTECHNIC INSTITUTE

in partial fulfillment of the requirements for the

Degree of Master of Science

in

Applied Statistics

by

_______________________________________

December 2006

APPROVED:

____________________________________

Joseph D. Petruccelli, Advisor

____________________________________

Bogdan M. Vernescu, Department Head

To My Parents

i

Abstract

This project analyzed the giving data of Worcester Polytechnic Institute’s

alumni and other constituents (parents, friends, neighbors, etc.) from fiscal

year 1983 to 2007 using a two-stage modeling approach. Logistic regression

analysis was conducted in the first stage to predict the likelihood of giving for

each constituent, followed by linear regression method in the second stage

which was used to predict the amount of contribution to be expected from

each contributor. Box-Cox transformation was performed in the linear

regression phase to ensure the assumption underlying the model holds.

Due to the nature of the data, multiple imputation was performed on the

missing information to validate generalization of the models to a broader

population.

Concepts from the field of direct and database marketing, like “score” and

“lift”, were also introduced in this report.

ii

Acknowledgments

First and foremost, I would like to thank Dr. Joseph D. Petruccelli, my

academic and project advisor, for his guidance and patience throughout this

project. The illuminating ideas from each meeting, although at certain stage

made the project seem endless, turned out to be the most fun and rewarding

part of this exploration.

I also want to thank the people at the Office of Development and Alumni

Relations, especially Mr. Dexter A. Bailey Jr. (Vice President for Development

and Alumni Relations) and Ms. Lisa Corinne Maizite (Assistant Vice

President for Development) for their consent in granting me access to the data

as well as faith in recruiting me for the examination of the file.

Thanks also go to Dr. Jason D. Wilbur and Dr. Balgobin Nandram, whose

excellent teaching helped me step into a wonderful world of advanced

statistics and be better prepared for this capstone project and things beyond.

The internship I had in Epsilon Data Management over the summer, also

under the supervision of Dr. Petruccelli, opened my eye to the statistical

application in business world and practically led to the Development Office’s

sponsorship of this project. I enjoyed the three months spent with the

Epsilon team of industrial statisticians and look forward to joining them after

graduation.

iii

Contents

Chapter 1 Introduction ………………………………………………………… 1

1.1 Project Overview ………………………………………………………… 1

1.1.1 Background ………………………………………………………….. 1

1.1.2 Expectations …………………………………………………………. 2

1.2 Data Description …………………………………………………………. 2

1.2.1 Data Overview ………………………………………………………. 3

1.2.2 Data Dictionary ……………………………………………………... 3

1.2.3 Quality Concerns ……………………………………………………. 6

1.2.4 Modeling Data ………………………………………………………. 7

1.3 Statistical Methodologies/Models ……………………………………... 9

1.4 Software Package ………………………………………………………... 10

Chapter 2 Data Preparation …………………………………………………… 11

2.1 Quality Control and Data Cleaning ……………………………………. 11

2.2 Univariate Summarization ……………………………………………… 11

2.3 Modeling Universe Creation …………………………………………… 13

2.3.1 Initial Variable Selection …………………………………………… 13

2.3.2 Response Variable Creation ………………………………………... 13

2.3.3 Variable Recoding and Transformation …………………………... 14

2.3.4 Learning/Validation File Split …………………………………….. 17

2.4 Variable Removal ………………………………………………………... 17

Chapter 3 Model Fitting ……………………………………………………….. 18

3.1 Logistic Regression Model ……………………………………………… 18

3.1.1 Initial Logistic Fit ……………………………………………………. 18

iv

3.1.2 Reality Check ………………………………………………………... 19

3.1.3 Collinearity …………………………………………………………... 21

3.1.4 Model Selection and Validation …………………………………… 22

3.1.5 Odds and Odds Ratio ………………………………………………. 24

3.2 Linear Regression Model ……………………………………………….. 26

3.2.1 Box-Cox Transformation …………………………………………… 26

3.2.2 Model Fitting and Validation ……………………………………… 27

3.2.3 Model Diagnostics …………………………………………………... 29

3.3 Multiple Imputation for Missing Values ……………………………… 30

Chapter 4 Conclusions …………………………………………………………. 34

4.1 Summary …………………………………………………………………. 34

4.2 Future Work ……………………………………………………………… 35

Appendix A: Table of Major Codes …………………………………………... 36

Appendix B: Logistic Modeling Results ……………………………………... 42

Appendix C: Logistic Modeling Detail ………………………………………. 49

Appendix D: Box-Cox Transformation ………………………………………. 52

Bibliography …………………………………………………………………….. 54

v

List of Figures

Figure 3.1 Side-by-side Boxplot for “Age” …………………………………... 20

Figure 3.2 Side-by-side Boxplot for “B.S. Recency” ………………………… 21

Figure 3.3 Scatterplot of “Age” and “B.S. Recency” ………………………... 22

Figure 3.4 Histogram of Transformed Contribution Amount ……………... 27

Figure 3.5 Normal Probability Plot of the Residuals ……………………….. 30

Figure D.1 Histogram of the Contribution Amount of Contributors ……... 52

Figure D.2 Plot of Box-Cox Result ……………………………………………. 53

vi

List of Tables

Table 1.1 Original Data Extract Key ………………………………………….. 3

Table 1.2 Constituent Category and Distribution ………………………….... 6

Table 1.3 Pre-analysis Grouping of the Original Variable …………………. 8

Table 1.4 Completeness of Information for the Subgroups ………………... 9

Table 2.1 Descriptive Statistics of Contribution Amount …………………... 12

Table 2.2 Detail of "B.S. MAJOR" and " STATE" …………………………….. 14

Table 3.1 Initial Logistic Fit Result ……………………………………………. 19

Table 3.2 Reunion Indicator Cross-Tab ………………………………………. 20

Table 3.3 Award Counts Cross-Tab …………………………………………... 20

Table 3.4 Performance of Logistic Models …………………………………… 23

Table 3.5 Performance of Linear Models …………………………………….. 28

Table 3.6 Linear Model Results ……………………………………………….. 28

Table 3.7 Modeling Results after Multiple Imputation …………………….. 31

Table A.1 WPI Major Code ……………………………………………………. 36

Table B.1 Logistic Fit Results for Model 3 …………………………………… 42

Table B.2 Odds Ratio Estimates for Model 3 ………………………………… 43

Table B.3 Logistic Fit Results for Model 5 …………………………………… 45

Table B.4 Odds Ratio Estimates for Model 5 ………………………………… 47

Table C.1 Class Variable Recoding Detail …………………………………… 49

Table C.2 Summary of Stepwise Selection …………………………………… 50

Table C.3 Association of Predicted Probabilities and Observed Responses 51

1

Chapter 1 Introduction

1.1 Project Overview

1.1.1 Background

As a private institution, Worcester Polytechnic Institute (WPI) has relied on

the generosity of its alumni, parents and many friends to help provide the

fundamental support that enhances the school’s overall operations since its

very founding in 1865.

The Office of Development and Alumni Relations (Development Office) is the

university administrative unit that has as one of its missions reaching out to

the community to secure financial support for the institution.

Since WPI had its database system computerized in 1983, information has

been collected on the giving history plus other aspects of the university’s

alumni and broader constituents (parents, neighbors, foundations, etc.).

With the accumulation of data and the recognition of statistical analysis

techniques, the Development Office initiated a project to examine the giving

patterns quantitatively in an effort to achieve deeper understanding of the

constituents and better results in its solicitation efforts. The Center for

Industrial Mathematics and Statistics (CIMS) at WPI’s Mathematical Sciences

Department was invited to partner in the project.

2

1.1.2 Expectations

The records include constituents who have given to the school, whom we will

call contributors, as well as those who have not given, whom we will call

prospects. The two main questions for which the Development Office is

seeking answers are:

1) What are the characteristics that distinguish contributors from prospects? and

2) What are the key factors that drive the contributors’ amount of contribution?

By answering the first question, the office is hoping to obtain a clearer image

of a “typical” contributor and prospect, along with a set of predictors effective

in identifying prospective contributors. The answer to the second question

will lead to more effective allocation of resources and increased magnitude of

support.

1.2 Data Description

The original data file was extracted by WPI’s Computing and

Communications Center (CCC) from the “Banner” system and delivered in

the format of Microsoft Excel spreadsheet. A quick initial data browsing was

then done followed by meetings with Ms. Lisa Maizite of the Development

Office, and Ms. Paula Delaney and Mr. Kevin Sheehan of CCC to discuss

quality issues and place further requests. Based on these meetings, an

updated version of the data was prepared and used for this project.

3

1.2.1 Data Overview

The data set consists of 48,604 observations (constituents) and 102 variables.

A data dictionary was also supplied. The file includes all living WPI

constituents and their gifts recorded in the computerized “Banner” system

beginning in 1983. The values for 1983 represent the cumulative giving up to

the end of that fiscal year. After 1983, the yearly gift data and giving club

membership are listed by fiscal years.

1.2.2 Data Dictionary

Explanations for the 102 original variables are presented in Table 1.1.

Table 1.1 Original Data Extract Key

1 PERSON_NUM Person number for data extract

2 CATEGORY See Table 1.2

3 GENDER M/F/NA

4 BIRTH_YEAR 4-digit year of birth

5 MARRIED Married/Single/etc.

6 LEGACY Yes: the person’s admission record indicated

a legacy relationship (no details available)

7 GPA[1] Numbers for those available, spaces for those

unavailable, "N/A" for those not applicable

8 BS_YEAR WPI B.S. year

9 BS_MAJOR WPI B.S. major

10 MS_YEAR WPI M.S. year

11 MS_MAJOR WPI M.S. major

12 PHD_YEAR WPI Ph.D. year

13 PHD_MAJOR WPI Ph.D. major

14 CERT_YEAR WPI certificate year

15 CERT_MAJOR WPI certificate major

16 HONOR_YEAR WPI honorary degree year

17 HONOR_DEG WPI honorary degree

18 NON_WPI_DEG value if known (formatted as institution :

degree code : year : major)

4

19 WPI_SPS Yes: the spouse is a constituent

20 NUM_OF_CHILD Count of children

21 PREF_CLAS Preferred class year

22 HAD_SCHOLARSHIP Yes: had scholarship while at WPI

23 PRES_FND Yes: a Presidential Founder

24 LIFETIME_PAC Yes: a lifetime PAC[2] member

25 TRUSTEE Yes: a trustee of WPI

26 ADM_VOL Yes: involved in alumni/admissions

27 CLS_AGENT Yes: involved in a solicitation structure

28 REUNION Yes: constituent attended reunion(s)

29 ALUM_VOLUNTEER

Count of distinct number of activities

(involved in/as department advisory board,

gold council, …, 42 possibilities)

30 ALUM_CLUB Count of distinct number of activities (Tech

Old Timers, Polyclub, …)

31 ALUM_LEADER

Count of distinct number of activities

(involved in/as class officer, trustee search

committee, fund board, …, 30 possibilities)

32 FRAT Name of fraternity/sorority, blank otherwise

33 SPORT_COUNT Count of varsity sports listed

34 VARSITY_SPRTS Concatenated list of varsity sports

35 WPI_AWD Yes: constituent received this award at WPI

36 TAYLOR_AWD Yes: constituent received this award at WPI

37 SCHWIEGER_AWD Yes: constituent received this award at WPI

38 GODDARD_AWD Yes: constituent received this award at WPI

39 GROGAN_AWD Yes: constituent received this award at WPI

40 BOYNTON_AWD Yes: constituent received this award at WPI

41 WASHBURN_AWD Yes: constituent received this award at WPI

42 RES_CITY Home city (permanent address)

43 RES_STATE Home state code

44 RES_ZIP Home zip code (5 or 9-digit format)

45 RES_COUNTRY Home country

46 TITLE Job title if known, blank if unknown

47 WORK_CITY Work city (business address)

48 WORK_STATE Work state code

49 WORK_ZIP Work zip code (5 or 9-digit format)

50 WORK_COUNTRY Work country

51 STU_CLUB Count of clubs (Outing Club, Science Fiction,

Sport Parachute, …)

52 STU_ARTS Count of arts and literature organizations

(Masque, Pathways, Peddler, …)

53 STU_INTL_CLUB Count of international clubs (Indian Students

Association, …)

5

54 STU_CLUB_SPORT Count of club sports (scuba, bowling,

autocross, …)

55 STU_PROF_SOC Count of undergrad professional societies

56 STU_MUSIC Count of music band: glee club, baker’s dozen …

57 STU_CLS_OFF Count of class officer (freshman,

sophomore, …)

58 STU_SCH_INVOLVE Count of school involvement (student

activities board, resident advisor)

59 STU_SPEC_PROG Count of special programs (undergraduate

employment program, exchange, …)

60 STU_INTRAMURAL Count of intramural sports (basketball,

softball, table tennis, …)

61 STU_HONR_SOC Count of honor societies (Pershing Rifles,

Sigma Mu Epsilon, Skull, …)

62 STU_PROJECT_CTR Project center info (from the student courses)

63 ALU_PROJECT_CTR Project center info (from alumni activities)

64 GRAD_DISTINCTION H: graduated with high distinction, D:

graduated with distinction, and blank

65 ALUM_CONTACTS Contacts made as an alumnus (phone calls,

personal visits, …)

66-90 FISCAL_YEAR_X

(X: 1983~2007)

Total gift and memo for the specific fiscal

year[3]

91-102 GIFT_CLUB_X

(X: 1996~2007)

gift club designation for the specific fiscal

year

[1] WPI undergraduates do not have a "true" GPA. Standard "numerical

equivalent for passed courses" approved by the faculty was used.

[2] PAC stands for President’s Advisory Council.

[3] Note the 1983 number is a cumulative amount given up through 1983 as

the values were loaded into "Banner".

Each of the constituents is assigned a best (primary) category. The supplied

dictionary lists 37 distinct categories, but only 18 of them are present in the

data. The four letter codes of these 18 categories and their definitions are

given in Table 1.2 along with their frequencies and percentages in descending

order of size.

6

Table 1.2 Constituent Category and Distribution

Code Category Count Percentage

ALUM Alumna/Alumnus 24,027 49.43%

PRNT Parent 10,601 21.81%

GRAD Graduate Alumnus 4,782 9.84%

FRND Friend 3,435 7.07%

WIDO Widow/Widower 1,867 3.84%

CERT WPI Certificate Recipients 1,207 2.94%

GPAR Grandparent 770 1.58%

ALND Non-degreed Alumna/us 646 1.33%

FACT Faculty/Staff 445 0.92%

NEIG Neighbor 319 0.66%

MPAR Mass Academy Parent 311 0.64%

HOND Honorary Degree Recipient 85 0.17%

STDT Student 44 0.09%

HONA Honorary Alumna/us 32 0.07%

TRUS Trustee 19 0.04%

OTHR Other Organizations 12 0.02%

FFOU Family Foundation 1 0.00%

TRNS Pre-Banner Class Transfer 1 0.00%

1.2.3 Quality Concerns

One concern regarding data quality comes from the high percentage of

missing (blank) values across the file. As an example, the variable about job

title has 68.7% null cells. Most of these cases are due to the fact that these

types of information were collected on a self-report basis -- the constituents

have no obligation of responding to such inquiries. Another issue arises

7

from the confounding of responses, primarily seen in those variables with

values extracted from the database as either yes or null (blank). While yes

assures us a confirmative response, blank in many cases does not necessarily

mean no: it simply means no answer was given.

These problems along with the messy (i.e. literally impossible to categorize)

values in variables like “Job Title” and “Non-WPI Degree” brought a

challenge for variable recoding.

1.2.4 Modeling Data

For analysis and modeling purposes, the data were divided into two groups:

current plus former WPI students, and all others. Furthermore, in the

“student” group, undergraduate, graduate and non-degree alumni (of

categories ALUM, GRAD and ALND) form an especially desirable subgroup

characterized by the most complete information across variables, which leads

to the expectation of highest predictive power. The remaining categories in

this group, certificate recipients and current students, appear to be less

attractive in terms of modeling since they lack certain information due to the

nature of the categories. Table 1.3 shows a pre-analysis grouping of the 102

original variables based on the type of information they contain. Table 1.4

then displays the completeness of information for the subgroups.

8

Table 1.3 Pre-analysis Grouping of the Original Variable

Variable Group Original Variables Count

Identifier PERSON_NUM 1

Biographical

Information

CATEGORY

GENDER

BIRTH_YEAR

MARRIED

NUM_OF_CHILD

RES_CITY

RES_STATE

RES_ZIP

RES_COUNTRY

WORK_CITY

WORK_STATE

WORK_ZIP

WORK_COUNTRY

TITLE

LEGACY

WPI_SPS

TRUSTEE

16

Education

History

GPA

GRAD_DISTINCTION

PREF_CLAS

BS_YEAR

BS_MAJOR

MS_YEAR

MS_MAJOR

PHD_YEAR

PHD_MAJOR

CERT_YEAR

CERT_MAJOR

HONOR_YEAR

HONOR_DEG

NON_WPI_DEG

WPI_AWD

TAYLOR_AWD

SCHWIEGER_AWD

GODDARD_AWD

GROGAN_AWD

BOYNTON_AWD

WASHBURN_AWD

HAD_SCHOLARSHIP

STU_PROJECT_CTR

ALU_PROJECT_CTR

24

Extracurricular

Activities

ADM_VOL

CLS_AGENT

FRAT

SPORT_COUNT

VARSITY_SPRTS

STU_CLUB

STU_ARTS

STU_INTL_CLUB

STU_CLUB_SPORT

STU_PROF_SOC

STU_MUSIC

STU_CLS_OFF

STU_SCH_INVOLVE

STU_SPEC_PROG

STU_INTRAMURAL

STU_HONR_SOC

17

Alumni

Activities

REUNION

ALUM_VOLUNTEER

ALUM_CLUB

ALUM_LEADER

4

Giving

Records

ALUM_CONTACTS

FISCAL _YEAR_X

GIFT_CLUB_X

PRES_FND

LIFETIME_PAC 41

9

Table 1.4 Completeness of Information for the Subgroups

"Student" Variable Group

ALUM + GRAD + ALND CERT + STDT "Non-student"

Identifier complete complete complete

Biographical

Information complete complete complete

Education

History complete incomplete none

Extracurricular

Activities complete incomplete none

Alumni

Activities complete incomplete none

Giving Records complete complete complete

Overall, 29,455 (60.6%) of the constituents fall in the “best” subgroup of

ALUM + GRAD + ALND, and thus makes a sufficiently large sample for

analysis. For this reason, we decided to start the analysis with these three

categories combined in the hope of getting the “best possible” model.

1.3 Statistical Methodologies/Models

A two-stage modeling approach was used in the analysis. For the first stage,

the goal was to estimate the probability (likelihood) that a constituent is a

contributor, and to assess the ability of this estimation in predicting

constituents as either contributors or prospects. A logistic regression

approach was chosen to model the relation between predictor variables and

giving behavior. The goal of the second stage was to locate factors that have

a statistically significant impact on the amount of contribution for the

contributors. Note the response here has values on a continuous scale and

10

thus a linear regression model was a natural choice.

After the models were built on the “best” subgroup, multiple imputation was

done on the entire “student” group in an effort to deal with the missing

values and also evaluate the stability of the imputation.

1.4 Software Package

The statistical computing package SAS® was used throughout this project.

The choice was partially due to the extensive availability of documentation

and technical support for the software in addition to its analysis capability

and programming flexibility. The version of the package used was 9.1 TM

Level 1M2 on Microsoft Windows XP professional platform.

11

Chapter 2 Data Preparation

2.1 Quality Control and Data Cleaning

Quality control of the data started with duplicated observation detection on

the identifier variable and subsequent de-duplication if necessary. Extreme

values and ranges of individual variables were examined to identify

problematic cells. Natural associations among variables (columns) for

individual observation (row) were then used as a reference for data cleaning

[10]. A nice example is constituent with identifier 762250336. The value

under “B.S. Year” appears to be 19 (which translates to 1919). But after

printing out the entire row, we see the person was born in 1971 and obtained

her bachelor’s degree from MIT, so there should be an empty cell rather than

19. For the same person however, the value 95 under “M.S. Year” (which

will be converted into 1995 later) can now be trusted with more confidence.

Variables in the file with dates containing years were presented in both

two-digit and four-digit formats. For the purpose of new variable creation

and recoding at a later phase, two-digit years were converted into four digits

by identifying a cut-off value based on the variable’s distribution.

2.2 Univariate Summarization

Univariate statistical analysis was conducted on each variable. Histograms

12

and boxplots were constructed to display the distributions (location, spread,

symmetry, etc.) of numeric variables and to perform a quick graphical check

for outlier. Then descriptive statistics were calculated and examined. For

categorical variables, frequency tables were obtained and checked.

Out of the 48,604 constituents, 24,204 (49.8%) turned out to be contributors.

Table 2.1 gives a basic summary of the contribution amount for the whole

population as well as the contributor group.

Table 2.1 Descriptive Statistics of Contribution Amount

All constituents Contributor Group

Counts 48,604 24,204

Minimum $0.00 $0.02

Maximum $5,979,538.69 $5,979,538.69

Mean $2,044.85 $4,106.25

Standard Deviation 44,824.35 63,453.40

25 Percentile $0.00 $50.00

Median $0.00 $170.00

75 Percentile $170.00 $695.00

Inter-Quartile Range $170.00 $645.00

Total $99,387,742.10 $99,387,742.10

Not that due to the skewness of the contribution amount’s distribution,

median and inter-quartile range (IQR) are more appropriate than mean and

standard deviation here as measures of location and spread for the variable.

13

2.3 Modeling Universe Creation

2.3.1 Initial Variable Selection

Some of the 102 original variables were not included in the modeling universe

for various reasons. 12 variables of the gift club designations from fiscal year

1996 to 2007 were dropped because the club entry standards changed over the

years. “Preferred Class Year” was also excluded because of the huge overlap

with “B.S. Year”. The later variable was retained because it was believed to

be more accurate and objective since preferred class year was picked by

constituents themselves and thus bears fair amount of subjectivity. For the

geographical location variables, “State” was chosen for its advantage of

having standard abbreviations and fewer categories (which means easier

cleaning and recoding and a much more consistent format compared with the

“City” and “Zip Code” variables). Note here though that these dropped

variables were still valuable references when new erratic cells were uncovered

[10].

2.3.2 Response Variable Creation

The 25 variables carrying information of constituents’ yearly contribution

amount were used to create the response variables for the two models.

Summing values across rows gave the total amount contributed by each

constituent and in turn led to the definition of contributor as those with

positive values. The remaining constituents were then designated to the

prospect group.

14

2.3.3 Variable Recoding and Transformation

Many variables in the data take values of either “yes” or blank. For the

purpose of maximizing the final model’s predictive power in light of the

limited number of candidate predictors available, we decided to keep as

many variables as possible in this stage and thus coded them to indicators

with “yes” as one and blank as zero. Care had to be taken when making

interpretations about these indicators as zero here means no information

available rather than simply “no”.

The recoding produced 59 variables, all appended with suffix “_MOD” to

distinguish them from their original versions. They include 28 binary

indictors and 7 class variables (CATEGORY_MOD, GENDER_MOD,

MARRIAGE_MOD, BSMAJOR_MOD, HOME_MOD, BIZSTATE_MOD and

DISTINCTION_MOD). Table 2.2 gives the categorization detail for the “B.S.

Major” as well as the two geographical region variables (which shared the

same recoding scheme).

Table 2.2 Detail of "B.S. MAJOR" and "HOME/BUSINESS STATE"

Variable Class Contents

Mass MA

Rest_NewEng CT, NH, RI, ME, VT

Northeast NY, NJ, PA, DE, MD, WV, DC

West CA, AK, AZ, CO, HI, ID, MT, NV,

NM, OR, UT, WA, WY

South FL, AL, AR, GA, KY, LA, MS, NC,

OK, SC, TN, TX, VA

Midwest IL, IN, IA, KS, MI, MN, MO, NE,

ND, OH, SD, WI

Other AE, AP, GU, PR, VI

Home &

Biz State

NA QC, ZZ, ON, M, other, blank

15

MechanicalEngr ME, MEA, MEB, MEN, MFE, MTE, IE, AE

Elec./Comp.Engr EE, ECE, EEB, EEC, EEN

CivilEngr CE, CEI

ComputerSci CS, CA, CSB, CSC, CSM

ChemicalEngr CM, CMB, CMN

Chemistry CH, CHI

Physics PH, PHE

Math MA, MAC

BizEconomcs MGE, BU, MG, MGC, MGS, MGT, MIS, EC, ET

Bio./LifeSci BBT, BBI, BC, BE, BIO, BM, BS, BB, LS, LSI

HumanitiesArts HT, HTE, HTH, HU, SS, SST, ST, TC, TW, IN

OtherEngr EP, EV, PL, FPE, NE

Other GS, ID, ND, SD

B.S. Major

[1]

NA blank

[1] See "Appendix A" for the major codes.

Two original variables were recoded to enhance their interpretability: values

of “B.S. Year” were subtracted from 2006 to produce a new “B.S. Recency”

variable (which turned out later to have very strong predictive power for both

models) and “Year of Birth” was translated into “Age” in a similar way.

Some new variables were created by consolidating original variables that

deliver the same type of information and whose values are fairly sparse.

Two approaches were used:

1) Taking maximum of indicators.

“M.S. Major” and “M.S. Year” are two original variables with information

about the field of the master’s program and the year the degree was awarded.

They were first coded to binary indicators of value zero (if the original cell

was blank) and one (if the original cell was not blank). These two new

16

variables indicate the availability of such information in the data set.

Secondly, a new binary variable indicating enrollment in WPI’s master’s

program at some time point was created by taking maximum of the two

aforementioned indicators. As a result, as long as one of the two original

columns had something recorded, “MASTER_MOD” will be one. Only if

both original columns were blank will it be zero. New variables created in

the same fashion include: “PHD_MOD”, “CERT_MOD”, “HONOR_MOD”

and “VIP_MOD” (based on “PRES_FND”, “LIFETIME_PAC” and

“TRUSTEE”), “INTL_MOD” (based on “RES_COUNTRY” and

“WORK_COUNTRY”), “PROJECT_MOD” (based on “STU_PROJECT_CTR”

and “ALU_PROJECT_CTR”).

2) Summing up indicators/counts.

An example is the new variable “AWARD_MOD”, which counts types of a

certain set of awards the constituent received. The file comes with seven

original variables corresponding to various types of awards (“WPI_AWD”,

“Taylor_AWD”, “Schwieger_AWD”, “Goddard_AWD”, “Grogan_AWD”,

“Washburn_AWD” and “Boynton_AWD”) with values of either “yes” or

blank. Similarly, “yes” became one and blank became zero.

“AWARD_MOD” was then constructed by summing the seven binary

indicators. The new variable “ALUM_MOD” was created in the same way

and counts the number of a set of alumni activities the constituent

participated in.

Two variables, “Job Title” and “Non-WPI Degree” (the “messy” ones

mentioned in section 1.2.3), were infeasible to categorize. In such cases,

indicators of whether or not the constituent reported this information were

created instead.

17

Transformations were done on some variables. The variable “Number of

Children”, highly skewed right with maximum value 12, has 4 as its 99th

percentile. So it was regrouped into five categories of 0, 1, 2, 3, and 4 or

more children.

After the recoding and transformation, “GPA”, “Age” and “B.S. Recency”

were the three variables left with large numbers of missing values. The

14,047 observations having non-missing values for all these three predictors

were then flagged as the “complete” set out of the “best” subgroup of 29,455

alumni, graduate alumni and non-degree alumni and became the base for

initial modeling.

2.3.4 Learning/Validation File Split

The modeling set was split into approximately equal-sized learning and

validation files. In order to make the two sets more comparable, the split

was conducted using stratified random sampling [6] with 20 equally-sized

strata based on contribution amount. The choice of 20, rather than more

commonly used 10 [16], was due to the fact that approximately half of the

constituents made no contributions. Comparison of univariate statistics of

the two files assured us they were similar with respect to the number of

contributors and amount of contribution.

2.4 Variable Removal

The file splitting and subsetting up to this point rendered three variables no

longer suitable for modeling. Indicators for legacy and honorary degree

holder both became constants (all zero) and VIP Indicator had only one

non-zero cell.

18

Chapter 3 Model Fitting

3.1 Logistic Regression Model

A logistic model is useful for modeling binary responses as a function of a set

of predictors, and the fitted response can be used to estimate the probability

(likelihood) of a certain event of interest [2]. For a logistic model with n

predictors, the model equation is:

01

log1

n

i ii

P XP

β β=

⎛ ⎞ = +⎜ ⎟−⎝ ⎠∑ (3.1)

in which P is the probability of the event of interest, 0β is the intercept and

iβ is the coefficient for the ith predictor iX (i = 1 … n). Here, we can utilize

this model to predict the tendency of giving for each constituent.

3.1.1 Initial Logistic Fit

Using the logistic procedure from SAS [3] with stepwise selection and

variable entry and stay significance parameters both set at 0.05, an initial

model was built on the complete records of the “best” subgroup

(ALUM+GRAD+ALND). The resulting significant predictors, their p-values

and the estimated signs for numeric predictors are shown in Table 3.1. The

set is presented in descending order of statistical significance.

19

Table 3.1 Initial Logistic Fit Result

Predictor Estimated Sign p-value

Years since B.S. awarded + <.0001

Biz geographical region Class variable <.0001

Alumni activities count + <.0001

Number of children + <.0001

School activities indicator + <.0001

Home geographical region Class variable <.0001

GPA + <.0001

Reunion indicator + <.0001

Gender Class variable <.0001

Indicator, non-WPI degree reported + <.0001

WPI spouse indicator + 0.0011

Honor society count + 0.0041

International club activities count - 0.0044

Professional society count + 0.0136

Area of B.S. major Class variable 0.0145

Awards Count - 0.0316

Age - 0.0327

3.1.2 Reality Check

Some of the signs for the parameter estimates in Table 3.1 seem

counterintuitive. For example, the model has a negative sign for “Awards

Count”, which counts the types of award the constituent has received. One

would think that award recipients should be more, not less, likely to give back

to the school. To investigate the consistency of the estimated coefficient

signs with the data, we performed “reality checks” by looking more closely at

20

the data. For numeric variables like indicators and counts whose values are

on a discrete scale, a simple cross tabulation will help reveal what the

estimated sign should be. This is illustrated in Tables 3.2 and 3.3. We can

easily tell that both variables should end up with positive signs.

Table 3.2 Reunion Indicator Cross-Tab Table 3.3 Award Counts Cross-Tab

Contributor Reunion

Indicator No Yes

0 7618

58.48%

5409

41.52%

1 249

24.41%

771

75.59%

Contributor Award

Count No Yes

0 7846

56.09%

6141

43.91%

1 21

35.59%

38

64.41%

2 0

0.00%

1

100.00%

For numeric variables with values on a continuous scale, a side-by-side box

plots grouped by contributor/prospect can accomplish the same task. Two

examples are given below in Figures 3.1 and 3.2 regarding the “Age” and “B.S.

Recency” variables.

Figure 3.1 Side-by-side Boxplot for "Age"

21

Figure 3.2 Side-by-side Boxplot for “B.S. Recency”

The two plots reveal that constituents graduated (with B.S. degree) earlier,

thus of older age, are more likely to give. So we would conclude that the

estimated sign for the age variable in the initial fit did not correspond to the

marginal relation of the variable with the response. This is possibly caused

by the existence of collinearity, because two highly correlated variables bring

in redundant information, and compensation for the presence of the other

might lead to a reversal of signs in their coefficient estimates [1].

3.1.3 Collinearity

Scatterplot matrices and correlation matrices constructed for the identified set

of predictors were helpful in graphically displaying the existence of pairwise

collinearity [1]. A simple scatterplot of “Age” and “B.S. Recency” along with

a fitted linear regression line is shown in Figure 3.3.

22

Figure 3.3 Scatterplot of "Age" and "B.S. Recency"

A first glimpse might mask the true strong linear association. But the

Pearson correlation is 0.9583, very high since the great majority of data points

lie close to the fitted line which corresponds to the following equation:

B.S. Recency = -20.3408 + 0.9288 * Age

The estimated intercept and slope show an interesting fact that “B.S. Recency”

is basically “Age” shifted 20 years.

3.1.4 Model Selection and Validation

The reality check and collinearity detection led to the idea of trying models

with or without the “Age” and “Award Counts” variables. Also, “Home

Region” and “Working Region” both stayed in the initial model, but values

for these two could possibly overlap for many observations. A quick

comparison showed a match rate of 52.86%. So over half of the pairs share

23

the same values and it was then worth trying model fits with one of them

excluded.

Table 3.4 gives the validation results of models with different candidate pools.

Three measures were shown for comparison:

1) Contributor prediction rate. This is the percentage of contributors in the

validation sample who have been correctly identified by the model as

contributors.

2) Prospect prediction rate. Similarly, this is the percentage of prospects in the

validation sample who have been correctly identified by the model as

prospects.

3) Prediction match rate. This is the percentage of constituents in the

validation sample who were correctly classified by the model.

Table 3.4 Performance of Logistic Models

Model No. 1 2 3 4 5

Model Detail Initial

Model

No

Age

No Age

& Award

No Age,

Award, Home

No Age,

Award, Biz

Contributor

Pred. Rate 61.70% 60.11% 60.24% 59.47% 61.80%

Prospect

Pred. Rate 79.64% 80.95% 81.05% 80.80% 79.76%

Prediction

Match Rate 71.72% 71.74% 71.86% 71.37% 71.83%

We observe that all the five models are better at identifying prospects than

contributors and the performances of the models have no considerable

differences. For the purpose of identifying contributors, model 5 seems to

24

outperform the others. If we want to identify prospects or achieve the

highest overall classification accuracy instead, model 3 will produce the most

desirable result.

The sets of significant predictors for models 3 and 5 along with their p-values

and point estimates obtained using the maximum likelihood method are

shown in Table B.1 and B.3 of Appendix B. The predictors are presented in

descending order of statistical significance. Given the inputs, applying the

model will give each constituent a predicted response, which is an estimate of

the probability of giving (also known as “score” [16]).

An excerpt of the fitting and statistical details for model 3 can be found in

Appendix C.

3.1.5 Odds and Odds Ratio

For a logistic model, in many cases the odds ratio is also of interest.

The odds of an event are calculated by dividing the probability of an event (P)

by the probability of its complement, as P/(1-P) [2]. For instance, if the

probability a constituent is a contributor is 0.51, then the odds a constituent is

a contributor are 0.51/0.49 = 1.04. An odds greater than one implies that the

event is more likely to happen than not (the odds of an event that is certain to

happen are infinite); if the odds are less than one the event is less likely to

happen than not (the odds of an impossible event are zero). An event

equally likely to happen or not has odds one.

An odds ratio is the ratio of the odds of one event to the odds of another event

and is used to compare the odds of the two. In a logistic model, odds ratios

25

are used to assess the effect of a predictor on the odds of the event being

modeled (here the event a constituent is a contributor). Specifically, the

coefficient of a numeric predictor is the proportional change in the odds for

any one unit increase in that predictor. An odds ratio greater than one

means that the event is more likely to happen when the predictor goes up one

unit, given all other predictors remain unchanged [2].

In the logistic model equation (3.1), P is a function of 1,..., nX X and thus the

01

log1

n

i ii

P XP

β β=

⎛ ⎞ = +⎜ ⎟−⎝ ⎠∑ (3.1)

values of the odds 1

PP−

, denoted by O( 1,..., nX X ), is also determined by

levels of the predictors. The log odds of the event for a set of given predictor

levels 1,..., nx x , written as log[O( 1,..., nx x )] is just

log[O( 1,..., nx x )]= 01

n

i ii

xβ β=

+∑ (3.2)

Suppose the jth predictor has a one unit increase in its level (from jx to

jx +1), then the log odds will correspondingly change to

log[O( 1,..., 1,...,j nx x x+ )]= 01

n

i i ji

xβ β β=

+ +∑ (3.3)

Subtracting (3.2) from (3.3) gives the difference between the two log odds

log[O( 1,..., 1,...,j nx x x+ )] - log[O( 1,..., ,...,j nx x x )]= jβ (3.4)

and this equals

1

1

( ,..., 1,..., )log

( ,..., ,..., )j n

jj n

O x x xO x x x

β⎛ ⎞+

=⎜ ⎟⎜ ⎟⎝ ⎠

(3.5)

which tells us the ratio between these two odds is

1

1

( ,..., 1,..., )( ,..., ,..., )

jj n

j n

O x x xe

O x x xβ+

= (3.6)

and this is just the odds ratio for the jth predictor.

26

For a categorical (class) predictor, its odds ratio is just the proportional change

of the odds if the predictor changes from the baseline category (chosen in

recoding) to the current category [2]. Appendix C gives details about the

categorical variable recoding for model 3.

Table B.2 and B.4 of Appendix B show both point and interval estimates of the

odds ratios for the significant numeric variables identified in model 3 and 5.

3.2 Linear Regression Model

A linear regression model is appropriate for modeling responses of

continuous numeric type with one of the underlying assumptions being that

the response comes from a normal distribution [1]. For a linear regression

model with n predictors, the model equation is:

01

n

i ii

Y Xβ β ε=

= + +∑ (3.7)

in which Y is the observed response, 0β is the intercept, iβ is the coefficient

for the ith predictor iX (i = 1 … n) and ε is the random error term

independently and identically distributed as 2(0, )N σ . Here, we will utilize

this method to predict the amount of contribution for each of the known

contributors.

3.2.1 Box-Cox Transformation

The response was highly skewed, so we chose a Box-Cox transformation [1]

(See Appendix D for more information), which turned out to be a natural log.

27

Figure 3.4 shows a histogram of the transformed response with a fitted

normal curve.

Figure 3.4 Histogram of the Transformed Contribution Amount

3.2.2 Model Fitting and Validation

Several linear regression models with slightly different groups of candidate

predictors and significance levels for stepwise variable selection were tried

and the two models in Table 3.5 ended up being the best two. As with the

logistic fit, performance on the validation file was used as the criterion for

comparison. The validation was done by first applying the respective model

equation to the validation file, followed by grouping those constituents (in the

validation file) into ten deciles based on their predicted giving amount.

Percentages of the total real contribution amount for each decile were then

calculated. The results are shown in Table 3.5.

28

Table 3.5 Performance of Linear Models

Model1 (SLE=.01, SLS=.01) Model2 (SLE=.05, SLS=.01)

Decile Amount Percentage Amount Percentage

1st $443,515.89 32.16% $433,850.53 31.46%

2nd $224,542.17 16.28% $225,325.17 16.34%

3rd $121,517.23 8.81% $122,212.23 8.86%

Top 20% $668,058.06 48.44% $659,175.70 47.80%

Top 30% $789,575.29 57.25% $781,387.93 56.66%

In an imaginary case where the constituents are randomly sliced into deciles,

each decile is expected to account for roughly 10% of the contributions. But

here, we see that the model-identified top 20% give almost half of the

contribution amounts within the validation file. A direct marketing

professional would thus recognize the model with over 300% lift [16] on the

first decile and over 160% lift on the second one. Results between the

models showed that model 1 performed better although the difference is

relatively small.

The linear model based on the “complete” observations from the “student”

contributors yields the following set of significant predictors, sorted in

descending order of the magnitudes of their standardized coefficient

estimates.

Table 3.6 Linear Model Results

Predictor Coefficient Estimate Standardized Estimate

Years since B.S. awarded 0.10327 0.43306

Alumni activities count 0.30861 0.16025

Reunion indicator 0.47378 0.10083

GPA 0.43371 0.08514

29

School activities indicator 0.08125 0.05764

WPI spouse indicator 0.27620 0.05708

Count of intramural sports 0.07689 0.05672

Count of varsity sports -0.10287 -0.04217

Contacts made as an alumnus 1.81278 0.04169

PhD Indicator -1.75430 -0.04034

Mass 0.00049236 0.00024552

Rest_NewEng -0.03521 -0.01409

Midwest 0.16820 0.05330

Northeast -0.03291 -0.01179

South 0.10224 0.03518

West 0.09509 0.03219

Biz

Geographical

Region

Other -0.06383 -0.01829

3.2.3 Model Diagnostics

Although predictive capability was the principal feature of interest in these

models, residual plots were evaluated to check the usual assumptions of

normality and homoscedasticity and appropriateness of fit [1]. The normal

probability plot is given in Figure 3.5 as an example. No substantial

deviations from these assumptions were detected.

30

Figure 3.5 Normal Probability Plot of the Residuals

3.3 Multiple Imputation for Missing Values

Missing values are an issue in a substantial number of statistical analyses.

While analyzing only complete observations has its simplicity, the

information contained in the incomplete ones is lost. Sometimes there are

also systematic differences between the complete set and the incomplete set

and this can make the resulting inference inapplicable to the population of all

these observations, especially when the size of the complete set is relatively

small.

For our case, the highest missing rate happened on the variable “GPA”

(38.14%) followed by “B.S. Recency” (18.91%). So the size of the complete set

is relatively large. Checking the data further we found out the categories of

graduate and non-degree alumni have the “B.S. Rencency” cells all blank

which is to be expected. Excluding these two categories reduced the missing

rate to 0.90% for the single category of ALUM. This situation signals us it is

not appropriate to impute values for all the three categories combined since it

31

violates the important assumption of “missing at random” for imputation.

So we decided to do the imputation by individual category.

The MI procedure from SAS is capable of creating multiply imputed data sets

for incomplete data. It uses methods that incorporate appropriate variability

across the imputations. Available methods include a parametric method

(with multivariate normality assumption) like regression, a nonparametric

method like propensity score and a Markov Chain Monte Carlo (MCMC)

method [15].

Five imputations were run on the “student” group using the MCMC method.

The multiply imputed data sets were then subjected to the same procedures

for model selection, fit, and analysis used for the complete data. The five

logistic models all produced the same set of 24 significant predictors with

merely order of entering the model differing slightly. Table 3.7 lists the

coefficient estimates from these five analyses with the predictors identified on

the “complete” set bolded. We see the set includes all 17 variables from the

model fitted on the “complete” fraction and the estimated values for the

coefficients are fairly close across the models. This ensures us the stability

and reliability of this imputation process.

Table 3.7 Modeling Results after Multiple Imputation

Coefficient Estimates for 5 Models Predictor

1 2 3 4 5

Class agent 1.5638 1.5643 1.5630 1.5634 1.5642

Alumni activity indicator 0.6589 0.6591 0.6592 0.6594 0.6592

GPA 0.0608 0.0601 0.0613 0.0607 0.0600

B.S. Recency 0.0659 0.0658 0.0660 0.0659 0.0658

Non-WPI Degree 0.3153 0.3157 0.3154 0.3154 0.3155

32

Spouse Indicator 0.1783 0.1778 0.1784 0.1782 0.1780

Number of children 0.1503 0.1505 0.1503 0.1504 0.1505

Scholarship indicator 0.0925 0.0920 0.0928 0.0925 0.0921

Reunion indicator 0.7308 0.7311 0.7307 0.7308 0.7311

Greek house indicator 0.1322 0.1325 0.1322 0.1323 0.1325

Varsity sports -0.1931 -0.1932 -0.1932 -0.1932 -0.1932

International Club -0.2593 -0.2596 -0.2593 -0.2595 -0.2597

Club sport 0.0650 0.0651 0.0650 0.0651 0.0651

Professional Society 0.1531 0.1534 0.1530 0.1532 0.1535

Music indicator 0.1369 0.1370 0.1369 0.1369 0.1370

School Involvement 0.1963 0.1962 0.1962 0.1962 0.1962

Honor Society 0.1631 0.1628 0.1633 0.1631 0.1628

Project Center 0.1491 0.1487 0.1493 0.1490 0.1486

Divorced 0.2809 0.2827 0.2819 0.2820 0.2815

Married 0.2949 0.2965 0.2961 0.2954 0.2956

NA -0.5442 -0.5536 -0.5507 -0.5475 -0.5480

Other/Partner 0.1674 0.1689 0.1684 0.1678 0.1680

Separated -0.2425 -0.2420 -0.2410 -0.2425 -0.2430

Marital

Status

Single -0.0785 -0.0775 -0.0773 -0.0781 -0.0783

Biological/LifeSci 0.0904 0.0872 0.0868 0.0869 0.0844

BizEconomcs 0.1913 0.1880 0.1875 0.1876 0.1853

ChemicalEngr 0.1386 0.1357 0.1346 0.1351 0.1330

Chemistry -0.1025 -0.1051 -0.1065 -0.1059 -0.1078

CivilEngr 0.3118 0.3089 0.3079 0.3083 0.3062

ComputerSci 0.3307 0.3276 0.3269 0.3272 0.3249

Electr./Comp.Engr 0.3361 0.3334 0.3320 0.3326 0.3307

HumanitiesArts 0.3840 0.3808 0.3803 0.3804 0.3780

Math 0.0872 0.0845 0.0832 0.0837 0.0818

B.S. Major

MechanicalEngr 0.2708 0.2678 0.2667 0.2672 0.2651

33

NA -2.4217 -2.3838 -2.3708 -2.3759 -2.3483

Other 0.3389 0.3361 0.3349 0.3354 0.3334 B.S. Major

OtherEngr 0.00773 0.00460 0.00382 0.00413 0.00189

Mass 0.0275 0.0273 0.0272 0.0274 0.0274

Midwest 0.0546 0.0558 0.0558 0.0554 0.0553

NA -0.5169 -0.5170 -0.5171 -0.5170 -0.5170

Northeast 0.0370 0.0371 0.0371 0.0371 0.0372

Other 0.2301 0.2298 0.2299 0.2299 0.2299

Rest_NewEng 0.1034 0.1031 0.1031 0.1031 0.1032

Biz region

South 0.1563 0.1562 0.1562 0.1563 0.1562

Mass 0.1693 0.1692 0.1696 0.1693 0.1691

Midwest 0.2688 0.2675 0.2678 0.2681 0.2678

NA -0.9271 -0.9266 -0.9273 -0.9268 -0.9262

Northeast 0.2461 0.2463 0.2463 0.2462 0.2462

Other 0.2081 0.2082 0.2083 0.2082 0.2082

Rest_NewEng 0.0489 0.0491 0.0493 0.0490 0.0488

Home

region

South -0.0653 -0.0653 -0.0652 -0.0653 -0.0653

F -1.8857 -1.9988 -1.6320 -1.9068 -2.1126

M -2.0699 -2.1827 -1.8159 -2.0907 -2.2964 Gender

N 1.9281 2.5259 2.1341 2.2005 2.1573

D 0.000270 0.000200 0.000351 0.000278 0.000187Distinction

H 0.1050 0.1051 0.1049 0.1050 0.1051

34

Chapter 4 Conclusions

4.1 Summary

The logistic models discovered sets of variables bearing statistically

significant impacts on the likelihood of giving for constituents in the student

group. It also enabled us to assign a score [16] (i.e. predicted value for the

response) to current and future individuals in the group so that efforts can be

focused on the higher-scored fraction. To score the constituents with

“complete” records inside the “student” group, the models built upon these

observations shall be used. If scoring the remaining individuals is also

desired, the average predicted value from models built after multiple

imputation can be an option. But overall, the “complete” models are the

ones to deliver and recommend for scoring future “student” constituents as

we expect the incoming observations will all have complete information as a

result of improved record keeping. The specific choice of model depends on

what is to be achieved in a campaign and the performance of respective

models.

The linear model gave a set of variables having statistical significance in

driving the magnitude of giving for contributors. The relative importance of

the predictors can be decided by comparing the absolute values of the

standardized parameter coefficients (shown in Table 3.6). The larger they

are, the higher contribution amount can be expected to receive for an increase

of one standard deviation (which is comparable across the predictors after the

35

standardization) in the predictor.

Comparing the sets of identified significant predictors from both models,

there are seven common ones. So, regardless of the objective, whether to

predict the possibility or the amount of giving, those who graduated earlier,

work in particular geographical areas, participated in alumni activities and

reunion activities in the past and had better academic performance and

involved in school activities when attending WPI, and whose spouse is also a

constituent are more likely to give and to give larger amounts on average.

4.2 Future Work

The modeling so far primarily focused on the “student” group. Profiles of

the rest of the constituent categories (parents, neighbors, friends, etc.) can also

be investigated to see whether with lesser amount of information, an effective

predictive model can still be obtained.

Major contributors flagged by the VIP indicator (generated by consolidating

“PRES_FND”, “LIFETIME_PAC” and “TRUSTEE”) were excluded in the

modeling base. Although a fairly small group, they tend to account for a

large portion of the total gifts and display distinctive behaviors, which makes

examination of the group worthwhile.

Other approaches to analysis, such as classification and neural network

methods, might be appropriate for analyzing this data set and could reveal

other interesting findings as well.

36

Appendix A: Table of Major Codes

Table A.1 WPI Major Codes

Code Description Dept

AE Aerospace Engineering ME

AL American Literature HU

AM Applied Mathematics MA

AS American Studies ND

ASC Assumption College ND

ASD Actuarial Science ND

B1 Cellular and Molecular Biology BB

B2 Biomaterials BE

BB Biology/Biotechnology BB

BBI Biology BB

BBT Biotechnology BB

BC Biochemistry CH

BE Biomedical Engineering BE

BIO Biology and Biotechnology BB

BIOC Computational Biology BB

BIOE Ecology & Environmental Bio BB

BIOG Cell & Molecular Bio/Genetics BB

BIOM Biomedical Interests BE

BIOO Organismal Biology BB

BIOP Bioprocess BB

BIS Biological Information Systems BB

BM Biomedical BE

BMP Biomedical Eng/Medical Physics BE

BS Biomedical Sciences BB

BSMB BS/MBA PROGRAM ND

BSMS BS/MS PROGRAM ND

BU Business ND

BUSA Business Administration ND

CA Computers with Applications CS

CC Customized Certificate ND

CCN Computers & Comm. Networks ND

CE Civil Engineering CE

CEEV Environmental CE

CEI Civil Engineering-Interdiscipl CE

37

CET Civil Engineering-Traffic CE

CH Chemistry CH

CHB Chemistry:Bio-organic Emphasis CH

CHI Chemistry-Interdisciplinary CH

CHMC Medicinal Chemistry CH

CL Clinical Engineering BE

CM Chemical Engineering CM

CMB Chem. Eng w/Biomedical Int. CM

CMBC Biochemical CM

CMBM Biomedical CM

CMEV Environmental CM

CMMT Materials CM

CMN Chem. Engr. w/Nuclear Int. CM

CNE Central New England College ND

COMM Commerce ND

CPM Construction Project Mgmt. CE

CS Computer Science CS

CSB Computer Sci w/Biomedical Int. CS

CSC Computers w/Commercial Appl. CS

CSM Computers w/Mathematical Appl. CS

CV Client / Server DCS

DE Differential Equations MA

DENT Dentistry ND

DT Drama/Theatre HU

EC Economics SST

ECE Electrical & Computer Eng. EE

ECO Ecology BB

ED Engineering - To Be Declared ND

EE Electrical Engineering EE

EEB Elect. Eng w/Biomedical Int. EE

EEC Elec. Eng. w/Comp. Eng. Spec. EE

EECO Computer Engineering EE

EEN Elec Engr w/ Nuclear Int EE

EIT Engineer in Training ND

EL English Literature HU

EM E-Commerce DCS

EN English HU

EP Environmental Policy & Develop SST

ER Entrepreneurship MG

ES Environmental Studies ND

ET Economics & Technology SST

38

EV Environmental Engineering ID

EVS Environmental Science ND

FORS Forestry ND

FPE Fire Protection Engineering FPE

FPIN Fire Protection Interests FPE

FR French HU

GD Geometric Dimens & Tolerance DCS

GH Global History HU

GN German HU

GS General Science (OldTimer) ND

GWEP Greater Worc Exec Prog ND

HCC Holy Cross College (32) ND

HI History ND

HS Hispanic Studies HU

HT Humanities Studies/Sci & Tech HU

HTE Humanities/Technology-English HU

HTH Humanities/Technology-History HU

HTT Humanities/Technology HU

HU Humanities and Arts HU

HUAH Art History HU

HUAS American Studies HU

HUCW Creative Writing HU

HUDT Drama/Theatre HU

HUEV Environmental Studies HU

HUGN German Studies HU

HUHI History HU

HUHS Hispanic Studies HU

HULI Literature HU

HUMU Music HU

HUPY Philosophy HU

HURE Religion HU

HUST HU Studies of Science & Tech HU

HUWR Writing and Rhetoric HU

ID Interdisciplinary ID

IDM Individually-Designed Minor ND

IE Industrial Engineering MG

IME Impact Engineering ID

IMGD Interactive Media & Game Dev ID

IN International Studies ID

IS Intersession ND

ISCH International Scholar ND

39

ISCP International Scholar Program ND

ISM Information Security - Mgmt ND

IST Information Security - Technic DCS

IT Information Technology MG

LIT Literature HU

LS Life Sciences ND

LSI Life Sciences-Interdisciplin ND

LT Law and Technology ID

MA Mathematical Sciences MA

MAC Actuarial Mathematics MA

MAF Financial Mathematics MA

MAI Industrial Mathematics MA

MAS Applied Statistics MA

MAT Mathematics MA

MBA Master of Business Admin. MG

ME Mechanical Engineering ME

MEA Mech. Eng. w/ Aerospace Int. ME

MEAE Aerospace ME

MEB Mech. Eng. w/ Biomedical Int. ME

MEBM Biomedical ME

MEEM Engineering Mechanics ME

MEEV Environmental ME

MEMB Biomechanical ME

MEMD Mechanical Design ME

MEMF Manufacturing ME

MEMS Materials Science ME

MEN Mech. Eng. w/ Nuclear Int. ME

MENE Nuclear ME

METF Thermal-Fluids ME

MF Manufacturing Systems Eng. ME

MFA Advanced Manufacturing Eng. ME

MFE Manufacturing Engineering ME

MFM Manufacturing Management MG

MFS Manufacturing Eng Mgmt ID

MG Management MG

MGC Management with Computer Appl. MG

MGE Management Engineering MG

MGS Management Science & Engr. MG

MGT Management MG

MH Mathematics MA

MHS Statistics MA

40

MIS Management Information Systems MG

MM Master of Mathematics MA

MME Master of Mathematics for Educ MA

MN Management Development DCS

MNS Master of Natural Sciences BB

MPE Materials Processing Eng ME

MSM Master of Science in Mgmt. MG

MT Management of Technology MG

MTE Materials Science and Eng. ME

MTI Marketing & Tech. Innovation MG

MTL Materials ME

MU Music HU

MUSC Music HU

N1 Nanoscience CM

NC Non-Certificate ( DCS/CPE ) DCS

ND To Be Declared ND

NE Nuclear Engineering ME

NURS Nursing ND

ODL Operations Design & Leadership MG

OIT Operations & Information Tech. MG

OL Organizational Leadership MG

OT Special Topics DCS

PDEN Pre-Dental ND

PH Physics PH

PHE Engineering Physics PH

PHL Philosophy HU

PHL1 Philosophy of Social Problems HU

PHRM Pharmacy ND

PI Process Improvement DCS

PL Urban & Environmental Planning CE

PLE Plant Eng. Certificate ND

PM Pre-Med ND

PMED Pre-Medical ND

PO Political Science & Law SST

PR Project Management DCS

PS Psychology SST

PSM Power Systems Management ID

PSS Psychological Science SS

PVET Pre-Veterinary ND

PW Professional Writing HU

QI Quality Improvement DCS

41

RE Religion HU

RH Rhetoric HU

SC Science (Freshmen Only) ND

SD System Dynamics SST

SE Structural Engineering CE

SIM School of Industrial Management MG

SM Systems Modeling ID

SO Sociology SST

SP Spanish HU

SS Social Science SST

SST Social Science & Technology SST

ST Society, Technology & Policy SST

STA Statistics MA

TC Tech, Sci & Prof Communication ID

TEAC Teaching ND

TM Technology Marketing MG

TW Technical Writing ID

URB Urban Planning ND

URBN Urban Studies ND

WC World Class Manufacturing DCS

WD Windows 2000 DCS

WH World History HU

WR Writing and Rhetoric HU

WT Web Technologies DCS

42

Appendix B: Logistic Modeling Results

Table B.1 Logistic Fit Results for Model 3

Predictor EstimateStandard

Error p-value

Years since B.S. awarded 0.0934 0.00502 <.0001

Mass 0.0900 0.1080

Midwest 0.00161 0.2098

NA -0.4991 0.0988

Northeast 0.1569 0.1499

Other -0.2025 0.5470

Rest_NewEng 0.1265 0.1257

Biz

geographical

region

South 0.2800 0.1580

<.0001

Alumni activities count 0.5564 0.0752 <.0001

Number of children 0.2573 0.0417 <.0001

School activities indicator 0.2142 0.0413 <.0001

Mass 0.2199 0.0874

Midwest -0.0531 0.1745

NA -0.6728 0.1280

Northeast 0.2502 0.1236

Other 0.3195 0.4169


Home

geographical

region

South -0.0598 0.1252

<.0001

GPA 0.6524 0.1231 <.0001

Reunion indicator 0.6021 0.1207 <.0001

F 2.9413 55.3900 Gender

M 2.6106 55.3900 <.0001

43

Indicator, non-WPI degree reported 0.3449 0.0789 <.0001

WPI spouse indicator 0.3413 0.1057 0.0011

International club activities count -0.2896 0.0918 0.0044

Honor society count 0.1960 0.0792 0.0041

Professional society count 0.1626 0.0586 0.0136

Biological/LifeSci -0.0489 0.1158

BizEconomcs -0.1438 0.1251

ChemicalEngr -0.1830 0.1259

Chemistry -0.2228 0.2456

CivilEngr -0.0230 0.1068

ComputerSci 0.2337 0.1080

Electrical/ComputerEngr 0.2372 0.0894

HumanitiesArts 0.2935 0.2593

Math 0.0697 0.1843

MechanicalEngr 0.1094 0.0854

Other -0.0625 0.5205

Area of B.S.

major

OtherEngr 0.0785 0.4296

0.0145

Greek house indicator 0.1412 0.0650 0.0354

D 0.0162 0.0454 Graduate with

distinction H -0.1450 0.0670 0.0457

Table B.2 Odds Ratio Estimates for Model 3

Predictor Point

Estimate

95% Confidence

Interval

Alumni activities count 1.744 1.505 2.022

GPA 1.920 1.509 2.444

Years since B.S. awarded 1.098 1.087 1.109

Indicator, non-WPI degree reported 1.412 1.210 1.648

44


Number of children 1.293 1.192 1.404

Reunion indicator 1.826 1.441 2.313


International club activities count 0.749 0.625 0.896


School activities indicator 1.239 1.142 1.343


F vs >999.999 <0.001 >999.999Gender

M vsN

>999.999 <0.001 >999.999

Biological/LifeSci vs 1.335 0.801 2.225

BizEconomcs vs 1.214 0.721 2.044

ChemicalEngr vs 1.168 0.694 1.964

Chemistry vs 1.122 0.564 2.231

CivilEngr vs 1.370 0.830 2.262

ComputerSci vs 1.771 1.073 2.925

Electr./Comp.Engr vs 1.777 1.095 2.885

HumanitiesArts vs 1.880 0.924 3.828

Math vs 1.503 0.830 2.723

MechanicalEngr vs 1.564 0.967 2.531

Other vs 1.317 0.399 4.352

Area of B.S.

major

OtherEngr vs

Physics

1.517 0.549 4.190

Mass vs 1.044 0.732 1.490

Midwest vs 0.956 0.562 1.627

NA vs 0.579 0.413 0.813

Northeast vs 1.117 0.728 1.713

Other vs 0.780 0.221 2.749

Rest_NewEng vs 1.083 0.737 1.591

Biz

geographical

region

South vs

West

1.263 0.813 1.963

45

Mass vs 1.311 0.981 1.752

Midwest vs 0.998 0.642 1.552

NA vs 0.537 0.375 0.769

Northeast vs 1.352 0.950 1.922

Other vs 1.449 0.553 3.792

Rest_NewEng vs 1.103 0.809 1.505

Home

geographical

region

South vs

West

0.991 0.696 1.413

D vs 0.894 0.774 1.031 Graduate with

distinction H vsNA

0.760 0.610 0.947

Table B.3 Logistic Fit Results for Model 5

Predictor EstimateStandard

Error P-value

Years since B.S. awarded 0.0929 0.00512 <.0001

Alumni activities count 0.5599 0.0754 <.0001

Indicator, job title reported 0.4325 0.0587 <.0001

Mass 0.2354 0.0776

Midwest -0.1079 0.1424

NA -0.5634 0.1433

Northeast 0.2786 0.1047

Other 0.1304 0.4045


Home

geographical

region

South 0.0140 0.1090

<.0001

School activities indicator 0.2055 0.0412 <.0001

Number of children 0.2072 0.0442 <.0001

GPA 0.4522 0.0914 <.0001

Reunion indicator 0.6336 0.1207 <.0001

Indicator, non-WPI degree reported 0.3311 0.0791 <.0001

WPI spouse indicator 0.2323 0.1118 <.0001

46

International student indicator -0.6172 0.2221 0.0014

F 3.2056 91.3223 Gender

M 2.9052 91.3223 0.0022


International club activities count -0.2815 0.0922 0.0043


Biological/LifeSci -0.0732 0.1153

BizEconomcs -0.1607 0.1249

ChemicalEngr -0.1584 0.1253

Chemistry -0.2089 0.2456

CivilEngr -0.0197 0.1062

ComputerSci 0.2204 0.1074

Electrical/ComputerEngr 0.2432 0.0891

HumanitiesArts 0.2724 0.2596

Math 0.0732 0.1831

MechanicalEngr 0.1162 0.0849

Other -0.0267 0.5126

Area of B.S.

major

OtherEngr 0.0969 0.4295

0.0096


Divorced -0.7714 39.2215

Married -1.0593 39.2208

NA -1.5833 39.2225

Other/Partner -1.3808 39.2240

Separated 7.4173 235.3

Marriage

Single -1.2968 39.2208

0.0374

47

Table B.4 Odds Ratio Estimates for Model 5

Effect Estimate 95% C.I.

Alumni activities count 1.750 1.510 2.029

GPA 1.572 1.314 1.880

Years since B.S. awarded 1.097 1.086 1.108

Indicator, non-WPI degree reported 1.392 1.192 1.626


Number of children 1.230 1.128 1.342

Reunion indicator 1.884 1.487 2.387


Indicator, job title reported 1.541 1.374 1.729

International student indicator 0.539 0.349 0.834

International club activities count 0.755 0.630 0.904


School activities indicator 1.228 1.133 1.331


Mass vs West 1.340 1.080 1.662

Midwest vs West 0.950 0.673 1.342

NA vs West 0.603 0.424 0.856

Northeast vs West 1.399 1.070 1.829

Other vs West 1.206 0.478 3.042

Rest_NewEng vs West 1.135 0.898 1.435

Home

geographical

region

South vs West 1.074 0.814 1.416

Divorced vs Widowed 1.741 0.102 29.789

Married vs Widowed 1.305 0.080 21.297

NA vs Widowed 0.773 0.042 14.269

Other/Partner vs Widowed 0.946 0.046 19.429

Separated vs Widowed >999.999 <0.001 >999.999

Marriage

Single vs Widowed 1.029 0.063 16.750

48

F vs N >999.999 <0.001 >999.999Gender

M vs N >999.999 <0.001 >999.999

Biological/LifeSci vs Physics 1.352 0.814 2.246

BizEconomcs vs Physics 1.238 0.738 2.079

ChemicalEngr vs Physics 1.241 0.741 2.081

Chemistry vs Physics 1.180 0.595 2.341

CivilEngr vs Physics 1.426 0.867 2.345

ComputerSci vs Physics 1.813 1.102 2.983

Electrical/ComputerEngr vs Physics 1.855 1.147 3.000

HumanitiesArts vs Physics 1.910 0.939 3.883

Math vs Physics 1.565 0.867 2.823

MechanicalEngr vs Physics 1.634 1.013 2.634

Other vs Physics 1.416 0.436 4.600

Area of

B.S. major

OtherEngr vs Physics 1.602 0.580 4.423

49

Appendix C: Logistic Modeling Detail

Table C.1 Class Variable Recoding Detail

Class Var. Categories Design Variables

ALND 1 Category

ALUM -1

F 1 0

M 0 1 Gender

N -1 -1

Divorced 1 0 0 0 0 0

Married 0 1 0 0 0 0

NA 0 0 1 0 0 0

Other/Partner 0 0 0 1 0 0

Separated 0 0 0 0 1 0

Single 0 0 0 0 0 1

Marriage

Widowed -1 -1 -1 -1 -1 -1

Biological/LifeSci 1 0 0 0 0 0 0 0 0 0 0 0

BizEconomcs 0 1 0 0 0 0 0 0 0 0 0 0

ChemicalEngr 0 0 1 0 0 0 0 0 0 0 0 0

Chemistry 0 0 0 1 0 0 0 0 0 0 0 0

CivilEngr 0 0 0 0 1 0 0 0 0 0 0 0

ComputerSci 0 0 0 0 0 1 0 0 0 0 0 0

Elect./Comp.Engr 0 0 0 0 0 0 1 0 0 0 0 0

HumanitiesArts 0 0 0 0 0 0 0 1 0 0 0 0

Math 0 0 0 0 0 0 0 0 1 0 0 0

MechanicalEngr 0 0 0 0 0 0 0 0 0 1 0 0

B.S. Major

Other 0 0 0 0 0 0 0 0 0 0 1 0

50

OtherEngr 0 0 0 0 0 0 0 0 0 0 0 1B.S. Major

Physics -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1

Mass 1 0 0 0 0 0 0

Midwest 0 1 0 0 0 0 0

NA 0 0 1 0 0 0 0

Northeast 0 0 0 1 0 0 0

Other 0 0 0 0 1 0 0

Rest_NewEng 0 0 0 0 0 1 0

South 0 0 0 0 0 0 1

Bizstate

West -1 -1 -1 -1 -1 -1 -1

Mass 1 0 0 0 0 0 0

Midwest 0 1 0 0 0 0 0

NA 0 0 1 0 0 0 0

Northeast 0 0 0 1 0 0 0

Other 0 0 0 0 1 0 0

Rest_NewEng 0 0 0 0 0 1 0

South 0 0 0 0 0 0 1

Home

West -1 -1 -1 -1 -1 -1 -1

D 1 0

H 0 1 Distinction

NA -1 -1

Table C.2 Summary of Stepwise Selection

Effect Step

Entered RemovedDF

Number

In

Score

Chi-Square p-value

1 bsrecency_mod 1 1 1049.2959 <.0001

2 bizstate_mod 7 2 249.6479 <.0001

3 alum_mod 1 3 149.7490 <.0001

51

4 child_mod 1 4 77.3250 <.0001

5 schinvolve_mod 1 5 60.0473 <.0001

6 home_mod 7 6 66.3851 <.0001

7 gpa_mod 1 7 41.0654 <.0001

8 reunion_mod 1 8 32.4448 <.0001

9 gender_mod 2 9 21.4961 <.0001

10 nonwpideg_mod 1 10 17.3076 <.0001

11 sps_mod 1 11 10.5696 0.0011

12 intlclub_mod 1 12 8.1032 0.0044

13 honorsoc_mod 1 13 8.2241 0.0041

14 profsoc_mod 1 14 6.0898 0.0136

15 bsmajor_mod 12 15 25.0615 0.0145

16 frat_mod 1 16 4.4279 0.0354

17 distinction_mod 2 17 6.1715 0.0457

Table C.3 Association of Predicted Probabilities and Observed Responses

Percent Concordant 79.0 Somers' D 0.582

Percent Discordant 20.8 Gamma 0.583

Percent Tied 0.2 Tau-a 0.286

Pairs 11883776 c 0.791

52

Appendix D: Box-Cox Transformation

The second phase of analysis (linear regression model) starts with an initial

check for the necessity of transformation on the response variable. Figure

D.1 shows the histogram of the response variable with a fitted normal curve.

Clearly there is no way to believe it comes from a normal distribution. So a

transformation is necessary here. The technique of Box-Cox transformation

[1] is then utilized to optimally locate the choice of transformation. Figure

D.2 illustrate how the sum of squared errors changes with the choice of

different λ , the order of the transformation. Both the software printout and

the line plot led to the choice of λ = 0 which corresponds to a natural log

transformation on the contribution amount. Figure 3.4 shows the histogram

along with a fitted normal curve of the transformed responses which presents

a much more plausible shape.

Figure D.1 Histogram of the Contribution Amount of Contributors

53

Figure D.2 Plot of Box-Cox Result

54

Bibliography

[1] Michael H. Kutner, Christoper J. Nachtsheim, John Neter, William Li.

Applied Linear Statistical Models, fifth edition. McGraw-Hill, 2005

[2] David W. Jr. Hosmer, Stanley Lemeshow, Applied Logistic Regression, second

edition. Wiley-Interscience, 2000

[3] Paul D. Allison. Logistic Regression Using the SAS System: Theory and

Application, first edition. SAS Publishing, 1999

[4] Alan Agresti, Categorical Data Analysis, second edition. Wiley-Interscience,

2002

[5] Stokes. Categorical Data Analysis Using the SAS System, second edition. WA

(Wiley-SAS), 2006

[6] Sharon L. Lohr. Sampling: Design and Analysis, first edition. Duxbury Press,

1998

[7] Joseph D. Petruccelli, Balgobin Nandram, Minghui Chen. Applied Statistics

for Engineers and Scientists, first edition. Prentice Hall, 1999

[8] Ron P. Cody, Jeffrey K. Smith. Applied Statistics and the SAS Programming

Language, fifth edition. Prentice Hall, 2005

[9] Lora D. Delwiche, Susan J. Slaughter. The Little SAS® Book: A Primer, third

edition. SAS Publishing, 2003

55

[10] Ronald P. Cody. Cody's Data Cleaning Techniques Using SAS Software, first


[11] Katherine Prairie. The Essential PROC SQL Handbook for SAS Users, first


[12] Kirk Paul Lafler, Proc SQL: Beyond the Basics Using SAS, first edition. SAS

Publishing, 2004

[13] Ronald P. Cody, Ray Pass, SAS Institute. SAS Programming by Example,

first edition. SAS Publishing, 1995

[14] Ronald P. Cody. SAS Functions by Example, first edition. SAS Publishing,

2004

[15] SAS Institute Inc. SAS OnlineDoc 9.1.2,

http://support.sas.com/onlinedoc/912/

[16] David Shepard Associates, Inc. The New Direct Marketing: How to

Implement A Profit-Driven Database Marketing Strategy, third edition.

McGraw-Hill, 1999

Date post:	05-Feb-2018
Category:	Documents
Upload:	lamkhuong
View:	215 times
Download:	0 times

Regression Analysis of University Giving Data · PDF fileRegression Analysis of University...

Documents