Database Management

12/4/2015

“The Regions of Virginia and Educational Outcomes”

By Corydon W. Baylor

Prepared for:

Database Management in the Social Sciences

Database Management Briefing

Working with the Raw Data

For this project, I wanted to perform an analysis on the effect living in different regions has on

educational outcomes in the Commonwealth of Virginia. I utilized six databases from the

Virginia Department of Education (VDOE) for my analysis. I used 2008 to 2009 and 2009 to

2010 graduation rate and dropout rate cohort data. Additionally, I used 2008 to 2009 and 2009 to

2010 postsecondary enrollment and cohort size data. Finally, I used a database that classified

each school division as city, rural, suburban, or town. Using information stored in a table on the

VDOE website, I added regional information to each school division as well. VDOE breaks

Virginia into eight regions: Central Virginia, Tidewater, Northern Neck, Northern Virginia,

Valley, Western Virginia, Southwest, and Southside.

Challenges and Opportunities in the Data

The files were relatively easy to use but suffered from poor documentation and a lack of

standardized sources across datasets. Fortunately each dataset contained a common variable, the

school division, which served as a common unit of observation across datasets. The college

enrollment and graduation rate datasets also contained school and state level data, but for the

purposes of my analysis, I dropped these observations.

I found the lack of comprehensive documentation a true challenge throughout the entire project.

While documentation exists describing each variable and how the data was collected, there was

no easily accessible information explaining how the data was arranged. The division level data

contained thousands of duplicates but no clear directions explaining which observations were

duplicates. For example, if the gender variable contained an “M” then it was displaying

information for males. If the gender variable was blank then it was displaying information for

both males and females. There were ten binary variables. Each variable had to be “empty” to

describe both populations. As such, each school division had dozens of observations, but only

one contained information for the entire division because only observation had all blank binary

variables. The college enrollment data had similar issues. Proper documentation would have

proved invaluable in overcoming these challenges.

An additional challenge was the underlying quality of the data and the lack of a standardized

source across datasets. I found it a little bit concerning that the graduation rate and the dropout

rate did not sum to 100%, though I suppose that the remaining percentage were those who

repeating grades. Additionally, the college enrollment data and the graduation and dropout data

comes from two different sources. The college enrollment data is gathered from National Student

Clearinghouse, which uses estimates instead of actual counts. This results in some hard to

reconcile figures when comparing graduation rates, cohort size, and college enrollment rates. For

example, Fairfax County has a higher college enrollment rate than a graduation rate. National

Student Clearinghouse also reported more students enrolling in college than graduated in a

cohort for Highland county. Clearly using two different data sources results in distortions when

combining datasets. Personally, I believe that National Student Clearinghouse makes some

errors in their estimation, and these errors become highly visible when looking at small school

divisions. This is a major limitation.

Cleaning the Datasets

The first major challenge I faced was cleaning the six datasets. Cleaning the regional description

database, henceforth called the VDOE database, was relatively easy. I needed to add labels to my

region variable and combine different variables into a simplified division description variable.

Finally, I need to drop Lexington City as it was not a division in the other datasets.

Cleaning the graduation and dropout datasets, henceforth called the cohort datasets, proved more

difficult. As previously mentioned, each school division contained only one observation that was

relevant to my analysis, a fact missing from the documentation. A decent amount of guesswork

therefore went into cleaning this data. Eventually, I discovered the problem and dropped all

observations that were not division level and that did not contain empty binary variables. I used a

loop to clean both the 2008-2009 and 2009-2010 data at the same time. I additionally dropped

the Department of Corrections and the School for the Deaf and Blind as these are not geographic

regions. I also had to drop these observations in the college enrollment dataset.

Cleaning the college enrollment dataset proved to be the most difficult. Again, I used a loop to

clean across the two years in question. Due to its small size, data on gender, race and other

variables was restricted so I had to replace them with

missing values. Next, I dropped all observations at the

school and state level and observations that provided

counts for non-four year universities, as I wanted to restrict my analysis to students who went to

four year universities. After cleaning the data, I was left with each division containing two

observations. One observation had a count of student enrolling in a public four year college. The

other had a count of students enrolling in a private four year college. I used the collapse

command to sum the counts of public and private enrollment into one observation per school.

The Collapse Command: collapse

(sum) ps_enrollment_cnt

(mean)year, by(div_name)

Creating One Clean Dataset

After cleaning each individual dataset, all that we left was to merge everything together. At first,

I had a lot of trouble merging on the division variable as it was a string variable. I decided to use

the encode command to remedy this problem. The encode command assigns a unique integer for

each unique string it finds in a variable. It then keeps the string as the integer label. Because

across all six datasets the school divisions were arranged alphabetically, the encode command

assigned the same integer to the same string for each

database. Of course this would not work if a single school

division was out of order. However, after looking at each

division variable next to each other, I found that I did not have this problem. Because each

division now had a unique and consistent integer, I could easily merge across all six datasets. I

consider coming up with a simple way to merge string variables together using encode to be my

proudest data management achievement.

First, I merged cohort dataset to the VDOE dataset. Next I merged the college enrollment dataset

to the VDOE dataset. Finally I merged these two datasets together. This left me with two

datasets. One dataset containing VDOE, cohort, and college enrollment data for 2008-2009, and

one containing the same data for 2009-2010. I appended the two datasets together and was left

with one big dataset that contained everything I needed.

However, I still needed to put some finishing touches on this final dataset. I renamed and labeled

the variables. I created dummy variables for both regions and descriptions for later use in

regressions, and I used a loop to label these dummy variables. I used the egen group command to

create a unique id for each observation, so that I could more easily sort my observations. Due to

The Encode Command: encode

div_name, gen (div)

drop div_name

extreme irregularities caused by the different data sources, I replaced the postsecondary count of

Highland County with a missing variable. Finally, I created a rate of college enrollment variable

for each division. Now that my data had been properly cleaned and complied, I was ready to

perform analyses.

Analysis

Executive Summary

In order to test the differences in educational outcomes across the regions in the commonwealth

of Virginia. School division were divided by descriptions such as city, town, suburbs, or rural

and by region, such as Central, Tidewater, Northern Neck, Northern, Valley, Western,

Southwest, and Southside. The dependent variables were graduation rate, dropout rate, and

college enrollment rate. According to regressions, Tidewater Virginia faces the worst outcomes

in the Commonwealth while Northern Virginia has the best outcomes. Cities face the worst

outcomes and suburbs enjoy the best outcomes.

Difference in Means (T-Tests)

As a first measure, I decided to use a difference in means test to see if the average graduation

rate for both regions and the descriptions of regions. I decided to test the Central, Tidewater and

Northern regions as these are the largest in the Commonwealth. I found that mean graduation

rate of Central Virginia was in fact different from the rest of the Commonwealth (81% compared

to 84% respectively) with a p-value of .034. Tidewater was also different (80% compared to

84%) with a p-value of .000. Northern Virginia was the only region tested with a higher mean

(84.5% compared to 81.2 for the rest of the state) with a p-value of .001. Interestingly but not

surprisingly, cities showed a very strong negative effect for graduation rates. Those who live in

the city have 77.4 graduation rate compared to an 84.2 graduation rate for those who do not.

The Effects of Descriptions

With this in mind, I wanted to test the effects of living in different types of locations or as I had

labeled them descriptions of school division. How did living in a city, town, or suburb effect

one’s educational outcomes? In order to examine this, I used regressions. Previously, I had

created dummy variables for each school division based on their description. I decided to omit

“rural” and have it serve as the comparison (the constant) because I was most interested in the

effects of living in the other regions, specifically cities. Table 1 in the appendix houses the

regression results, and Table 2 houses the summary statistics for region descriptions.

To make the regression table, I used the outreg command, which automatically exports the

regression table to a word document. I then edited the table in an excel document. To make the

summary statistics table, I first tried to use the outsum command. Unfortunately, outsum does

not allow for the same level of precision and detail as I wanted. Instead, I used tabstat and copied

the table into an excel document. From their making changes was quick and easy. I used this

same process for my regression and summary statistic tables for the effects of regions.

Across graduation rate, dropout rate, and college entrance rate, living in a city had the most

substantial and often the most significant effects on outcomes. City dwellers were 7 percentage

points less likely to graduate and 4 percentage points more likely to dropout but were 7

percentage points more likely to enroll at colleges. The only other statistically significant finding

was that suburbanites were 22.6 percentage points more likely to attend college than their rural

peers. I found it odd that city dwellers were both more likely to dropout or not graduate on time

and more likely to go to college compared to their rural peers. To me, this appears to be evidence

that students in the city either face very poor conditions or somewhat positive conditions.

The Effects of Regions

Next I wanted to test the effect of living in different regions of the Commonwealth. I created

binary variables for each region, and I omitted Northern Neck for the purpose of this regression

as its summary statistics were very average across all three outcomes. Tables 3 and 4 show the

regression outputs and summary statistics for regions respectfully. By simply examining the

summary statistics, I would have guessed that Central Virginia, Tidewater, Northern Virginia,

and Southside would have yielded statistically significant results, especially when considering

graduation and college enrollment rates.

However, after running the regressions, only Northern Virginia and Tidewater produced

statistically significant results. Those in the Tidewater area were 3 percentage points less likely

to graduate but also 8 percentage points more likely to enroll in college compared to those in the

Northern Neck. Northern Virginia had the most robust results of all. Those in Northern Virginia

were 4 percentage points more likely to graduate, 2 percentage points less likely to dropout and

14 percentage points more likely to enroll in a four year college.

Deeper Dives

I wanted to compare Northern Virginia and Tidewater in more detail. In order to do this, I

created a line graph (Graph 3) that plotted graduation and dropout rate and grouped them by

region. This graph really highlights the difference between these two regions. While Northern

Virginia has a 87.4 gradaution rate and a 6.9 dropout rate, Tidewater has a 79.9 graduation rate

and a 11.7 dropout rate. Next, I wanted to see the difference between these two regions on a

descriptive level, that is how do Northern Virginia cities compare to Tidewater cities? Using a

box graph (Graph 4), I found that Northern Virginia cities and suburbs have similar dropout rates

to Tidewater. The biggest difference is in the towns. Northern Virginia towns have about a 5%

dropout rate while Tidewater towns have near a 20% dropout rate. I then checked and saw how

many cities, towns, and suburbs are in Tidewater and Northern Virginia. I think what is really

driving the difference between the two regions is that Tidewater has more cities than Northern

Virginia (5 to 3) and that Northern Virginia has more counties than Tidewater (6 to 4).

Next, I wanted to see how individual divisions compared to their respective regions. In order to

do this, I used egen command to create a regional average variable. I then created a variable that

measured the difference between a division’s

dropout rate and the region’s average rate. Using

the command: “xi: reg difference i.division,

nocons”, I created temporary dummy variables for each division and which divisions had a

substantial and statistically significant difference in dropout rates compared to their region’s

average. Petersburg City with an 11.7 point difference, Roanoke City with a 10.3 point

difference, and Lunenburg County with an 8.4 point difference had the biggest differences

between their dropout rate and their region’s average. Each division was in a different region and

had different description. Petersburg City is a county in Central Virginia. Roanoke City is a city

in Western Virginia, and Lunenburg is a rural district in Southside. Hanover County (-8.08),

Halifax County (-6.7), and Bland County (-6.6) were had the greatest negative statistically

significant difference compared to their regions average. Hanover County is a rural county in

Central Virginia. Halifax is town in Southside, and Bland County is a rural county in Southwest

Virginia. Again, it should be noted that this does not measure which counties have the greatest

difference compared to their region’s mean.

Regressing Each District:

xi: reg difference i.division,

nocons

drop _I*

Discussion, Caveats, and Concerns

Through my analysis, I found that cities and Tidewater face the highest negative outcomes in

education and that suburbs and Northern Virginia enjoys the highest positive outcomes. While it

was interesting to measure the differences in educational outcomes across regions, this analysis

does not make an attempt to find the underlying casual mechanism behind these outcomes.

Living in Tidewater does not inherently result in lower outcomes. School funding and other

socio-economic factors likely are the cause of these outcomes. Another cause of concern is the

sample size of some of some of the groups. There are simply not that enough towns in each

region to make statistical inferences.

Appendix:

Graph 1:

Graph 2:

Graph 3:

Graph 4:

Table 1:

The Effect of Region Description

Grad Rate Dropout Rate College Rate

City -7.13 4.38 7.18

(5.56)** (4.63)** (2.35)*

Town -2.28 0.26 0.30

-1.91 -0.3 -0.1

Suburban 0.52 0.19 22.64

-0.42 -0.21 (7.77)**

_cons 84.52 8.76 41.38

(159.81)** (22.43)** (32.62)**

R2 0.12 0.08 0.2

N 262 262 260

* p<0.05; ** p<0.01

Graph 2:

Description Summary Statistics

Description Grad Rate Dropout Rate College Rate

City 77.39 13.14 48.56

Town 82.24 9.02 41.67

Suburbs 85.04 8.95 64.01

Rural 84.52 8.76 41.38

Total 83.39 9.36 45.44

Table 3:

The Effect of Regions Grad Rate Grad Rate Dropout Rate College Rate

central -2.44 1.6 4.72

-1.45 -1.31 -1.16

tidewater -3.34 2.33 8.89

(1.99)* -1.91 (2.18)*

nova 4.09 -2.45 14.43

(2.58)* (2.14)* (3.76)**

valley 0.9 -1.61 5.39

-0.57 -1.4 -1.38

west 0.8 0 1.66

-0.48 0 -0.41

swva 0.78 -0.48 -7.17

-0.49 -0.42 -1.87

south -1.87 1.86 -6.12

-1.04 -1.43 -1.41

_cons 83.29 9.4 42.43

(72.33)** (11.27)** (15.21)**

R2 0.1 0.1 0.16

N 262 262 260

* p<0.05; ** p<0.01

Table 4:

Region Summary Statistics Regions Grad Rate Dropout Rate College Rate

Central Virginia 80.85 11 47.16

Tidewater 79.95 11.73 51.33

Northern Neck 83.29 9.4 42.43

Northern Virginia 87.38 6.95 56.86

Valley 84.19 7.79 47.82

Western Virginia 84.09 9.39 44.1

Southwest 84.07 8.91 35.26

Southside 81.43 11.26 36.31

Total 83.39 9.36 45.44

Date post:	22-Jan-2017
Category:	Documents
Upload:	corydon-baylor
View:	41 times
Download:	0 times

Database Management

Documents