Date post: | 22-Jan-2017 |
Category: |
Documents |
Upload: | corydon-baylor |
View: | 41 times |
Download: | 0 times |
12/4/2015
“The Regions of Virginia and Educational Outcomes”
By Corydon W. Baylor
Prepared for:
Database Management in the Social Sciences
Database Management Briefing
Working with the Raw Data
For this project, I wanted to perform an analysis on the effect living in different regions has on
educational outcomes in the Commonwealth of Virginia. I utilized six databases from the
Virginia Department of Education (VDOE) for my analysis. I used 2008 to 2009 and 2009 to
2010 graduation rate and dropout rate cohort data. Additionally, I used 2008 to 2009 and 2009 to
2010 postsecondary enrollment and cohort size data. Finally, I used a database that classified
each school division as city, rural, suburban, or town. Using information stored in a table on the
VDOE website, I added regional information to each school division as well. VDOE breaks
Virginia into eight regions: Central Virginia, Tidewater, Northern Neck, Northern Virginia,
Valley, Western Virginia, Southwest, and Southside.
Challenges and Opportunities in the Data
The files were relatively easy to use but suffered from poor documentation and a lack of
standardized sources across datasets. Fortunately each dataset contained a common variable, the
school division, which served as a common unit of observation across datasets. The college
enrollment and graduation rate datasets also contained school and state level data, but for the
purposes of my analysis, I dropped these observations.
I found the lack of comprehensive documentation a true challenge throughout the entire project.
While documentation exists describing each variable and how the data was collected, there was
no easily accessible information explaining how the data was arranged. The division level data
contained thousands of duplicates but no clear directions explaining which observations were
duplicates. For example, if the gender variable contained an “M” then it was displaying
information for males. If the gender variable was blank then it was displaying information for
both males and females. There were ten binary variables. Each variable had to be “empty” to
describe both populations. As such, each school division had dozens of observations, but only
one contained information for the entire division because only observation had all blank binary
variables. The college enrollment data had similar issues. Proper documentation would have
proved invaluable in overcoming these challenges.
An additional challenge was the underlying quality of the data and the lack of a standardized
source across datasets. I found it a little bit concerning that the graduation rate and the dropout
rate did not sum to 100%, though I suppose that the remaining percentage were those who
repeating grades. Additionally, the college enrollment data and the graduation and dropout data
comes from two different sources. The college enrollment data is gathered from National Student
Clearinghouse, which uses estimates instead of actual counts. This results in some hard to
reconcile figures when comparing graduation rates, cohort size, and college enrollment rates. For
example, Fairfax County has a higher college enrollment rate than a graduation rate. National
Student Clearinghouse also reported more students enrolling in college than graduated in a
cohort for Highland county. Clearly using two different data sources results in distortions when
combining datasets. Personally, I believe that National Student Clearinghouse makes some
errors in their estimation, and these errors become highly visible when looking at small school
divisions. This is a major limitation.
Cleaning the Datasets
The first major challenge I faced was cleaning the six datasets. Cleaning the regional description
database, henceforth called the VDOE database, was relatively easy. I needed to add labels to my
region variable and combine different variables into a simplified division description variable.
Finally, I need to drop Lexington City as it was not a division in the other datasets.
Cleaning the graduation and dropout datasets, henceforth called the cohort datasets, proved more
difficult. As previously mentioned, each school division contained only one observation that was
relevant to my analysis, a fact missing from the documentation. A decent amount of guesswork
therefore went into cleaning this data. Eventually, I discovered the problem and dropped all
observations that were not division level and that did not contain empty binary variables. I used a
loop to clean both the 2008-2009 and 2009-2010 data at the same time. I additionally dropped
the Department of Corrections and the School for the Deaf and Blind as these are not geographic
regions. I also had to drop these observations in the college enrollment dataset.
Cleaning the college enrollment dataset proved to be the most difficult. Again, I used a loop to
clean across the two years in question. Due to its small size, data on gender, race and other
variables was restricted so I had to replace them with
missing values. Next, I dropped all observations at the
school and state level and observations that provided
counts for non-four year universities, as I wanted to restrict my analysis to students who went to
four year universities. After cleaning the data, I was left with each division containing two
observations. One observation had a count of student enrolling in a public four year college. The
other had a count of students enrolling in a private four year college. I used the collapse
command to sum the counts of public and private enrollment into one observation per school.
The Collapse Command: collapse
(sum) ps_enrollment_cnt
(mean)year, by(div_name)
Creating One Clean Dataset
After cleaning each individual dataset, all that we left was to merge everything together. At first,
I had a lot of trouble merging on the division variable as it was a string variable. I decided to use
the encode command to remedy this problem. The encode command assigns a unique integer for
each unique string it finds in a variable. It then keeps the string as the integer label. Because
across all six datasets the school divisions were arranged alphabetically, the encode command
assigned the same integer to the same string for each
database. Of course this would not work if a single school
division was out of order. However, after looking at each
division variable next to each other, I found that I did not have this problem. Because each
division now had a unique and consistent integer, I could easily merge across all six datasets. I
consider coming up with a simple way to merge string variables together using encode to be my
proudest data management achievement.
First, I merged cohort dataset to the VDOE dataset. Next I merged the college enrollment dataset
to the VDOE dataset. Finally I merged these two datasets together. This left me with two
datasets. One dataset containing VDOE, cohort, and college enrollment data for 2008-2009, and
one containing the same data for 2009-2010. I appended the two datasets together and was left
with one big dataset that contained everything I needed.
However, I still needed to put some finishing touches on this final dataset. I renamed and labeled
the variables. I created dummy variables for both regions and descriptions for later use in
regressions, and I used a loop to label these dummy variables. I used the egen group command to
create a unique id for each observation, so that I could more easily sort my observations. Due to
The Encode Command: encode
div_name, gen (div)
drop div_name
extreme irregularities caused by the different data sources, I replaced the postsecondary count of
Highland County with a missing variable. Finally, I created a rate of college enrollment variable
for each division. Now that my data had been properly cleaned and complied, I was ready to
perform analyses.
Analysis
Executive Summary
In order to test the differences in educational outcomes across the regions in the commonwealth
of Virginia. School division were divided by descriptions such as city, town, suburbs, or rural
and by region, such as Central, Tidewater, Northern Neck, Northern, Valley, Western,
Southwest, and Southside. The dependent variables were graduation rate, dropout rate, and
college enrollment rate. According to regressions, Tidewater Virginia faces the worst outcomes
in the Commonwealth while Northern Virginia has the best outcomes. Cities face the worst
outcomes and suburbs enjoy the best outcomes.
Difference in Means (T-Tests)
As a first measure, I decided to use a difference in means test to see if the average graduation
rate for both regions and the descriptions of regions. I decided to test the Central, Tidewater and
Northern regions as these are the largest in the Commonwealth. I found that mean graduation
rate of Central Virginia was in fact different from the rest of the Commonwealth (81% compared
to 84% respectively) with a p-value of .034. Tidewater was also different (80% compared to
84%) with a p-value of .000. Northern Virginia was the only region tested with a higher mean
(84.5% compared to 81.2 for the rest of the state) with a p-value of .001. Interestingly but not
surprisingly, cities showed a very strong negative effect for graduation rates. Those who live in
the city have 77.4 graduation rate compared to an 84.2 graduation rate for those who do not.
The Effects of Descriptions
With this in mind, I wanted to test the effects of living in different types of locations or as I had
labeled them descriptions of school division. How did living in a city, town, or suburb effect
one’s educational outcomes? In order to examine this, I used regressions. Previously, I had
created dummy variables for each school division based on their description. I decided to omit
“rural” and have it serve as the comparison (the constant) because I was most interested in the
effects of living in the other regions, specifically cities. Table 1 in the appendix houses the
regression results, and Table 2 houses the summary statistics for region descriptions.
To make the regression table, I used the outreg command, which automatically exports the
regression table to a word document. I then edited the table in an excel document. To make the
summary statistics table, I first tried to use the outsum command. Unfortunately, outsum does
not allow for the same level of precision and detail as I wanted. Instead, I used tabstat and copied
the table into an excel document. From their making changes was quick and easy. I used this
same process for my regression and summary statistic tables for the effects of regions.
Across graduation rate, dropout rate, and college entrance rate, living in a city had the most
substantial and often the most significant effects on outcomes. City dwellers were 7 percentage
points less likely to graduate and 4 percentage points more likely to dropout but were 7
percentage points more likely to enroll at colleges. The only other statistically significant finding
was that suburbanites were 22.6 percentage points more likely to attend college than their rural
peers. I found it odd that city dwellers were both more likely to dropout or not graduate on time
and more likely to go to college compared to their rural peers. To me, this appears to be evidence
that students in the city either face very poor conditions or somewhat positive conditions.
The Effects of Regions
Next I wanted to test the effect of living in different regions of the Commonwealth. I created
binary variables for each region, and I omitted Northern Neck for the purpose of this regression
as its summary statistics were very average across all three outcomes. Tables 3 and 4 show the
regression outputs and summary statistics for regions respectfully. By simply examining the
summary statistics, I would have guessed that Central Virginia, Tidewater, Northern Virginia,
and Southside would have yielded statistically significant results, especially when considering
graduation and college enrollment rates.
However, after running the regressions, only Northern Virginia and Tidewater produced
statistically significant results. Those in the Tidewater area were 3 percentage points less likely
to graduate but also 8 percentage points more likely to enroll in college compared to those in the
Northern Neck. Northern Virginia had the most robust results of all. Those in Northern Virginia
were 4 percentage points more likely to graduate, 2 percentage points less likely to dropout and
14 percentage points more likely to enroll in a four year college.
Deeper Dives
I wanted to compare Northern Virginia and Tidewater in more detail. In order to do this, I
created a line graph (Graph 3) that plotted graduation and dropout rate and grouped them by
region. This graph really highlights the difference between these two regions. While Northern
Virginia has a 87.4 gradaution rate and a 6.9 dropout rate, Tidewater has a 79.9 graduation rate
and a 11.7 dropout rate. Next, I wanted to see the difference between these two regions on a
descriptive level, that is how do Northern Virginia cities compare to Tidewater cities? Using a
box graph (Graph 4), I found that Northern Virginia cities and suburbs have similar dropout rates
to Tidewater. The biggest difference is in the towns. Northern Virginia towns have about a 5%
dropout rate while Tidewater towns have near a 20% dropout rate. I then checked and saw how
many cities, towns, and suburbs are in Tidewater and Northern Virginia. I think what is really
driving the difference between the two regions is that Tidewater has more cities than Northern
Virginia (5 to 3) and that Northern Virginia has more counties than Tidewater (6 to 4).
Next, I wanted to see how individual divisions compared to their respective regions. In order to
do this, I used egen command to create a regional average variable. I then created a variable that
measured the difference between a division’s
dropout rate and the region’s average rate. Using
the command: “xi: reg difference i.division,
nocons”, I created temporary dummy variables for each division and which divisions had a
substantial and statistically significant difference in dropout rates compared to their region’s
average. Petersburg City with an 11.7 point difference, Roanoke City with a 10.3 point
difference, and Lunenburg County with an 8.4 point difference had the biggest differences
between their dropout rate and their region’s average. Each division was in a different region and
had different description. Petersburg City is a county in Central Virginia. Roanoke City is a city
in Western Virginia, and Lunenburg is a rural district in Southside. Hanover County (-8.08),
Halifax County (-6.7), and Bland County (-6.6) were had the greatest negative statistically
significant difference compared to their regions average. Hanover County is a rural county in
Central Virginia. Halifax is town in Southside, and Bland County is a rural county in Southwest
Virginia. Again, it should be noted that this does not measure which counties have the greatest
difference compared to their region’s mean.
Regressing Each District:
xi: reg difference i.division,
nocons
drop _I*
Discussion, Caveats, and Concerns
Through my analysis, I found that cities and Tidewater face the highest negative outcomes in
education and that suburbs and Northern Virginia enjoys the highest positive outcomes. While it
was interesting to measure the differences in educational outcomes across regions, this analysis
does not make an attempt to find the underlying casual mechanism behind these outcomes.
Living in Tidewater does not inherently result in lower outcomes. School funding and other
socio-economic factors likely are the cause of these outcomes. Another cause of concern is the
sample size of some of some of the groups. There are simply not that enough towns in each
region to make statistical inferences.
Table 1:
The Effect of Region Description
Grad Rate Dropout Rate College Rate
City -7.13 4.38 7.18
(5.56)** (4.63)** (2.35)*
Town -2.28 0.26 0.30
-1.91 -0.3 -0.1
Suburban 0.52 0.19 22.64
-0.42 -0.21 (7.77)**
_cons 84.52 8.76 41.38
(159.81)** (22.43)** (32.62)**
R2 0.12 0.08 0.2
N 262 262 260
* p<0.05; ** p<0.01
Graph 2:
Description Summary Statistics
Description Grad Rate Dropout Rate College Rate
City 77.39 13.14 48.56
Town 82.24 9.02 41.67
Suburbs 85.04 8.95 64.01
Rural 84.52 8.76 41.38
Total 83.39 9.36 45.44
Table 3:
The Effect of Regions Grad Rate Grad Rate Dropout Rate College Rate
central -2.44 1.6 4.72
-1.45 -1.31 -1.16
tidewater -3.34 2.33 8.89
(1.99)* -1.91 (2.18)*
nova 4.09 -2.45 14.43
(2.58)* (2.14)* (3.76)**
valley 0.9 -1.61 5.39
-0.57 -1.4 -1.38
west 0.8 0 1.66
-0.48 0 -0.41
swva 0.78 -0.48 -7.17
-0.49 -0.42 -1.87
south -1.87 1.86 -6.12
-1.04 -1.43 -1.41
_cons 83.29 9.4 42.43
(72.33)** (11.27)** (15.21)**
R2 0.1 0.1 0.16
N 262 262 260
* p<0.05; ** p<0.01
Table 4:
Region Summary Statistics Regions Grad Rate Dropout Rate College Rate
Central Virginia 80.85 11 47.16
Tidewater 79.95 11.73 51.33
Northern Neck 83.29 9.4 42.43
Northern Virginia 87.38 6.95 56.86
Valley 84.19 7.79 47.82
Western Virginia 84.09 9.39 44.1
Southwest 84.07 8.91 35.26
Southside 81.43 11.26 36.31
Total 83.39 9.36 45.44