Database Design & Implementation Report

GIS Data & Databases Coursework: University Student Applicants and their Demographics

Produced by: Ian Morris

Introduction

The aim of this project is to design and implement a relational spatial database in order to carry out analyses on

university applicants across students in the UK, including Scotland and Northern Ireland.

Using the data provided the specific objectives are as follows;

1. Create an ER schema, use it to identify relationships between entities and then normalise it, the

geodatabase.

2. Implement the normalised geodatabase using ESRI.

3. Assess the functionality of the geodatabase by running spatial and non-spatial, SQL, queries through it.

4. Critique the design and implementation of the relational spatial database in a form of a report.

The visualisation of any spatial patterns created by student application demographic data enables universities and

educational establishments to gain a better understanding of how to market themselves and recruit students.

Therefore the purpose of this work is to create a spatial database or “geodatabase”, by completing the

aforementioned objectives, that will positively influence the success of those organisations that utilise it effectively

by enabling them to undertake both spatial and non-spatial work in order to analyse and therefore better

understand their target student demographic in order recruit them successfully.

Database Design

Initial Data

The initial data was handed over in the form of tables, shown below in the form of columns and their relevant

attribute data.

This data was then manipulated into the “Final ER Schema” and “Final Attributes” tables as described and shown

below;

- New and separate lookup tables/entities were created from the “Fees”, “Sex”, “Main Cycle_Clearing” and

“Mode of Study” fields in the “University Applicants Data” table initally provided.

o The “Ethic Group” and “Post-confirmation actual query on entry” fields were also turned into their

own separate lookup tables by sorting and transposing the data they contained into two new tables

named “Ethnicity Code” and “Qualification Codes” respectively.

o A “Local Authorities” lookup table/entity was also created from the “GEO_CODE” and “GEO_LABEL”

fields in the same table.

- Redundant data such as “CDU_ID”, “GEO_TYPE” was removed, from all tables, as “GEO_CODE” could act as

the primary key in a new “Applicants” table and, subsequently, as the Foreign key in related tables in place

of “CDU_ID” and “GEO_TYPE” which only defined attributes as being a local authority which becomes

redundant with the creation of a “Local Authorities” table.

- The two sets of spatial data given for each applicant (“PCODE” and “Easting” and “Northing”) in the

“University Applicants Data” table was aggregated to Local Authority level by methods described later on

under the heading “Geodatabase implemention, Stage 1: Creation of tables”. After this had been done the

“Easting” and “Northing” data became redundant as this only served to give the location of the postocde

data, also known as the “PCODE” field, and therefore is repeating data which doesn’t define the primary key,

the “ID” field in this instance, in a new “Applicants” table.

- The above was all done with the intent of normalising the database to 3NF by;

o Ensuring all data is atomic, no rows or columns were repeated and that each field was uniquely

named (which all data was when provided initially). Thus the database is in first normal form (1NF).

o To put the database in second normal form (2NF) it was ascertained that all attributes in each row

were dependent on their primary key i.e. “GEO_CODE”, “PCODE”, “Fees”, “SEX”, “Ethnic Group”,

“Post-confirmation actual qual on entry”, “Main Cycle_Clearing” and “Mode of Study” all define “ID”

in the new “Applicant” table.

o Data that also doesn’t define the primary key in each table was then removed if it was repeating

(not mutually independent) and didn’t define the primary key, i.e. i.e. “Easting” and “Northing” in

the old “University Applicant Data” table.

Final ER Schema

Note: Attributes of the entities shown in the above “Final ER Schema” table are shown below in the “Final

Attributes” table. To clarify any unclear relationships above – Many “Applicants” can have 1 “Ethnicity Code” and 1

“Ethnicity Code” can have Many “Applicants”; Many “Applicants” can have 1 “Qualification Code” and 1

“Qualification Code” can have Many “Applicants”; Many “UK_LAD.shp” (Local Authorities shapefile) can have 1

“English Region” (English Regions shapefile) and 1 “English Region” can have Many “UK_LAD.shp”;

Final Attributes

Note: Primary Key in applicant table shown in red. This also acts as the Foreign Key in other tables, shown in green,

to relate the tables.

Geodatabase Implementation

The implementation of the Geodatabase is then broken down into stages as follows;

Stage 1: Creation of tables o Individual .csv files were created for each entity and it’s attributes from the data as displayed in the

“Final Attributes” table above.

o In order to aggregate the spatial data (PCODE, Eastings and Northings) to relate each individual

applicant from the “University Applicant” table to a Local Authority the following two steps were

undertaken;

Selecting the correct “University Applicant” .csv tab from the ArcCatalog window and then

“Create Feature Class > From XY Table”, setting the X Field to “Easting” and Y Field to

“Northing”, before setting the co-ordinate system to match the shapefile (British National

Grid) and displaying it on the map (figure 1). This is important for later on so that the “select

by location” tool can be used for objective three.

This also has the desired effect of creating a shapefile, entitled “Join_Output_2” in

this case, that displays the location of each applicant by a point feature (based on

their easting and northing) which can be added to the database for use later on.

Next the “University Applicant” .csv and “UK_Lad” shapefile had to be joined so that

applicants can be further categorised by which local authority they hail from (i.e. South

Ayrshire). This was done by selecting the .csv file, using the join tool from “Joins and

Relates” and “joining data from another layer (the .shapefile in this case) based on spatial

data” and giving “each point the attributes of the polygon that it falls inside” (figure 2). This

newly created table now shows all of the applicants as being related to a local authority.

Figure 1 – Feature class creation from the provided Easting and Northing data in the “University Applicant Data”

Figure 2 – Joining the “University Applicants” .csv file and “UK_LAD” shapefile to create a new table containing the

data from both tables.

Stage 2: Importing data into the geodatabase o Created a new File Geodatabase naming it “Database DB” (figure 3) for the purposes of the

assignment.

o Imported the created .csv files via “Import > Table (multiple)” and shapefiles via “Import > Feature

Class (multiple)”

o The creation of a separate geodatabase for this, and any, project and importing any relevant files

(.csv, shapefiles, etc) into it needs to be done so that the data contained within it is readily

accessible and locatable when working on a project. This helps to improve project efficiency.

Figure 3 – Imported tables and shapefiles in the newly created “Database DB” File Geodatabase

Stage 3: Creation of subtypes and attribute domains o Attribute domains were then created by selecting the File Geodatabase and navigating to the

“Domains” tab where domain names, properties and coded values could be defined (figure 4)

according to the entities shown in the Final ER Schema. The only field type used was “Text” as all

coded values entered were either text or alphanumeric (i.e. sexes included “M”, for male, modes of

study included “PartTime”, etc) and the only fields to include the alphanumeric characters were

those of the Local Authorities (i.e. S12000028).

Figure 4 – Attribute domain creation showing the defined domain names, properties and values

o Once the legal parameters of each attribute domain had been defined they were then associated

with the corresponding fields in the “main” “Applicant” and “Qualification Stats”, “Occupation

Stats”, and “Ethnicity Stats ” tables and default values applied to them by creating a subtype for

each table (figure 5). For example the “Fees” attribute domain was associated with the “Fees” field

name in the “Applicant” table and a default value of “Unknown” applied to the field. This was done

through selecting the properties of each table, going to the “Subtypes” tab, creating a new subtype

and updating the “Default Values and Domains” section (figure 5).

o This is done to preserve the integrity of the data and to ensure that only select data can be entered

into the database, help normalise it to 2NF at least by removing repeating data, putting it into a

separate lookup table and relating them via a primary key. This is also helps to minimise the data

stored and frees up space (megabytes, etc) on hard drives and across networks.

Figure 5 – The creation of subtypes and the associating of field names to attribute domains and default value

creation to preserve database integrity.

o Certain values such as “PCODE”, representing postcodes, were not assigned an attribute domain as

this particular attribute defines, very precisely (although not quite uniquely due to defining an area

rather than individual addresses), each applicant. Therefore it was decided to not be worth creating

attributes domains for almost unique values as this would be overly time consuming and also

without real benefit to the database design. However it was determined that they be left in the

“Applicant” table as they further define their primary key (the student ID) and help normalise the

database overall to 3NF.

Stage 4: Defining relationships

o Relationships between entities/tables need to be created in order to describe the relationships

between the attributes within them. In the case of the following example, the ethnicity of each

applicant from the “Ethnicity_Code” and “Applicants” tables respectively is defined.

o The first stage in this process is to identify the relationships that need creating. This can be done by

viewing the “Final ER Schema” table in the “Database Design” section above.

o In ArcCatalog a new relationship class is then created by selecting the File Geodatabase and

navigating through “New > Relationship Class” where the name of the relationship class and the

origin (“Local_Authorities”) and destination (“Applicants”) tables are defined (figure 6).

Figure 6 – The first stage in defining a New Relationship Class in ArcCatalog. Naming the relationship and defining

origin and destination tables

o Next the type of relationship is defined. In this case, the “Local_Authorities” and “Ethnicity Codes”

tables can exist independently of one another in the database so “peer to peer” is selected (figure

7).

Figure 7 – The second stage in defining a New Relationship Class in ArcCatalog, the type of relationship between two

tables.

o Next the labels for the forwards and backwards relationship between the origin table and

destination table are to be defined. In this case the forwards relationship message is “LA of

Applicant” as you move from “Local_Authorities” to “Applicant” and the backwards relationship

message is “Applicant LA” as you move from “Applicant” to “Local_Authorities” (figure 8).

Figure 8 – The third stage in defining a New Relationship Class in ArcCatalog, label specification for the relationships

to and from the origin and destination tables

o Next, the type of relationship (1-1, 1-M, M-N) between the origin and destination is defined. In this

example “1 Local Authority can have Many Applicants” and “Many Applicants can have 1 Local

Authority” so therefore the relationship is “One to Many” or 1 – M (figure 9).

Figure 9 – The fourth stage in defining a New Relationship Class in ArcCatalog, cardinality creation

o The next stage involves the creation of a new table if necessary. In this case the relationship

between “Applicant” and “Local_Authorities” is already defined as it is and there is no need to

create an additional table to define their relationship further (figure 10).

Figure 10 – The fifth stage in defining a New Relationship Class in ArcCatalog, creating additional tables.

o The next stage involves relating the two tables via a key (primary key in the origin table and

subsequent foreign key in the destination table). In this example the primary key is the

“GEO_CODE” field in the origin table and it’s corresponding foreign key in the destination table is

also “GEO_CODE” as both these fields relate to a code that identifies a particular Local Authority

(i.e. E07000202 identifies the local authority of Ipswich) (figure 11).

Figure 11 – The sixth stage in defining a New Relationship Class in ArcCatalog, creating additional tables.

o The final stage is simply reviewing the data before creating the relationship (figure 12).

Figure 12 – The final stage in defining a New Relationship Class in ArcCatalog, reviewing the relationship

o The preceding steps were then repeated to define the remaining relationships between other

entities as shown in the “Final ER Schema” table.

Stage 5: Creating topology

o Lastly, topological rules were created in order to define the spatial relationships between data in

the “UK_LAD”, “English Regions” and “Applicant” shapefiles and ensure they retained topological

features (containment, adjacency and connectivity) where applicable.

o The first stage in this process was to import the aforementioned shapefiles into a “Feature Dataset”

so that these rules could be applied.

o Next a name for the topological relationship class needed to be defined as per the following figure.

Figure 13 – Naming and cluster tolerance specification

o Next the shapefiles to which the rules were being applied, the three aforementioned “UK_LAD”,

“English Regions” and “Applicant” shapefiles, needed to be selected table (figure 14).

Figure 14 – Selecting the feature classes that will have topological rules applied to them

o After this, ranks were applied to each feature class in order to arrange a preference of which data

would be snapped to the position of other data based on their accuracy – i.e. less accurately

collected data is snapped to the position of more accurately collected data during validation of the

topology. In this case rank 1 was applied to all feature classes because the data in each feature class

collected is assumed to have been collected with as much accuracy as one another. This is based on

the assumption that applicants have entered their correct postcode and the correct easting and

northing data attached to that postcode has therefore been applied (figure 15).

Figure 15 – Feature class rank assignment

o Next the individual topological rules were set for each feature class (figures 16 - 20).

Figure 16 – Rule defining that point features from applicant data must be properly inside the local authority

polygons

Figure 17 – Rule defining that polygon area features from the “UK_LAD” shapefile must not overlap one another

Figure 18 – Rule defining that polygon area features from the “EnglishRegions” shapefile must not overlap one

another

Figure 19 – Rule defining that polygon features from the “UK_LAD” shapefile must not have gaps in them

Figure 20 – Rule defining that polygon features from the “EnglishRegions” shapefile must not have gaps in them

o Lastly a summary page was presented, checked and verified.

Figure 21 – Topology rule summary

Queries

Non-Spatial Queries

1. List the number of applicants that belong to White ethnic group. Also calculate the percentage of the

results over the full population of applicants.

7248 of the 39240 applicants belong to the “White” ethnic group, representing 18.47% (7248 / 39240 * 100) of the

full applicant population as shown by the following SQL statement and result screenshot.

2. How many part-time applicants are females?

1869 applicants are both female and part time as shown by the following SQL statement and result screenshot.

3. What is the number of male applicants with the highest qualification as A/AS Level?

4090 of the applicants whose highest qualification is A/AS Level are male as shown by the following SQL statement

and result screenshot.

4. List all female white applicants who have applied for a full-time degree.

3414 of the applicants are white females who have applied for a full time degree as shown by the following SQL

statement and result screenshot.

5. What is the proportion of applicants that applied through clearing?

1132 of the 39240 applicants applied through clearing, representing 2.88% (1132 / 39240 * 100) of the full applicant

population as shown by the following SQL statement and result screenshot.

6. List all mixed race female applicants who have a UK Master’s Degree.

9 of the applicants are mixed race, female and have a UK Master’s Degree as shown by the following SQL statement


7. List all mixed race male applicants who have a UK Master’s Degree.

2 of the applicants are mixed race, male and have a UK Master’s Degree as shown by the following SQL statement


8. What is the percentage of AoAB applicants?

3059 of the 39240 applicants are “AoAB”, representing 7.79% (3059 / 39240 * 100) of the full applicant population,

as shown by the following SQL statement and result screenshot.

9. What is proportion of applications from mature students admitted on the basis of previous experience or

admissions test.

67 of the 39240 applicants are “Mature students admitted on the basis of previous experience or admissions test”,

representing 0.17% (67 / 39240 * 100) of the full applicant population, as shown by the following SQL statement and

result screenshot.

10. List all full-time BoBB applicants who DO NOT have any formal qualifications.

1 Applicant is full time, “BoBB” and doesn’t have any formal qualifications as shown by the following SQL statement


Spatial Queries

1. Select all the applicants from the London region.

The London region is part of the larger “English Regions” shapefile and needs to be separated from the other 8

regions within the shapefile so that a “Select By Location” query could be run in order to select all applicants from

the London region. This was done by;

Selecting the London region within the “English Regions” shapefile

Exporting it as a separately defined polygon.

This then allows selection of the London region as a source layer in a “Select By Location” query where all of the

applicants within the source layer can be selected.

The end of this process shows 25064 of the 39240 applicants as coming from the London region as highlighted in the

table screenshot below.

2. Which Local Authority has the highest number of applicants?

This was done by;

Joining the “UK_Lad” shapefile, containing the data on UK Local Authorities, to the “Join_Output_2” shapefile aka

“Applicants” shapefile and summing the number of applicants (points) in each local authority (polygon).

This adds the “summed” data to a newly created layer which displays Greenwich as the Local Authority with the

highest number of applicants at 2225 when the attribute table is opened and the “Count_” column sorted by

descending as per the screenshot below.

3. Which English Region has the highest number of applicants?

This was done by;

Joining the “EnglishRegions” shapefile, containing the data on English Regions (i.e. South East England), to the

“Join_Output_2” shapefile aka “Applicants” shapefile and summing the number of applicants (points) in each English

Region (polygon).

This adds the “summed” data to a newly created layer which displays London as the Local Authority with the highest

number of applicants at 25064 when the attribute table is opened and the “Count_” column sorted by descending as

per the screenshot below.

4. List all the local authorities with CoOEB greater than 15000 and select all the applicants within the

resulting local authorities.

This was done by;

Joining the “UK_Lad” shapefile with the “Ethnicity Stats” table, using the “GEO_CODE” field as the primary and

foreign key respectively.

This then allows the execution of the following SQL statement to select all Local Authorities with “CoOEB” greater

than 15000.

As confirmed by opening the “UK_LAD” attribute table.

Next these selected polygons were exported to their own separate shapefile.

And the following “Select By Location” query can be run to select all of the applicants from the “Join_Output_2”

layer that fall within the Source Layer, the newly created layer from the previous step.

Which shows 2848 of the total 39240 applicants as being within these selected Local Authorities as per the

screenshot below.

5. From query 4 selected applicants, choose the applicants that are of CoOEB background. List the number.

This was done by;

Running the following SQL statement to select all applicants that are within the “CoOEB” ethnic group in the

“Join_Output_2” layer.

And exporting them to their own separate layer.

A “Select By Location” query can then be run to select all of the exported “CoOEB” applicants that fall within the

Local Authorities as determined in the previous query.

This gives a total of 57 “CoOEB” applicants out of the total 620 applicants from the aforementioned Local Authorities

as shown in the screenshot below.

6. Which local authority has the highest number of people with ‘no qualification’ and how many applicants

are in that local authority.

This was done by;

Opening the ”Qualification Stats” attribute table and sorting the “NoQual” column, standing for “No Qualification”,

by descending to order it from highest value to lowest value.

From this it was determined that Birmingham had the highest number of people with ‘no qualification’ at 233,835 as

indicated by the attribute table above. The Birmingham polygon was then exported to its own individual shapefile.

So a “Select By Location” query could be run to identify all of the applicants that are completely within the source

layer of Birmingham.

158 out of the 39240 applicants are from the local authority of Birmingham as shown in the screenshot below.

7. How many applicants are in South East Region of England and how many of these are fulltime female

applicants.

This was done by;

Selecting the “South East” polygon from the “English Regions” polygon and exporting it to its own separate feature

so that a “Select By Location” query could be run on that region only.

A “Select By Location” query was then run to identify all of the applicants within the South East Region polygon.

Which was 7110 out of the total 39240 applicants.

This data was then exported to its own individual shapefile so that a “Select By Attribute” SQL query could be run on

it.

The following “Select By Location” SQL statement was run to identify all of the South East Applicants that are

fulltime and female.

This determined that 3889 of the 7110 South East Applicants are full time and female as shown in the screenshot

below.

8. Select all the local authorities with applicant numbers greater than 500.

This was done by;

Taking the previously joined, from non-spatial query 2, “UK_Lad” shapefile and “Join_Output_2” shapefile aka

“Applicants” and running the SQL statement, as shown below, to identify all Local Authorities with a “Count_”

column value of more than 500

This selects all Local Authorities with applicant numbers greater than 500 as shown in the screenshot below.

9. Select all the local authorities with no applicants; which English regions do these local authorities belong.

This was done by;

Taking the previously joined “UK_Lad” shapefile and “Join_Output_2” shapefile aka “Applicants” and running the

SQL statement, as shown below, to identify all Local Authorities with a “Count_” column value of 0

This selects all Local Authorities with applicant numbers of 0 as shown in the screenshot below.

These selected data are then exported to their own separate shapefile.

Next a “Select By Location” query is run to select all of the “0 Applicant” local authorities within English Regions.

Which from the below screenshot shows as being the local authority of Bolsover

By viewing the map and using the “Identify” tool the local authority of Bolsover falls within the English Region “East

Midlands” as shown in the screenshot below.

10. Calculate the density of all applicants in all the London region’s local authorities; which London authority

has the highest density of applicants.

This was done by;

Running a “Select By Location” to select all local authorities, with applicant numbers included from previous queries,

within the London region.

These results are then displayed in the attribute table below

Next a new column is added to the table so that the total population (White + Mixed + AoAB + BoBB + CoOEB) can be

summed for each of the selected local authorities using the field calculator and the expression below.

Next the number of applicants per local authority data from spatial data query 2 “Which Local Authority has the

highest number of applicants” was joined to the above table and a new field entitled “App_Density” added and its

values calculated using the following expression in Field Calculator.

When the resultant values in the attribute table are sorted by descending values it shows Greenwich as being the

local authority in London with the highest density of applicants across its entire population as shown by the

screenshot below.

BONUS QUERIES

11. Select the applicants from the 5 most ethnically diverse local authorities.

This was done by;

Adding a new column entitled “OtherEthnicities” to the “Ethnicity Stats” table and adding together all ethnicities,

except for “White” using the following expression in the Field Calculator.

Next another field entitled “Percentage_of_OE” was added to the “Ethnicity Stats” table and the percentage of the

so called “other ethnicities” versus “white” was calculated using the following expression in field calculator.

When the results in the attribute table were sorted by descending it gave the following five Local Authorities;

Newham, Brent, Redbridge, Harrow and Tower Hamlets as having the largest percentages of “other ethnicities”

making up their respective local authorities populations as per the screenshot below.

These local authorities were then exported to a new layer.

And a select by location query was run to select all of the applicants within these five most diverse local authorities

as shown in the screenshot below

In total there are 5704 applicants within the aforementioned “5 most ethnically diverse local authorities” as per the

results screenshot below.

Discussion

During the creation of the ER Schema a relationship between the Applicants and English Regions could have been created

as many applicants could relate to one region and vice versa. Apart from that I can see no other obvious relation between

other entities, one that is worth defining anyway. The fees, sex, main cycle or clearing, mode of study, ethnicities and

qualifications needed to be moved to their own separate tables in the interests of preserving data integrity (ensuring

incorrect values cannot be entered into the database) and saving space by reducing the amount redundant data in the

database. In this respect the database is normalised to 3NF as far as I can tell however it may not be due to leaving

postcode data in the applicant table. The reason for this being left in was because it may be useful for further

categorisation, on a smaller scale than local authority level, at some other point in the future however for the purposes of

this exercise it may be considered redundant data once the local authorities in to which each applicant falls had been

determined from it.

All of the steps in the ESRI “Building a Geodatabase” tutorial were not followed as it was determined that some

stages were either not relevant (creating a geometric network) or not essential (creating annotation feature classes).

As a part of this geometric networks were not created as these model flow and connectivity between objects (i.e.

utility networks such as water and electricity). It was determined that there are no networks in this database and

therefore a geometric network wasn’t created. Annotation feature classes were also not created as they’re not

strictly essential for the creation of a relational spatial geodatabase. This stage maybe would’ve been more

worthwhile had there been more initial shapefiles to add to the database, as opposed to just .csv files, because, for

example, there would’ve been more maps that required quick pre-defined annotation (i.e. Local Authority names to

be attached to the Local Authority polygons.

The ranking of shapefiles during the creation of topological rules in stage 5 of the implementation could have been

harsher in that the applicant data maybe should have been ranked lower than the Local Authority and English

Regions shapefile due to it being overly reliant on accurate input of postcodes by the applicant and then also relying

on accurate conversions to the Easting and Northing data. Perhaps being beyond the scope of this project it would

have been good to be able to check the “correctness” of the creation of attribute domains and whether the input of

new data to the database was restricted to those fields as defined during stage 3 of the implementation.

It would also have been good to have tested the final database against Codd’s twelve rules for a relational database.

This is something which may well also have been beyond the scope of this project again however there is definitely

scope for this as it will show whether a DBMS is truly fully relational or not. For what it’s worth I believe the database

created definitely satisfies rules 1 – Information Representation, 2 – Guaranteed Access, 3 – Representation of null

values (by use of “unknown” for null values rather than “0”), 6 – Updating Views and 7 – Insert, Update, Delete and

Update Operations. Being unclear on the remaining rules gives rise to belief that maybe the database isn’t truly

relational according to Codd.

Date post:	13-Feb-2017
Category:	Documents
Upload:	ian-morris
View:	36 times
Download:	5 times

Database Design & Implementation Report

Documents