ACS DataMart_ppt

ACS DATAMARTJEREMY SEARLS

“MAKE THE CENSUS DATA EASIER TO USE”PROJECT OVERVIEW

INITIAL OBJECTIVES

• Create a repeatable process to build a “Data-Mart” of all census data

• Utilize Hive and Hadoop to write and store data

• Make census data more accessible and understandable by organizing data into categories with logical column headers

QUICK REVIEW OF ACS DATA

• “American Community Survey”

• Sent to approximately 295,000 addresses monthly (or 3.5 million per year)

• The ACS only includes approximately 2 million final interviews per year

• The survey was fully implemented in 2005

• Data comes in 3 forms:

• 1 year, 3 year and 5 year

5 YEAR ACSThe 2014 ACS 5-year estimates were released in 2015 and summarize responses received in 2010, 2011, 2012, 2013 and 2014 for all geographies.

This is most suitable for data users interested in longer-term changes at small geographic scales.

FYI - “Places” refer to the statistical counterparts of incorporated places, and are delineated to provide data for settled concentrations of population that are

identifiable by name but are not legally incorporated under the laws of the state in which they are located. e.g. Boroughs in New York

Red Circles denote available data

ACS DATA STRUCTURE

2014 5 YR ACS

121 “SEQUENCES” EACH

52 STATES/TERRITORI

ES

EACH SEQUENCE IS A TABLE CONTAINING THE SAME GEOGRAPHIC DATA WITH

DIFFERENT CENSUS DATA

PRIOR AFTER ACTION ITEMS MOTIVATING PROJECT DEV

WHY?

• During my starter project with census data, it was noted that the topics of information were scattered across several tables, or “sequences”

• The column headers were also coded, requiring the use of a lookup table to decipher the headers

• Titled headers then had to be manually created and entered as headers or metadata to replace coded headers

THAT’S A LOTTA BYTESSTEP 1:

• The first step taken was downloading the census data

• Issues with the previous census data utilized were its lack of granular data and that it only represented estimates based off one year

• The 5 year census was chosen to get more accurate data, and a more granular version of that data was selected

• Once selected and downloaded, the data must be converted to SAS tables

SAS TABLE CONVERSION• After downloading SAS and running on a VM, the census sequence files

are converted to SAS Tables using a macro that directs the conversion. 176 GB of SAS tables were created.

• The macro is supplied by the Census, but must be manipulated to output desired data

THE JOURNEY CONTINUESSAS TABLE TO CSV

• Ruby script calls python scripts to convert SAS tables to CSVs

• CSVs reduce overall data size from 176 GB to 53 GB

I’M HADOOP AND SO CAN YOU!STEP 2

• Once finished with obtaining the data, the next step was learning Hadoop

• Hadoop Hortonworks VM sandbox was installed and I began training the Hive

• Once I had created training tables and felt comfortable working in HDFS, I began deciding on data structure

THE WALMART THEOREMSTEP 3

• Made the most sense to organize data into sensible categories on a single table. While the tables would be large, it would reduce the encumbrance of having only half the data needed on a single table.

• Having logical titles would eliminate the need for a lookup table and manually entering titles when desired data was selected.

• Would reduce the data redundancy of repeating geographical information on every sequence.

CATEGORIES

• Manually went through entire census, categorizing tables into topic and subtopics

• Requested changing scope of objectives

• Focused on Marriage Data for a proof of concept model

WE’VE GOT SOME WORK TO DO.

• Began looking at how table metadata was organized and could be automated by a script to create logical names.

• Created a hierarchy and row number to build logical titles

• Problems with this method:-Can’t be easily applied to other subjects, no repeatability-Would be just as effective to write out the names manually

IN COMES PETER.

• While discussing my dilemma with Peter, he showed me the Census Reporter, a group that “helps journalists navigate and understand information from the U.S. Census bureau.”

• They had already organized the entire census, giving the indents and most importantly, the parent column ID

TIME TO CODESTEP 4

• Throughout this process, I had continued training in Ruby and was convinced it would be the best method for me to create the logical titles. This was based on several factors: - Ruby has a great CSV library built in- I was simultaneously training in ruby and had this project in mind during training sessions-A ruby algorithm would be easily repeatable and shared/improved

LOADING THE CSV OF COLUMN HEADERS

• Created an array of hashes from each row

• Value of parent column id is the predecessor containing the next portion of a logical title

• Realize all that is needed is the title, the col_id and parent column id

• Learned how to use a hash lookup from Thon in previous code challenge

PUTTING TITLES TOGETHER• By using the column_id

as the key to the hash, when it is called, its value is returned, containing column title and its parent column id.

• Performing a hash lookup on the parent column id results in the corresponding column title and its parent id to be interpolated. This process continues until there are no more parents(‘nil’).

LET’S MAKE IT LEGIBLE

• Use flatten to remove array nesting

• Compact removes the nil values

• Reverse puts the string in the correct order

• Join creates a single string from the arrays

COMBINE WITH TABLE NAME

CONCATENATE AND FORMAT

DONE? NOT SO MUCH.Each column-id represents meta data for the estimate (e) and margin of error (m)

B01001003 = B01001e3 & B01001m3

TRANSFORM COLUMN TITLE & ID FOR E & M

Original Code:

ex. B01001003 = B01001e3 & B01001m3 | B01001103 = B01001e103 & B01001m103

Issue: Codes with ‘e’ or ‘m’ values greater than 99 not being replaced

CODE FIX

FINALLY! READY TO CHANGE A .CSV

• Need to load the CSVs created from SAS tables and find coded headers, replacing them with the full formatted titles and write them out to a new CSV

LOOKING FOR A HASH

‘ROW’ RETURNS A CSV::ROW OBJECT (A HARD LEARNED LESSON). USES AN INDEX WITH NESTED ARRAY. NEED [I] TO ACCESS.

SMALL ISSUES

• When loading files, only load CSVs, other files were being picked up

• How to retitle new CSV files

• Combining two arrays

COOL!! ALL DONE RIGHT??

• Still have 52 versions of each sequence• The total number of sequences is 121

(ex: sf0037ak.csv - sequence file 37 of 121 for the state of Alaska)• Topics are spread across multiple sequences• Implement the WalMart Theorem.

WELLLLLL….

ONE. BIG. TABLE.• Use integrator to combine the sequences and

their respective columns that relate to marriage data

• Needed to create ‘recspec’, used concatenation of LOGRENCO and STATE

• Very large memory load, sorted data and used extMerge

• 535,345 rows, 1011 columns, 541,233,795 cells

UPLOADING TO HDFS

• Originally uploaded file to Hortonworks Sandbox, sandbox was ill equipped for such a large table.

• Switched to production Hadoop cluster

CREATE HIVE/IMPALA TABLE

• Exported metadata created by integrator join

• Created external table to loaded census CSV

QUERY IMPALA

ENSURE DATA SET HAS BEEN LOADED WITH CORRECT METADATA

CRADLE TO GRAVE PICTOGRAPHSEQUENCE FILES

SAS TABLE

CSV TABLE WITH CODED HEADERS

CSV TABLE WITH TITLED HEADERS

SEQUENCE TABLE CSVS WITH COMMON CATEGORY JOINED INTO ONE CSV

CSV LOADED ONTO HDFS

EXTERNAL HIVE TABLE CREATED

IMPORT INTO BDD

Next Steps

REPEAT FOR ADDITIONAL CATEGORIES

Easily Repeatable for

additional census

categories

SAS

Python Script

Ruby Script

Integrator

HortonworksHive/Impala

USE CASE• David’s Bridal becomes a client

• Trying to decide between 10 counties where to place their next store

• Their research says first time marriages use more expensive dresses

• Add census marriage data with choropleth “heat map” of areas with highest concentration of unmarried women based on age 22-35, for greater probability of a first time marriage

• Give clients a “menu” of applicable census categories to be included in their build

NEXT STEPS?

• Refactor code with classes

• Automate subject area code selection process

• Utilize integrator’s Hadoop writer tool to eliminate the nee to write to .csv, import into HDFS and build Hive table

• Create more common abbreviations in titles to reduce overall length and redundancy. Use something more sophisticated than multiple .gsubs

FORESEEN ISSUES

• Uploading entire census would take up a lot of cluster space, may not be practical

• Limitations to giant tables? Hive Limitations? Especially with header character limits

• Possible solution would be uploading categories PRN when applicable to projects

LESSONS LEARNED• Booting and running a VM

• Hive Syntax, commands

• Loading files into HDFS

• Using Hive and HDFS through the command line

• File permissions

• Using SAS

• Manipulating Macros

• Maintaining a git repo/ using GitX

• How to communicate technical problems to my superiors and solve them

• Basic ruby coding and program design

• Regular Expressions

• ETL concepts and practices

• Working well outside comfort zone

• How to easily manipulate JSON and CSVs

• Using all resources at my disposal

• Different areas of staff expertise

• Staying within a project timeline, keeping supervisor informed of status

• Working with new and different filetypes (.gz, .fmt)

• Using integrator for large amounts of data and designing graphs that optimize memory load

• Becoming more self-sufficient, researching and implementing new concepts on my own

• When to dig in, and when it is appropriate to ask for help

• Repurposing lessons taught from prior projects into current ones

• Reading and applying the code of others for my purposes

Date post:	12-Apr-2017
Category:	Documents
Upload:	jeremy-searls
View:	26 times
Download:	0 times

ACS DataMart_ppt

Documents