Ryan A. Meier
Central Michigan University
2015 IMAGIN Conference
The Use of Affinity Propagation to Cluster U.S. Socio-economic Census Data
• Socioeconomic
characteristic
• Data that represents the people
• Population
• Age
• Gender
• Race Identity
• Education
• Family Size
• Household Size
• Employment Sector
• Income
• Marital status
• Nativity
• Language Spoken
WHAT COMMUNITIES BEST REPRESENT THE
UNITED STATES?
• Middletown – Lynd and Lynd (1929)
-- Introduction of Data Mining –
• PRIZM – John Robbin (1978)
• The Clustering of America – Michael Weiss (1988)
• Our Patchwork Nation – Chinni and Gimpel (2010)
PREVIOUS STUDIES
OBJECTIVE
• Map U.S. census socio-demographic data using affinity propagation to group zip codes into meaningful clusters.
• Identify exemplar locations of the U.S. to be used as ideal sample
sites in future research.
• Combine GIS techniques with a novel statistical analysis.
• Demonstrate an objective method for the generalization and analysis of large data sets.
• Frey Labs, University of Toronto
• Clustering algorithm that expresses exemplars
− Most representative data point in the cluster
• Considers all data points as exemplars
• Parameters:
− Dissimilarity Matrix
− Preference Value
• Previous Studies
AFFINITY PROPAGATION
A visualization of AP cluster
classification with exemplar data points
(Bodenhofer, 2013 p. 3).
• Cardille and Lambois (2010)
• Objectively identify signature landscapes of the U.S.
• Aid in ecosystem management
• Land use/ Land cover satellite imagery
• 17 distinct landscapes
• Interesting insight into exemplar landscapes
− Human signature in almost every exemplar
FROM THE REDWOOD FOREST TO THE GULF
STREAM WATERS: HUMAN SIGNATURE NEARLY
UBIQUITOUS IN REPRESENTATIVE US LANDSCAPES
METHODS
• 2010 U.S. Census Data and 2008-2012 five year estimate American Community Survey (ACS)
• Zip Code Tabulation Areas (ZCTAstm)
• 40 different attributes:
Population density, age, gender, race identity, educational attainment, family size, household size, employment sector, income, marital status, nativity and place of birth, and language spoken at home.
Download Data
from U.S. Census
Bureau
METHODS
• Null data values replaced or removed.
• The z-score for each variable was calculated to standardize the dataset.
Download Data
from U.S. Census
Bureau
Format Data in to
Spreadsheet and
Calculate Z-scores
METHODS
• Reduced data size by 35% while maintaining 95% of information.
• Eliminated correlated data.
• Increased RAM efficiency.
• Decreased overall running time.
Download Data
from U.S. Census
Bureau
Format Data in to
Spreadsheet and
Calculate Z-scores
Run PCA on Entire
Dataset
METHODS
• Matrix of how ‘different’ each point is from every other point.
• Size of matrix is exponential to the number of data points.
• 1,052,418,481 pairwise dissimilarities for 32,441 ZCTAs.
• Negative weighted Euclidian distance between z-scores in n-dimensional space.
Download Data
from U.S. Census
Bureau
Format Data in to
Spreadsheet and
Calculate Z-scores
Run PCA on Entire
Dataset
Create a Dissimilarity
Matrix from PCA
Results−
𝑗=1
𝐽
𝑤𝑗 𝑥𝑗 − 𝑦𝑗
METHODS
• R package APCluster (Bodenhofer et al., 2011).
• Approximate run time was 20 hours using 50GB RAM on a 3.5 GHz current Xeon processor.
• AP runs only on one processor.
Download Data
from U.S. Census
Bureau
Format Data in to
Spreadsheet and
Calculate Z-scores
Run PCA on Entire
Dataset
Create a Dissimilarity
Matrix from PCA
Results
Run AP Using
Dissimilarity Matrix
and Preference Value
METHODS
• Mapped using ArcMap 10.2.
Download Data
from U.S. Census
Bureau
Format Data in to
Spreadsheet and
Calculate Z-scores
Run PCA on Entire
Dataset
Create a Dissimilarity
Matrix from PCA
Results
Run AP Using
Dissimilarity Matrix
and Preference Value
Map Results
RESULTS
• 22 unique clusters
and exemplars
• Appearance of regions and spatial patterns
WHY 22 CLUSTERS?
A graph showing the number of resulting clusters based on the preference value, starting with -11.705, the minimum value in the dissimilarity matrix. The red dot indicates the chosen preference value of -234.1.
22 Clusters of
America
1. Worcester, MA2. Newtonville, MA3. Easton, ME4. Lodi, NJ5. Greenwhich, NJ6. Edison, NJ7. Penn Yan, NY8. Savannah, GA9. Hagerhill, KY10. Columbus, OH11. Fort Wayne, IN12. Wabash, IN13. Northome, MN14. Aurora, IL15. Downers Grove, IL
16. Pierce City, MO17. Hardy, NE18. Lafayette, LA19. Sanger, TX20. Lockhart, TX21. Desert Hot
Springs, CA22. Lower Kalskag, AK
CLUSTER REGIONALITY AND PATTERNS
Cluster 10 & 18- The South
10. Columbus, OH18. Lafayette, LA
Cluster 10 & 18- The South
Cluster 9- Native Born Caucasian
9. Hagerhill, KY
Cluster 22- American Indian and
Alaska Native
22. Lower Kalskag, AK
Cluster 22- American Indian and
Alaska Native
22. Lower Kalskag, AK
Cluster 14 & 21- Hispanic and Latino
14. Aurora, IL 21. Desert Hot
Springs, CA
Cluster 1, 2, 6, & 15- Cities and Suburban area
1. Worcester, MA2. Newtonville, MA6. Edison, NJ15. Downers Grove, IL
Cluster 1, 2, 6, & 15- Cities and Suburban area
1. Worcester, MA2. Newtonville, MA6. Edison, NJ15. Downers Grove, IL
Cluster 5, 7, 12, & 13- Rural Areas
5. Greenwhich, NJ7. Penn Yan, NY12. Wabash, IN13. Northome, MN
DEGREE OF CLUSTERING
Degree of Clustering Spatial
Analysis
DISCUSSION
COMPARED TO PREVIOUS STUDIES
• Easier to see regions
• Looks cleaner
• Shows more heterogeneity
• Greater detail of urban areas
Chinni and Gimpel (2010)
22 Clusters of
America
1. Worcester, MA2. Newtonville, MA3. Easton, ME4. Lodi, NJ5. Greenwhich, NJ6. Edison, NJ7. Penn Yan, NY8. Savannah, GA9. Hagerhill, KY10. Columbus, OH11. Fort Wayne, IN12. Wabash, IN13. Northome, MN14. Aurora, IL15. Downers Grove, IL
16. Pierce City, MO17. Hardy, NE18. Lafayette, LA19. Sanger, TX20. Lockhart, TX21. Desert Hot
Springs, CA22. Lower Kalskag, AK
CONCLUSION
Questions?
REFERENCES
Anderson, M. J. 1988. The American Census: A Social History. New Haven, CT: Yale University Press. pp.
Bodenhofer, U., and Kothmeier, A., 2011. APCluster: An R Package for Affinity Propagation Clustering. In:
Bioinformatics, 27: pp. 2463-2464
Cardille, J. A., and Lambois, M., 2010. From The Redwood Forest to the Gulf Stream Waters: Human Signature Nearly
Ubiquitous in Representative US Landscapes. In: Frontiers in Ecology and the Environment, 8(3): pp. 130-
134.
Chang, C-J., and Shyue, S-W., 2009. A Study on the Application of Data Mining to Disadvantaged Social Classes in
Taiwan’s Population Census. In: Ecpert Systems with Applications, 36(1): pp. 510-518.
Chinni, D., and Gimpel, J., 2010. Our Patchwork Nation. New York, USA: Penguin Group Inc.
Dueck, D., and Frey, B. J., 2007. Non-Metric Affinity Propagation for Un-Supervised Image Categorization. In:
Proceedings, 11th IEEE International Conference, Rio de Janeiro, Brazil, Computer Vision, pp. 1-8.
Fan, B., 2009. A Hybrid Spatial Data Clustering Method for Site Selection: The Data Driven Approach of GIS Mining. In:
Expert Systems with Applications, 36(2 part II): pp. 3923-3936.
Fligstein, N., 1981. Going North: Migration of Blacks and Whites from the South, 1900-1950. New York, USA: Academic
Press, Inc.
Frey, B. J., and Dueck, D., 2007. Clustering by Passing Messages Between Data Points. In: Science 315: pp. 972-76.
Furse, D. H., Punj, G. N., and Stewart, D. W., 1984. A Typology of Individual Search Strategies Among Purchasers of New
Automobiles. In: Journal of Consumer Research, 10(4): pp. 417-431.
Goss, J., 1995. “We Know Who You Are and We Know Where You Live”: The Instrumental Rationality of
Geodemographic Systems. In: Economic Geography, 71(2): pp. 171-198.
REFERENCES cont.
Green, P. E., Frank, R. E., and Robinson, P. J., 1967. Cluster Analysis in Test Market Selection. In: Management Science
(pre-1986), 13(8): pp. B387 (14).
Greenacre, M., 2005. Weighted Metric Multidimensional Scaling. In: Studies in Classification, Data Analysis, and
Knowledge Organization: pp. 141-149.
Hanson, Sandra L., 2004. Classic Book Reviews: The Past Revived. In: Journal of Marriage and Family 62, 3: pp. 847-49.
Helsen, K., and Green, P. E., 1991. A Computational Study of Replicated Clustering with an Application to Market
Segmentation. In: Decision Sciences, 22(5): pp. 1124-1141.
Karimipour, F., Delavar, M. R., and Kinaie, M., 2005. Water Quality Management Using GIS Data Mining. In: Journal of
Environmental Informatics, 5(2): pp. 61-72.
Keim, D. A., Panse, C., Sips, M., and North, S. C., 2004. Pixel Based Visual Data Mining of Geo-spatial Data. In:
Computers & Graphics, 28: pp. 327-344.
Kopanakis, I., and Theodoulidis, B., 2003. Visual Data Mining Modeling Techniques for the Visualization of Mining
Outcomes. In: Journal of Visual Languages & Computing, 14(6): pp. 543-589.
Lê, J. S., and Husson, F., 2008. FactoMineR: An R Package for Multivariate Analysis. In: Journal of Statistical Software,
25(1): pp. 1-18.
Lynd, R. S., and Lynd H. M., 1929. Middletown. New York, USA: Harcourt, Brace & World, Inc. pp. 3-9.
Mennis, J., and Guo, D., 2009. Spatial Data Mining and Geographic Knowledge Discovery—An introduction. In:
Computers, Environment and Urban Systems, 33(6): pp. 403-408.
Mines, R., 1981. Developing a Community Tradition of Migration: A Field Study in rural Zacatecas. Mexico, and
California Settlement Areas. In: Center for U.S.-Mexican Studies. UC San Diego.
REFERENCES cont.
Murray, C., Kulkarni, S., Michaud, C., Tomijima, N., Bulzacchelli, M., Iandiorio, T., and Ezzati, M., 2006. Eight Americas:
Investigating Mortality Disparities Across Counties, and Race-Counties in the United States. In: PLoS
Medicine, 3(9): pp. 1513-1524.
Paasi, A., 2004. Place and Region: Looking Through the Prism of Scale. In: Progress in Human Geography 28(4): pp.
536-546.
Punj, G., and Stewart, D. W., 1983. Cluster Analysis in Marketing Research: Review and Suggestions for Application. In:
Journal of Marketing Research, 20(2): pp. 134-148.
Rouse, R., 1991. Mexican Migration and the Social Space Postmodernism. In: Diaspora: A Journal of Transnational
Studies, 1(1): pp. 8-23.
Slocum, T., McMaster, R., Kessler, F., and Howard, H., 2009. Data Classification. In: Thematic Cartography and
Geovisualization 3rd, Upper Saddle River, NJ: Pearson Education Inc. pp. 57-75.
Spielman, S. E., and Thill, J-C., 2008. Social Area Analysis, Data Mining, and GIS. In: Computers, Environment and Urban
Systems, 32(2): pp. 110-122.
Weiss, M. J., 1988. The Clustering of America. New York, USA: Harper & Row, Pubishers.
Winkle, K., 1991. The U.S. Census as a Source in Political History. In: Social Science History, 15(4): pp. 565-57.