The Use of Affinity Propagation to Cluster Socio-economic ...Non-Metric Affinity Propagation for...

Post on 16-Jul-2020

4 views 0 download

transcript

Ryan A. Meier

Central Michigan University

2015 IMAGIN Conference

The Use of Affinity Propagation to Cluster U.S. Socio-economic Census Data

• Socioeconomic

characteristic

• Data that represents the people

• Population

• Age

• Gender

• Race Identity

• Education

• Family Size

• Household Size

• Employment Sector

• Income

• Marital status

• Nativity

• Language Spoken

WHAT COMMUNITIES BEST REPRESENT THE

UNITED STATES?

• Middletown – Lynd and Lynd (1929)

-- Introduction of Data Mining –

• PRIZM – John Robbin (1978)

• The Clustering of America – Michael Weiss (1988)

• Our Patchwork Nation – Chinni and Gimpel (2010)

PREVIOUS STUDIES

OBJECTIVE

• Map U.S. census socio-demographic data using affinity propagation to group zip codes into meaningful clusters.

• Identify exemplar locations of the U.S. to be used as ideal sample

sites in future research.

• Combine GIS techniques with a novel statistical analysis.

• Demonstrate an objective method for the generalization and analysis of large data sets.

• Frey Labs, University of Toronto

• Clustering algorithm that expresses exemplars

− Most representative data point in the cluster

• Considers all data points as exemplars

• Parameters:

− Dissimilarity Matrix

− Preference Value

• Previous Studies

AFFINITY PROPAGATION

A visualization of AP cluster

classification with exemplar data points

(Bodenhofer, 2013 p. 3).

• Cardille and Lambois (2010)

• Objectively identify signature landscapes of the U.S.

• Aid in ecosystem management

• Land use/ Land cover satellite imagery

• 17 distinct landscapes

• Interesting insight into exemplar landscapes

− Human signature in almost every exemplar

FROM THE REDWOOD FOREST TO THE GULF

STREAM WATERS: HUMAN SIGNATURE NEARLY

UBIQUITOUS IN REPRESENTATIVE US LANDSCAPES

METHODS

• 2010 U.S. Census Data and 2008-2012 five year estimate American Community Survey (ACS)

• Zip Code Tabulation Areas (ZCTAstm)

• 40 different attributes:

Population density, age, gender, race identity, educational attainment, family size, household size, employment sector, income, marital status, nativity and place of birth, and language spoken at home.

Download Data

from U.S. Census

Bureau

METHODS

• Null data values replaced or removed.

• The z-score for each variable was calculated to standardize the dataset.

Download Data

from U.S. Census

Bureau

Format Data in to

Spreadsheet and

Calculate Z-scores

METHODS

• Reduced data size by 35% while maintaining 95% of information.

• Eliminated correlated data.

• Increased RAM efficiency.

• Decreased overall running time.

Download Data

from U.S. Census

Bureau

Format Data in to

Spreadsheet and

Calculate Z-scores

Run PCA on Entire

Dataset

METHODS

• Matrix of how ‘different’ each point is from every other point.

• Size of matrix is exponential to the number of data points.

• 1,052,418,481 pairwise dissimilarities for 32,441 ZCTAs.

• Negative weighted Euclidian distance between z-scores in n-dimensional space.

Download Data

from U.S. Census

Bureau

Format Data in to

Spreadsheet and

Calculate Z-scores

Run PCA on Entire

Dataset

Create a Dissimilarity

Matrix from PCA

Results−

𝑗=1

𝐽

𝑤𝑗 𝑥𝑗 − 𝑦𝑗

METHODS

• R package APCluster (Bodenhofer et al., 2011).

• Approximate run time was 20 hours using 50GB RAM on a 3.5 GHz current Xeon processor.

• AP runs only on one processor.

Download Data

from U.S. Census

Bureau

Format Data in to

Spreadsheet and

Calculate Z-scores

Run PCA on Entire

Dataset

Create a Dissimilarity

Matrix from PCA

Results

Run AP Using

Dissimilarity Matrix

and Preference Value

METHODS

• Mapped using ArcMap 10.2.

Download Data

from U.S. Census

Bureau

Format Data in to

Spreadsheet and

Calculate Z-scores

Run PCA on Entire

Dataset

Create a Dissimilarity

Matrix from PCA

Results

Run AP Using

Dissimilarity Matrix

and Preference Value

Map Results

RESULTS

• 22 unique clusters

and exemplars

• Appearance of regions and spatial patterns

WHY 22 CLUSTERS?

A graph showing the number of resulting clusters based on the preference value, starting with -11.705, the minimum value in the dissimilarity matrix. The red dot indicates the chosen preference value of -234.1.

22 Clusters of

America

1. Worcester, MA2. Newtonville, MA3. Easton, ME4. Lodi, NJ5. Greenwhich, NJ6. Edison, NJ7. Penn Yan, NY8. Savannah, GA9. Hagerhill, KY10. Columbus, OH11. Fort Wayne, IN12. Wabash, IN13. Northome, MN14. Aurora, IL15. Downers Grove, IL

16. Pierce City, MO17. Hardy, NE18. Lafayette, LA19. Sanger, TX20. Lockhart, TX21. Desert Hot

Springs, CA22. Lower Kalskag, AK

CLUSTER REGIONALITY AND PATTERNS

Cluster 10 & 18- The South

10. Columbus, OH18. Lafayette, LA

Cluster 10 & 18- The South

Cluster 9- Native Born Caucasian

9. Hagerhill, KY

Cluster 22- American Indian and

Alaska Native

22. Lower Kalskag, AK

Cluster 22- American Indian and

Alaska Native

22. Lower Kalskag, AK

Cluster 14 & 21- Hispanic and Latino

14. Aurora, IL 21. Desert Hot

Springs, CA

Cluster 1, 2, 6, & 15- Cities and Suburban area

1. Worcester, MA2. Newtonville, MA6. Edison, NJ15. Downers Grove, IL

Cluster 1, 2, 6, & 15- Cities and Suburban area

1. Worcester, MA2. Newtonville, MA6. Edison, NJ15. Downers Grove, IL

Cluster 5, 7, 12, & 13- Rural Areas

5. Greenwhich, NJ7. Penn Yan, NY12. Wabash, IN13. Northome, MN

DEGREE OF CLUSTERING

Degree of Clustering Spatial

Analysis

DISCUSSION

COMPARED TO PREVIOUS STUDIES

• Easier to see regions

• Looks cleaner

• Shows more heterogeneity

• Greater detail of urban areas

Chinni and Gimpel (2010)

22 Clusters of

America

1. Worcester, MA2. Newtonville, MA3. Easton, ME4. Lodi, NJ5. Greenwhich, NJ6. Edison, NJ7. Penn Yan, NY8. Savannah, GA9. Hagerhill, KY10. Columbus, OH11. Fort Wayne, IN12. Wabash, IN13. Northome, MN14. Aurora, IL15. Downers Grove, IL

16. Pierce City, MO17. Hardy, NE18. Lafayette, LA19. Sanger, TX20. Lockhart, TX21. Desert Hot

Springs, CA22. Lower Kalskag, AK

CONCLUSION

Questions?

REFERENCES

Anderson, M. J. 1988. The American Census: A Social History. New Haven, CT: Yale University Press. pp.

Bodenhofer, U., and Kothmeier, A., 2011. APCluster: An R Package for Affinity Propagation Clustering. In:

Bioinformatics, 27: pp. 2463-2464

Cardille, J. A., and Lambois, M., 2010. From The Redwood Forest to the Gulf Stream Waters: Human Signature Nearly

Ubiquitous in Representative US Landscapes. In: Frontiers in Ecology and the Environment, 8(3): pp. 130-

134.

Chang, C-J., and Shyue, S-W., 2009. A Study on the Application of Data Mining to Disadvantaged Social Classes in

Taiwan’s Population Census. In: Ecpert Systems with Applications, 36(1): pp. 510-518.

Chinni, D., and Gimpel, J., 2010. Our Patchwork Nation. New York, USA: Penguin Group Inc.

Dueck, D., and Frey, B. J., 2007. Non-Metric Affinity Propagation for Un-Supervised Image Categorization. In:

Proceedings, 11th IEEE International Conference, Rio de Janeiro, Brazil, Computer Vision, pp. 1-8.

Fan, B., 2009. A Hybrid Spatial Data Clustering Method for Site Selection: The Data Driven Approach of GIS Mining. In:

Expert Systems with Applications, 36(2 part II): pp. 3923-3936.

Fligstein, N., 1981. Going North: Migration of Blacks and Whites from the South, 1900-1950. New York, USA: Academic

Press, Inc.

Frey, B. J., and Dueck, D., 2007. Clustering by Passing Messages Between Data Points. In: Science 315: pp. 972-76.

Furse, D. H., Punj, G. N., and Stewart, D. W., 1984. A Typology of Individual Search Strategies Among Purchasers of New

Automobiles. In: Journal of Consumer Research, 10(4): pp. 417-431.

Goss, J., 1995. “We Know Who You Are and We Know Where You Live”: The Instrumental Rationality of

Geodemographic Systems. In: Economic Geography, 71(2): pp. 171-198.

REFERENCES cont.

Green, P. E., Frank, R. E., and Robinson, P. J., 1967. Cluster Analysis in Test Market Selection. In: Management Science

(pre-1986), 13(8): pp. B387 (14).

Greenacre, M., 2005. Weighted Metric Multidimensional Scaling. In: Studies in Classification, Data Analysis, and

Knowledge Organization: pp. 141-149.

Hanson, Sandra L., 2004. Classic Book Reviews: The Past Revived. In: Journal of Marriage and Family 62, 3: pp. 847-49.

Helsen, K., and Green, P. E., 1991. A Computational Study of Replicated Clustering with an Application to Market

Segmentation. In: Decision Sciences, 22(5): pp. 1124-1141.

Karimipour, F., Delavar, M. R., and Kinaie, M., 2005. Water Quality Management Using GIS Data Mining. In: Journal of

Environmental Informatics, 5(2): pp. 61-72.

Keim, D. A., Panse, C., Sips, M., and North, S. C., 2004. Pixel Based Visual Data Mining of Geo-spatial Data. In:

Computers & Graphics, 28: pp. 327-344.

Kopanakis, I., and Theodoulidis, B., 2003. Visual Data Mining Modeling Techniques for the Visualization of Mining

Outcomes. In: Journal of Visual Languages & Computing, 14(6): pp. 543-589.

Lê, J. S., and Husson, F., 2008. FactoMineR: An R Package for Multivariate Analysis. In: Journal of Statistical Software,

25(1): pp. 1-18.

Lynd, R. S., and Lynd H. M., 1929. Middletown. New York, USA: Harcourt, Brace & World, Inc. pp. 3-9.

Mennis, J., and Guo, D., 2009. Spatial Data Mining and Geographic Knowledge Discovery—An introduction. In:

Computers, Environment and Urban Systems, 33(6): pp. 403-408.

Mines, R., 1981. Developing a Community Tradition of Migration: A Field Study in rural Zacatecas. Mexico, and

California Settlement Areas. In: Center for U.S.-Mexican Studies. UC San Diego.

REFERENCES cont.

Murray, C., Kulkarni, S., Michaud, C., Tomijima, N., Bulzacchelli, M., Iandiorio, T., and Ezzati, M., 2006. Eight Americas:

Investigating Mortality Disparities Across Counties, and Race-Counties in the United States. In: PLoS

Medicine, 3(9): pp. 1513-1524.

Paasi, A., 2004. Place and Region: Looking Through the Prism of Scale. In: Progress in Human Geography 28(4): pp.

536-546.

Punj, G., and Stewart, D. W., 1983. Cluster Analysis in Marketing Research: Review and Suggestions for Application. In:

Journal of Marketing Research, 20(2): pp. 134-148.

Rouse, R., 1991. Mexican Migration and the Social Space Postmodernism. In: Diaspora: A Journal of Transnational

Studies, 1(1): pp. 8-23.

Slocum, T., McMaster, R., Kessler, F., and Howard, H., 2009. Data Classification. In: Thematic Cartography and

Geovisualization 3rd, Upper Saddle River, NJ: Pearson Education Inc. pp. 57-75.

Spielman, S. E., and Thill, J-C., 2008. Social Area Analysis, Data Mining, and GIS. In: Computers, Environment and Urban

Systems, 32(2): pp. 110-122.

Weiss, M. J., 1988. The Clustering of America. New York, USA: Harper & Row, Pubishers.

Winkle, K., 1991. The U.S. Census as a Source in Political History. In: Social Science History, 15(4): pp. 565-57.