https://eshanley.github.io/

https://github.com/eshanley

Final Tutorial: Food Access Across America: How Food Deserts Affect Community Health

Analysis by Erin Shanley

Data Source

USDA Food Access Research Atlas: https://www.ers.usda.gov/data-products/food-access-research-atlas/download-the-data/

  • Contains 2010 Census Data regarding population demographics, food access, poverty rate, median household incomes, etc.

University of Wisconsin Population Health Institute: 2010 County Health Rankings National Data https://www.countyhealthrankings.org/sites/default/files/2010%20County%20Health%20Rankings%20National%20Data_v2.xls

  • Contains county health data such as obesity rates, average unhealthy days reported by individuals, health quality reported by individuals, etc.

Project Explanation and Importance

Since I am a graduate student, to fulfill the extra requirement I will be working alone on this project. The data sets I plan on working with come from the US Department of Agriculture and The University of Wisconsin Population Health Institute. The particular sets of data I will look at focus on food access across the US and contains demographic information on different counties in the US and their ability to access supermarkets, other healthy and affordable food sources, overall mental health, physical health, and obsesity rates.

While there are many ways to define which areas are considered "food deserts" and many ways to measure food store access for individuals and for neighborhoods, I have chosen a few initial parameters to begin analyzing. The USDA defines a food desert as 'an area that has limited access to affordable and nutritious food, in terms of income and distance to a grocery store. Food deserts are a national crisis in America, as nearly 23.5 million Americans currently live in food deserts. In addition, the USDA estimates that people living in food deserts are 30% more likely to become obese.

For my analysis, I will focus on median family income, poverty rate, percent of populations considered low access at 1/2 mile, 1 mile, 10 miles, and 20 miles distance between housing units and grocery stores, frequency of grocery stores, access to transportation, racial demographics, and the percentage of children living in poverty. We will measure community health by obesity rates per county, reported unhealthy days by individuals, and percent of individuals reporting fair to poor health.

It is important to note a few definitions provided by the data. 'Low Access' refers to accessibility to sources of healthy food, as measured by distance to a store (at ½ mile, 1 mile, 10 miles, 20 miles). 'Low Income' refers to an area where the poverty rate is greater than 20 percent or the median family income is less than or equal to 80 percent of the State-wide median family income. Then a food desert is determined by the combination of both a low access and low income community. The data provides a spatial overview of food access indicators for low income vs other census tracts using measures of 1-mile, 10-mile and 20-mile demarcations to the nearest supermarket, and vehicle availability for all census tracts.

Questions to Address

  1. Does the distance between housing units and supermarkets and the frequency of supermarkets in a community impact community health? Is there higher or lower rates of obesity in areas with less access to healthy affordable food options?
  2. Do average income rates of a particular county impact overall health of the community members?
  3. Does access to transportation have an impact on a community’s health?
  4. Is there a relationship between median family income and areas with less access to healthy affordable food options?

Hypothesis

Communities with lower median family incomes, a higher percentage of low access population, less access to transportation, and higher poverty rates are going to experience the poorest health conditions.

Reading in Data

USDA Food Access Research Atlas Data

Note that this data refers to the 2010 census.

In [139]:
# Read in neccessary libraries
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import scipy
from scipy import stats
from matplotlib import cm
In [140]:
# Read in USDA Food Access Research Atlas Data
df_food = pd.read_csv("USDA_data.csv")
df_food.head()
Out[140]:
CensusTract State County Urban POP2010 OHU2010 GroupQuartersFlag NUMGQTRS PCTGQTRS LILATracts_1And10 ... TractSeniors TractWhite TractBlack TractAsian TractNHOPI TractAIAN TractOMultir TractHispanic TractHUNV TractSNAP
0 1001020100 Alabama Autauga 1 1912 693 0 0 0.000000 0 ... 221 1622 217 14 0 14 45 44 26 112
1 1001020200 Alabama Autauga 1 2170 743 0 181 0.083410 0 ... 214 888 1217 5 0 5 55 75 87 202
2 1001020300 Alabama Autauga 1 3373 1256 0 0 0.000000 0 ... 439 2576 647 17 5 11 117 87 108 120
3 1001020400 Alabama Autauga 1 4386 1722 0 0 0.000000 0 ... 904 4086 193 18 4 11 74 85 19 82
4 1001020500 Alabama Autauga 1 10766 4082 0 181 0.016812 0 ... 1126 8666 1437 296 9 48 310 355 198 488

5 rows × 147 columns

Tidy USDA Food Access Research Atlas Data

Next, we need to tidy the data imported from the USDA Food Access Research Atlas. Below is a tidied dataframe with columns named more appropriately and unneccessary columns of data removed. This new dataframe displays data surrounding county population, total number of housing units, number of houses without vehicle access, median household income, poverty rate, and low access population percentages when measured at 1 mile, 10 miles, and 20 miles from a grocery store, county populations based on race, and the percent of the population considered to be low income.

In [141]:
# Tidy data by creating new dataframe containing only the relevant parameters
# Create a list of all columns to keep 
food_cols = ["State","County","POP2010","OHU2010","PovertyRate","MedianFamilyIncome",
             "lapophalf","lapop1","lapop10","lapop20","TractHUNV","TractLOWI","TractWhite",
            "TractBlack","TractAsian","TractHispanic","TractOMultir"]
df_food = df_food[food_cols]

# Rename columns appropriately
df_food.rename(columns={"POP2010": "Population", "OHU2010": "TotalHouseUnits","TractHUNV":"HouseUnitsNoVehicle",
                        "TractWhite":"PopWhite","TractBlack":"PopBlack","TractAsian":"PopAsian","TractHispanic":"PopHispanic",
                        "TractOMultir":"PopOther","TractLOWI":"PopLowIncome"}, inplace=True)

display(df_food.head())

# Check for correct data types
df_food.dtypes
State County Population TotalHouseUnits PovertyRate MedianFamilyIncome lapophalf lapop1 lapop10 lapop20 HouseUnitsNoVehicle PopLowIncome PopWhite PopBlack PopAsian PopHispanic PopOther
0 Alabama Autauga 1912 693 10.0 74750 1732.225468 1357.480940 0.0 0.0 26 448 1622 217 14 44 45
1 Alabama Autauga 2170 743 18.2 51875 1410.374828 483.429683 0.0 0.0 87 763 888 1217 5 75 55
2 Alabama Autauga 3373 1256 19.1 52905 2764.604126 1417.874893 0.0 0.0 108 1578 2576 647 17 87 117
3 Alabama Autauga 4386 1722 3.3 68079 3651.061015 1363.466885 0.0 0.0 19 1241 4086 193 18 85 74
4 Alabama Autauga 10766 4082 8.5 77819 7778.396188 2643.095161 0.0 0.0 198 2692 8666 1437 296 355 310
Out[141]:
State                   object
County                  object
Population               int64
TotalHouseUnits          int64
PovertyRate            float64
MedianFamilyIncome       int64
lapophalf              float64
lapop1                 float64
lapop10                float64
lapop20                float64
HouseUnitsNoVehicle      int64
PopLowIncome             int64
PopWhite                 int64
PopBlack                 int64
PopAsian                 int64
PopHispanic              int64
PopOther                 int64
dtype: object

Below I have created a pivot table indexed on state and county to organize the above dataframe into a more manageable format. Then I edited the low access and race columns to represent percentage of the total population, rather than a sum of individuals. This was achieved by creating new columns and dividing by the overall population, then multiplying by 100.

In [142]:
# Create a pivot table to organize data by State and County
df_food_pivot = df_food.pivot_table(values=['Population','TotalHouseUnits','PovertyRate','MedianFamilyIncome','lapophalf',
                                            'lapop1','lapop10','lapop20','HouseUnitsNoVehicle','PopLowIncome','PopWhite',
                                           'PopBlack','PopAsian','PopHispanic','PopOther'],
                                   index=['State','County'],
                                   aggfunc={'Population':np.sum,'TotalHouseUnits':np.sum,'HouseUnitsNoVehicle':np.sum, 
                                            'PovertyRate':np.mean,'MedianFamilyIncome':np.mean,'lapophalf':np.sum,
                                            'lapop1':np.sum,'lapop10':np.sum,'lapop20':np.sum,'PopWhite':np.sum,
                                            'PopBlack':np.sum,'PopAsian':np.sum,'PopHispanic':np.sum,'PopOther':np.sum,
                                            'PopLowIncome':np.sum})

# Edit columns to show % of total population for each identifier
df_food_pivot['% House Units No Vehicle']=(df_food_pivot['HouseUnitsNoVehicle']/df_food_pivot['TotalHouseUnits'])*100
df_food_pivot['% Pop Low Access (1/2 mile)'] = (df_food_pivot['lapophalf']/df_food_pivot['Population'])*100
df_food_pivot['% Pop Low Access (1 mile)'] = (df_food_pivot['lapop1']/df_food_pivot['Population'])*100
df_food_pivot['% Pop Low Access (10 miles)'] = (df_food_pivot['lapop10']/df_food_pivot['Population'])*100
df_food_pivot['% Pop Low Access (20 miles)'] = (df_food_pivot['lapop20']/df_food_pivot['Population'])*100
df_food_pivot['% Pop White'] = (df_food_pivot['PopWhite']/df_food_pivot['Population'])*100
df_food_pivot['% Pop Black'] = (df_food_pivot['PopBlack']/df_food_pivot['Population'])*100
df_food_pivot['% Pop Asian'] = (df_food_pivot['PopAsian']/df_food_pivot['Population'])*100
df_food_pivot['% Pop Hispanic'] = (df_food_pivot['PopHispanic']/df_food_pivot['Population'])*100
df_food_pivot['% Pop Other'] = (df_food_pivot['PopOther']/df_food_pivot['Population'])*100
df_food_pivot['% Pop Low Income'] = (df_food_pivot['PopLowIncome']/df_food_pivot['Population'])*100

# Keep only columns specified below
df_food_pivot = df_food_pivot[['Population','PovertyRate','MedianFamilyIncome','% House Units No Vehicle',
                               '% Pop Low Access (1/2 mile)','% Pop Low Access (1 mile)','% Pop Low Access (10 miles)',
                               '% Pop Low Access (20 miles)','% Pop White','% Pop Black','% Pop Asian','% Pop Hispanic',
                               '% Pop Other','% Pop Low Income']]

# Display new dataframe and round values to two decimal places
df_food_pivot.round(2).head()
Out[142]:
Population PovertyRate MedianFamilyIncome % House Units No Vehicle % Pop Low Access (1/2 mile) % Pop Low Access (1 mile) % Pop Low Access (10 miles) % Pop Low Access (20 miles) % Pop White % Pop Black % Pop Asian % Pop Hispanic % Pop Other % Pop Low Income
State County
Alabama Autauga 54571 13.86 61082.92 5.35 88.91 66.83 12.88 0.0 78.53 17.67 0.87 2.40 2.45 34.25
Baldwin 182265 14.36 60664.94 3.06 90.54 72.23 1.60 0.0 85.67 9.38 0.74 4.38 3.49 34.72
Barbour 27457 24.53 43123.00 8.17 86.11 69.92 20.04 0.0 48.00 46.89 0.39 5.05 4.20 45.47
Bibb 22915 16.02 43362.50 3.76 95.49 82.56 1.29 0.0 75.85 22.02 0.10 1.77 1.69 44.18
Blount 57322 17.91 52136.00 3.81 96.57 91.14 3.41 0.0 92.58 1.33 0.20 8.07 5.29 39.89

Next, I standardized the USDSA dataframe for easy comparison to other dataframes in the future.

In [143]:
# Create a standardized version of dataframe 
df_food_standardized = (df_food_pivot-df_food_pivot.mean())/df_food_pivot.std()
df_food_standardized.head()
Out[143]:
Population PovertyRate MedianFamilyIncome % House Units No Vehicle % Pop Low Access (1/2 mile) % Pop Low Access (1 mile) % Pop Low Access (10 miles) % Pop Low Access (20 miles) % Pop White % Pop Black % Pop Asian % Pop Hispanic % Pop Other % Pop Low Income
State County
Alabama Autauga -0.139698 -0.476881 0.294447 -0.254612 0.444634 0.100300 0.023888 -0.241483 -0.257603 0.605499 -0.115364 -0.446136 -0.505646 -0.308021
Baldwin 0.268281 -0.400962 0.264806 -0.778672 0.584754 0.373792 -0.514367 -0.241483 0.165218 0.034185 -0.166833 -0.295747 -0.296299 -0.259494
Barbour -0.226327 1.144868 -0.979200 0.393177 0.203356 0.256685 0.365537 -0.241483 -2.064724 2.620298 -0.306421 -0.245217 -0.153446 0.863842
Bibb -0.240838 -0.147721 -0.962215 -0.618874 1.011454 0.897000 -0.529087 -0.241483 -0.416295 0.905732 -0.423591 -0.493791 -0.657281 0.728927
Blount -0.130909 0.138818 -0.340033 -0.606364 1.105210 1.331558 -0.427856 -0.241483 0.573968 -0.521356 -0.380463 -0.016433 0.066420 0.280905

County Health Data

This data was gathered at University of Wisconsin Population Health Institute. It represents 2010 County Health Rankings National Data.

In [144]:
# Read in data from University of Wisconsin Population Health Institute 
df_health = pd.read_csv("County_Health.csv", header=[1])
df_health.head()
Out[144]:
FIPS State County Unreliable Deaths Aggregate Population YPLL Rate 95% CI - Low 95% CI - High Quartile ... Quartile.26 Unreliable.5 Zip Codes with Healthy Food # Zip Codes % Healthy Food Quartile.27 Population.2 Liquor Stores Liquor Store Rate Quartile.28
0 1001.0 Alabama Autauga NaN 670.0 137881.0 9778.0 8786.0 10770.0 2 ... 1 NaN 2.0 8.0 25.0 4 49109.0 2.0 0.4 2
1 1003.0 Alabama Baldwin NaN 2148.0 449589.0 8222.0 7716.0 8727.0 1 ... 4 NaN 11.0 26.0 42.0 2 168233.0 13.0 0.8 3
2 1005.0 Alabama Barbour NaN 424.0 79450.0 10686.0 9263.0 12109.0 2 ... 1 NaN 2.0 5.0 40.0 2 28125.0 3.0 1.1 4
3 1007.0 Alabama Bibb NaN 373.0 60430.0 13070.0 11271.0 14868.0 4 ... 1 NaN 4.0 8.0 50.0 1 21341.0 1.0 0.5 2
4 1009.0 Alabama Blount NaN 787.0 155580.0 8930.0 8040.0 9819.0 1 ... 4 NaN 4.0 7.0 57.0 1 55811.0 NaN 0.0 1

5 rows × 133 columns

Tidy the County Health Rankings Data

I decided to only use a fraction of the parameters available in this dataset to start and make the data more manageable. I decided that the State and County were important for indexing purposes. In addition, percent of population reporting fair/poor health, number of physically unhealthy days, number of mentally unhealthy days, percent of population that smokes, obesity rate, and percent of children in poverty will be important parameters to consider.

In [145]:
# Select columns to keep in dataframe
health_cols = ['State','County','% Fair/Poor','Physically Unhealthy Days',
               'Mentally Unhealthy Days','% Smokers','% Obese','% Children in Poverty']

df_health = df_health[health_cols]

# Rename columns for easy interpretation
df_health.rename(columns={'% Fair/Poor':'% Fair/Poor Health'}, inplace= True)

# Remove columns missing county data
df_health = df_health[df_health['County'].notna()]

# Display dataframe and dtypes
display(df_health.head())
df_health.dtypes
State County % Fair/Poor Health Physically Unhealthy Days Mentally Unhealthy Days % Smokers % Obese % Children in Poverty
0 Alabama Autauga 26.0 5.5 4.1 28.0 30.0 14.0
1 Alabama Baldwin 13.0 3.6 4.1 23.0 25.0 15.0
2 Alabama Barbour 24.0 6.1 3.8 23.0 36.0 34.0
3 Alabama Bibb 18.0 4.2 5.3 NaN 32.0 24.0
4 Alabama Blount 25.0 5.6 4.5 23.0 32.0 19.0
Out[145]:
State                         object
County                        object
% Fair/Poor Health           float64
Physically Unhealthy Days    float64
Mentally Unhealthy Days      float64
% Smokers                    float64
% Obese                      float64
% Children in Poverty        float64
dtype: object

Next, I formatted the dataframe to be grouped by state and county. Then I created a pivot table to make the data more organized.

In [146]:
# Create pivot table to index by State and County
df_health_pivot = df_health.pivot_table(values=['% Fair/Poor Health','Physically Unhealthy Days',
                                       'Mentally Unhealthy Days','% Smokers','% Obese',
                                        '% Children in Poverty'],
                                       index=['State','County'])
df_health_pivot['% unhealthy days per month'] = (df_health_pivot['Physically Unhealthy Days']+df_health_pivot['Mentally Unhealthy Days'])/30

display(df_health_pivot.head())
% Children in Poverty % Fair/Poor Health % Obese % Smokers Mentally Unhealthy Days Physically Unhealthy Days % unhealthy days per month
State County
Alabama Autauga 14.0 26.0 30.0 28.0 4.1 5.5 0.320000
Baldwin 15.0 13.0 25.0 23.0 4.1 3.6 0.256667
Barbour 34.0 24.0 36.0 23.0 3.8 6.1 0.330000
Bibb 24.0 18.0 32.0 NaN 5.3 4.2 0.316667
Blount 19.0 25.0 32.0 23.0 4.5 5.6 0.336667

Then we create a standardized version of the County Health data for comparison with other dataframes later on.

In [147]:
# Create standardized version of df_health_pivot
df_health_standardized = (df_health_pivot-df_health_pivot.mean())/df_health_pivot.std()
df_health_standardized.head()
Out[147]:
% Children in Poverty % Fair/Poor Health % Obese % Smokers Mentally Unhealthy Days Physically Unhealthy Days % unhealthy days per month
State County
Alabama Autauga -0.794430 1.546098 0.457595 0.937950 0.625340 1.520199 1.211401
Baldwin -0.683192 -0.659851 -0.915977 0.096658 0.625340 -0.169571 0.231568
Barbour 1.430327 1.206722 2.105882 0.096658 0.333717 2.053811 1.366111
Bibb 0.317949 0.188591 1.007024 NaN 1.791828 0.364041 1.159830
Blount -0.238240 1.376410 1.007024 0.096658 1.014169 1.609134 1.469251

Combining USDA Food Access Data and County Health Data

In [148]:
# Using .merge() function to combine the two dataframes into one 
df_pivot = df_food_pivot.merge(df_health_pivot, left_on=['State','County'], right_on=['State','County'], how='left')

display(df_pivot.round(2))

# Using .concat() create a pivot table of all of the data
df_pivot_standardized = pd.concat([df_food_standardized, df_health_standardized],axis=1)
df_pivot_standardized = df_pivot_standardized[df_pivot_standardized['Population'].notna()]

display(df_pivot_standardized)
Population PovertyRate MedianFamilyIncome % House Units No Vehicle % Pop Low Access (1/2 mile) % Pop Low Access (1 mile) % Pop Low Access (10 miles) % Pop Low Access (20 miles) % Pop White % Pop Black ... % Pop Hispanic % Pop Other % Pop Low Income % Children in Poverty % Fair/Poor Health % Obese % Smokers Mentally Unhealthy Days Physically Unhealthy Days % unhealthy days per month
State County
Alabama Autauga 54571 13.86 61082.92 5.35 88.91 66.83 12.88 0.00 78.53 17.67 ... 2.40 2.45 34.25 14.0 26.0 30.0 28.0 4.1 5.5 0.32
Baldwin 182265 14.36 60664.94 3.06 90.54 72.23 1.60 0.00 85.67 9.38 ... 4.38 3.49 34.72 15.0 13.0 25.0 23.0 4.1 3.6 0.26
Barbour 27457 24.53 43123.00 8.17 86.11 69.92 20.04 0.00 48.00 46.89 ... 5.05 4.20 45.47 34.0 24.0 36.0 23.0 3.8 6.1 0.33
Bibb 22915 16.02 43362.50 3.76 95.49 82.56 1.29 0.00 75.85 22.02 ... 1.77 1.69 44.18 24.0 18.0 32.0 NaN 5.3 4.2 0.32
Blount 57322 17.91 52136.00 3.81 96.57 91.14 3.41 0.00 92.58 1.33 ... 8.07 5.29 39.89 19.0 25.0 32.0 23.0 4.5 5.6 0.34
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Wyoming Sweetwater 43806 11.84 81835.08 3.81 82.28 54.87 5.82 4.71 88.45 1.00 ... 15.27 8.72 25.55 9.0 14.0 29.0 28.0 4.0 3.6 0.25
Teton 21294 8.10 89555.25 1.59 86.60 71.93 7.88 2.69 88.39 0.23 ... 14.99 9.69 25.20 7.0 7.0 13.0 13.0 2.3 2.7 0.17
Uinta 21118 14.17 65312.67 3.44 85.44 58.64 3.59 0.06 92.40 0.26 ... 8.78 6.08 32.93 12.0 16.0 30.0 21.0 3.4 4.1 0.25
Washakie 8533 14.13 62213.67 7.22 58.14 33.63 10.57 8.56 91.35 0.26 ... 13.62 6.73 34.79 16.0 13.0 24.0 18.0 2.4 3.0 0.18
Weston 7208 12.10 75233.00 6.72 72.66 50.76 11.79 2.61 95.52 0.29 ... 3.00 2.61 28.59 12.0 16.0 29.0 23.0 3.1 3.1 0.21

3141 rows × 21 columns

Population PovertyRate MedianFamilyIncome % House Units No Vehicle % Pop Low Access (1/2 mile) % Pop Low Access (1 mile) % Pop Low Access (10 miles) % Pop Low Access (20 miles) % Pop White % Pop Black ... % Pop Hispanic % Pop Other % Pop Low Income % Children in Poverty % Fair/Poor Health % Obese % Smokers Mentally Unhealthy Days Physically Unhealthy Days % unhealthy days per month
State County
Alabama Autauga -0.139698 -0.476881 0.294447 -0.254612 0.444634 0.100300 0.023888 -0.241483 -0.257603 0.605499 ... -0.446136 -0.505646 -0.308021 -0.794430 1.546098 0.457595 0.937950 0.625340 1.520199 1.211401
Baldwin 0.268281 -0.400962 0.264806 -0.778672 0.584754 0.373792 -0.514367 -0.241483 0.165218 0.034185 ... -0.295747 -0.296299 -0.259494 -0.683192 -0.659851 -0.915977 0.096658 0.625340 -0.169571 0.231568
Barbour -0.226327 1.144868 -0.979200 0.393177 0.203356 0.256685 0.365537 -0.241483 -2.064724 2.620298 ... -0.245217 -0.153446 0.863842 1.430327 1.206722 2.105882 0.096658 0.333717 2.053811 1.366111
Bibb -0.240838 -0.147721 -0.962215 -0.618874 1.011454 0.897000 -0.529087 -0.241483 -0.416295 0.905732 ... -0.493791 -0.657281 0.728927 0.317949 0.188591 1.007024 NaN 1.791828 0.364041 1.159830
Blount -0.130909 0.138818 -0.340033 -0.606364 1.105210 1.331558 -0.427856 -0.241483 0.573968 -0.521356 ... -0.016433 0.066420 0.280905 -0.238240 1.376410 1.007024 0.096658 1.014169 1.609134 1.469251
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
Wyoming Sweetwater -0.174092 -0.783254 1.766110 -0.606877 -0.126406 -0.505493 -0.313067 0.079341 0.329782 -0.543953 ... 0.529209 0.757072 -1.217720 -1.350619 -0.490162 0.182881 0.937950 0.528132 -0.169571 0.179998
Teton -0.246018 -1.351689 2.313594 -1.116222 0.245193 0.358784 -0.214542 -0.058466 0.325802 -0.597028 ... 0.507673 0.952408 -1.253503 -1.573095 -1.677981 -4.212552 -1.585926 -1.124393 -0.969989 -1.160826
Uinta -0.246580 -0.430039 0.594405 -0.691598 0.145421 -0.314704 -0.419210 -0.237176 0.563657 -0.594937 ... 0.037664 0.225970 -0.445808 -1.016905 -0.150785 0.457595 -0.239859 -0.055112 0.275105 0.128427
Washakie -0.286789 -0.435103 0.374636 0.174916 -2.207276 -1.581543 -0.086276 0.341399 0.501304 -0.595117 ... 0.404013 0.356176 -0.251418 -0.571954 -0.659851 -1.190692 -0.744634 -1.027186 -0.703183 -0.954546
Weston -0.291022 -0.744008 1.297916 0.060824 -0.955841 -0.713677 -0.028255 -0.063418 0.748006 -0.592806 ... -0.400955 -0.473057 -0.899384 -1.016905 -0.150785 0.182881 0.096658 -0.346734 -0.614248 -0.541984

3141 rows × 21 columns

Exploratory Data Analysis

Correlation Matrices

To get a general understanding of our data and identify which parameters have the largest effect on a communities overall health, I decided to create a correlation matrix of all of the parameters.

In [149]:
# Find correlation between each variable
corr_df = df_pivot_standardized.corr()
display(corr_df)
Population PovertyRate MedianFamilyIncome % House Units No Vehicle % Pop Low Access (1/2 mile) % Pop Low Access (1 mile) % Pop Low Access (10 miles) % Pop Low Access (20 miles) % Pop White % Pop Black ... % Pop Hispanic % Pop Other % Pop Low Income % Children in Poverty % Fair/Poor Health % Obese % Smokers Mentally Unhealthy Days Physically Unhealthy Days % unhealthy days per month
Population 1.000000 -0.030277 0.252402 0.180973 -0.373280 -0.395089 -0.160021 -0.071463 -0.194805 0.080578 ... 0.191958 0.262420 -0.096868 -0.113666 -0.056954 -0.211916 -0.137536 -0.015736 -0.079866 -0.055267
PovertyRate -0.030277 1.000000 -0.693203 0.409257 0.119866 0.086868 -0.054087 -0.057718 -0.497119 0.469473 ... 0.095423 0.058463 0.850294 0.806606 0.599408 0.427828 0.346337 0.360869 0.457638 0.455845
MedianFamilyIncome 0.252402 -0.693203 1.000000 -0.186699 -0.297261 -0.328622 -0.080545 0.006171 0.177033 -0.260074 ... -0.000432 0.079778 -0.813431 -0.752778 -0.600189 -0.493518 -0.427333 -0.353888 -0.474926 -0.463659
% House Units No Vehicle 0.180973 0.409257 -0.186699 1.000000 -0.224198 -0.168975 -0.053422 0.016145 -0.472136 0.331081 ... 0.004916 0.040673 0.338743 0.363947 0.233834 0.186482 0.225384 0.110475 0.150408 0.144083
% Pop Low Access (1/2 mile) -0.373280 0.119866 -0.297261 -0.224198 1.000000 0.883369 0.265725 0.220670 0.080941 0.058566 ... -0.241924 -0.307676 0.178354 0.183363 0.257817 0.300796 0.298722 0.183616 0.260038 0.249765
% Pop Low Access (1 mile) -0.395089 0.086868 -0.328622 -0.168975 0.883369 1.000000 0.428435 0.306561 0.129673 -0.013425 ... -0.283062 -0.337020 0.199507 0.215153 0.267911 0.287381 0.274665 0.149543 0.252419 0.227916
% Pop Low Access (10 miles) -0.160021 -0.054087 -0.080545 -0.053422 0.265725 0.428435 1.000000 0.781635 0.039026 -0.140716 ... 0.004819 -0.030242 0.057510 0.115515 -0.121861 0.000783 -0.123302 -0.220183 -0.147692 -0.199274
% Pop Low Access (20 miles) -0.071463 -0.057718 0.006171 0.016145 0.220670 0.306561 0.781635 1.000000 0.009178 -0.134076 ... 0.051618 0.038280 0.019973 0.056578 -0.118638 -0.064793 -0.067277 -0.163595 -0.131807 -0.160566
% Pop White -0.194805 -0.497119 0.177033 -0.472136 0.080941 0.129673 0.039026 0.009178 1.000000 -0.790956 ... -0.164510 -0.315235 -0.382655 -0.436396 -0.244092 -0.326036 -0.065113 -0.060526 -0.066694 -0.069874
% Pop Black 0.080578 0.469473 -0.260074 0.331081 0.058566 -0.013425 -0.140716 -0.134076 -0.790956 1.000000 ... -0.097863 -0.075351 0.334237 0.420052 0.281557 0.456083 0.058962 0.114480 0.131700 0.136730
% Pop Asian 0.450583 -0.119340 0.424927 0.160344 -0.376314 -0.399709 -0.149212 -0.044451 -0.275437 0.023836 ... 0.150262 0.335506 -0.213186 -0.236856 -0.164437 -0.279357 -0.226901 -0.094769 -0.162722 -0.145473
% Pop Hispanic 0.191958 0.095423 -0.000432 0.004916 -0.241924 -0.283062 0.004819 0.051618 -0.164510 -0.097863 ... 1.000000 0.811778 0.158589 0.140592 0.062593 -0.284173 -0.201185 -0.044732 -0.024562 -0.038327
% Pop Other 0.262420 0.058463 0.079778 0.040673 -0.307676 -0.337020 -0.030242 0.038280 -0.315235 -0.075351 ... 0.811778 1.000000 0.110649 0.043319 0.027048 -0.289351 -0.166715 -0.023941 -0.046479 -0.040463
% Pop Low Income -0.096868 0.850294 -0.813431 0.338743 0.178354 0.199507 0.057510 0.019973 -0.382655 0.334237 ... 0.158589 0.110649 1.000000 0.831318 0.648684 0.418905 0.375860 0.363711 0.485567 0.474295
% Children in Poverty -0.113666 0.806606 -0.752778 0.363947 0.183363 0.215153 0.115515 0.056578 -0.436396 0.420052 ... 0.140592 0.043319 0.831318 1.000000 0.665417 0.455924 0.393383 0.371574 0.506382 0.491136
% Fair/Poor Health -0.056954 0.599408 -0.600189 0.233834 0.257817 0.267911 -0.121861 -0.118638 -0.244092 0.281557 ... 0.062593 0.027048 0.648684 0.665417 1.000000 0.484557 0.551725 0.600930 0.760157 0.752513
% Obese -0.211916 0.427828 -0.493518 0.186482 0.300796 0.287381 0.000783 -0.064793 -0.326036 0.456083 ... -0.284173 -0.289351 0.418905 0.455924 0.484557 1.000000 0.415832 0.250547 0.334227 0.327207
% Smokers -0.137536 0.346337 -0.427333 0.225384 0.298722 0.274665 -0.123302 -0.067277 -0.065113 0.058962 ... -0.201185 -0.166715 0.375860 0.393383 0.551725 0.415832 1.000000 0.480469 0.515111 0.541192
Mentally Unhealthy Days -0.015736 0.360869 -0.353888 0.110475 0.183616 0.149543 -0.220183 -0.163595 -0.060526 0.114480 ... -0.044732 -0.023941 0.363711 0.371574 0.600930 0.250547 0.480469 1.000000 0.624982 0.891750
Physically Unhealthy Days -0.079866 0.457638 -0.474926 0.150408 0.260038 0.252419 -0.147692 -0.131807 -0.066694 0.131700 ... -0.024562 -0.046479 0.485567 0.506382 0.760157 0.334227 0.515111 0.624982 1.000000 0.910589
% unhealthy days per month -0.055267 0.455845 -0.463659 0.144083 0.249765 0.227916 -0.199274 -0.160566 -0.069874 0.136730 ... -0.038327 -0.040463 0.474295 0.491136 0.752513 0.327207 0.541192 0.891750 0.910589 1.000000

21 rows × 21 columns

To display the correlation matrix in an easily interpretable way, I have created a heatmap of the correlation values below. On the right side of the heatmap is a legend explaining the different colors and their respective correlation values. The darker red colors represent a value closer to positive one, or perfectly positive association, while the darker blue colors represent a value closer to negative 1, or a perfectly negative association.

In [150]:
# Create heatmap to visually represent correlation matrix
fig, ax = plt.subplots(figsize=(20,20))
colormap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr_df, cmap=colormap, annot=True, fmt = ".2f")
plt.xticks(range(len(corr_df.columns)), corr_df.columns)
plt.yticks(range(len(corr_df.columns)), corr_df.columns)
plt.title("County Data Correlations")
plt.show()

Then to further investigate different measures of community health, I calculated the correlation values of each parameter when relating to obesity rates, average unhealthy days reported per month, and the percentage of individuals reporting fair to poor health in each county. The average unhealthy days parameter is calculated as the sum of the average physically unhealthy days per month and the average mentally unhealthy days per month of an individual, divided by 30 to get the percentage of the month an individual is mentally and/or physicallly unhealthy.

First, lets look at which parameters have the greatest correlation when using obesity rate as an indicator of community overall health.

Obesity Rate as Indicator of Community Health

In [151]:
# Display the correlation values in relation to obesity in order in a bar graph
obesity_corr_df = corr_df[['% Obese']]["% Obese"]
obesity_corr_df = obesity_corr_df.drop('% Obese')
obesity_corr_df = obesity_corr_df.drop('Physically Unhealthy Days')
obesity_corr_df = obesity_corr_df.drop('Mentally Unhealthy Days')
obesity_corr_df = obesity_corr_df.drop('% Fair/Poor Health')
obesity_corr_df = obesity_corr_df.drop('% unhealthy days per month')

# Sort the correlation values and print 
obesity_corr_df = obesity_corr_df.sort_values()
print(obesity_corr_df)

# Choose the greatest correlation parameters from all parameters considered 
greatest_obesity_corr_df = obesity_corr_df[(obesity_corr_df >= .3) | (obesity_corr_df <= -.3)]
greatest_obesity_corr_df = greatest_obesity_corr_df.sort_values()

# Plot all correlation values and the greatest correlation value parameters for Obesity rate
fig, ax = plt.subplots(1,2, figsize=(15,5))

obesity_corr_df.plot.bar(title = 'Correlation Values for Obesity Rate',ax=ax[0])
greatest_obesity_corr_df.plot.bar(title = 'Greatest Value Correlation Parameters for Obesity Rate',ax=ax[1], color='red')
MedianFamilyIncome            -0.493518
% Pop White                   -0.326036
% Pop Other                   -0.289351
% Pop Hispanic                -0.284173
% Pop Asian                   -0.279357
Population                    -0.211916
% Pop Low Access (20 miles)   -0.064793
% Pop Low Access (10 miles)    0.000783
% House Units No Vehicle       0.186482
% Pop Low Access (1 mile)      0.287381
% Pop Low Access (1/2 mile)    0.300796
% Smokers                      0.415832
% Pop Low Income               0.418905
PovertyRate                    0.427828
% Children in Poverty          0.455924
% Pop Black                    0.456083
Name: % Obese, dtype: float64
Out[151]:
<matplotlib.axes._subplots.AxesSubplot at 0x2324ed08>

The two bar graphs above suggest that median family income, percentage of the population that is White, percent of the population considered low access (at 1/2 mile), percentage of smokers, percent of population considered low income, the poverty rate, the percent of children in poverty, and the percentage of the population that is Black have the greatest correlations with obesity rate.

Next let us consider which parameters have the greatest correlation when using combined physically unhealthy and mentally unhealthy days per month as an indicator of community health.

% Unhealthy Days per Month as Indicator of Community Health

In [152]:
# Find correlation between each variable and the unhealthy days parameter.
corr_df2 = corr_df
corr_df2 = corr_df2.drop('Physically Unhealthy Days')
corr_df2 = corr_df2.drop('Mentally Unhealthy Days')
corr_df2 = corr_df2.drop('% Obese')
corr_df2 = corr_df2.drop('% Fair/Poor Health')

unhealthy_days_corr_df = corr_df2[['% unhealthy days per month']]["% unhealthy days per month"]
unhealthy_days_corr_df = unhealthy_days_corr_df.drop('% unhealthy days per month')

# Sort corrleation values and print
unhealthy_days_corr_df = unhealthy_days_corr_df.sort_values()
print(unhealthy_days_corr_df)

# Choose the greatest correlation parameters from all parameters considered 
greatest_unhealthy_days_corr_df = unhealthy_days_corr_df[(unhealthy_days_corr_df >= .2) | (unhealthy_days_corr_df <= -.2)]
greatest_unhealthy_days_corr_df = greatest_unhealthy_days_corr_df.sort_values()

# Display the correlation values in order in a bar graph
# Plot all correlation values and the greatest correlation value parameters for % unhealthy days per month
fig, ax = plt.subplots(1,2, figsize=(15,5))

unhealthy_days_corr_df.plot.bar(title = 'Correlation Values for % of Unhealthy Days per Month',ax=ax[0])
greatest_unhealthy_days_corr_df.plot.bar(title = 'Greatest Value Correlation Parameters for % of Unhealthy Days per Month',ax=ax[1], color='red')
MedianFamilyIncome            -0.463659
% Pop Low Access (10 miles)   -0.199274
% Pop Low Access (20 miles)   -0.160566
% Pop Asian                   -0.145473
% Pop White                   -0.069874
Population                    -0.055267
% Pop Other                   -0.040463
% Pop Hispanic                -0.038327
% Pop Black                    0.136730
% House Units No Vehicle       0.144083
% Pop Low Access (1 mile)      0.227916
% Pop Low Access (1/2 mile)    0.249765
PovertyRate                    0.455845
% Pop Low Income               0.474295
% Children in Poverty          0.491136
% Smokers                      0.541192
Name: % unhealthy days per month, dtype: float64
Out[152]:
<matplotlib.axes._subplots.AxesSubplot at 0x29d5b248>

The two bar graphs above suggest that median family income, the percent of the population that is low access (at 1 mile), the percent of the population that is low access (at 1/2 mile) the poverty rate, the percentage of the population that is low income, the percentage of children in poverty, and the percentage of smokers are the greatest indicators of that community's overall health.

Finally, lets calculate the correlation values when using the percent of a population reporting fair/poor health as the indicator of community health.

% Reporting Fair/Poor Health as Indicator of Community Health

In [153]:
# Find correlation between each variable and the unhealthy days parameter.
corr_df3 = corr_df
corr_df3 = corr_df3.drop('Physically Unhealthy Days')
corr_df3 = corr_df3.drop('Mentally Unhealthy Days')
corr_df3 = corr_df3.drop('% Obese')
corr_df3 = corr_df3.drop('% unhealthy days per month')

fair_poor_corr_df = corr_df3[['% Fair/Poor Health']]["% Fair/Poor Health"]
fair_poor_corr_df = fair_poor_corr_df.drop('% Fair/Poor Health')
fair_poor_corr_df = fair_poor_corr_df.sort_values()
print(fair_poor_corr_df)

# Choose the greatest correlation parameters from all parameters considered 
greatest_fair_poor_corr_df = fair_poor_corr_df[(fair_poor_corr_df >= .25) | (fair_poor_corr_df <= -.25)]
greatest_fair_poor_corr_df = greatest_fair_poor_corr_df.sort_values()

# Display the correlation values in order in a bar graph
# Plot all correlation values and the greatest correlation value parameters for % unhealthy days per month
fig, ax = plt.subplots(1,2, figsize=(15,5))

fair_poor_corr_df.plot.bar(title = 'Correlation Values for % Reporting Fair/Poor Health',ax=ax[0])
greatest_fair_poor_corr_df.plot.bar(title = 'Greatest Value Correlation Parameters for % Reporting Fair/Poor Health',ax=ax[1], color='red')
MedianFamilyIncome            -0.600189
% Pop White                   -0.244092
% Pop Asian                   -0.164437
% Pop Low Access (10 miles)   -0.121861
% Pop Low Access (20 miles)   -0.118638
Population                    -0.056954
% Pop Other                    0.027048
% Pop Hispanic                 0.062593
% House Units No Vehicle       0.233834
% Pop Low Access (1/2 mile)    0.257817
% Pop Low Access (1 mile)      0.267911
% Pop Black                    0.281557
% Smokers                      0.551725
PovertyRate                    0.599408
% Pop Low Income               0.648684
% Children in Poverty          0.665417
Name: % Fair/Poor Health, dtype: float64
Out[153]:
<matplotlib.axes._subplots.AxesSubplot at 0x29ca8fc8>

The two bar graphs above suggest that median family income, the percent of the population that is low access (at 1/2 mile), the percent of the population that is low access (at 1 mile), percent of population that is Black, the percentage of smokers, the poverty rate, the percent of the population that is considered low income, and the percentage of children in poverty are the greatest indicators of that community's overall health.

The common parameters for all three measures of community health are median family income, low access population percentages, poverty rates, and percent smokers.

Question 1: Does the distance between housing units and supermarkets and the frequency of supermarkets in a community impact community health? Is there higher or lower rates of obesity in areas with less access to healthy affordable food options?

To answer quesiton number 1, I decided to create scatter plots of the obesity rate and low access population percentage for each county. Each point in the scatter plot represents a county, and each subplot represents low access defined at either 1/2 mile, 1 mile, 10 miles, or 20 miles to the grocery store. The x-axis shows population percentage, and the y-axis shows obesity rate.

In [154]:
# Graph low access population vs obesity rate for all counties
fig, ax = plt.subplots(4,1, figsize=(10,18))
df_pivot.plot.scatter(x='% Pop Low Access (1/2 mile)', y='% Obese', ax=ax[0],title='Across America')
df_pivot.plot.scatter(x='% Pop Low Access (1 mile)', y='% Obese', color='orange',ax=ax[1])
df_pivot.plot.scatter(x='% Pop Low Access (10 miles)', y='% Obese', color='green',ax=ax[2])
df_pivot.plot.scatter(x='% Pop Low Access (20 miles)', y='% Obese',color='purple',ax=ax[3])
Out[154]:
<matplotlib.axes._subplots.AxesSubplot at 0x29e80388>

The above graphs show the obesity rate of each county in relation to the percentage of low access population at different mile demarcations. The greatest relationship can be seen between obesity rate and percent of low access population at 1/2 mile. While there is a positive linear association between the percent of low access population and obesity rate, it is not as significant as I would have expected. As the low access mile demarcations get larger, the relationship between obesity rate and percent of low access population gets less significant. This is probably due to the fact that a larger percentage of the population is considered when looking at areas 1/2 mile and 1 mile from a grocery store rather than areas 10 and 20 miles from a grocery store.

Below I chose a to show Florida as an example of this relationship. Each point in the scatter plot represents a county in Florida. As you can see by the regression line, there is a linear association between obesity rate and low access population percentage.

In [155]:
# show just florida 
fig, ax = plt.subplots(figsize=(8,6))
df_pivot.loc['Florida'].plot.scatter(x='% Pop Low Access (1 mile)',y='% Obese',title='Florida',ax=ax)

# plot regression line
slope, intercept, r_value, p_value, std_err = stats.linregress(df_pivot.loc['Florida']['% Pop Low Access (1 mile)'],df_pivot.loc['Florida']['% Obese'])
pts = np.array(ax.get_xlim())
line = slope * pts + intercept
ax.plot(pts, line, lw=1, color='red')
Out[155]:
[<matplotlib.lines.Line2D at 0x1fc90888>]

Question 2: Do average income rates of a particular county impact overall health of the community members?

In [156]:
# Create bar graph showing median household incomes by state
fig, ax = plt.subplots(figsize = (20,10))
ax1 = df_pivot.groupby('State')['MedianFamilyIncome'].mean().sort_values().plot.bar(title = 'Median household income by state')
The bar graph above shows the median family income per state. It appears that Missippi and Arkansas have the lowest median income, while Washington D.C. and New Jersey have the highest median family incomes. Next, I will display obesity rates by state.
In [157]:
# Create bar graph showing obesity rates by state 
fig, ax = plt.subplots(figsize = (20,10))
ax2 = df_pivot.groupby('State')['% Obese'].mean().sort_values().plot.bar(title = '% Obesity by State')

The bar graph above reports the rate of obesity by state. This graph shows that Colorado and Rhode Island have the lowest rate of obesity and Alabama and Missippi have the highest rate of obesity.

Below I combined the previous bar graphs to show both the obesity rate and median family income per state, with two different y-axis ranges. On the left side, obesity rate ranges from 0 to 35% and on the right side, median family income ranges from $0 to $100,000.

In [158]:
# create labels for x axis of all the state abbreviations
labels = ["AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "DC", "FL", "GA", 
          "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", 
          "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", 
          "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", 
          "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"]
x = np.arange(len(labels))  # the label locations
width = 0.35  # the width of the bars

# create figure and axis objects with subplots()
fig,ax = plt.subplots(figsize=(20,10))

# make a plot
ax.bar(x - width/2, df_pivot.groupby('State')['% Obese'].mean(),width, label='Obesity',color='red')
# set x-axis label
ax.set_xlabel("State",fontsize=16)
# set y-axis label
ax.set_ylabel("Obesity Rate (%)",color="red",fontsize=16)
# twin object for two different y-axis on the sample plot
ax2=ax.twinx()
# make a plot with different y-axis using second axis object
ax2.bar(x + width/2, df_pivot.groupby('State')['MedianFamilyIncome'].mean(),width,label='Income',color='blue')
ax2.set_ylabel("Median Family Income($)",color="blue",fontsize=16)
ax.set_xticks(x)
ax.set_xticklabels(labels)
plt.show()

Next, I wanted to show the relationship in a scatter plot. Each point represents a states median family income and obesity rate.

In [159]:
# create dataframe of just median family income and obesity rate
state_income_obesity = df_pivot[['MedianFamilyIncome','% Obese']].groupby('State').mean()

# create scatter plot point labels
labels = ["AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "DC", "FL", "GA", 
          "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", 
          "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", 
          "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", 
          "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"]

fig, ax = plt.subplots(figsize=(12,8))
income = state_income_obesity['MedianFamilyIncome']
obese = state_income_obesity['% Obese']

plt.scatter(x=income,y=obese)
ax.set_xlabel('Median Family Income ($)')
ax.set_ylabel('Obesity Rate (%)')
ax.set_title('Median Family Income vs Obesity Rate in America')


# Label each point with its correct State
for i, txt in enumerate(labels):
    ax.annotate(txt, (income[i], obese[i]), fontsize=12)

# Plot a linear regression line
slope, intercept, r_value, p_value, std_err = stats.linregress(income,obese)
pts = np.array(ax.get_xlim())
line = slope * pts + intercept
ax.plot(pts, line, lw=1, color='red')
Out[159]:
[<matplotlib.lines.Line2D at 0x19f24748>]

The above scatter plot gives further insight into the relationship between income and obesity rate. As we can see from the linear regression line, there is an obvious negative linear relationship between income and obesity. The graph shows that most states follow this relationship, with Missippi as the lowest median income and highest obesity rate, and DC with the highest income and lowest obesity combination. However, there are some outliers such as Colorado. Colorado shows a very low obesity rate and an average median family income compared to other states.

Question 3: Does access to transportation have an impact on a community’s health?

The next question I wanted to address was whether or not access to transportation has an effect on community health, measured by county obesity rate.

In [160]:
fig, ax = plt.subplots(1,2,figsize=(16,6))
df_pivot.plot.scatter(x='% House Units No Vehicle', y='% Obese', ax=ax[0],title='Across America')

# show just Florida 
df_pivot.loc['Florida'].plot.scatter(x='% House Units No Vehicle',y='% Obese',title='Florida',ax=ax[1])

# show linear regression line for Florida
slope, intercept, r_value, p_value, std_err = stats.linregress(df_pivot.loc['Florida']['% House Units No Vehicle'],df_pivot.loc['Florida']['% Obese'])
pts = np.array(ax[1].get_xlim())
line = slope * pts + intercept
plt.plot(pts, line, lw=1, color='red')
Out[160]:
[<matplotlib.lines.Line2D at 0x23339ac8>]

As we can see there is not a very significant relationship between the percent of houses without a vehicle and the obesity rate in a county. Most counties fall in a 0 to 20% range of houses without access to transportation, while obesity rate ranges from 0 to 45% regardless of house units without vehicle access. I expected transportation access to have a greater effect on community health.

Question 4: Is there a relationship between median family income and areas with less access to healthy affordable food options?

Finally, I wanted to analyze the relationship between income and access to healthy and affordable food for each county in the dataset. To look further into this correlation I plotted low access population versus median family income. Each scatter plot represents low access definitions at 1/2 mile, 1 mile, 10 miles, and 20 miles.

In [161]:
# Graph low access population vs median family income for all counties
fig, ax = plt.subplots(4,1, figsize=(10,18))
df_pivot.plot.scatter(x='% Pop Low Access (1/2 mile)', y='MedianFamilyIncome', ax=ax[0],title='Across America')
df_pivot.plot.scatter(x='% Pop Low Access (1 mile)', y='MedianFamilyIncome', color='orange',ax=ax[1])
df_pivot.plot.scatter(x='% Pop Low Access (10 miles)', y='MedianFamilyIncome', color='green',ax=ax[2])
df_pivot.plot.scatter(x='% Pop Low Access (20 miles)', y='MedianFamilyIncome',color='purple',ax=ax[3])
Out[161]:
<matplotlib.axes._subplots.AxesSubplot at 0x1fe897c8>

When looking at low access defined at 1/2 mile and 1 mile, we can see a linear association between income and access to healthy food. As income decreases, access to healthy food also decreases for counties across America. There is a less significant relationship for low access populations defined at 10 miles and 20 miles.

Next, I grouped the dataframe by state and plotted each state in its own subplot to show the relationship between median family income and low access populations measured at 1 mile from a grocery store.

In [162]:
# Create pivot table of median family income and % population low access at 1 mile mark
df_state = df_pivot[['MedianFamilyIncome','% Pop Low Access (1 mile)']]

# create list of state names to loop through 
state_list = df_food.State.unique()

# create subplot to populate
fig, ax = plt.subplots(nrows=17, ncols=3, figsize=(35,130))
plt.subplots_adjust(hspace=0.3)
plt.subplots_adjust(wspace=0.3)

# create subplot of scatter plots of median family income vs healthy food access per state
for i in range(len(state_list)):
    df_state.loc[state_list[i]].plot.scatter(x='% Pop Low Access (1 mile)',y='MedianFamilyIncome',ax=ax[i // 3][i % 3], title=state_list[i],fontsize=18)

The subplots above show the relationship between median family income and access to healthy food per state. Each dot on the scatter plots represents a county in that given state. These graphs are useful in visualizing whether or not there is a relationship between a household's income and their access to healthy food.

Below I show a good example of this relationship by displaying just Florida. The regression line shows the negative relationship between income and low access population percentage.

In [163]:
# show just florida 
fig, ax = plt.subplots(figsize=(8,6))
df_pivot.loc['Florida'].plot.scatter(x='% Pop Low Access (1 mile)',y='MedianFamilyIncome',title='Florida',ax=ax)
slope, intercept, r_value, p_value, std_err = stats.linregress(df_pivot.loc['Florida']['% Pop Low Access (1 mile)'],df_pivot.loc['Florida']['MedianFamilyIncome'])
pts = np.array(ax.get_xlim())
line = slope * pts + intercept
ax.plot(pts, line, lw=1, color='red')
Out[163]:
[<matplotlib.lines.Line2D at 0x22f3acc8>]

Conclusions

In conclusion, we have found answers to our research questions and identified the greatest predictors of community health. From the data analysis above, we can conclude that counties with greater low access populations also have higher obesity rates, particularly when looking at 1/2 mile and 1 mile distances as a measure of low access communities. Areas with less access to healthy affordable food options report higher rates of obesity. We also found evidence to support the claim that communities with lower median incomes report higher obesity rates, more unhealthy days, and higher populations in fair-poor health. However, the data does not provide evidence to support my hypothesis that access to transportation is a good predictor of community health. The data shows that access to transportation did not have a significant impact on community health. Finally, we can conclude that there is a linear association between income and access to healthy food. This finding was not surprising given that about half of low access communities in America are also below the poverty line.

I believe that the findings in this tutorial prove the importance of the issue of food deserts in America. People all accross America need access to healthy affordable food, regardless of income, or location. It is my hope that the research and work going into closing the 'grocery gap' will eventually lead to greater general community health and greater access to healthy food for all populations.

Additional Sources

If you are interested in further research into food deserts and their effects on America's health, I have listed various websites that I used for statistics and project motivation. Thank you for your interest in helping make America healthier and food more accessible!

Special thanks to Professor Mattei, Sri, and Eli throughout the semester!