Final Tutorial: Food Access Across America: How Food Deserts Affect Community Health¶

Analysis by Erin Shanley¶

Data Source¶

USDA Food Access Research Atlas: https://www.ers.usda.gov/data-products/food-access-research-atlas/download-the-data/

Contains 2010 Census Data regarding population demographics, food access, poverty rate, median household incomes, etc.

University of Wisconsin Population Health Institute: 2010 County Health Rankings National Data https://www.countyhealthrankings.org/sites/default/files/2010%20County%20Health%20Rankings%20National%20Data_v2.xls

Contains county health data such as obesity rates, average unhealthy days reported by individuals, health quality reported by individuals, etc.

Project Explanation and Importance¶

Since I am a graduate student, to fulfill the extra requirement I will be working alone on this project. The data sets I plan on working with come from the US Department of Agriculture and The University of Wisconsin Population Health Institute. The particular sets of data I will look at focus on food access across the US and contains demographic information on different counties in the US and their ability to access supermarkets, other healthy and affordable food sources, overall mental health, physical health, and obsesity rates.

While there are many ways to define which areas are considered "food deserts" and many ways to measure food store access for individuals and for neighborhoods, I have chosen a few initial parameters to begin analyzing. The USDA defines a food desert as 'an area that has limited access to affordable and nutritious food, in terms of income and distance to a grocery store. Food deserts are a national crisis in America, as nearly 23.5 million Americans currently live in food deserts. In addition, the USDA estimates that people living in food deserts are 30% more likely to become obese.

For my analysis, I will focus on median family income, poverty rate, percent of populations considered low access at 1/2 mile, 1 mile, 10 miles, and 20 miles distance between housing units and grocery stores, frequency of grocery stores, access to transportation, racial demographics, and the percentage of children living in poverty. We will measure community health by obesity rates per county, reported unhealthy days by individuals, and percent of individuals reporting fair to poor health.

It is important to note a few definitions provided by the data. 'Low Access' refers to accessibility to sources of healthy food, as measured by distance to a store (at ½ mile, 1 mile, 10 miles, 20 miles). 'Low Income' refers to an area where the poverty rate is greater than 20 percent or the median family income is less than or equal to 80 percent of the State-wide median family income. Then a food desert is determined by the combination of both a low access and low income community. The data provides a spatial overview of food access indicators for low income vs other census tracts using measures of 1-mile, 10-mile and 20-mile demarcations to the nearest supermarket, and vehicle availability for all census tracts.

Questions to Address¶

Does the distance between housing units and supermarkets and the frequency of supermarkets in a community impact community health? Is there higher or lower rates of obesity in areas with less access to healthy affordable food options?
Do average income rates of a particular county impact overall health of the community members?
Does access to transportation have an impact on a community’s health?
Is there a relationship between median family income and areas with less access to healthy affordable food options?

Hypothesis¶

Communities with lower median family incomes, a higher percentage of low access population, less access to transportation, and higher poverty rates are going to experience the poorest health conditions.

Reading in Data¶

USDA Food Access Research Atlas Data¶

Note that this data refers to the 2010 census.

# Read in neccessary libraries
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import scipy
from scipy import stats
from matplotlib import cm

# Read in USDA Food Access Research Atlas Data
df_food = pd.read_csv("USDA_data.csv")
df_food.head()

Tidy USDA Food Access Research Atlas Data¶

Next, we need to tidy the data imported from the USDA Food Access Research Atlas. Below is a tidied dataframe with columns named more appropriately and unneccessary columns of data removed. This new dataframe displays data surrounding county population, total number of housing units, number of houses without vehicle access, median household income, poverty rate, and low access population percentages when measured at 1 mile, 10 miles, and 20 miles from a grocery store, county populations based on race, and the percent of the population considered to be low income.

# Tidy data by creating new dataframe containing only the relevant parameters
# Create a list of all columns to keep 
food_cols = ["State","County","POP2010","OHU2010","PovertyRate","MedianFamilyIncome",
             "lapophalf","lapop1","lapop10","lapop20","TractHUNV","TractLOWI","TractWhite",
            "TractBlack","TractAsian","TractHispanic","TractOMultir"]
df_food = df_food[food_cols]

# Rename columns appropriately
df_food.rename(columns={"POP2010": "Population", "OHU2010": "TotalHouseUnits","TractHUNV":"HouseUnitsNoVehicle",
                        "TractWhite":"PopWhite","TractBlack":"PopBlack","TractAsian":"PopAsian","TractHispanic":"PopHispanic",
                        "TractOMultir":"PopOther","TractLOWI":"PopLowIncome"}, inplace=True)

display(df_food.head())

# Check for correct data types
df_food.dtypes

State                   object
County                  object
Population               int64
TotalHouseUnits          int64
PovertyRate            float64
MedianFamilyIncome       int64
lapophalf              float64
lapop1                 float64
lapop10                float64
lapop20                float64
HouseUnitsNoVehicle      int64
PopLowIncome             int64
PopWhite                 int64
PopBlack                 int64
PopAsian                 int64
PopHispanic              int64
PopOther                 int64
dtype: object

Below I have created a pivot table indexed on state and county to organize the above dataframe into a more manageable format. Then I edited the low access and race columns to represent percentage of the total population, rather than a sum of individuals. This was achieved by creating new columns and dividing by the overall population, then multiplying by 100.

# Create a pivot table to organize data by State and County
df_food_pivot = df_food.pivot_table(values=['Population','TotalHouseUnits','PovertyRate','MedianFamilyIncome','lapophalf',
                                            'lapop1','lapop10','lapop20','HouseUnitsNoVehicle','PopLowIncome','PopWhite',
                                           'PopBlack','PopAsian','PopHispanic','PopOther'],
                                   index=['State','County'],
                                   aggfunc={'Population':np.sum,'TotalHouseUnits':np.sum,'HouseUnitsNoVehicle':np.sum, 
                                            'PovertyRate':np.mean,'MedianFamilyIncome':np.mean,'lapophalf':np.sum,
                                            'lapop1':np.sum,'lapop10':np.sum,'lapop20':np.sum,'PopWhite':np.sum,
                                            'PopBlack':np.sum,'PopAsian':np.sum,'PopHispanic':np.sum,'PopOther':np.sum,
                                            'PopLowIncome':np.sum})

# Edit columns to show % of total population for each identifier
df_food_pivot['% House Units No Vehicle']=(df_food_pivot['HouseUnitsNoVehicle']/df_food_pivot['TotalHouseUnits'])*100
df_food_pivot['% Pop Low Access (1/2 mile)'] = (df_food_pivot['lapophalf']/df_food_pivot['Population'])*100
df_food_pivot['% Pop Low Access (1 mile)'] = (df_food_pivot['lapop1']/df_food_pivot['Population'])*100
df_food_pivot['% Pop Low Access (10 miles)'] = (df_food_pivot['lapop10']/df_food_pivot['Population'])*100
df_food_pivot['% Pop Low Access (20 miles)'] = (df_food_pivot['lapop20']/df_food_pivot['Population'])*100
df_food_pivot['% Pop White'] = (df_food_pivot['PopWhite']/df_food_pivot['Population'])*100
df_food_pivot['% Pop Black'] = (df_food_pivot['PopBlack']/df_food_pivot['Population'])*100
df_food_pivot['% Pop Asian'] = (df_food_pivot['PopAsian']/df_food_pivot['Population'])*100
df_food_pivot['% Pop Hispanic'] = (df_food_pivot['PopHispanic']/df_food_pivot['Population'])*100
df_food_pivot['% Pop Other'] = (df_food_pivot['PopOther']/df_food_pivot['Population'])*100
df_food_pivot['% Pop Low Income'] = (df_food_pivot['PopLowIncome']/df_food_pivot['Population'])*100

# Keep only columns specified below
df_food_pivot = df_food_pivot[['Population','PovertyRate','MedianFamilyIncome','% House Units No Vehicle',
                               '% Pop Low Access (1/2 mile)','% Pop Low Access (1 mile)','% Pop Low Access (10 miles)',
                               '% Pop Low Access (20 miles)','% Pop White','% Pop Black','% Pop Asian','% Pop Hispanic',
                               '% Pop Other','% Pop Low Income']]

# Display new dataframe and round values to two decimal places
df_food_pivot.round(2).head()

Next, I standardized the USDSA dataframe for easy comparison to other dataframes in the future.

# Create a standardized version of dataframe 
df_food_standardized = (df_food_pivot-df_food_pivot.mean())/df_food_pivot.std()
df_food_standardized.head()

County Health Data¶

This data was gathered at University of Wisconsin Population Health Institute. It represents 2010 County Health Rankings National Data.

# Read in data from University of Wisconsin Population Health Institute 
df_health = pd.read_csv("County_Health.csv", header=[1])
df_health.head()

Tidy the County Health Rankings Data¶

I decided to only use a fraction of the parameters available in this dataset to start and make the data more manageable. I decided that the State and County were important for indexing purposes. In addition, percent of population reporting fair/poor health, number of physically unhealthy days, number of mentally unhealthy days, percent of population that smokes, obesity rate, and percent of children in poverty will be important parameters to consider.

# Select columns to keep in dataframe
health_cols = ['State','County','% Fair/Poor','Physically Unhealthy Days',
               'Mentally Unhealthy Days','% Smokers','% Obese','% Children in Poverty']

df_health = df_health[health_cols]

# Rename columns for easy interpretation
df_health.rename(columns={'% Fair/Poor':'% Fair/Poor Health'}, inplace= True)

# Remove columns missing county data
df_health = df_health[df_health['County'].notna()]

# Display dataframe and dtypes
display(df_health.head())
df_health.dtypes

State                         object
County                        object
% Fair/Poor Health           float64
Physically Unhealthy Days    float64
Mentally Unhealthy Days      float64
% Smokers                    float64
% Obese                      float64
% Children in Poverty        float64
dtype: object

Next, I formatted the dataframe to be grouped by state and county. Then I created a pivot table to make the data more organized.

# Create pivot table to index by State and County
df_health_pivot = df_health.pivot_table(values=['% Fair/Poor Health','Physically Unhealthy Days',
                                       'Mentally Unhealthy Days','% Smokers','% Obese',
                                        '% Children in Poverty'],
                                       index=['State','County'])
df_health_pivot['% unhealthy days per month'] = (df_health_pivot['Physically Unhealthy Days']+df_health_pivot['Mentally Unhealthy Days'])/30

display(df_health_pivot.head())

Then we create a standardized version of the County Health data for comparison with other dataframes later on.

# Create standardized version of df_health_pivot
df_health_standardized = (df_health_pivot-df_health_pivot.mean())/df_health_pivot.std()
df_health_standardized.head()

Combining USDA Food Access Data and County Health Data¶

# Using .merge() function to combine the two dataframes into one 
df_pivot = df_food_pivot.merge(df_health_pivot, left_on=['State','County'], right_on=['State','County'], how='left')

display(df_pivot.round(2))

# Using .concat() create a pivot table of all of the data
df_pivot_standardized = pd.concat([df_food_standardized, df_health_standardized],axis=1)
df_pivot_standardized = df_pivot_standardized[df_pivot_standardized['Population'].notna()]

display(df_pivot_standardized)

Exploratory Data Analysis¶

Correlation Matrices¶

To get a general understanding of our data and identify which parameters have the largest effect on a communities overall health, I decided to create a correlation matrix of all of the parameters.

# Find correlation between each variable
corr_df = df_pivot_standardized.corr()
display(corr_df)

To display the correlation matrix in an easily interpretable way, I have created a heatmap of the correlation values below. On the right side of the heatmap is a legend explaining the different colors and their respective correlation values. The darker red colors represent a value closer to positive one, or perfectly positive association, while the darker blue colors represent a value closer to negative 1, or a perfectly negative association.

# Create heatmap to visually represent correlation matrix
fig, ax = plt.subplots(figsize=(20,20))
colormap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr_df, cmap=colormap, annot=True, fmt = ".2f")
plt.xticks(range(len(corr_df.columns)), corr_df.columns)
plt.yticks(range(len(corr_df.columns)), corr_df.columns)
plt.title("County Data Correlations")
plt.show()

Then to further investigate different measures of community health, I calculated the correlation values of each parameter when relating to obesity rates, average unhealthy days reported per month, and the percentage of individuals reporting fair to poor health in each county. The average unhealthy days parameter is calculated as the sum of the average physically unhealthy days per month and the average mentally unhealthy days per month of an individual, divided by 30 to get the percentage of the month an individual is mentally and/or physicallly unhealthy.

First, lets look at which parameters have the greatest correlation when using obesity rate as an indicator of community overall health.

Obesity Rate as Indicator of Community Health¶

# Display the correlation values in relation to obesity in order in a bar graph
obesity_corr_df = corr_df[['% Obese']]["% Obese"]
obesity_corr_df = obesity_corr_df.drop('% Obese')
obesity_corr_df = obesity_corr_df.drop('Physically Unhealthy Days')
obesity_corr_df = obesity_corr_df.drop('Mentally Unhealthy Days')
obesity_corr_df = obesity_corr_df.drop('% Fair/Poor Health')
obesity_corr_df = obesity_corr_df.drop('% unhealthy days per month')

# Sort the correlation values and print 
obesity_corr_df = obesity_corr_df.sort_values()
print(obesity_corr_df)

# Choose the greatest correlation parameters from all parameters considered 
greatest_obesity_corr_df = obesity_corr_df[(obesity_corr_df >= .3) | (obesity_corr_df <= -.3)]
greatest_obesity_corr_df = greatest_obesity_corr_df.sort_values()

# Plot all correlation values and the greatest correlation value parameters for Obesity rate
fig, ax = plt.subplots(1,2, figsize=(15,5))

obesity_corr_df.plot.bar(title = 'Correlation Values for Obesity Rate',ax=ax[0])
greatest_obesity_corr_df.plot.bar(title = 'Greatest Value Correlation Parameters for Obesity Rate',ax=ax[1], color='red')

MedianFamilyIncome            -0.493518
% Pop White                   -0.326036
% Pop Other                   -0.289351
% Pop Hispanic                -0.284173
% Pop Asian                   -0.279357
Population                    -0.211916
% Pop Low Access (20 miles)   -0.064793
% Pop Low Access (10 miles)    0.000783
% House Units No Vehicle       0.186482
% Pop Low Access (1 mile)      0.287381
% Pop Low Access (1/2 mile)    0.300796
% Smokers                      0.415832
% Pop Low Income               0.418905
PovertyRate                    0.427828
% Children in Poverty          0.455924
% Pop Black                    0.456083
Name: % Obese, dtype: float64

<matplotlib.axes._subplots.AxesSubplot at 0x2324ed08>

The two bar graphs above suggest that median family income, percentage of the population that is White, percent of the population considered low access (at 1/2 mile), percentage of smokers, percent of population considered low income, the poverty rate, the percent of children in poverty, and the percentage of the population that is Black have the greatest correlations with obesity rate.

Next let us consider which parameters have the greatest correlation when using combined physically unhealthy and mentally unhealthy days per month as an indicator of community health.

% Unhealthy Days per Month as Indicator of Community Health¶

# Find correlation between each variable and the unhealthy days parameter.
corr_df2 = corr_df
corr_df2 = corr_df2.drop('Physically Unhealthy Days')
corr_df2 = corr_df2.drop('Mentally Unhealthy Days')
corr_df2 = corr_df2.drop('% Obese')
corr_df2 = corr_df2.drop('% Fair/Poor Health')

unhealthy_days_corr_df = corr_df2[['% unhealthy days per month']]["% unhealthy days per month"]
unhealthy_days_corr_df = unhealthy_days_corr_df.drop('% unhealthy days per month')

# Sort corrleation values and print
unhealthy_days_corr_df = unhealthy_days_corr_df.sort_values()
print(unhealthy_days_corr_df)

# Choose the greatest correlation parameters from all parameters considered 
greatest_unhealthy_days_corr_df = unhealthy_days_corr_df[(unhealthy_days_corr_df >= .2) | (unhealthy_days_corr_df <= -.2)]
greatest_unhealthy_days_corr_df = greatest_unhealthy_days_corr_df.sort_values()

# Display the correlation values in order in a bar graph
# Plot all correlation values and the greatest correlation value parameters for % unhealthy days per month
fig, ax = plt.subplots(1,2, figsize=(15,5))

unhealthy_days_corr_df.plot.bar(title = 'Correlation Values for % of Unhealthy Days per Month',ax=ax[0])
greatest_unhealthy_days_corr_df.plot.bar(title = 'Greatest Value Correlation Parameters for % of Unhealthy Days per Month',ax=ax[1], color='red')

MedianFamilyIncome            -0.463659
% Pop Low Access (10 miles)   -0.199274
% Pop Low Access (20 miles)   -0.160566
% Pop Asian                   -0.145473
% Pop White                   -0.069874
Population                    -0.055267
% Pop Other                   -0.040463
% Pop Hispanic                -0.038327
% Pop Black                    0.136730
% House Units No Vehicle       0.144083
% Pop Low Access (1 mile)      0.227916
% Pop Low Access (1/2 mile)    0.249765
PovertyRate                    0.455845
% Pop Low Income               0.474295
% Children in Poverty          0.491136
% Smokers                      0.541192
Name: % unhealthy days per month, dtype: float64

<matplotlib.axes._subplots.AxesSubplot at 0x29d5b248>

The two bar graphs above suggest that median family income, the percent of the population that is low access (at 1 mile), the percent of the population that is low access (at 1/2 mile) the poverty rate, the percentage of the population that is low income, the percentage of children in poverty, and the percentage of smokers are the greatest indicators of that community's overall health.

Finally, lets calculate the correlation values when using the percent of a population reporting fair/poor health as the indicator of community health.

% Reporting Fair/Poor Health as Indicator of Community Health¶

# Find correlation between each variable and the unhealthy days parameter.
corr_df3 = corr_df
corr_df3 = corr_df3.drop('Physically Unhealthy Days')
corr_df3 = corr_df3.drop('Mentally Unhealthy Days')
corr_df3 = corr_df3.drop('% Obese')
corr_df3 = corr_df3.drop('% unhealthy days per month')

fair_poor_corr_df = corr_df3[['% Fair/Poor Health']]["% Fair/Poor Health"]
fair_poor_corr_df = fair_poor_corr_df.drop('% Fair/Poor Health')
fair_poor_corr_df = fair_poor_corr_df.sort_values()
print(fair_poor_corr_df)

# Choose the greatest correlation parameters from all parameters considered 
greatest_fair_poor_corr_df = fair_poor_corr_df[(fair_poor_corr_df >= .25) | (fair_poor_corr_df <= -.25)]
greatest_fair_poor_corr_df = greatest_fair_poor_corr_df.sort_values()

# Display the correlation values in order in a bar graph
# Plot all correlation values and the greatest correlation value parameters for % unhealthy days per month
fig, ax = plt.subplots(1,2, figsize=(15,5))

fair_poor_corr_df.plot.bar(title = 'Correlation Values for % Reporting Fair/Poor Health',ax=ax[0])
greatest_fair_poor_corr_df.plot.bar(title = 'Greatest Value Correlation Parameters for % Reporting Fair/Poor Health',ax=ax[1], color='red')

MedianFamilyIncome            -0.600189
% Pop White                   -0.244092
% Pop Asian                   -0.164437
% Pop Low Access (10 miles)   -0.121861
% Pop Low Access (20 miles)   -0.118638
Population                    -0.056954
% Pop Other                    0.027048
% Pop Hispanic                 0.062593
% House Units No Vehicle       0.233834
% Pop Low Access (1/2 mile)    0.257817
% Pop Low Access (1 mile)      0.267911
% Pop Black                    0.281557
% Smokers                      0.551725
PovertyRate                    0.599408
% Pop Low Income               0.648684
% Children in Poverty          0.665417
Name: % Fair/Poor Health, dtype: float64

<matplotlib.axes._subplots.AxesSubplot at 0x29ca8fc8>

The two bar graphs above suggest that median family income, the percent of the population that is low access (at 1/2 mile), the percent of the population that is low access (at 1 mile), percent of population that is Black, the percentage of smokers, the poverty rate, the percent of the population that is considered low income, and the percentage of children in poverty are the greatest indicators of that community's overall health.

The common parameters for all three measures of community health are median family income, low access population percentages, poverty rates, and percent smokers.

Question 1: Does the distance between housing units and supermarkets and the frequency of supermarkets in a community impact community health? Is there higher or lower rates of obesity in areas with less access to healthy affordable food options?¶

To answer quesiton number 1, I decided to create scatter plots of the obesity rate and low access population percentage for each county. Each point in the scatter plot represents a county, and each subplot represents low access defined at either 1/2 mile, 1 mile, 10 miles, or 20 miles to the grocery store. The x-axis shows population percentage, and the y-axis shows obesity rate.

# Graph low access population vs obesity rate for all counties
fig, ax = plt.subplots(4,1, figsize=(10,18))
df_pivot.plot.scatter(x='% Pop Low Access (1/2 mile)', y='% Obese', ax=ax[0],title='Across America')
df_pivot.plot.scatter(x='% Pop Low Access (1 mile)', y='% Obese', color='orange',ax=ax[1])
df_pivot.plot.scatter(x='% Pop Low Access (10 miles)', y='% Obese', color='green',ax=ax[2])
df_pivot.plot.scatter(x='% Pop Low Access (20 miles)', y='% Obese',color='purple',ax=ax[3])

<matplotlib.axes._subplots.AxesSubplot at 0x29e80388>

The above graphs show the obesity rate of each county in relation to the percentage of low access population at different mile demarcations. The greatest relationship can be seen between obesity rate and percent of low access population at 1/2 mile. While there is a positive linear association between the percent of low access population and obesity rate, it is not as significant as I would have expected. As the low access mile demarcations get larger, the relationship between obesity rate and percent of low access population gets less significant. This is probably due to the fact that a larger percentage of the population is considered when looking at areas 1/2 mile and 1 mile from a grocery store rather than areas 10 and 20 miles from a grocery store.

Below I chose a to show Florida as an example of this relationship. Each point in the scatter plot represents a county in Florida. As you can see by the regression line, there is a linear association between obesity rate and low access population percentage.

# show just florida 
fig, ax = plt.subplots(figsize=(8,6))
df_pivot.loc['Florida'].plot.scatter(x='% Pop Low Access (1 mile)',y='% Obese',title='Florida',ax=ax)

# plot regression line
slope, intercept, r_value, p_value, std_err = stats.linregress(df_pivot.loc['Florida']['% Pop Low Access (1 mile)'],df_pivot.loc['Florida']['% Obese'])
pts = np.array(ax.get_xlim())
line = slope * pts + intercept
ax.plot(pts, line, lw=1, color='red')

[<matplotlib.lines.Line2D at 0x1fc90888>]

Question 2: Do average income rates of a particular county impact overall health of the community members?¶

# Create bar graph showing median household incomes by state
fig, ax = plt.subplots(figsize = (20,10))
ax1 = df_pivot.groupby('State')['MedianFamilyIncome'].mean().sort_values().plot.bar(title = 'Median household income by state')

# Create bar graph showing obesity rates by state 
fig, ax = plt.subplots(figsize = (20,10))
ax2 = df_pivot.groupby('State')['% Obese'].mean().sort_values().plot.bar(title = '% Obesity by State')

The bar graph above reports the rate of obesity by state. This graph shows that Colorado and Rhode Island have the lowest rate of obesity and Alabama and Missippi have the highest rate of obesity.

Below I combined the previous bar graphs to show both the obesity rate and median family income per state, with two different y-axis ranges. On the left side, obesity rate ranges from 0 to 35% and on the right side, median family income ranges from $0 to $100,000.

# create labels for x axis of all the state abbreviations
labels = ["AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "DC", "FL", "GA", 
          "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", 
          "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", 
          "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", 
          "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"]
x = np.arange(len(labels))  # the label locations
width = 0.35  # the width of the bars

# create figure and axis objects with subplots()
fig,ax = plt.subplots(figsize=(20,10))

# make a plot
ax.bar(x - width/2, df_pivot.groupby('State')['% Obese'].mean(),width, label='Obesity',color='red')
# set x-axis label
ax.set_xlabel("State",fontsize=16)
# set y-axis label
ax.set_ylabel("Obesity Rate (%)",color="red",fontsize=16)
# twin object for two different y-axis on the sample plot
ax2=ax.twinx()
# make a plot with different y-axis using second axis object
ax2.bar(x + width/2, df_pivot.groupby('State')['MedianFamilyIncome'].mean(),width,label='Income',color='blue')
ax2.set_ylabel("Median Family Income($)",color="blue",fontsize=16)
ax.set_xticks(x)
ax.set_xticklabels(labels)
plt.show()

Next, I wanted to show the relationship in a scatter plot. Each point represents a states median family income and obesity rate.

# create dataframe of just median family income and obesity rate
state_income_obesity = df_pivot[['MedianFamilyIncome','% Obese']].groupby('State').mean()

# create scatter plot point labels
labels = ["AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "DC", "FL", "GA", 
          "HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD", 
          "MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ", 
          "NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC", 
          "SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"]

fig, ax = plt.subplots(figsize=(12,8))
income = state_income_obesity['MedianFamilyIncome']
obese = state_income_obesity['% Obese']

plt.scatter(x=income,y=obese)
ax.set_xlabel('Median Family Income ($)')
ax.set_ylabel('Obesity Rate (%)')
ax.set_title('Median Family Income vs Obesity Rate in America')


# Label each point with its correct State
for i, txt in enumerate(labels):
    ax.annotate(txt, (income[i], obese[i]), fontsize=12)

# Plot a linear regression line
slope, intercept, r_value, p_value, std_err = stats.linregress(income,obese)
pts = np.array(ax.get_xlim())
line = slope * pts + intercept
ax.plot(pts, line, lw=1, color='red')

[<matplotlib.lines.Line2D at 0x19f24748>]

The above scatter plot gives further insight into the relationship between income and obesity rate. As we can see from the linear regression line, there is an obvious negative linear relationship between income and obesity. The graph shows that most states follow this relationship, with Missippi as the lowest median income and highest obesity rate, and DC with the highest income and lowest obesity combination. However, there are some outliers such as Colorado. Colorado shows a very low obesity rate and an average median family income compared to other states.

Question 3: Does access to transportation have an impact on a community’s health?¶

The next question I wanted to address was whether or not access to transportation has an effect on community health, measured by county obesity rate.

fig, ax = plt.subplots(1,2,figsize=(16,6))
df_pivot.plot.scatter(x='% House Units No Vehicle', y='% Obese', ax=ax[0],title='Across America')

# show just Florida 
df_pivot.loc['Florida'].plot.scatter(x='% House Units No Vehicle',y='% Obese',title='Florida',ax=ax[1])

# show linear regression line for Florida
slope, intercept, r_value, p_value, std_err = stats.linregress(df_pivot.loc['Florida']['% House Units No Vehicle'],df_pivot.loc['Florida']['% Obese'])
pts = np.array(ax[1].get_xlim())
line = slope * pts + intercept
plt.plot(pts, line, lw=1, color='red')

[<matplotlib.lines.Line2D at 0x23339ac8>]

As we can see there is not a very significant relationship between the percent of houses without a vehicle and the obesity rate in a county. Most counties fall in a 0 to 20% range of houses without access to transportation, while obesity rate ranges from 0 to 45% regardless of house units without vehicle access. I expected transportation access to have a greater effect on community health.

Question 4: Is there a relationship between median family income and areas with less access to healthy affordable food options?¶

Finally, I wanted to analyze the relationship between income and access to healthy and affordable food for each county in the dataset. To look further into this correlation I plotted low access population versus median family income. Each scatter plot represents low access definitions at 1/2 mile, 1 mile, 10 miles, and 20 miles.

# Graph low access population vs median family income for all counties
fig, ax = plt.subplots(4,1, figsize=(10,18))
df_pivot.plot.scatter(x='% Pop Low Access (1/2 mile)', y='MedianFamilyIncome', ax=ax[0],title='Across America')
df_pivot.plot.scatter(x='% Pop Low Access (1 mile)', y='MedianFamilyIncome', color='orange',ax=ax[1])
df_pivot.plot.scatter(x='% Pop Low Access (10 miles)', y='MedianFamilyIncome', color='green',ax=ax[2])
df_pivot.plot.scatter(x='% Pop Low Access (20 miles)', y='MedianFamilyIncome',color='purple',ax=ax[3])

<matplotlib.axes._subplots.AxesSubplot at 0x1fe897c8>

When looking at low access defined at 1/2 mile and 1 mile, we can see a linear association between income and access to healthy food. As income decreases, access to healthy food also decreases for counties across America. There is a less significant relationship for low access populations defined at 10 miles and 20 miles.

Next, I grouped the dataframe by state and plotted each state in its own subplot to show the relationship between median family income and low access populations measured at 1 mile from a grocery store.

# Create pivot table of median family income and % population low access at 1 mile mark
df_state = df_pivot[['MedianFamilyIncome','% Pop Low Access (1 mile)']]

# create list of state names to loop through 
state_list = df_food.State.unique()

# create subplot to populate
fig, ax = plt.subplots(nrows=17, ncols=3, figsize=(35,130))
plt.subplots_adjust(hspace=0.3)
plt.subplots_adjust(wspace=0.3)

# create subplot of scatter plots of median family income vs healthy food access per state
for i in range(len(state_list)):
    df_state.loc[state_list[i]].plot.scatter(x='% Pop Low Access (1 mile)',y='MedianFamilyIncome',ax=ax[i // 3][i % 3], title=state_list[i],fontsize=18)

The subplots above show the relationship between median family income and access to healthy food per state. Each dot on the scatter plots represents a county in that given state. These graphs are useful in visualizing whether or not there is a relationship between a household's income and their access to healthy food.

Below I show a good example of this relationship by displaying just Florida. The regression line shows the negative relationship between income and low access population percentage.

# show just florida 
fig, ax = plt.subplots(figsize=(8,6))
df_pivot.loc['Florida'].plot.scatter(x='% Pop Low Access (1 mile)',y='MedianFamilyIncome',title='Florida',ax=ax)
slope, intercept, r_value, p_value, std_err = stats.linregress(df_pivot.loc['Florida']['% Pop Low Access (1 mile)'],df_pivot.loc['Florida']['MedianFamilyIncome'])
pts = np.array(ax.get_xlim())
line = slope * pts + intercept
ax.plot(pts, line, lw=1, color='red')

[<matplotlib.lines.Line2D at 0x22f3acc8>]

Conclusions¶

In conclusion, we have found answers to our research questions and identified the greatest predictors of community health. From the data analysis above, we can conclude that counties with greater low access populations also have higher obesity rates, particularly when looking at 1/2 mile and 1 mile distances as a measure of low access communities. Areas with less access to healthy affordable food options report higher rates of obesity. We also found evidence to support the claim that communities with lower median incomes report higher obesity rates, more unhealthy days, and higher populations in fair-poor health. However, the data does not provide evidence to support my hypothesis that access to transportation is a good predictor of community health. The data shows that access to transportation did not have a significant impact on community health. Finally, we can conclude that there is a linear association between income and access to healthy food. This finding was not surprising given that about half of low access communities in America are also below the poverty line.

I believe that the findings in this tutorial prove the importance of the issue of food deserts in America. People all accross America need access to healthy affordable food, regardless of income, or location. It is my hope that the research and work going into closing the 'grocery gap' will eventually lead to greater general community health and greater access to healthy food for all populations.

Additional Sources¶

If you are interested in further research into food deserts and their effects on America's health, I have listed various websites that I used for statistics and project motivation. Thank you for your interest in helping make America healthier and food more accessible!

	CensusTract	State	County	Urban	POP2010	OHU2010	NUMGQTRS	PCTGQTRS	...	TractSeniors	TractWhite	TractBlack	TractAsian	TractNHOPI	TractAIAN	TractOMultir	TractHispanic	TractHUNV	TractSNAP
0	1001020100	Alabama	Autauga	1	1912	693	0	0.000000	...	221	1622	217	14	0	14	45	44	26	112
1	1001020200	Alabama	Autauga	1	2170	743	181	0.083410	...	214	888	1217	5	0	5	55	75	87	202
2	1001020300	Alabama	Autauga	1	3373	1256	0	0.000000	...	439	2576	647	17	5	11	117	87	108	120
3	1001020400	Alabama	Autauga	1	4386	1722	0	0.000000	...	904	4086	193	18	4	11	74	85	19	82
4	1001020500	Alabama	Autauga	1	10766	4082	181	0.016812	...	1126	8666	1437	296	9	48	310	355	198	488

	State	County	Population	TotalHouseUnits	PovertyRate	MedianFamilyIncome	lapophalf	lapop1	HouseUnitsNoVehicle	PopLowIncome	PopWhite	PopBlack	PopAsian	PopHispanic	PopOther
0	Alabama	Autauga	1912	693	10.0	74750	1732.225468	1357.480940	26	448	1622	217	14	44	45
1	Alabama	Autauga	2170	743	18.2	51875	1410.374828	483.429683	87	763	888	1217	5	75	55
2	Alabama	Autauga	3373	1256	19.1	52905	2764.604126	1417.874893	108	1578	2576	647	17	87	117
3	Alabama	Autauga	4386	1722	3.3	68079	3651.061015	1363.466885	19	1241	4086	193	18	85	74
4	Alabama	Autauga	10766	4082	8.5	77819	7778.396188	2643.095161	198	2692	8666	1437	296	355	310

		Population	PovertyRate	MedianFamilyIncome	% House Units No Vehicle	% Pop Low Access (1/2 mile)	% Pop Low Access (1 mile)	% Pop Low Access (10 miles)	% Pop Low Access (20 miles)	% Pop White	% Pop Black	% Pop Asian	% Pop Hispanic	% Pop Other	% Pop Low Income
State	County
Alabama	Autauga	54571	13.86	61082.92	5.35	88.91	66.83	12.88	0.0	78.53	17.67	0.87	2.40	2.45	34.25
	Baldwin	182265	14.36	60664.94	3.06	90.54	72.23	1.60	0.0	85.67	9.38	0.74	4.38	3.49	34.72
	Barbour	27457	24.53	43123.00	8.17	86.11	69.92	20.04	0.0	48.00	46.89	0.39	5.05	4.20	45.47
	Bibb	22915	16.02	43362.50	3.76	95.49	82.56	1.29	0.0	75.85	22.02	0.10	1.77	1.69	44.18
	Blount	57322	17.91	52136.00	3.81	96.57	91.14	3.41	0.0	92.58	1.33	0.20	8.07	5.29	39.89

		Population	PovertyRate	MedianFamilyIncome	% House Units No Vehicle	% Pop Low Access (1/2 mile)	% Pop Low Access (1 mile)	% Pop Low Access (10 miles)	% Pop Low Access (20 miles)	% Pop White	% Pop Black	% Pop Asian	% Pop Hispanic	% Pop Other	% Pop Low Income
State	County
Alabama	Autauga	-0.139698	-0.476881	0.294447	-0.254612	0.444634	0.100300	0.023888	-0.241483	-0.257603	0.605499	-0.115364	-0.446136	-0.505646	-0.308021
	Baldwin	0.268281	-0.400962	0.264806	-0.778672	0.584754	0.373792	-0.514367	-0.241483	0.165218	0.034185	-0.166833	-0.295747	-0.296299	-0.259494
	Barbour	-0.226327	1.144868	-0.979200	0.393177	0.203356	0.256685	0.365537	-0.241483	-2.064724	2.620298	-0.306421	-0.245217	-0.153446	0.863842
	Bibb	-0.240838	-0.147721	-0.962215	-0.618874	1.011454	0.897000	-0.529087	-0.241483	-0.416295	0.905732	-0.423591	-0.493791	-0.657281	0.728927
	Blount	-0.130909	0.138818	-0.340033	-0.606364	1.105210	1.331558	-0.427856	-0.241483	0.573968	-0.521356	-0.380463	-0.016433	0.066420	0.280905

	FIPS	State	County	Unreliable	Deaths	Aggregate Population	YPLL Rate	95% CI - Low	95% CI - High	Quartile	...	Quartile.26	Unreliable.5	Zip Codes with Healthy Food	# Zip Codes	% Healthy Food	Quartile.27	Population.2	Liquor Stores	Liquor Store Rate	Quartile.28
0	1001.0	Alabama	Autauga	NaN	670.0	137881.0	9778.0	8786.0	10770.0	2	...	1	NaN	2.0	8.0	25.0	4	49109.0	2.0	0.4	2
1	1003.0	Alabama	Baldwin	NaN	2148.0	449589.0	8222.0	7716.0	8727.0	1	...	4	NaN	11.0	26.0	42.0	2	168233.0	13.0	0.8	3
2	1005.0	Alabama	Barbour	NaN	424.0	79450.0	10686.0	9263.0	12109.0	2	...	1	NaN	2.0	5.0	40.0	2	28125.0	3.0	1.1	4
3	1007.0	Alabama	Bibb	NaN	373.0	60430.0	13070.0	11271.0	14868.0	4	...	1	NaN	4.0	8.0	50.0	1	21341.0	1.0	0.5	2
4	1009.0	Alabama	Blount	NaN	787.0	155580.0	8930.0	8040.0	9819.0	1	...	4	NaN	4.0	7.0	57.0	1	55811.0	NaN	0.0	1

	Population	PovertyRate	MedianFamilyIncome	% House Units No Vehicle	% Pop Low Access (1/2 mile)	% Pop Low Access (1 mile)	% Pop Low Access (10 miles)	% Pop Low Access (20 miles)	% Pop White	% Pop Black	...	% Pop Hispanic	% Pop Other	% Pop Low Income	% Children in Poverty	% Fair/Poor Health	% Obese	% Smokers	Mentally Unhealthy Days	Physically Unhealthy Days	% unhealthy days per month
Population	1.000000	-0.030277	0.252402	0.180973	-0.373280	-0.395089	-0.160021	-0.071463	-0.194805	0.080578	...	0.191958	0.262420	-0.096868	-0.113666	-0.056954	-0.211916	-0.137536	-0.015736	-0.079866	-0.055267
PovertyRate	-0.030277	1.000000	-0.693203	0.409257	0.119866	0.086868	-0.054087	-0.057718	-0.497119	0.469473	...	0.095423	0.058463	0.850294	0.806606	0.599408	0.427828	0.346337	0.360869	0.457638	0.455845
MedianFamilyIncome	0.252402	-0.693203	1.000000	-0.186699	-0.297261	-0.328622	-0.080545	0.006171	0.177033	-0.260074	...	-0.000432	0.079778	-0.813431	-0.752778	-0.600189	-0.493518	-0.427333	-0.353888	-0.474926	-0.463659
% House Units No Vehicle	0.180973	0.409257	-0.186699	1.000000	-0.224198	-0.168975	-0.053422	0.016145	-0.472136	0.331081	...	0.004916	0.040673	0.338743	0.363947	0.233834	0.186482	0.225384	0.110475	0.150408	0.144083
% Pop Low Access (1/2 mile)	-0.373280	0.119866	-0.297261	-0.224198	1.000000	0.883369	0.265725	0.220670	0.080941	0.058566	...	-0.241924	-0.307676	0.178354	0.183363	0.257817	0.300796	0.298722	0.183616	0.260038	0.249765
% Pop Low Access (1 mile)	-0.395089	0.086868	-0.328622	-0.168975	0.883369	1.000000	0.428435	0.306561	0.129673	-0.013425	...	-0.283062	-0.337020	0.199507	0.215153	0.267911	0.287381	0.274665	0.149543	0.252419	0.227916
% Pop Low Access (10 miles)	-0.160021	-0.054087	-0.080545	-0.053422	0.265725	0.428435	1.000000	0.781635	0.039026	-0.140716	...	0.004819	-0.030242	0.057510	0.115515	-0.121861	0.000783	-0.123302	-0.220183	-0.147692	-0.199274
% Pop Low Access (20 miles)	-0.071463	-0.057718	0.006171	0.016145	0.220670	0.306561	0.781635	1.000000	0.009178	-0.134076	...	0.051618	0.038280	0.019973	0.056578	-0.118638	-0.064793	-0.067277	-0.163595	-0.131807	-0.160566
% Pop White	-0.194805	-0.497119	0.177033	-0.472136	0.080941	0.129673	0.039026	0.009178	1.000000	-0.790956	...	-0.164510	-0.315235	-0.382655	-0.436396	-0.244092	-0.326036	-0.065113	-0.060526	-0.066694	-0.069874
% Pop Black	0.080578	0.469473	-0.260074	0.331081	0.058566	-0.013425	-0.140716	-0.134076	-0.790956	1.000000	...	-0.097863	-0.075351	0.334237	0.420052	0.281557	0.456083	0.058962	0.114480	0.131700	0.136730
% Pop Asian	0.450583	-0.119340	0.424927	0.160344	-0.376314	-0.399709	-0.149212	-0.044451	-0.275437	0.023836	...	0.150262	0.335506	-0.213186	-0.236856	-0.164437	-0.279357	-0.226901	-0.094769	-0.162722	-0.145473
% Pop Hispanic	0.191958	0.095423	-0.000432	0.004916	-0.241924	-0.283062	0.004819	0.051618	-0.164510	-0.097863	...	1.000000	0.811778	0.158589	0.140592	0.062593	-0.284173	-0.201185	-0.044732	-0.024562	-0.038327
% Pop Other	0.262420	0.058463	0.079778	0.040673	-0.307676	-0.337020	-0.030242	0.038280	-0.315235	-0.075351	...	0.811778	1.000000	0.110649	0.043319	0.027048	-0.289351	-0.166715	-0.023941	-0.046479	-0.040463
% Pop Low Income	-0.096868	0.850294	-0.813431	0.338743	0.178354	0.199507	0.057510	0.019973	-0.382655	0.334237	...	0.158589	0.110649	1.000000	0.831318	0.648684	0.418905	0.375860	0.363711	0.485567	0.474295
% Children in Poverty	-0.113666	0.806606	-0.752778	0.363947	0.183363	0.215153	0.115515	0.056578	-0.436396	0.420052	...	0.140592	0.043319	0.831318	1.000000	0.665417	0.455924	0.393383	0.371574	0.506382	0.491136
% Fair/Poor Health	-0.056954	0.599408	-0.600189	0.233834	0.257817	0.267911	-0.121861	-0.118638	-0.244092	0.281557	...	0.062593	0.027048	0.648684	0.665417	1.000000	0.484557	0.551725	0.600930	0.760157	0.752513
% Obese	-0.211916	0.427828	-0.493518	0.186482	0.300796	0.287381	0.000783	-0.064793	-0.326036	0.456083	...	-0.284173	-0.289351	0.418905	0.455924	0.484557	1.000000	0.415832	0.250547	0.334227	0.327207
% Smokers	-0.137536	0.346337	-0.427333	0.225384	0.298722	0.274665	-0.123302	-0.067277	-0.065113	0.058962	...	-0.201185	-0.166715	0.375860	0.393383	0.551725	0.415832	1.000000	0.480469	0.515111	0.541192
Mentally Unhealthy Days	-0.015736	0.360869	-0.353888	0.110475	0.183616	0.149543	-0.220183	-0.163595	-0.060526	0.114480	...	-0.044732	-0.023941	0.363711	0.371574	0.600930	0.250547	0.480469	1.000000	0.624982	0.891750
Physically Unhealthy Days	-0.079866	0.457638	-0.474926	0.150408	0.260038	0.252419	-0.147692	-0.131807	-0.066694	0.131700	...	-0.024562	-0.046479	0.485567	0.506382	0.760157	0.334227	0.515111	0.624982	1.000000	0.910589
% unhealthy days per month	-0.055267	0.455845	-0.463659	0.144083	0.249765	0.227916	-0.199274	-0.160566	-0.069874	0.136730	...	-0.038327	-0.040463	0.474295	0.491136	0.752513	0.327207	0.541192	0.891750	0.910589	1.000000

		% Children in Poverty	% Fair/Poor Health	% Obese	% Smokers	Mentally Unhealthy Days	Physically Unhealthy Days	% unhealthy days per month
State	County
Alabama	Autauga	14.0	26.0	30.0	28.0	4.1	5.5	0.320000
	Baldwin	15.0	13.0	25.0	23.0	4.1	3.6	0.256667
	Barbour	34.0	24.0	36.0	23.0	3.8	6.1	0.330000
	Bibb	24.0	18.0	32.0	NaN	5.3	4.2	0.316667
	Blount	19.0	25.0	32.0	23.0	4.5	5.6	0.336667

		% Children in Poverty	% Fair/Poor Health	% Obese	% Smokers	Mentally Unhealthy Days	Physically Unhealthy Days	% unhealthy days per month
State	County
Alabama	Autauga	-0.794430	1.546098	0.457595	0.937950	0.625340	1.520199	1.211401
	Baldwin	-0.683192	-0.659851	-0.915977	0.096658	0.625340	-0.169571	0.231568
	Barbour	1.430327	1.206722	2.105882	0.096658	0.333717	2.053811	1.366111
	Bibb	0.317949	0.188591	1.007024	NaN	1.791828	0.364041	1.159830
	Blount	-0.238240	1.376410	1.007024	0.096658	1.014169	1.609134	1.469251