USDA Food Access Research Atlas: https://www.ers.usda.gov/data-products/food-access-research-atlas/download-the-data/
University of Wisconsin Population Health Institute: 2010 County Health Rankings National Data https://www.countyhealthrankings.org/sites/default/files/2010%20County%20Health%20Rankings%20National%20Data_v2.xls
Since I am a graduate student, to fulfill the extra requirement I will be working alone on this project. The data sets I plan on working with come from the US Department of Agriculture and The University of Wisconsin Population Health Institute. The particular sets of data I will look at focus on food access across the US and contains demographic information on different counties in the US and their ability to access supermarkets, other healthy and affordable food sources, overall mental health, physical health, and obsesity rates.
While there are many ways to define which areas are considered "food deserts" and many ways to measure food store access for individuals and for neighborhoods, I have chosen a few initial parameters to begin analyzing. The USDA defines a food desert as 'an area that has limited access to affordable and nutritious food, in terms of income and distance to a grocery store. Food deserts are a national crisis in America, as nearly 23.5 million Americans currently live in food deserts. In addition, the USDA estimates that people living in food deserts are 30% more likely to become obese.
For my analysis, I will focus on median family income, poverty rate, percent of populations considered low access at 1/2 mile, 1 mile, 10 miles, and 20 miles distance between housing units and grocery stores, frequency of grocery stores, access to transportation, racial demographics, and the percentage of children living in poverty. We will measure community health by obesity rates per county, reported unhealthy days by individuals, and percent of individuals reporting fair to poor health.
It is important to note a few definitions provided by the data. 'Low Access' refers to accessibility to sources of healthy food, as measured by distance to a store (at ½ mile, 1 mile, 10 miles, 20 miles). 'Low Income' refers to an area where the poverty rate is greater than 20 percent or the median family income is less than or equal to 80 percent of the State-wide median family income. Then a food desert is determined by the combination of both a low access and low income community. The data provides a spatial overview of food access indicators for low income vs other census tracts using measures of 1-mile, 10-mile and 20-mile demarcations to the nearest supermarket, and vehicle availability for all census tracts.
Communities with lower median family incomes, a higher percentage of low access population, less access to transportation, and higher poverty rates are going to experience the poorest health conditions.
# Read in neccessary libraries
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import scipy
from scipy import stats
from matplotlib import cm
# Read in USDA Food Access Research Atlas Data
df_food = pd.read_csv("USDA_data.csv")
df_food.head()
Next, we need to tidy the data imported from the USDA Food Access Research Atlas. Below is a tidied dataframe with columns named more appropriately and unneccessary columns of data removed. This new dataframe displays data surrounding county population, total number of housing units, number of houses without vehicle access, median household income, poverty rate, and low access population percentages when measured at 1 mile, 10 miles, and 20 miles from a grocery store, county populations based on race, and the percent of the population considered to be low income.
# Tidy data by creating new dataframe containing only the relevant parameters
# Create a list of all columns to keep
food_cols = ["State","County","POP2010","OHU2010","PovertyRate","MedianFamilyIncome",
"lapophalf","lapop1","lapop10","lapop20","TractHUNV","TractLOWI","TractWhite",
"TractBlack","TractAsian","TractHispanic","TractOMultir"]
df_food = df_food[food_cols]
# Rename columns appropriately
df_food.rename(columns={"POP2010": "Population", "OHU2010": "TotalHouseUnits","TractHUNV":"HouseUnitsNoVehicle",
"TractWhite":"PopWhite","TractBlack":"PopBlack","TractAsian":"PopAsian","TractHispanic":"PopHispanic",
"TractOMultir":"PopOther","TractLOWI":"PopLowIncome"}, inplace=True)
display(df_food.head())
# Check for correct data types
df_food.dtypes
Below I have created a pivot table indexed on state and county to organize the above dataframe into a more manageable format. Then I edited the low access and race columns to represent percentage of the total population, rather than a sum of individuals. This was achieved by creating new columns and dividing by the overall population, then multiplying by 100.
# Create a pivot table to organize data by State and County
df_food_pivot = df_food.pivot_table(values=['Population','TotalHouseUnits','PovertyRate','MedianFamilyIncome','lapophalf',
'lapop1','lapop10','lapop20','HouseUnitsNoVehicle','PopLowIncome','PopWhite',
'PopBlack','PopAsian','PopHispanic','PopOther'],
index=['State','County'],
aggfunc={'Population':np.sum,'TotalHouseUnits':np.sum,'HouseUnitsNoVehicle':np.sum,
'PovertyRate':np.mean,'MedianFamilyIncome':np.mean,'lapophalf':np.sum,
'lapop1':np.sum,'lapop10':np.sum,'lapop20':np.sum,'PopWhite':np.sum,
'PopBlack':np.sum,'PopAsian':np.sum,'PopHispanic':np.sum,'PopOther':np.sum,
'PopLowIncome':np.sum})
# Edit columns to show % of total population for each identifier
df_food_pivot['% House Units No Vehicle']=(df_food_pivot['HouseUnitsNoVehicle']/df_food_pivot['TotalHouseUnits'])*100
df_food_pivot['% Pop Low Access (1/2 mile)'] = (df_food_pivot['lapophalf']/df_food_pivot['Population'])*100
df_food_pivot['% Pop Low Access (1 mile)'] = (df_food_pivot['lapop1']/df_food_pivot['Population'])*100
df_food_pivot['% Pop Low Access (10 miles)'] = (df_food_pivot['lapop10']/df_food_pivot['Population'])*100
df_food_pivot['% Pop Low Access (20 miles)'] = (df_food_pivot['lapop20']/df_food_pivot['Population'])*100
df_food_pivot['% Pop White'] = (df_food_pivot['PopWhite']/df_food_pivot['Population'])*100
df_food_pivot['% Pop Black'] = (df_food_pivot['PopBlack']/df_food_pivot['Population'])*100
df_food_pivot['% Pop Asian'] = (df_food_pivot['PopAsian']/df_food_pivot['Population'])*100
df_food_pivot['% Pop Hispanic'] = (df_food_pivot['PopHispanic']/df_food_pivot['Population'])*100
df_food_pivot['% Pop Other'] = (df_food_pivot['PopOther']/df_food_pivot['Population'])*100
df_food_pivot['% Pop Low Income'] = (df_food_pivot['PopLowIncome']/df_food_pivot['Population'])*100
# Keep only columns specified below
df_food_pivot = df_food_pivot[['Population','PovertyRate','MedianFamilyIncome','% House Units No Vehicle',
'% Pop Low Access (1/2 mile)','% Pop Low Access (1 mile)','% Pop Low Access (10 miles)',
'% Pop Low Access (20 miles)','% Pop White','% Pop Black','% Pop Asian','% Pop Hispanic',
'% Pop Other','% Pop Low Income']]
# Display new dataframe and round values to two decimal places
df_food_pivot.round(2).head()
Next, I standardized the USDSA dataframe for easy comparison to other dataframes in the future.
# Create a standardized version of dataframe
df_food_standardized = (df_food_pivot-df_food_pivot.mean())/df_food_pivot.std()
df_food_standardized.head()
This data was gathered at University of Wisconsin Population Health Institute. It represents 2010 County Health Rankings National Data.
# Read in data from University of Wisconsin Population Health Institute
df_health = pd.read_csv("County_Health.csv", header=[1])
df_health.head()
I decided to only use a fraction of the parameters available in this dataset to start and make the data more manageable. I decided that the State and County were important for indexing purposes. In addition, percent of population reporting fair/poor health, number of physically unhealthy days, number of mentally unhealthy days, percent of population that smokes, obesity rate, and percent of children in poverty will be important parameters to consider.
# Select columns to keep in dataframe
health_cols = ['State','County','% Fair/Poor','Physically Unhealthy Days',
'Mentally Unhealthy Days','% Smokers','% Obese','% Children in Poverty']
df_health = df_health[health_cols]
# Rename columns for easy interpretation
df_health.rename(columns={'% Fair/Poor':'% Fair/Poor Health'}, inplace= True)
# Remove columns missing county data
df_health = df_health[df_health['County'].notna()]
# Display dataframe and dtypes
display(df_health.head())
df_health.dtypes
Next, I formatted the dataframe to be grouped by state and county. Then I created a pivot table to make the data more organized.
# Create pivot table to index by State and County
df_health_pivot = df_health.pivot_table(values=['% Fair/Poor Health','Physically Unhealthy Days',
'Mentally Unhealthy Days','% Smokers','% Obese',
'% Children in Poverty'],
index=['State','County'])
df_health_pivot['% unhealthy days per month'] = (df_health_pivot['Physically Unhealthy Days']+df_health_pivot['Mentally Unhealthy Days'])/30
display(df_health_pivot.head())
Then we create a standardized version of the County Health data for comparison with other dataframes later on.
# Create standardized version of df_health_pivot
df_health_standardized = (df_health_pivot-df_health_pivot.mean())/df_health_pivot.std()
df_health_standardized.head()
# Using .merge() function to combine the two dataframes into one
df_pivot = df_food_pivot.merge(df_health_pivot, left_on=['State','County'], right_on=['State','County'], how='left')
display(df_pivot.round(2))
# Using .concat() create a pivot table of all of the data
df_pivot_standardized = pd.concat([df_food_standardized, df_health_standardized],axis=1)
df_pivot_standardized = df_pivot_standardized[df_pivot_standardized['Population'].notna()]
display(df_pivot_standardized)
To get a general understanding of our data and identify which parameters have the largest effect on a communities overall health, I decided to create a correlation matrix of all of the parameters.
# Find correlation between each variable
corr_df = df_pivot_standardized.corr()
display(corr_df)
To display the correlation matrix in an easily interpretable way, I have created a heatmap of the correlation values below. On the right side of the heatmap is a legend explaining the different colors and their respective correlation values. The darker red colors represent a value closer to positive one, or perfectly positive association, while the darker blue colors represent a value closer to negative 1, or a perfectly negative association.
# Create heatmap to visually represent correlation matrix
fig, ax = plt.subplots(figsize=(20,20))
colormap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(corr_df, cmap=colormap, annot=True, fmt = ".2f")
plt.xticks(range(len(corr_df.columns)), corr_df.columns)
plt.yticks(range(len(corr_df.columns)), corr_df.columns)
plt.title("County Data Correlations")
plt.show()
Then to further investigate different measures of community health, I calculated the correlation values of each parameter when relating to obesity rates, average unhealthy days reported per month, and the percentage of individuals reporting fair to poor health in each county. The average unhealthy days parameter is calculated as the sum of the average physically unhealthy days per month and the average mentally unhealthy days per month of an individual, divided by 30 to get the percentage of the month an individual is mentally and/or physicallly unhealthy.
First, lets look at which parameters have the greatest correlation when using obesity rate as an indicator of community overall health.
# Display the correlation values in relation to obesity in order in a bar graph
obesity_corr_df = corr_df[['% Obese']]["% Obese"]
obesity_corr_df = obesity_corr_df.drop('% Obese')
obesity_corr_df = obesity_corr_df.drop('Physically Unhealthy Days')
obesity_corr_df = obesity_corr_df.drop('Mentally Unhealthy Days')
obesity_corr_df = obesity_corr_df.drop('% Fair/Poor Health')
obesity_corr_df = obesity_corr_df.drop('% unhealthy days per month')
# Sort the correlation values and print
obesity_corr_df = obesity_corr_df.sort_values()
print(obesity_corr_df)
# Choose the greatest correlation parameters from all parameters considered
greatest_obesity_corr_df = obesity_corr_df[(obesity_corr_df >= .3) | (obesity_corr_df <= -.3)]
greatest_obesity_corr_df = greatest_obesity_corr_df.sort_values()
# Plot all correlation values and the greatest correlation value parameters for Obesity rate
fig, ax = plt.subplots(1,2, figsize=(15,5))
obesity_corr_df.plot.bar(title = 'Correlation Values for Obesity Rate',ax=ax[0])
greatest_obesity_corr_df.plot.bar(title = 'Greatest Value Correlation Parameters for Obesity Rate',ax=ax[1], color='red')
The two bar graphs above suggest that median family income, percentage of the population that is White, percent of the population considered low access (at 1/2 mile), percentage of smokers, percent of population considered low income, the poverty rate, the percent of children in poverty, and the percentage of the population that is Black have the greatest correlations with obesity rate.
Next let us consider which parameters have the greatest correlation when using combined physically unhealthy and mentally unhealthy days per month as an indicator of community health.
# Find correlation between each variable and the unhealthy days parameter.
corr_df2 = corr_df
corr_df2 = corr_df2.drop('Physically Unhealthy Days')
corr_df2 = corr_df2.drop('Mentally Unhealthy Days')
corr_df2 = corr_df2.drop('% Obese')
corr_df2 = corr_df2.drop('% Fair/Poor Health')
unhealthy_days_corr_df = corr_df2[['% unhealthy days per month']]["% unhealthy days per month"]
unhealthy_days_corr_df = unhealthy_days_corr_df.drop('% unhealthy days per month')
# Sort corrleation values and print
unhealthy_days_corr_df = unhealthy_days_corr_df.sort_values()
print(unhealthy_days_corr_df)
# Choose the greatest correlation parameters from all parameters considered
greatest_unhealthy_days_corr_df = unhealthy_days_corr_df[(unhealthy_days_corr_df >= .2) | (unhealthy_days_corr_df <= -.2)]
greatest_unhealthy_days_corr_df = greatest_unhealthy_days_corr_df.sort_values()
# Display the correlation values in order in a bar graph
# Plot all correlation values and the greatest correlation value parameters for % unhealthy days per month
fig, ax = plt.subplots(1,2, figsize=(15,5))
unhealthy_days_corr_df.plot.bar(title = 'Correlation Values for % of Unhealthy Days per Month',ax=ax[0])
greatest_unhealthy_days_corr_df.plot.bar(title = 'Greatest Value Correlation Parameters for % of Unhealthy Days per Month',ax=ax[1], color='red')
The two bar graphs above suggest that median family income, the percent of the population that is low access (at 1 mile), the percent of the population that is low access (at 1/2 mile) the poverty rate, the percentage of the population that is low income, the percentage of children in poverty, and the percentage of smokers are the greatest indicators of that community's overall health.
Finally, lets calculate the correlation values when using the percent of a population reporting fair/poor health as the indicator of community health.
# Find correlation between each variable and the unhealthy days parameter.
corr_df3 = corr_df
corr_df3 = corr_df3.drop('Physically Unhealthy Days')
corr_df3 = corr_df3.drop('Mentally Unhealthy Days')
corr_df3 = corr_df3.drop('% Obese')
corr_df3 = corr_df3.drop('% unhealthy days per month')
fair_poor_corr_df = corr_df3[['% Fair/Poor Health']]["% Fair/Poor Health"]
fair_poor_corr_df = fair_poor_corr_df.drop('% Fair/Poor Health')
fair_poor_corr_df = fair_poor_corr_df.sort_values()
print(fair_poor_corr_df)
# Choose the greatest correlation parameters from all parameters considered
greatest_fair_poor_corr_df = fair_poor_corr_df[(fair_poor_corr_df >= .25) | (fair_poor_corr_df <= -.25)]
greatest_fair_poor_corr_df = greatest_fair_poor_corr_df.sort_values()
# Display the correlation values in order in a bar graph
# Plot all correlation values and the greatest correlation value parameters for % unhealthy days per month
fig, ax = plt.subplots(1,2, figsize=(15,5))
fair_poor_corr_df.plot.bar(title = 'Correlation Values for % Reporting Fair/Poor Health',ax=ax[0])
greatest_fair_poor_corr_df.plot.bar(title = 'Greatest Value Correlation Parameters for % Reporting Fair/Poor Health',ax=ax[1], color='red')
The two bar graphs above suggest that median family income, the percent of the population that is low access (at 1/2 mile), the percent of the population that is low access (at 1 mile), percent of population that is Black, the percentage of smokers, the poverty rate, the percent of the population that is considered low income, and the percentage of children in poverty are the greatest indicators of that community's overall health.
The common parameters for all three measures of community health are median family income, low access population percentages, poverty rates, and percent smokers.
To answer quesiton number 1, I decided to create scatter plots of the obesity rate and low access population percentage for each county. Each point in the scatter plot represents a county, and each subplot represents low access defined at either 1/2 mile, 1 mile, 10 miles, or 20 miles to the grocery store. The x-axis shows population percentage, and the y-axis shows obesity rate.
# Graph low access population vs obesity rate for all counties
fig, ax = plt.subplots(4,1, figsize=(10,18))
df_pivot.plot.scatter(x='% Pop Low Access (1/2 mile)', y='% Obese', ax=ax[0],title='Across America')
df_pivot.plot.scatter(x='% Pop Low Access (1 mile)', y='% Obese', color='orange',ax=ax[1])
df_pivot.plot.scatter(x='% Pop Low Access (10 miles)', y='% Obese', color='green',ax=ax[2])
df_pivot.plot.scatter(x='% Pop Low Access (20 miles)', y='% Obese',color='purple',ax=ax[3])
The above graphs show the obesity rate of each county in relation to the percentage of low access population at different mile demarcations. The greatest relationship can be seen between obesity rate and percent of low access population at 1/2 mile. While there is a positive linear association between the percent of low access population and obesity rate, it is not as significant as I would have expected. As the low access mile demarcations get larger, the relationship between obesity rate and percent of low access population gets less significant. This is probably due to the fact that a larger percentage of the population is considered when looking at areas 1/2 mile and 1 mile from a grocery store rather than areas 10 and 20 miles from a grocery store.
Below I chose a to show Florida as an example of this relationship. Each point in the scatter plot represents a county in Florida. As you can see by the regression line, there is a linear association between obesity rate and low access population percentage.
# show just florida
fig, ax = plt.subplots(figsize=(8,6))
df_pivot.loc['Florida'].plot.scatter(x='% Pop Low Access (1 mile)',y='% Obese',title='Florida',ax=ax)
# plot regression line
slope, intercept, r_value, p_value, std_err = stats.linregress(df_pivot.loc['Florida']['% Pop Low Access (1 mile)'],df_pivot.loc['Florida']['% Obese'])
pts = np.array(ax.get_xlim())
line = slope * pts + intercept
ax.plot(pts, line, lw=1, color='red')
# Create bar graph showing median household incomes by state
fig, ax = plt.subplots(figsize = (20,10))
ax1 = df_pivot.groupby('State')['MedianFamilyIncome'].mean().sort_values().plot.bar(title = 'Median household income by state')
# Create bar graph showing obesity rates by state
fig, ax = plt.subplots(figsize = (20,10))
ax2 = df_pivot.groupby('State')['% Obese'].mean().sort_values().plot.bar(title = '% Obesity by State')
The bar graph above reports the rate of obesity by state. This graph shows that Colorado and Rhode Island have the lowest rate of obesity and Alabama and Missippi have the highest rate of obesity.
Below I combined the previous bar graphs to show both the obesity rate and median family income per state, with two different y-axis ranges. On the left side, obesity rate ranges from 0 to 35% and on the right side, median family income ranges from $0 to $100,000.
# create labels for x axis of all the state abbreviations
labels = ["AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "DC", "FL", "GA",
"HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD",
"MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ",
"NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC",
"SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"]
x = np.arange(len(labels)) # the label locations
width = 0.35 # the width of the bars
# create figure and axis objects with subplots()
fig,ax = plt.subplots(figsize=(20,10))
# make a plot
ax.bar(x - width/2, df_pivot.groupby('State')['% Obese'].mean(),width, label='Obesity',color='red')
# set x-axis label
ax.set_xlabel("State",fontsize=16)
# set y-axis label
ax.set_ylabel("Obesity Rate (%)",color="red",fontsize=16)
# twin object for two different y-axis on the sample plot
ax2=ax.twinx()
# make a plot with different y-axis using second axis object
ax2.bar(x + width/2, df_pivot.groupby('State')['MedianFamilyIncome'].mean(),width,label='Income',color='blue')
ax2.set_ylabel("Median Family Income($)",color="blue",fontsize=16)
ax.set_xticks(x)
ax.set_xticklabels(labels)
plt.show()
Next, I wanted to show the relationship in a scatter plot. Each point represents a states median family income and obesity rate.
# create dataframe of just median family income and obesity rate
state_income_obesity = df_pivot[['MedianFamilyIncome','% Obese']].groupby('State').mean()
# create scatter plot point labels
labels = ["AL", "AK", "AZ", "AR", "CA", "CO", "CT", "DE", "DC", "FL", "GA",
"HI", "ID", "IL", "IN", "IA", "KS", "KY", "LA", "ME", "MD",
"MA", "MI", "MN", "MS", "MO", "MT", "NE", "NV", "NH", "NJ",
"NM", "NY", "NC", "ND", "OH", "OK", "OR", "PA", "RI", "SC",
"SD", "TN", "TX", "UT", "VT", "VA", "WA", "WV", "WI", "WY"]
fig, ax = plt.subplots(figsize=(12,8))
income = state_income_obesity['MedianFamilyIncome']
obese = state_income_obesity['% Obese']
plt.scatter(x=income,y=obese)
ax.set_xlabel('Median Family Income ($)')
ax.set_ylabel('Obesity Rate (%)')
ax.set_title('Median Family Income vs Obesity Rate in America')
# Label each point with its correct State
for i, txt in enumerate(labels):
ax.annotate(txt, (income[i], obese[i]), fontsize=12)
# Plot a linear regression line
slope, intercept, r_value, p_value, std_err = stats.linregress(income,obese)
pts = np.array(ax.get_xlim())
line = slope * pts + intercept
ax.plot(pts, line, lw=1, color='red')
The above scatter plot gives further insight into the relationship between income and obesity rate. As we can see from the linear regression line, there is an obvious negative linear relationship between income and obesity. The graph shows that most states follow this relationship, with Missippi as the lowest median income and highest obesity rate, and DC with the highest income and lowest obesity combination. However, there are some outliers such as Colorado. Colorado shows a very low obesity rate and an average median family income compared to other states.
The next question I wanted to address was whether or not access to transportation has an effect on community health, measured by county obesity rate.
fig, ax = plt.subplots(1,2,figsize=(16,6))
df_pivot.plot.scatter(x='% House Units No Vehicle', y='% Obese', ax=ax[0],title='Across America')
# show just Florida
df_pivot.loc['Florida'].plot.scatter(x='% House Units No Vehicle',y='% Obese',title='Florida',ax=ax[1])
# show linear regression line for Florida
slope, intercept, r_value, p_value, std_err = stats.linregress(df_pivot.loc['Florida']['% House Units No Vehicle'],df_pivot.loc['Florida']['% Obese'])
pts = np.array(ax[1].get_xlim())
line = slope * pts + intercept
plt.plot(pts, line, lw=1, color='red')
As we can see there is not a very significant relationship between the percent of houses without a vehicle and the obesity rate in a county. Most counties fall in a 0 to 20% range of houses without access to transportation, while obesity rate ranges from 0 to 45% regardless of house units without vehicle access. I expected transportation access to have a greater effect on community health.
Finally, I wanted to analyze the relationship between income and access to healthy and affordable food for each county in the dataset. To look further into this correlation I plotted low access population versus median family income. Each scatter plot represents low access definitions at 1/2 mile, 1 mile, 10 miles, and 20 miles.
# Graph low access population vs median family income for all counties
fig, ax = plt.subplots(4,1, figsize=(10,18))
df_pivot.plot.scatter(x='% Pop Low Access (1/2 mile)', y='MedianFamilyIncome', ax=ax[0],title='Across America')
df_pivot.plot.scatter(x='% Pop Low Access (1 mile)', y='MedianFamilyIncome', color='orange',ax=ax[1])
df_pivot.plot.scatter(x='% Pop Low Access (10 miles)', y='MedianFamilyIncome', color='green',ax=ax[2])
df_pivot.plot.scatter(x='% Pop Low Access (20 miles)', y='MedianFamilyIncome',color='purple',ax=ax[3])
When looking at low access defined at 1/2 mile and 1 mile, we can see a linear association between income and access to healthy food. As income decreases, access to healthy food also decreases for counties across America. There is a less significant relationship for low access populations defined at 10 miles and 20 miles.
Next, I grouped the dataframe by state and plotted each state in its own subplot to show the relationship between median family income and low access populations measured at 1 mile from a grocery store.
# Create pivot table of median family income and % population low access at 1 mile mark
df_state = df_pivot[['MedianFamilyIncome','% Pop Low Access (1 mile)']]
# create list of state names to loop through
state_list = df_food.State.unique()
# create subplot to populate
fig, ax = plt.subplots(nrows=17, ncols=3, figsize=(35,130))
plt.subplots_adjust(hspace=0.3)
plt.subplots_adjust(wspace=0.3)
# create subplot of scatter plots of median family income vs healthy food access per state
for i in range(len(state_list)):
df_state.loc[state_list[i]].plot.scatter(x='% Pop Low Access (1 mile)',y='MedianFamilyIncome',ax=ax[i // 3][i % 3], title=state_list[i],fontsize=18)
The subplots above show the relationship between median family income and access to healthy food per state. Each dot on the scatter plots represents a county in that given state. These graphs are useful in visualizing whether or not there is a relationship between a household's income and their access to healthy food.
Below I show a good example of this relationship by displaying just Florida. The regression line shows the negative relationship between income and low access population percentage.
# show just florida
fig, ax = plt.subplots(figsize=(8,6))
df_pivot.loc['Florida'].plot.scatter(x='% Pop Low Access (1 mile)',y='MedianFamilyIncome',title='Florida',ax=ax)
slope, intercept, r_value, p_value, std_err = stats.linregress(df_pivot.loc['Florida']['% Pop Low Access (1 mile)'],df_pivot.loc['Florida']['MedianFamilyIncome'])
pts = np.array(ax.get_xlim())
line = slope * pts + intercept
ax.plot(pts, line, lw=1, color='red')
In conclusion, we have found answers to our research questions and identified the greatest predictors of community health. From the data analysis above, we can conclude that counties with greater low access populations also have higher obesity rates, particularly when looking at 1/2 mile and 1 mile distances as a measure of low access communities. Areas with less access to healthy affordable food options report higher rates of obesity. We also found evidence to support the claim that communities with lower median incomes report higher obesity rates, more unhealthy days, and higher populations in fair-poor health. However, the data does not provide evidence to support my hypothesis that access to transportation is a good predictor of community health. The data shows that access to transportation did not have a significant impact on community health. Finally, we can conclude that there is a linear association between income and access to healthy food. This finding was not surprising given that about half of low access communities in America are also below the poverty line.
I believe that the findings in this tutorial prove the importance of the issue of food deserts in America. People all accross America need access to healthy affordable food, regardless of income, or location. It is my hope that the research and work going into closing the 'grocery gap' will eventually lead to greater general community health and greater access to healthy food for all populations.
If you are interested in further research into food deserts and their effects on America's health, I have listed various websites that I used for statistics and project motivation. Thank you for your interest in helping make America healthier and food more accessible!