By Huilin Zhu
Introduction
My project aims to study the relationship between built environments and health outcomes, specifically obesity prevalence in Pennsylvania. My previous blog introduced how I use transfer learning methods to extract the features of the built environment from Google satellite images. In this post, I will focus on the data analysis part to investigate the impact of the built environment on overweight.
Statistical Analysis
Using VGG as the pre-trained model, 4096 variables are extracted from the satellite images to represent the built environment in each census tract. These variables do not have a specific meaning, but they can be regarded as the indicator of the built environment, including color, gradient, edge, height, length, etc. Since the data contains a large number of features (n=4,096), I use an elastic net algorithm in the data analysis stage. Elastic Net is a regularized regression method involving eliminating insignificant variables and preserving significant and correlated variables. It’s especially powerful when applied to very large data where the number of variables might be in the thousands or even millions.
My project aims to investigate how people’s body-weights can be affected by the built environment. Adult’s obesity prevalence is chosen as a dependent variable. The obesity data comes from the 500 cities project. The independent variables would be the built environment, which is represented by the 4096 variables drawn from CNN. Each census tract is regarded as one observation. I combined these variables with heath variables to check the association between the built environment and overweight. The following table is the merged data.
Using the scikit-learn package in python, I run elastic net regression and get the coefficients of each feature variable, only 58 coefficients are significant, which means 58 variables have the feature that is related to the obesity percentage. The following image shows the value of coefficients of each independent variable.
Predictions
In order to evaluate how well the model predicts obesity prevalence across all census tracts, I split the data into two sets – a random sample representing fifty percent of the data for fitting and the remaining fifty percent for model evaluation. The model coefficient of determination R2 is 0.25. R2 is interpreted as the proportion of the variance in the dependent variable that is predictable from the independent variable. A value of 1.0 indicates a perfect fit. In this work, the value of R2 indicates the built environment variables can explain 25 percent of between–census tract variance of obesity.
I generated a choropleth map to compare the predicted obesity prevalence with the actual obesity prevalence:
The first image represents the cross-validated estimates of obesity prevalence based on features of the built environment extracted from satellite images; the second image represents the actual obesity prevalence. These two images have some similarities, but the similarity level is not very high. This indicates that the model can explain obesity prevalence in some sense, but the prediction is not quite well.
Currently, I only include the census tracts in Allentown. Allentown has 26 census tracts, so there are only 26 observations in this data sample. This will cause an overfitting problem, making results unreliable. To prevent overfitting, I need to limit the number of selected features to be less than or equal to the number of census tracts. I will include more cities in Pennsylvania to the sample, and use the same method to test the relationship between the built environment and body weight. If I get a high value of R2, as well as a high similarity between the predicted obesity percentage and actual obesity prevalence, then I can conclude that the built environment is correlated with obesity prevalence across neighborhoods.
Future work
In the future, I will do data analysis incorporating control variables: gender, race, median household income, and percentage of households under the poverty line. Also, I will gather the data of Google places of interest to investigate how the built environment will affect body weight through a food access channel.
My broader dissertation research explores how social factors affect health outcomes, particularly how gender, government policy, and urban space affect health well-being. My first dissertation chapter looks at how maternity leave affects children’s health in urban China, and my second chapter discusses the impact of maternity leave on mothers’ labor outcomes after childbirth. This digital project will be the third chapter of my dissertation.
In the process of working on this digital project, I encountered many technical problems such as downloading Google tile images, implementing convolution neural networking, and generating results in a map. I’ve gained a lot of experience in Python coding and become more familiar with the area of image analysis and CNN. All these will help me in my future studies. I really appreciate the help I’ve received from the Scholars Studio.