
Introductory Pandas Analysis
Overview & Data sources.
The purpose of our project was to study how obesity rates vary by income levels, sports participation rates and geography across Canada. Our data came from: The General Social Survey.
The data was accessed by way of Public Use Microdata Files accessible for University of Calgary via the Nesstar portal.
Variables were reviewed for their potential useful information, selected and then downloaded as a subset in a .csv file. The .csv files we used are being included in the upload with our notebook. The downloads also included documentation which described how the data were coded in the survey results.
Our guiding questions changed a little bit from our project proposal on the basis of how the data was coded and available to us. Ultimately we settled on the following:
Is there a difference in obesity levels among provinces?
Is there a relationship between income levels and obesity?
Is there a relationship between income levels and sport or activity participation?
Is there a relationship between sport or activity participation and obesity?
Is there a relationship between income levels and eating habits?
Is there a difference in obesity levels between rural areas and urban areas?
Working with the GSS Data
We had data from the GSS survey that we thought would help us look at different relationships as well as reexamine some of the other relationships from the CCHS data with a different data source.
Similar to the CCHS data set, the data was compiled in the context of single entries to different questions. An example of one such dataset would be:
The SPA_01 survey data asked respondents the question:
'Did you participate in any sports during the 12 months?
1 in the data corresponded to 'Yes'
2 in the data corresponded to 'No'
6 in the data set corresponded to 'Valid Skip'
7 in the data set corresponded to 'Don't know'
8 in the data set corresponded to 'Refusal'
9 in the data set corresponded to 'Not stated'
For each survey in the GSS Data, the questions differed and hence the numerical values assigned to the questionnaire to represent responses, also varied accordingly. Therefore, in terms of data wrangling, we used different cleaning methods and strategies to come up with meaningful inferences.
Poor Eating Habits by Province
Here are the steps we took to determine percentage of people that self-reported poor eating habits per province:
Cleaned up null responses from the data so that they would not interfere with our results
Got a count of respondents from each province that responded to the survey
Obtained a count of all the respondents per province that reported poor eating habits in general
Calculated the proportion and hence percentage of people per province that reported poor eating habits
Used the mean and max function to extract the province with the highest and lowest proportion of self-reported poor eating habits
Produced a table, bar graph and scatter plot to show the differences in reported poor eating habits per province

This shows that Prince Edward Island and New Brunswick have the highest percentage of people that reported poor eating habits whereas Quebec is has the lowest self-reported poor eating habits followed by British Columbia.
Poor Eating Habits by Different Population Centres: Urban, Rural & Prince Edward Island
To follow up on poor eating habits per province, we wanted to test the hypothesis that people living in urban areas have easy access to highly processed foods and a lot of unhealthy take-out options whereas people living in rural areas may not have such access. This led us to question whether people living in urban areas had increased self-reported poor eating habits than people living in rural areas
Clean up all null responses and only use data with actual information in the responses.
Get a count of each respondent who answered the poor eating habits question per population centre.
Get a count of all respondents that answered poor eating habits per population centre.
Calculate the poor eating habits proportion and percentage per population centre.
Display a table with the information
Display a bar-graph showing the different percentages

Surprisingly, we found that people living in large urban areas had a lower percentage of people (though not substantial) that reported poor eating habits as compared to people living in rural areas. People living in Prince Edward Island had the highest percentage of people that self- reported poor eating habits. These results were unexpected.
Family Income and Sports Participation
Based on previous research, we formulated the hypothesis that persons with higher incomes were much more likely to participate in sports as opposed to persons with lower incomes. This played well into our proposal that people of lower socio-economic status had less resources available to them in the context of healthy eating and sports participation. In this part, we investigate income levels and sports participation level. To get our results, we did the following:
Clean up any null responses. Only used data with actual information in the responses.
We averaged the responses to sports participation; A yes to the question,"Did you regularly participate in any sports during the past 12 month?" resulted in a '1' in the data whereas a no response resulted in '2' in the data.
We grouped the averaged sports participation by the 6 income levels.
We produced a table and a scatterplot showing the results.

Not surprisingly, we found an almost linear relationship between income level and sports participation. A value close to 1 is indicative of higher sports participation whereas a value of 2 is indicative of lower sports participation (on the survey, no sports participation = 2, yes to sports participation = 1). Therefore, persons with higher income reported higher sports participation as opposed to people with lower incomes.
Education level, Alcohol Consumption & Eating Habits
Next we wanted to examine the relationship between education level, alcohol consumption and eating habits. In order to conduct our analysis, we did the following:
Separated the three variables into one Pandas data frame.
Cleaned up all the null responses in from all 3 groups of data.
Calculated the counts for all respondents that responded to poor eating habits, alcohol consumption and education status.
From that data, we calculated the total number of respondents that responded as having poor eating habits, drinking alcohol everyday and grouped this information by education level.
Calculated the proportion and percentage of poor eating habits and regular drinking (drinking everyday) per education level.
Produced a table and a grouped bar graph to represent this information.

It took quite a bit of data wrangling but ultimately we got a stacked bar graph that showed that as education level increased, poor eating habits decreased however, surprisingly we also found that as level of education increased, the percentage of people that reported drinking regularly (everyday) also increased. There is a significant increase in the percentage of regular drinking for persons with a post-secondary degree as compared to persons with less than a secondary school education.¶
We choose the grouped-bar graph here because it allows us to simultaneously compare the two variables (poor-eating habits and regular drinking) in the context of a third variable: level of education.
Sports Participation & General Self-Rated Health.
Lastly, we wanted to examine the relationship between sports participation and self-rated health. As for self-rated health, the respondents answered according to the following scale:
"1" for in excellent health,
"2" for very good,
"3" for good,
"4" for fair,
"5" for poor
For sports participation, respondents answered the following question: Did you regularly participate in any sports during the past 12 months?
"1" for Yes
"2" for No
Using this data, we again grouped by sports participation and calculated the counts for each of the 5 categories of health (example: 1 for excellent health, 2 for good health)
In order to explore the relationship, we did the following:
Separated the two variables into one Pandas data frame.
Cleaned up all the null responses from all 3 groups of data.
Grouped the data by sports participation and calculated the counts for all respondents that responded in the survey.
Using this data, we again grouped by sports participation and calculated the counts for each of the 5 categories of health (exaple: 1 for excellent health, 2 for good health)
Calculated the proportions and percentage of each health category per sports participation.
Produced a table and a grouped bar graph to represent this information.We expected persons that answered "yes" to sports participation would self-report better health as compared to persons that answered "no" to sports participation.

The grouped bar graph allows us to compare the different health variables side by side. From the graph, we can see that a higher percentage of persons that reported to having participated in sports also reported being in "excellent health" and "very good health" as compared to the no sports participation group. Also, a larger proportion of persons from the no sports participation group reported as being in "poor health" as compared to the yes sports participation group. One surprising aspect of our results was that there was a higher proportion of person from the no sports group reported as being in "good health" as compared to the yes sports group.
Comments