import pandas as pd
df = pd.read_csv('causes_of_death.tsv', delimiter='\t')
print(len(df))
df.head()
4115
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Notes | State | State Code | Ten-Year Age Groups | Ten-Year Age Groups Code | Gender | Gender Code | Race | Race Code | Deaths | Population | Crude Rate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | Alabama | 1 | < 1 year | 1 | Female | F | American Indian or Alaska Native | 1002-5 | 14 | 3579 | Unreliable |
1 | NaN | Alabama | 1 | < 1 year | 1 | Female | F | Asian or Pacific Islander | A-PI | 24 | 7443 | 322.5 |
2 | NaN | Alabama | 1 | < 1 year | 1 | Female | F | Black or African American | 2054-5 | 2093 | 169339 | 1236.0 |
3 | NaN | Alabama | 1 | < 1 year | 1 | Female | F | White | 2106-3 | 2144 | 347921 | 616.2 |
4 | NaN | Alabama | 1 | < 1 year | 1 | Male | M | Asian or Pacific Islander | A-PI | 33 | 7366 | 448.0 |
# Groupby State
grouped = #Your code here
Calculate the Correlation Coefficient between the Deaths and Population Columns (of your grouped dataframe)
#Your code here
Iterate over the following columns: ['Race', 'Gender', 'Ten-Year Age Groups']. Within your for loop, create a temporary groupby aggregate as we did for the State column above. Then, print any aggregate grouping where the correlation coefficient is less then .95.
#Your code here
We can further expand upon our exploration above by testing multiple features against each other! Complete the code below to print any combination of features where the correlation between population and death is below .95 (or some other appropriate threshold).
#This could also be accomplished with the combinations() method from the itertools package.
for n, feat1 in enumerate(features):
for feat2 in features[n:]:
#Your code here
#groupby feat1 and feat2!!
#repeat your code above to check if the correlation is below a [high] threshold.