Analysis and predictions of COVID-19 mortality per US county based on county's population Health Conditions (as % to total population)

Abstract

As entire world continues to fight COVID-19, more and more researches that are intended to better understand such invisible enemy as COVID are in progress or already completed. I also made a small step in this direction and now I present my current study where I try to analyze all possible correlations between existing Health Conditions and COVID mortality in each of US counties, as well as to predict COVID mortality in the same counties in the near future (i.e. next 5 months). I collected data pertaining to each US county population’s medical conditions by scrubbing respective numbers from the city-data site. The Covid-19 statistics per state/county I found on CDC site and created script to read/scrub such data. After numerous data massaging and manipulation I came up with the Excel spreadsheet as my main data source for further analysis which I performed by using Tableau Visualization and ML technics like Regression/Prediction through the Python scripts. In the created dataset I considered all listed Medical Conditions per county as a Features (independent variables) and Covid-19 Total Deaths per county as a Dependent/Target variable.

Data Exploration

While performing Exploratory Data Analysis of my data I found that: 1) there are numbers of highly correlated between each other features which can make an impact on the outcome of regression model (see correlation heatmap on Fig.1); 2) no linear correlation between each feature and target variable; 3) majority of the features have negative correlation with the target variable; 4) there are a few significant outliers in the target variable data set (we know that some US counties have dramatical COVID death toll). To address these anomalies I decided to explore 5 scenarios: 1)run entire Dataset (i.e. all states/counties) with all features; 2) select 5 Important features and run in the entire dataset; 3) select 8 Important features and run in the entire dataset; 4) exclude highly correlated features, then select 5 Important features among remaining and run in the entire dataset; 5) exclude highly correlated features, then select 8 Important features among remaining and run in the entire dataset. Fig.1 Correlation Matrix

Data Model

I tried both Linear Regression and Random Forest models, but saw that Random Forest is performing better – most likely, due to anomalies specified above, so only one (first) scenario was ran twice under the each of these two models, then another 4 scenarios specified above have been ran using Random Forest Regression model only. To address the outlier presence, I tried to do a scale, but it didn’t improve scores, then I tries to use BoxCox, but this method has modified target data significally, so I decided not to use this option because the nature of target data became misrepresented. All rows in dataset was shuffled and then the split was done to obtain train and test data sets. I tried 4 sets of Hyperparameres for the RandomForestRegressor (using GridSearch), but due the nature of RandomForest model I saw no significant differences in the scores, so I set parameter n_estimators (i.e. number of trees) as 200 and max_depth as 10 for all of 5 scenarios. Important Features have been selected using the Recursive Feature Elimination technic.

Output/Results

I saved output, which represent predictions, of the runs of 5 scenarios mentioned above in the corresponding Excel files,as well as scores for each scenario and Features ranking results for all scenarios except the first one. All of these files I used for the Tableau visualization that consists of 3 Dashboards combined in one Story (see below).

Results/Scores Interpretation

Graph 2on Dashboard 1 shows that with all important features the Total Deaths is not changing much except the known outliers. Similar for the predictive numbers on Graph 3 (same Dashboard 1). The Features selection/Ranking results are presented on Dashboar 2: the features with rank 1 are considered as important and have been selected for scenarios 2-5. Features with the biggest rank on Graph 4 on Dashboard 2 are considered as less important. As it can be seen on dGraphs 5 and 6 (Dashboard 2), some scores are negative, while some are relatively low (the Accuracy score is, indeed, the mean of the cross-validation computation with the score parameter as neg_mean_squared_error). The low scores can be explained by non-correlation of the majority of the features with the target variable. This can be proved by changing any of the features filter slider on Graphs 7 and 8 (Dashboards 3): any significant change on those feature values can lead to the little change of the Total Deaths figures.

Conclusion

It seems that there is no straight correlation between common diseases like diabetes, arthritis, coronary heart disease, etc. and COVID-19 mortality. The medication taken by people suffered of such disease may contribute to the COVID mortality prevention. For instance, doctors prescribed Corticosteroids for people suffering of gout, arthritis, emphysema (and some other COPD), psoriasis. People suffering of high blood pressure, coronary heart disease and similar problem are taking Lozartan, anemia disease have a cure with the man-made form of protein, while for the thyroid disease doctors prescribed the Levothyroxine. All of such medications have been named by different researches as a potential reducers of COVID severe symptoms. But much more studies are needed in order to make some definite conclusion about the possible impact of existing common medications on the COVID-19 illness and mortality.