This Report Generally reflects the report for the Medical Insurance Cost Prediction using different regression model.Here you can find the file:
Medical Insurance Cost Prediction – Regression Model Comparison Report
Introduction
In this report, we present an analysis of medical insurance cost prediction using Regression Model.
We explored the performance of three different regression models:
· Linear Regression Model
· Lasso Regression Model
· Ridge Regression Model
Data Preparation
“ Dataset – Medical Insurance Cost Prediction.csv ” is consist of seven different columns ( age, sex, bmi, children, smoker, region, charges) in which charges is the target variable and rest are features which gives impact in insurance charges.
“ Dataset – Medical Insurance Cost Prediction.csv ” was loaded in medical_data.
There was not any missing value in the given dataset. The visual diagrams which show the trend on features affecting the charges were also plotted.
The Bar Graph illustrates the stark contrast in charges between those who smoke (high) and those who don't (low). Having the most influence on insurance costs as well.
As seen in Bar graph above, Southeast seems to be the one with the highest charges and southwest being the lowest in comparison.
Categorical Variable
Three categorical variables were found in the dataset during data exploration: sex, region, and smoker. All these variables include important information that has a significant impact on estimating insurance costs. Original encoding was selected over One-hot encoding and removing categories to preserve information.
Ø For Sex , ‘Male’ was replaced by 1 and ‘Female’ by 0,
Ø For Smoker, ‘Yes’ was replaced by 1 and ‘No’ by 0,
Ø For Region, Southeast, Southwest, Northeast, Northwest was replaced by 0,1,2 and 3 respectively.
Features and Target Selection
Variables age, sex, bmi, children, smoker and region were assigned to x.
Target Variable charges (values) were assigned to Y.
Later, the charges column was removed from medical_data.
Final look of first five values on x
age | sex | bmi | children | smoker | region | |
0 | 19 | 0 | 27.900 | 0 | 1 | 1 |
1 | 18 | 1 | 33.770 | 1 | 0 | 0 |
2 | 28 | 1 | 33.000 | 3 | 0 | 0 |
3 | 33 | 1 | 22.705 | 0 | 0 | 3 |
4 | 32 | 1 | 28.880 | 0 | 0 | 3 |
Data Splitting
The dataset was divided into training, validation and testing sets with split ratio 70 /15/15.
· Training Set
Ø The training set is composed of 70% of the original dataset which is used for training our regression models.
· Validation Set
Ø The Validation set is composed of 15% of the original dataset. Validation set plays vital roles in assessing model performance and ensures model performance to new or unseen data. All Cross Validation and Hyperparameter tuning are performed in this dataset.
· Testing Set
Ø The Testing set is also composed of 15% of the original dataset which is completely unseen till final evaluation. This set provides a model predictive capability for a real-world dataset.
Model Training and Hyperparameter Selection
All three Regression Models were trained, each one with their own approach. “ LassoCV and RidgeCV were trained because it had inbuilt feature of automatically choosing the best hyperparameter using cross validation ”.
The list of alphas [ 0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0 ] was given for the selection of best alpha.
The best value alpha for both Lasso and Ridge given by LassoCV and RidgeCV were 10.0 and 1.0, respectively. Thus, they were considered and justified.
Model Evaluation and Comparison
Model evaluation is essential to know how well each model performs. For this, Mean Absolute Error(MAE) and Mean Squared Error (MSE) were calculated. Both MAE and MSE were performed in all three models in Validation and Testing dataset, as they were unseen dataset.
Model Performance on Validation Data
Model |
MAE Validation |
MSE Validation |
Linear Regression |
3953.871481 |
3.070072e+ 07 |
Lasso |
3960.270761 |
3.071415e+ 07 |
Ridge |
3969.455578 |
3.074930e+ 07 |
Model Performance on Test Data
Model |
MAE Test |
MSE Test |
Linear Regression |
4346.254246 |
3.691318e+07 |
Lasso |
4349.066593 |
3.691007e+07 |
Ridge |
4358.010750 |
3.694051e+07 |
As seen from the above table, we observe all the models have somewhat similar MSE values on both validation and test sets, with only slight variations. This suggests that all models can provide reasonable predication of medical charges.
However, there are noticeable differences in terms of MAE on both validation and test set. The Linear Regression Model has consistently achieved lower MAE on both validation and test set compared to Lasso and Ridge. Lower MAE and MSE indicates that a model will produce less prediction errors and accurate representation of relationships between the focused features and medical charges.
From the comparison between all three models in terms of MAE and MSE, Linear Regression Model is considered as more accurate and effective for the final prediction.
Critical Discussion
The selected Linear Regression Model when used to predict the entire dataset printed,
Several decisions and assumptions were made throughout the research in order simplify the steps and improve the model’s understanding. This includes encoding of categorical variables, selection of alpha values for Lasso and Ridge models.
Categorical Variable Encoding
Original Encoding was used for categorical variables to preserve essential information. While this choice simplifies the modeling process, it may not capture complex relationships hidden within data. One Hot Encoding could have been a better alternative as it gives better results in complex models.
Hyperparameter Selection
LassoCV and RidgeCV were used to calculate the alpha values. Although the good performance of selected values, further work on hyperparameter tuning should have been done to improve the model’s performance.
Data Limitations
The dataset used in analysis was finite and may not be the entire factor affecting medical insurance costs. If the dataset also had the data on pre-existing medical conditions or lifestyle, the result could have been different.
Conclusion
In Conclusion, this report presented an analysis of medical insurance cost predictions using three regressions models: Linear Regression, Lasso Regression, and Ridge Regression. Data Examination, Model training, and evaluation revealed the Linear Regression Model Outperformed the others in terms of MAE and MSE.
The chosen model expressed great potential. However, further improvements can be made.
References
“ LassoCV and RidgeCV were trained because it had inbuilt feature of automatically choosing the best hyperparameter using cross validation. ”
The knowledge about LassoCV and RidgeCV was taken from,
sklearn.linear_model.LassoCV — scikit-learn 1.3.2 documentation
sklearn.linear_model.RidgeCV — scikit-learn 1.3.2 documentation
Comments
Post a Comment