Medical Insurance Cost Prediction – Regression Model Comparison Report

This Report Generally reflects the report for the Medical Insurance Cost Prediction using different regression model.
Here you can find the file:

Medical Insurance Cost Prediction

Medical Insurance Cost Prediction – Regression Model Comparison Report

Introduction

In this report, we present an analysis of medical insurance cost prediction using Regression Model.

We explored the performance of three different regression models:

· Linear Regression Model

· Lasso Regression Model

· Ridge Regression Model

Data Preparation

“ Dataset – Medical Insurance Cost Prediction.csv ” is consist of seven different columns ( age, sex, bmi, children, smoker, region, charges) in which charges is the target variable and rest are features which gives impact in insurance charges.

“ Dataset – Medical Insurance Cost Prediction.csv ” was loaded in medical_data.

There was not any missing value in the given dataset. The visual diagrams which show the trend on features affecting the charges were also plotted.

The Bar Graph illustrates the stark contrast in charges between those who smoke (high) and those who don't (low). Having the most influence on insurance costs as well.

The bar graph above demonstrates how charges for women are somewhat higher than for men, however these differences could be seen as equivalent.

As seen in Bar graph above, Southeast seems to be the one with the highest charges and southwest being the lowest in comparison.

Categorical Variable

Three categorical variables were found in the dataset during data exploration: sex, region, and smoker. All these variables include important information that has a significant impact on estimating insurance costs. Original encoding was selected over One-hot encoding and removing categories to preserve information.

Ø For Sex , ‘Male’ was replaced by 1 and ‘Female’ by 0,

Ø For Smoker, ‘Yes’ was replaced by 1 and ‘No’ by 0,

Ø For Region, Southeast, Southwest, Northeast, Northwest was replaced by 0,1,2 and 3 respectively.

Features and Target Selection

Variables age, sex, bmi, children, smoker and region were assigned to x.

Target Variable charges (values) were assigned to Y.

Later, the charges column was removed from medical_data.

Final look of first five values on x

	age	sex	bmi	children	smoker	region
0	19	0	27.900	0	1	1
1	18	1	33.770	1	0	0
2	28	1	33.000	3	0	0
3	33	1	22.705	0	0	3
4	32	1	28.880	0	0	3

Data Splitting

The dataset was divided into training, validation and testing sets with split ratio 70 /15/15.

· Training Set

Ø The training set is composed of 70% of the original dataset which is used for training our regression models.

· Validation Set

Ø The Validation set is composed of 15% of the original dataset. Validation set plays vital roles in assessing model performance and ensures model performance to new or unseen data. All Cross Validation and Hyperparameter tuning are performed in this dataset.

· Testing Set

Ø The Testing set is also composed of 15% of the original dataset which is completely unseen till final evaluation. This set provides a model predictive capability for a real-world dataset.

Model Training and Hyperparameter Selection

All three Regression Models were trained, each one with their own approach. “ LassoCV and RidgeCV were trained because it had inbuilt feature of automatically choosing the best hyperparameter using cross validation ”.

The list of alphas [ 0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0 ] was given for the selection of best alpha.

The best value alpha for both Lasso and Ridge given by LassoCV and RidgeCV were 10.0 and 1.0, respectively. Thus, they were considered and justified.

Model Evaluation and Comparison

Model evaluation is essential to know how well each model performs. For this, Mean Absolute Error(MAE) and Mean Squared Error (MSE) were calculated. Both MAE and MSE were performed in all three models in Validation and Testing dataset, as they were unseen dataset.

Model Performance on Validation Data

Model

MAE Validation

MSE Validation

Linear Regression

3953.871481

3.070072e+ 07

Lasso

3960.270761

3.071415e+ 07

Ridge

3969.455578

3.074930e+ 07

Model Performance on Test Data

Model

MAE Test

MSE Test

Linear Regression

4346.254246

3.691318e+07

Lasso

4349.066593

3.691007e+07

Ridge

4358.010750

3.694051e+07

As seen from the above table, we observe all the models have somewhat similar MSE values on both validation and test sets, with only slight variations. This suggests that all models can provide reasonable predication of medical charges.

However, there are noticeable differences in terms of MAE on both validation and test set. The Linear Regression Model has consistently achieved lower MAE on both validation and test set compared to Lasso and Ridge. Lower MAE and MSE indicates that a model will produce less prediction errors and accurate representation of relationships between the focused features and medical charges.

From the comparison between all three models in terms of MAE and MSE, Linear Regression Model is considered as more accurate and effective for the final prediction.

Critical Discussion

The selected Linear Regression Model when used to predict the entire dataset printed,

Several decisions and assumptions were made throughout the research in order simplify the steps and improve the model’s understanding. This includes encoding of categorical variables, selection of alpha values for Lasso and Ridge models.

Categorical Variable Encoding

Original Encoding was used for categorical variables to preserve essential information. While this choice simplifies the modeling process, it may not capture complex relationships hidden within data. One Hot Encoding could have been a better alternative as it gives better results in complex models.

Hyperparameter Selection

LassoCV and RidgeCV were used to calculate the alpha values. Although the good performance of selected values, further work on hyperparameter tuning should have been done to improve the model’s performance.

Data Limitations

The dataset used in analysis was finite and may not be the entire factor affecting medical insurance costs. If the dataset also had the data on pre-existing medical conditions or lifestyle, the result could have been different.

Conclusion

In Conclusion, this report presented an analysis of medical insurance cost predictions using three regressions models: Linear Regression, Lasso Regression, and Ridge Regression. Data Examination, Model training, and evaluation revealed the Linear Regression Model Outperformed the others in terms of MAE and MSE.

The chosen model expressed great potential. However, further improvements can be made.

References

“ LassoCV and RidgeCV were trained because it had inbuilt feature of automatically choosing the best hyperparameter using cross validation. ”

The knowledge about LassoCV and RidgeCV was taken from,

sklearn.linear_model.LassoCV — scikit-learn 1.3.2 documentation

sklearn.linear_model.RidgeCV — scikit-learn 1.3.2 documentation

Journey Of Parvat Bhusal

Search This Blog

Medical Insurance Cost Prediction – Regression Model Comparison Report

This Report Generally reflects the report for the Medical Insurance Cost Prediction using different regression model.
Here you can find the file:

Medical Insurance Cost Prediction

Introduction

Data Preparation

Categorical Variable

Features and Target Selection

Data Splitting

Model Training and Hyperparameter Selection

Model Evaluation and Comparison

Model Performance on Validation Data

Model Performance on Test Data

Critical Discussion

Conclusion

References

Labels

Comments

Post a Comment

Popular posts from this blog

Sudoku Game Using PyGame

Journey Of Parvat Bhusal

Medical Insurance Cost Prediction – Regression Model Comparison Report

This Report Generally reflects the report for the Medical Insurance Cost Prediction using different regression model.Here you can find the file:

Medical Insurance Cost Prediction

Introduction

Data Preparation

Categorical Variable

Features and Target Selection

Data Splitting

Model Training and Hyperparameter Selection

Model Evaluation and Comparison

Model Performance on Validation Data

Model Performance on Test Data

Critical Discussion

Conclusion

References

Labels

Comments

Post a Comment

Popular posts from this blog

Sudoku Game Using PyGame

This Report Generally reflects the report for the Medical Insurance Cost Prediction using different regression model.
Here you can find the file: