Skip to main content

Medical Insurance Cost Prediction – Regression Model Comparison Report

 
This Report Generally reflects the report for the Medical Insurance Cost Prediction using different regression model.
Here you can find the file:


  

Medical Insurance Cost Prediction – Regression Model Comparison Report

 

Introduction

 

In this report, we present an analysis of medical insurance cost prediction using Regression Model.

We explored the performance of three different regression models:

·        Linear Regression Model

·        Lasso Regression Model

·        Ridge Regression Model

Data Preparation

 

“ Dataset – Medical Insurance Cost Prediction.csv ” is consist of seven different columns ( age, sex, bmi, children, smoker, region, charges) in which charges is the target variable and rest are features which gives impact in insurance charges.

“ Dataset – Medical Insurance Cost Prediction.csv ” was loaded in medical_data.

There was not any missing value in the given dataset. The visual diagrams which show the trend on features affecting the  charges were also plotted.

                                      

The Bar Graph illustrates the stark contrast in charges between those who smoke (high) and those who don't (low). Having the most influence on insurance costs as well.


                                                 

The bar graph above demonstrates how charges for women are somewhat higher than for men, however these differences could be seen as equivalent.

                                                


As seen in Bar graph above, Southeast seems to be the one with the highest charges and southwest being the lowest in comparison.

 

 

Categorical Variable

 

Three categorical variables were found in the dataset during data exploration: sex, region, and smoker. All these variables include important information that has a significant impact on estimating insurance costs. Original encoding was selected over One-hot encoding and removing categories to preserve information.

Ø  For Sex , ‘Male’ was replaced by  1  and ‘Female’ by 0,

Ø  For Smoker,  ‘Yes’ was replaced by 1 and ‘No’ by 0,

Ø For Region, Southeast, Southwest, Northeast, Northwest was replaced by 0,1,2 and 3 respectively.

 

Features and Target Selection

 

Variables age, sex, bmi, children, smoker and region were assigned to x.

Target Variable charges (values) were assigned to Y.

Later, the charges column was removed from medical_data.

 

Final look of first five values on x

 

age

sex

bmi

children

smoker

region

0

19

0

27.900

0

1

1

1

18

1

33.770

1

0

0

2

28

1

33.000

3

0

0

3

33

1

22.705

0

0

3

4

32

1

28.880

0

0

3

 

 

Data Splitting

 

The dataset was divided into training, validation and testing sets with split ratio 70 /15/15.

·        Training Set

Ø  The training set is composed of 70% of the original dataset which is used for training our regression models.

·        Validation Set

Ø  The Validation set is composed of 15% of the original dataset. Validation set plays vital roles in assessing model performance and ensures model performance to new or unseen data. All Cross Validation and Hyperparameter tuning are performed in this dataset.

·        Testing Set

Ø  The Testing set is also composed of 15% of the original dataset which is completely unseen till final evaluation. This set provides a model predictive capability for a real-world dataset.

 

 

Model Training and Hyperparameter Selection

 

All three Regression Models were trained, each one with their own approach. “ LassoCV and RidgeCV were trained because it had inbuilt feature of automatically choosing the best hyperparameter using cross validation ”.

The list of alphas [ 0.0001, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0 ] was given for the selection of best alpha.

The best value alpha for both Lasso and Ridge given by LassoCV and RidgeCV were 10.0 and 1.0, respectively. Thus, they were considered and justified.

 

 

Model Evaluation and Comparison

 

Model evaluation is essential to know how well each model performs. For this, Mean Absolute Error(MAE) and Mean Squared Error (MSE) were calculated. Both MAE and MSE were performed in all three models in Validation and Testing dataset, as they were unseen dataset.

Model Performance on Validation Data

 

Model

 

MAE Validation

 

MSE Validation

 

Linear Regression

 

3953.871481

 

3.070072e+ 07

 

Lasso

 

3960.270761

 

3.071415e+ 07

 

Ridge

 

3969.455578

 

3.074930e+ 07

 

Model Performance on Test Data

 

Model

 

MAE Test

 

MSE Test

 

Linear Regression

 

4346.254246 

 

3.691318e+07

 

 

Lasso

 

4349.066593 

 

3.691007e+07

 

Ridge

 

4358.010750 

 

3.694051e+07

 

 

 

 

As seen from the above table, we observe all the models have somewhat similar MSE values on both validation and test sets, with only slight variations. This suggests that all models can provide reasonable predication of medical charges.

However, there are noticeable differences in terms of MAE on both validation and test set. The Linear Regression Model has consistently achieved lower MAE on both validation and test set compared to Lasso and Ridge. Lower MAE and MSE indicates that a model will produce less prediction errors and accurate representation of relationships between the focused features and medical charges.

From the comparison between all three models in terms of MAE and MSE, Linear Regression Model is considered as more accurate and effective for the final prediction.

Critical Discussion

 

The selected Linear Regression Model when used to predict the entire dataset printed,



Several decisions and assumptions were made throughout the research in order simplify the steps and improve the model’s understanding. This includes encoding of categorical variables, selection of alpha values for Lasso and Ridge models.

Categorical Variable Encoding

Original Encoding was used for categorical variables to preserve essential information. While this choice simplifies the modeling process, it may not capture complex relationships hidden within data. One Hot Encoding could have been a better alternative as it gives better results in complex models.

Hyperparameter Selection

LassoCV and RidgeCV were used to calculate the alpha values. Although the good performance of selected values, further work on hyperparameter tuning should have been done to improve the model’s performance.

Data Limitations

The dataset used in analysis was finite and may not be the entire factor affecting medical insurance costs. If the dataset also had the data on pre-existing medical conditions or lifestyle, the result could have been different.

Conclusion

 

In Conclusion, this report presented an analysis of medical insurance cost predictions using three regressions models: Linear Regression, Lasso Regression, and  Ridge Regression. Data Examination, Model training, and evaluation revealed the Linear Regression Model Outperformed the others in terms of  MAE and MSE.     

The chosen model expressed great potential. However, further improvements can be made.

References

 

“ LassoCV and RidgeCV were trained because it had inbuilt feature of automatically choosing the best hyperparameter using cross validation. ”

The knowledge about LassoCV and RidgeCV was taken from,

sklearn.linear_model.LassoCV — scikit-learn 1.3.2 documentation

sklearn.linear_model.RidgeCV — scikit-learn 1.3.2 documentation

 

 

Comments

Popular posts from this blog

Sudoku Game Using PyGame

Click to Download and Play I created a Sudoku game in Python using the Pygame library for the graphical interface as part of my coursework project on algorithms and data structures. Below are the instructions on how to compile and execute the provided code. ### Prerequisites: - Latest version of Python installed on your system - Pygame library installed (install via pip:                  `pip install pygame`) ### Instructions: 1. **Download the Code**: Save the provided Python script named "sudoku.py" to your local machine. 2. **Install Python**: Ensure you have Python installed on your system. Download it from the official website: [Python Downloads](https://www.python.org/downloads/). 3. **Install Pygame**: Open your terminal or command prompt and run:        ``    pip install pygame    `’ 4. **Navigate to the Script**: Use the terminal or command prompt to navigate to the directory where you saved "s...