This project conducts a detailed linear regression analysis on a dataset from a European Toyota car dealer, focusing on the sales prices of used Toyota Corolla cars. The analysis aims to discern how various factors such as age, mileage, horsepower, and other features influence the sales prices.
- Project Overview
- Data Description
- Installation
- Exploratory Data Analysis (EDA)
- Visualizations
- Linear Regression Model Development
- Model Comparison and Validation
- ANOVA for Model Selection
- Conclusion
The dataset, UsedCars.csv
, consists of various attributes of used Toyota Corolla cars, including:
- Id: Identification number of the car.
- Model: Model name of the car.
- Price: Sales price in Euros.
- Age: Age of the car in months (as of August 2004).
- KM: Accumulated kilometers on the odometer.
- HP: Horsepower of the car.
- Metallic: Whether the car has a metallic color (1 for Yes, 0 for No).
- Automatic: Whether the car has an automatic transmission (1 for Yes, 0 for No).
- CC: Cylinder volume in cubic centimeters.
- Doors: Number of doors.
- Gears: Number of gears.
- Weight: Weight of the car in kilograms.
To replicate this analysis:
- Clone the repository:
git clone https://github.com/your-username/used-toyota-car-sales-analysis.git cd used-toyota-car-sales-analysis
- Install R and required packages.
- Execute the R scripts for analysis.
Initial EDA revealed a negative correlation between Price and KM, suggesting that as KM increases, Price tends to decrease.
Shows the relationship between Price and KM, highlighting the negative trend.
Indicates potential issues with the linear regression assumptions due to the pattern of residuals.
Assesses the normality of residuals, showing deviations, particularly in the tails.
Displays the distribution of residuals, highlighting skewness or deviations from normality.
- R-squared: 0.3824
- F-statistic: 781.4 on 1 and 1262 DF
- p-value: < 2.2e-16
Focused on the Price-KM relationship, it revealed a moderate negative linear relationship.
- R-squared: 0.4089
- F-statistic: 873.2 on 1 and 1262 DF
- p-value: < 2.2e-16
Implemented due to heteroscedasticity and non-linearity in Model 1. A Box-Cox transformation suggested using the inverse of Price, improving model fit and assumptions adherence.
Model 2 showed an improved R-squared value and better adherence to regression assumptions, compared to Model 1.
-
Full Model:
- R-squared: 0.8649, Adjusted R-squared: 0.8639
- F-statistic: 891.8 on 9 and 1254 DF
- p-value: < 2.2e-16
-
Reduced Model:
- R-squared: 0.8648, Adjusted R-squared: 0.8642
- F-statistic: 1341 on 6 and 1257 DF
- p-value: < 2.2e-16
Both a full model with all variables and a reduced model with significant variables were compared. The reduced model was simpler and slightly better in terms of adjusted R-squared.
The analysis underscores the importance of data transformation in regression and provides insights into factors affecting used car prices, useful for dealers and buyers alike.