Restarting Big Sales Mart Regression Analysis: Introducing tidymodels Framework
In a recent project, the team at RStudio utilised the tidymodels package to develop and evaluate a series of regression models in predicting Item Outlet Sales. This article outlines the key steps involved in this process.
Data Preparation
The first step involved splitting the dataset into training and testing sets using the package. The aim was to develop and evaluate a model that could accurately predict Item Outlet Sales from a number of input variables.
To prepare the data for modeling, the package was used to specify preprocessing steps such as dummy coding categorical variables, normalization, feature engineering, or feature elimination. This created a recipe that could be prepped and baked on data to ensure consistent transformations.
Model Specification
Two models were developed for demonstration purposes: Linear Regression and Random Forest. The package used for generating a regression model was called . A linear regression model was generated by declaring a model specification and using the function. A random forest model, on the other hand, was generated in just four lines of code using the package.
Model Training and Tuning
Random Forest models have a tendency to overfit and can be improved by tuning hyperparameters using the package. A tuning workflow was generated and a 4-fold cross validation object was created to evaluate the performance of the tuned model.
Evaluation
The root mean squared error (RMSE) was used as the evaluation metric for the models. created a wrapper function for specified error metrics, including RMSE, R2, and Mean Absolute Error (MAE).
Model Performance
The final random forest model had slightly decreased overfitting by 2 RMSE units compared to the training data. The vip package was used to fit the final random forest model and highlight the top 10 most important features.
Feature Importance
Item Sales is right-skewed, while Item Weight has no apparent distribution, Item Fat Content has inconsistent labels, Item Visibility is right-skewed, Item Type has a variety of labels, Item MRP has four major groups, Outlet Identifier has 10 unique labels, Outlet Establishment Year is varied, Outlet Size has inconsistent labels, Outlet Location Type has three labels, Outlet Type has four labels, and Item Outlet Sales is right-skewed. Item Sales is spread well across the entire range of the Item_Weight without any obvious patterns, Item_Visibility has no relationship with Item_Outlet_Sales, and Item_MRP has a moderate correlation with Item_Outlet_Sales.
Conclusion
Tidymodels offers a tidy and modular framework in R that mirrors many best practices popularized by Scikit-Learn, with additional emphasis on tidy data principles and declarative preprocessing. For those interested in learning more about tidymodels, the author recommends visiting Julia Silge's website (https://juliasilge.com/) and the Tidy Modeling with R website (https://www.tmwr.org/). The Big Sales Mart dataset used in this project is available through Analytics Vidhya.
Technology and data-and-cloud-computing played essential roles in the project by RStudio's team, as they utilized various packages, such as tidymodels, for data preprocessing, model specification, training, tuning, evaluation, and feature importance analysis. The tidymodels package, in particular, served as a foundation for many steps involved in developing regression models that predicted Item Outlet Sales.