Notes on the making of this code/R markdown file, and total run-time are presented at the end of this document. To download the data and the source code, click here.
This project explores historical sales data from 45 Walmart stores spread across different regions, and explores different approaches to predict the future weekly sales for every department in each store. The data was obtained from https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting.
The following are the libraries used in the script and its purpose: -
tidyverse
: Data processing, data wrangling, and plots. -
lubridate
: Framework to handle dates and times. -
viridis
: Color palette.
knitr::opts_chunk$set(echo = TRUE, fig.align='center')
library(knitr)
#library(class)
# Progress bar
library(progress)
# Misc.
library(tidyverse) ## Includes ggplot and caret
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Loading required package: viridisLite
The training data for this project is stored in 10 separate “folds”. There is a train.csv file in every fold, each containing 5 columns:
Store
Dept
Date
Weekly_Sales
IsHoliday
for (fold in 1:10) {
# Construct the path to the files for this fold
fold_folder <- file.path("./Proj2_Data", paste0("fold_", fold))
# Read the training data
train_path <- file.path(fold_folder, "train.csv")
assign(paste0("train_", fold), read.csv(train_path, header = TRUE))
}
These datasets look like the following:
## Store Dept Date Weekly_Sales IsHoliday
## 1 1 1 2010-02-05 24924.50 FALSE
## 2 1 1 2010-02-12 46039.49 TRUE
## 3 1 1 2010-02-19 41595.55 FALSE
## 4 1 1 2010-02-26 19403.54 FALSE
## 5 1 1 2010-03-05 21827.90 FALSE
## 6 1 1 2010-03-12 21043.39 FALSE
Prior to further analysis and transformations of the variables, it is
critical to deal with missing values, if any. In this analysis, any
incomplete observation is handled by transforming the “NA” into “0”. By
doing so, removing observations is avoided. The function
complete.cases()
can provide insightful information to
understand if there are any missing observations. With this function, an
observation that has a NA
in any variable would not be
counted as an incomplete case:
for (fold in 1:10) {
# Construct the path to the files for this fold
train <- get(paste0("train_", fold))
# Read the training data
tmp <- as.matrix(t(table(complete.cases(train))))
cat("Complete.cases for ", paste0("fold_", fold), ": ", tmp, " \n , out of ", dim(train)[1],"\n\n")
}
## Complete.cases for fold_1 : 164115
## , out of 164115
##
## Complete.cases for fold_2 : 190674
## , out of 190674
##
## Complete.cases for fold_3 : 214217
## , out of 214217
##
## Complete.cases for fold_4 : 240603
## , out of 240603
##
## Complete.cases for fold_5 : 267184
## , out of 267184
##
## Complete.cases for fold_6 : 294132
## , out of 294132
##
## Complete.cases for fold_7 : 317928
## , out of 317928
##
## Complete.cases for fold_8 : 344667
## , out of 344667
##
## Complete.cases for fold_9 : 371242
## , out of 371242
##
## Complete.cases for fold_10 : 397841
## , out of 397841
This section explores more in depth how many
Weekly_Sales
observations there are in the training
datasets, per store and department. For this purpose, the initial step
taken is to analyse the count of observations per store and department
in the training datasets:
for (fold in 1:10) {
# Get the training and testing data per fold
train <- get(paste0("train_", fold))
# Create a dynamic title based on the fold number
title <- paste("Count of Observations per Store and Dept - Training Dataset #", fold)
# Create a heatmap
g <- ggplot(train, aes(x = Dept, y = Store)) +
geom_bin2d() +
scale_x_continuous(breaks = seq(min(train$Dept), max(train$Dept), by = 3)) +
scale_y_continuous(breaks = seq(min(train$Store), max(train$Store), by = 3)) +
scale_fill_viridis_c(direction = 1) +
labs(x = "Store", y = "Department", fill = "Frequency", title = title) +
theme_void() +
theme(
axis.text.x = element_text(angle = 90, hjust = 1, margin = margin(t = 10, r = 0, b = 10, l = 0)),
axis.text = element_text(size = 10),
axis.title.y = element_text(angle = 90, size = 12, margin = margin(t = 0, r = 10, b = 0, l = 10)),
axis.title = element_text(size = 12),
plot.title = element_text(size = 14, hjust = 0.5)
)
# Add the plot to the list
print(g)
}
The test data is stored in the same folds as the training data, with one test.csv in each fold. These can be loaded in a similar way as the trainig data was loaded:
for (fold in 1:10) {
# Construct the path to the files for this fold
fold_folder <- file.path("./Proj2_Data", paste0("fold_", fold))
# Read the training data
test_path <- file.path(fold_folder, "test.csv")
assign(paste0("test_", fold), read.csv(test_path, header = TRUE))
}
The test datasets contain the same variables as the training
datasets, except for the Weekly_Sales
variable. This will
be the response variable in further analyses. These look like this:
## Store Dept Date IsHoliday
## 1 1 1 2012-03-02 FALSE
## 2 1 1 2012-03-09 FALSE
## 3 1 1 2012-03-16 FALSE
## 4 1 1 2012-03-23 FALSE
## 5 1 1 2012-03-30 FALSE
## 6 1 1 2012-04-06 FALSE
Note that the test sets span for 8 weeks after each corresponding training set’s last recorded date. For instance, if working with fold 1, where the last observation corresponds to:
## [1] "2011-02-25"
, the weeks of the test set corresponds to the 8 following weeks:
## [1] "2011-03-04" "2011-03-11" "2011-03-18" "2011-03-25" "2011-04-01"
## [6] "2011-04-08" "2011-04-15" "2011-04-22" "2011-04-29"
, (of course, starting on the week right after):
## [1] "2011-03-04"
Conveniently, a dataset that contains the correct labels for the observations in the test sets is available: test_with_label.csv, and it is loaded as follows:
This file contains the real values of Weekly_Sales
for
the dates included in the testing data time span, and has the same
structure as the training and testing(plus the Weekly_Sales
column). This information becomes very useful later on to test the
predictions accuracy.
Since several approaches are explored for the predictive analysis described in the following section, the details on data manipulation and transformations are discussed more in detail in every approach’s subsection.
To test the accuracy of the predictions of this analysis, the metric
used is weighted mean absolute error, WMAE
. An
implementation of a function that calculates this is error presented
below:
myeval = function(predict.name, test_with_label = test_with_label, num_folds){
wmae = rep(0, num_folds)
for (i in 1:num_folds) {
# Use already loaded test data frame
test = get(paste0("test_", i))
test = test %>%
select(-IsHoliday) %>%
left_join(test_with_label, by = c('Date', 'Store', 'Dept'))
# Use already loaded predictions data frame
test_pred = get(paste0(predict.name, i))
new_test <- test %>%
left_join(test_pred, by = c('Date', 'Store', 'Dept'))
actuals = new_test$Weekly_Sales
preds = new_test$Weekly_Pred
weights = if_else(new_test$IsHoliday.x, 5, 1)
weighted.sum = weights * abs(actuals - preds)
weighted.sum[is.na(weighted.sum)] <- 0
wmae[i] = sum(weighted.sum) / sum(weights)
}
return(list(wmae = wmae, mean.wmae = mean(wmae)))
}
This function:
Takes as input the name of the test with
predictions set predict.name
, the target dataset
test_with_label
that defaults to “test_with_label”
(purposely, the previously loaded dataframe that contains the real
labels), and the number of folds num_folds
.
Calculates as output the wmae
per
fold and the mean.wmae
of all the folds.
As a first approach into the predictive analysis, a naive solution
could be using the sales data from the most recent week to forecast all
subsequent weekly sales. For instance, if working with fold 1, the
prediction for the 8 weeks in the test set, that follow the last
recorded week in train set, will have the same value for Weekly_Sales.
Based on the nature of this data, a quick processing time would be
greatly benefitial. A code that loops through every fold and predicts
the Weekly_Sales
as per this approach is presented:
for (fold in 1:10) {
# Get the the training and testing data per fold
train <- get(paste0("train_", fold))
test <- get(paste0("test_", fold))
##### Simplistic Strategy: #####
# Assigning the latest weekly sales to every combination of store and department as the prediction.
most_recent_date <- max(train$Date)
tmp_train <- train %>%
filter(Date == most_recent_date) %>%
dplyr::rename(Weekly_Pred = Weekly_Sales) %>%
select(-Date, -IsHoliday)
test_pred <- test %>%
left_join(tmp_train, by = c('Dept', 'Store'))
# assign zero to missing predictions
id = is.na(test_pred$Weekly_Pred)
test_pred$Weekly_Pred[id] = 0
assign(paste0("simplistic.pred_", fold), test_pred)
}
The structure of the resulting predictions looks as follows:
## Store Dept Date IsHoliday Weekly_Pred
## 1 1 1 2011-03-04 FALSE 19363.83
## 2 1 1 2011-03-11 FALSE 19363.83
## 3 1 1 2011-03-18 FALSE 19363.83
## 4 1 1 2011-03-25 FALSE 19363.83
## 5 1 1 2011-04-01 FALSE 19363.83
## 6 1 1 2011-04-08 FALSE 19363.83
The accuracy of these results can be evaluated using the previously implemented error function:
## $wmae
## [1] 2078.726 2589.338 2253.936 2823.098 5156.012 4218.348 2269.904 2143.839
## [9] 2221.145 2372.425
##
## $mean.wmae
## [1] 2812.677
Since this is a simplistic way of “predicting” future sales,
more complicated methodologies are likely to give better results.
An interesting idea could be using the information of the data to fit
a regression model. For instance, temporal information in the variables
Yr
and Date
: predict using the sales of a
corresponding week
: a numerical representation of each week
of the year, ranging from 1 to 52 (or occasionally 53) (achieved by
using the week function from the lubridate package) of the previous year
being predicted. This is, for example, when predicting sales for Week 20
of 2011, use the data from Week 20 of 2010.
Considering these two variables as predictors, a linear regression
model characterized as: Y ~ Yr + Wk
is fitted for every
pair of store and department. Now, doing this would like result in many
linear models to be fitted, approximately:
length(unique(train_10$Store)) * length(unique(train_10$Dept))*10 # multiplied by 10 to account for the 10 folds
## [1] 36450
, a task that would likely be computationally very expensive. It is imperative to do this process in an time-economic way. The main idea behind the implementation is that not all departments/stores need a prediction. Departments not listed in the test sets and stores without data can be skipped. Such efficient implementation is presented below:
# Progress
pb <- progress_bar$new(total = 10, format = "[:bar] [:percent] [Elapsed time: :elapsedfull] [ETA: :eta]", force = TRUE)
for (fold in 1:10) {
# Get the training and testing data per fold
train <- get(paste0("train_", fold))
test <- get(paste0("test_", fold))
# find the unique pairs of (Store, Dept) combo that appeared in both training and test sets
train_pairs <- train[, 1:2] %>% count(Store, Dept) %>% filter(n != 0)
test_pairs <- test[, 1:2] %>% count(Store, Dept) %>% filter(n != 0)
unique_pairs <- intersect(train_pairs[, 1:2], test_pairs[, 1:2])
# pick out the needed training samples, convert to dummy coding, then put them into a list
train_split <- unique_pairs %>%
left_join(train, by = c('Store', 'Dept')) %>%
mutate(Wk = factor(ifelse(year(Date) == 2010, week(Date) - 1, week(Date)), levels = 1:52)) %>%
mutate(Yr = year(Date))
test_split <- unique_pairs %>%
left_join(test, by = c('Store', 'Dept')) %>%
mutate(Wk = factor(ifelse(year(Date) == 2010, week(Date) - 1, week(Date)), levels = 1:52)) %>%
mutate(Yr = year(Date))
# construct the design matrix only once
train_split = as_tibble(model.matrix(~ Weekly_Sales + Store + Dept + Yr + Wk, train_split)) %>%
group_split(Store, Dept)
test_split = as_tibble(model.matrix(~ Store + Dept + Yr + Wk, test_split)) %>%
mutate(Date = test_split$Date) %>%
group_split(Store, Dept)
# pre-allocate a list to store the predictions
test_pred <- vector(mode = "list", length = nrow(unique_pairs))
# perform regression for each split, note we used lm.fit instead of lm
for (i in 1:nrow(unique_pairs)) {
# Training Data
tmp_train <- train_split[[i]]
x.train <- as.matrix(tmp_train[, -(2:4)])
x.train <- as.matrix( cbind(x.train[,c(1:2), drop = FALSE],
"Yr2" = (x.train[,c(2), drop = FALSE])^2,
x.train[,-c(1, 2), drop = FALSE]) ) # adds year^2 column as the second column of the matrix
y.train <- tmp_train$Weekly_Sales
# Testing Data
tmp_test <- test_split[[i]]
x.test <- as.matrix(tmp_test[, 4:55])
x.test <- as.matrix( cbind("Yr" = x.test[,c(1), drop = FALSE],
"Yr2" = (x.test[,c(1), drop = FALSE])^2,
x.test[,-c(1), drop = FALSE])) # adds year^2 column as the second column of the matrix
# Model
mycoef <- lm.fit(x.train, y.train)$coefficients
mycoef[is.na(mycoef)] <- 0
# Prediction
tmp_pred <- mycoef[1] + x.test %*% mycoef[-1]
tmp_pred[is.na(tmp_pred)] <- 0
tmp_pred.df <- data.frame( Store = unique(tmp_test$Store),
Dept = unique(tmp_test$Dept),
Date = tmp_test$Date,
IsHoliday = test[test$Store == unique(tmp_test$Store) & test$Dept == unique(tmp_test$Dept),]$IsHoliday,
Weekly_Pred = tmp_pred)
test_pred[[i]] <- tmp_pred.df
}
# This combines the results of all the linear models as dataframes already into a big dataframe for the fold.
assign(paste0("test.pred_", fold), bind_rows(test_pred))
# Update the progress bar
pb$tick()
}
##
[======>--------------------------] [ 20%] [Elapsed time: 00:00:15] [ETA:
## 1m]
[=========>-----------------------] [ 30%] [Elapsed time: 00:00:32] [ETA:
## 1m]
[============>--------------------] [ 40%] [Elapsed time: 00:00:51] [ETA:
## 1m]
[===============>-----------------] [ 50%] [Elapsed time: 00:01:11] [ETA:
## 1m]
[===================>-------------] [ 60%] [Elapsed time: 00:01:32] [ETA:
## 1m]
[======================>----------] [ 70%] [Elapsed time: 00:01:55] [ETA:
## 49s]
[=========================>-------] [ 80%] [Elapsed time: 00:02:18] [ETA:
## 35s]
[=============================>---] [ 90%] [Elapsed time: 00:02:43] [ETA:
## 18s]
[=================================] [100%] [Elapsed time: 00:03:09] [ETA:
## 0s]
Just like before, the accuracy can be evaluated using the implemented WMAE function:
## $wmae
## [1] 2045.088 1466.872 1449.774 1593.858 2321.792 1677.478 1646.047 1365.176
## [9] 1358.769 1345.174
##
## $mean.wmae
## [1] 1627.003
When compared to the first approach, this models take an approximate
of 2 minutes to run, but yield significantly better predictions in terms
of the defined error metric. Furthermore, the same analysis performed
with a slightly different model: Y ~ Yr + Yr^2 + Wk
, where
the quadratic term looks to leverage the ‘year’ information even
further, produces slightly better results. But it turns out that it is
possible to improve the results even further.
Starting from recommendations from previous works on this data
(Prof. Feng Liang), and through trial and error, it was determined that
the performance of the predictive models improved significantly when
working with a ‘smoothed/denoised version’ of the training data. To
achieve this, Singular Value Decomposition (SVD) was performed using the
svd()
function, included in the base R. The idea behind
this is that that the dominant shared Principal Components (PCs) likely
represent meaningful signals, whereas PCs with minimal variances are
mostly noise. Thus, the original “noisy” dataset \({X}\) is set aside, and the predictive
analysis is performed with the de-noised dataset \(\tilde{X}\). In this analysis, the number
of PCs kept is set to d=8
.
# Progress
pb <- progress_bar$new(total = 10, format = "[:bar] [:percent] [Elapsed time: :elapsedfull] [ETA: :eta]", force = TRUE)
for (fold in 1:10) {
# Get the training and testing data per fold
train <- get(paste0("train_", fold))
train_smoothed <- data.frame()
d <- 8 # Specify the number of components to keep
existing.dept = unique(train$Dept)
for (department in existing.dept) {
train_dep <- train[ which(train$Dept== department ), ]
if (length(unique(train_dep$Store))<2) {
data_frame <- data.frame(Store = train_dep$Store, Dept=train_dep$Dept, Date=train_dep$Date, Weekly_Sales = train_dep$Weekly_Sales)
}
else{
train_temp2 <- train_dep %>%
select(Store, Date, Weekly_Sales) %>%
spread(Store, Weekly_Sales)
train_temp2[is.na(train_temp2)] <- 0
train_temp3 <- t(train_temp2[, -1])
# To center the dataset
store.mean <- rowMeans(train_temp3)
train_temp4 <- (train_temp3)-store.mean
# SVD
svd_result <- svd(train_temp4)
U <- svd_result$u
D <- diag(svd_result$d)
V <- svd_result$v
min.com = ncol(U)
X_reduced_temp <- (U[,1:min(min.com,d)] %*% D[1:min(min.com,d),1:min(min.com,d)] %*% t(V[,1:min(min.com,d)]))
X_reduced <- X_reduced_temp + matrix(rep(store.mean, times = dim(X_reduced_temp)[2]), nrow = length(store.mean), ncol = dim(X_reduced_temp)[2])
X_reduced_df<- as.data.frame(t(X_reduced))
names(X_reduced_df) <- names(train_temp2)[-1]
X_reduced_df$Date <- train_temp2$Date
data_frame_temp <- gather(X_reduced_df, key = "Store", value = "Weekly_Sales", -Date)
data_frame_temp$Dept <-rep(department, times = dim(data_frame_temp)[1])
desired_order <- c("Store", "Dept", "Date","Weekly_Sales")
data_frame_temp <- data_frame_temp[, desired_order]
data_frame_temp$Store<- as.numeric(data_frame_temp$Store)
}
train_smoothed <- rbind(train_smoothed, data_frame_temp)
}
assign(paste0("train.smooth_", fold), train_smoothed)
# Update the progress bar
pb$tick()
}
##
[======>--------------------------] [ 20%] [Elapsed time: 00:00:01] [ETA:
## 7s]
[=========>-----------------------] [ 30%] [Elapsed time: 00:00:03] [ETA:
## 9s]
[============>--------------------] [ 40%] [Elapsed time: 00:00:05] [ETA:
## 9s]
[===============>-----------------] [ 50%] [Elapsed time: 00:00:08] [ETA:
## 8s]
[===================>-------------] [ 60%] [Elapsed time: 00:00:10] [ETA:
## 7s]
[======================>----------] [ 70%] [Elapsed time: 00:00:13] [ETA:
## 6s]
[=========================>-------] [ 80%] [Elapsed time: 00:00:16] [ETA:
## 4s]
[=============================>---] [ 90%] [Elapsed time: 00:00:19] [ETA:
## 2s]
[=================================] [100%] [Elapsed time: 00:00:22] [ETA:
## 0s]
The resulting reconstructed and denoised train.smooth
dataframes hold the same structure as the original training data:
## Store Dept Date Weekly_Sales
## 1 1 1 2010-02-05 25385.79
## 2 1 1 2010-02-12 46887.33
## 3 1 1 2010-02-19 38562.33
## 4 1 1 2010-02-26 20160.12
## 5 1 1 2010-03-05 21865.19
## 6 1 1 2010-03-12 21068.78
Now, for this predictive modeling, the training and testing data undergo the same transformations required for the previous approach (4.2.):
store
and each department
within
that store, a list of the unique pairs of departments and stores that
are present in both the training and testing set is built.lubridate
library, a new
variable Wk
(week) replaces the
Date
variable. This makes working with the dates easier and
more systematic.For each unique pair of store
and
department
, using the variables
Yr
and its square
Yr^2
as a numerical predictor and
Wk
as a categorical predictor, a linear regression model
characterized as: Y ~ Yr + Yr^2 + Wk
is fitted using the
lm.fit()
function. With such models, the predictions are
made for the corresponding ‘test’ dataset.
The following code loops through all the folds, handles the described
transformations, fits a multiple linear regression model to predict the
Weekly_Sales
for the weeks of the testing datasets, and
stores the predictions in a test.pred.smooth_i dataframe:
# Progress
pb <- progress_bar$new(total = 10, format = "[:bar] [:percent] [Elapsed time: :elapsedfull] [ETA: :eta]", force = TRUE)
for (fold in 1:10) {
# Get the training and testing data per fold
train <- get(paste0("train.smooth_", fold))
test <- get(paste0("test_", fold))
# find the unique pairs of (Store, Dept) combo that appeared in both training and test sets
train_pairs <- train[, 1:2] %>% count(Store, Dept) %>% filter(n != 0)
test_pairs <- test[, 1:2] %>% count(Store, Dept) %>% filter(n != 0)
unique_pairs <- intersect(train_pairs[, 1:2], test_pairs[, 1:2])
# pick out the needed training samples, convert to dummy coding, then put them into a list
train_split <- unique_pairs %>%
left_join(train, by = c('Store', 'Dept')) %>%
mutate(Wk = factor(ifelse(year(Date) == 2010, week(Date) - 1, week(Date)), levels = 1:52)) %>%
mutate(Yr = year(Date))
test_split <- unique_pairs %>%
left_join(test, by = c('Store', 'Dept')) %>%
mutate(Wk = factor(ifelse(year(Date) == 2010, week(Date) - 1, week(Date)), levels = 1:52)) %>%
mutate(Yr = year(Date))
# construct the design matrix only once
train_split = as_tibble(model.matrix(~ Weekly_Sales + Store + Dept + Yr + Wk, train_split)) %>%
group_split(Store, Dept)
test_split = as_tibble(model.matrix(~ Store + Dept + Yr + Wk, test_split)) %>%
mutate(Date = test_split$Date) %>%
group_split(Store, Dept)
# pre-allocate a list to store the predictions
test_pred <- vector(mode = "list", length = nrow(unique_pairs))
# Perform regression for each split, note we used lm.fit instead of lm
for (i in 1:nrow(unique_pairs)) {
# Training Data
tmp_train <- train_split[[i]]
x.train <- as.matrix(tmp_train[, -(2:4)])
x.train <- as.matrix( cbind(x.train[,c(1:2), drop = FALSE],"Yr2" = (x.train[,c(2), drop = FALSE])^2, x.train[,-c(1, 2), drop = FALSE]) ) # adds year^2 column as the second column of the matrix
y.train <- tmp_train$Weekly_Sales
# Testing Data
tmp_test <- test_split[[i]]
x.test <- as.matrix(tmp_test[, 4:55])
x.test <- as.matrix( cbind("Yr" = x.test[,c(1), drop = FALSE],"Yr2" = (x.test[,c(1), drop = FALSE])^2, x.test[,-c(1), drop = FALSE])) # adds year^2 column as the second column of the matrix
# Model
mycoef <- lm.fit(x.train, y.train)$coefficients
mycoef[is.na(mycoef)] <- 0
tmp_pred <- mycoef[1] + x.test %*% mycoef[-1]
tmp_pred[is.na(tmp_pred)] <- 0
tmp_pred.df <- data.frame(
Store = unique(tmp_test$Store),
Dept = unique(tmp_test$Dept),
Date = tmp_test$Date,
IsHoliday = test[test$Store == unique(tmp_test$Store) & test$Dept == unique(tmp_test$Dept),]$IsHoliday,
Weekly_Pred = tmp_pred)
test_pred[[i]] <- tmp_pred.df
}
# This combines the results of all the linear models as dataframes already into a big dataframe for the fold.
assign(paste0("test.pred.smooth_", fold), bind_rows(test_pred))
# Update the progress bar
pb$tick()
}
##
[======>--------------------------] [ 20%] [Elapsed time: 00:00:16] [ETA:
## 1m]
[=========>-----------------------] [ 30%] [Elapsed time: 00:00:37] [ETA:
## 1m]
[============>--------------------] [ 40%] [Elapsed time: 00:00:56] [ETA:
## 1m]
[===============>-----------------] [ 50%] [Elapsed time: 00:01:17] [ETA:
## 1m]
[===================>-------------] [ 60%] [Elapsed time: 00:01:38] [ETA:
## 1m]
[======================>----------] [ 70%] [Elapsed time: 00:02:01] [ETA:
## 1m]
[=========================>-------] [ 80%] [Elapsed time: 00:02:26] [ETA:
## 37s]
[=============================>---] [ 90%] [Elapsed time: 00:02:49] [ETA:
## 19s]
[=================================] [100%] [Elapsed time: 00:03:14] [ETA:
## 0s]
The accuracy of the predictions obtained through the used methodology can be evaluated using the WMAE function implemented earlier:
## $wmae
## [1] 1938.903 1360.881 1377.585 1523.277 2304.853 1632.857 1611.032 1353.578
## [9] 1334.890 1330.739
##
## $mean.wmae
## [1] 1576.86
This approach takes pretty much the same amount of time as 4.2. to perform the analyses and yields an improved performance. Such improvements lead to comparable results to the benchmark discussed in the Kaggle competition where this data was obtained from.
The predicted sales can be visualized against the real values for each fold and the time span that fold corresponds to:
for (fold in 1:10) {
# Get the training and testing data per fold
target <- get(paste0("test_", fold))
test.pred <- get(paste0("test.pred.smooth_", fold))
#Target/labels
target_ = target %>%
select(-IsHoliday) %>%
left_join(test_with_label, by = c('Date', 'Store', 'Dept'))
target_ = target_ %>%
group_by(Date) %>% # Group by week
summarize(Target = mean(Weekly_Sales))
#Predictions
predct_ = test.pred %>%
group_by(Date) %>% # Group by week
summarize(Prediction = mean(Weekly_Pred))
# Combine the 'target_' and 'predct_' datasets into one
test.vs.pred <- cbind(
"Date" = target_$Date,
"Target" = target_[,2, drop = FALSE],
"Prediction" = predct_[,2, drop = FALSE])
# Dynamic title based on the fold number
title <- paste("Weekly Sales - Target vs. Prediction, Fold #", fold)
p <- ggplot(test.vs.pred, aes(x = as.Date(Date))) +
geom_line(aes(y = Target, color = "Target"), linetype = "solid") +
geom_point(aes(y = Target, color = "Target")) +
geom_line(aes(y = Prediction, color = "Prediction"), linetype = "solid") +
geom_point(aes(y = Prediction, color = "Prediction")) +
scale_color_manual(
values = c("Target" = "#440154FF", "Prediction" = "#55C667FF"),
labels = c("Target", "Prediction")
) +
labs(
x = "Date",
y = "Weekly Sales",
title = title,
color = "Data Source" # This will be the title of the legend for color
) +
theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1),
axis.text = element_text(size = 10),
axis.title = element_text(size = 12, face = "bold"),
plot.title = element_text(size = 12, face = "bold", hjust = 0.5)
)
print(p)
}
During this project, several other methodologies where implemented. Among these, some ideas regarded:
IsHoliday
, as
a categorical variable. my.L1 <- function(params) {
# params is a vector
# ** y_hat * x **
y_hat = params[1] + x.train[,-1] %*% params[-1]
# ** r **
res = y_hat - y.train
# ** L1 Norm **
return(norm(res, type = "1"))
}
# &&
##### Robust Regression with L1 norm
## L2 coeffs used as a starting point. Updated with:
mycoef <- optim(mycoef, my.L1)$par
These approaches came at a significant computational price
(specially a. and c., given the nature of the data and the models, and
the optimization steps), but no significant improvements were found. A
more extensive, in-depth analysis may result in interesting findings on
these approahces (or any other relevant techniques, for that matter) for
this problem.
Notes:
The total runtime of this R code is:
## Time difference of 7.609353 mins
The presented code/R markdown document builds up from the material from and work done for the course STAT 542: Statistical Learning.