Time series forecasting is a classical problem in which you predict the values in the next timesteps, given historical data.
The M5 Forecasting competition on Kaggle is an excellent dataset for time series forecasting. The ultimate goal of the competition is to predict the sales of a variety of products. There are several different files of data included:
- sales_train_validation.csv. Contains a list of items along with their respective category, store location, and sales for over 3 years
- sell_prices.csv. Contains a list of prices along with the corresponding item and date by week.
- calendar.csv. Used to convert calendar dates to the dates the competition uses. Also includes major holidays and "snap" events (food stamps).
I have generated various plots in plot.ly to explore the data included.''
This plots the total number of sales on a given day. The number of sales increased over time, with a greater number of sales in the summer season.''
Most of the sales come from California, while the number of sales from Texas and Wisconsin were roughly the same.''
Most of the sales were food items, which are much more dependent on season than hobbies or household items.''
This is a histogram of how much the prices of items varied.''
This graphs the yearly percent increase in item prices, along with USD inflation. The item percent increase generally followed the USD inflation, as it should because prices of items are tied to inflation of the US dollar.
To predict the sales figures, I used an LSTM network in Pytorch.
class M5LSTM(nn.Module): def __init__(self, input_dim, hidden_dim, num_layers): super().__init__() self.hidden_dim = hidden_dim self.num_layers = num_layers self.lstm = nn.LSTM(input_dim, hidden_dim, num_layers=num_layers, batch_first=True) self.fc = nn.Linear(hidden_dim, 1) # seqs shape [batch_size, timesteps, features] # out shape [batch_size * timesteps, 1] def forward(self, seqs, hidden): out, hidden = self.lstm(seqs, hidden) out = out.reshape(-1, self.hidden_dim) out = self.fc(out) return out, hidden def init_hidden(self, batch_size): weight = next(self.parameters()).data hidden = (weight.new(self.num_layers, batch_size, self.hidden_dim).zero_(), weight.new(self.num_layers, batch_size, self.hidden_dim).zero_()) return hidden
I passed in the amount of sales and other features like day of week, and the model predicts the amount of sales for each day in the future.
Training the network gave me these results:''
Loss decreased to about 1.80 over 20 epochs. This is an example output of the model:''
Although peaks of sales are not well represented, the model generally followed the true sales amounts. The code is available on GitHub.