Time series forecasting is a classical problem in which you predict the values in the next timesteps, given historical data.

The M5 Forecasting competition on Kaggle is an excellent dataset for time series forecasting. The ultimate goal of the competition is to predict the sales of a variety of products. There are several different files of data included:

  1. sales_train_validation.csv. Contains a list of items along with their respective category, store location, and sales for over 3 years
  2. sell_prices.csv. Contains a list of prices along with the corresponding item and date by week.
  3. calendar.csv. Used to convert calendar dates to the dates the competition uses. Also includes major holidays and "snap" events (food stamps).

Data Exploration

I have generated various plots in plot.ly to explore the data included.

This plots the total number of sales on a given day. The number of sales increased over time, with a greater number of sales in the summer season.

Most of the sales come from California, while the number of sales from Texas and Wisconsin were roughly the same.

Most of the sales were food items, which are much more dependent on season than hobbies or household items.

This is a histogram of how much the prices of items varied.

This graphs the yearly percent increase in item prices, along with USD inflation. The item percent increase generally followed the USD inflation, as it should because prices of items are tied to inflation of the US dollar.


The Model

To predict the sales figures, I used an LSTM network in Pytorch.

class M5LSTM(nn.Module):
    def __init__(self, input_dim, hidden_dim, num_layers):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.num_layers = num_layers

        self.lstm = nn.LSTM(input_dim, hidden_dim, num_layers=num_layers, batch_first=True)
        self.fc = nn.Linear(hidden_dim, 1)

    # seqs shape [batch_size, timesteps, features]
    # out shape [batch_size * timesteps, 1]
    def forward(self, seqs, hidden): 
        out, hidden = self.lstm(seqs, hidden)
        out = out.reshape(-1, self.hidden_dim)
        out = self.fc(out)
  
        return out, hidden

    def init_hidden(self, batch_size):
        weight = next(self.parameters()).data

        hidden = (weight.new(self.num_layers, batch_size, self.hidden_dim).zero_(),
                  weight.new(self.num_layers, batch_size, self.hidden_dim).zero_())

                  return hidden

An LSTM model

I passed in the amount of sales and other features like day of week, and the model predicts the amount of sales for each day in the future.

Training the network gave me these results:

Loss decreased to about 1.80 over 20 epochs. This is an example output of the model:

Although peaks of sales are not well represented, the model generally followed the true sales amounts. The code is available on Github.


UPDATE: The competition has finished, and you can view a synopsis of the first place solution here. They used the LightGBM gradient boosting framework along with a clever loss metric and feature selection.