Predicting bakery sales with machine learning in Apache Spark™

Who doesn’t like a nice warm mug of hot chocolate, especially on a cool winter’s day? We at Xerini certainly do and in this article we explore the prognostic powers of Decision Trees to predict how many hot chocolates are sold by an Edinburgh based bakery for a given hour on a given day.

While perusing the various datasets available on Kaggle.com, we happened across an interesting set of data from a bakery based in Edinburgh called “The Bread Basket”. Sadly, this dataset is no longer available, but it contained a list of all transactions from the bakery between 30th October 2016 and 9th April 2017. An example of the dataset is as follows:

DateTimeTransactionItem
2016-10-3009:58:111Bread
2016-10-3010:05:342Scandinavian
2016-10-3010:05:342Scandinavian
2016-10-3010:07:573Chocolate
2016-10-3010:07:573Jam
2016-10-3010:07:573Cookies
2016-10-3010:08:414Muffin

The original purpose of the dataset was to drive development of Apriori models. Apriori algorithms are a form of associated rule learning and are useful for building up suggestions of other products people may want to buy given what’s already in a basket, much like the “Frequently bought together” feature on Amazon.

While this is useful for online stores, you might question how this might be effectively applied with a bricks and mortar retail store. For a bakery, we thought it would be far more useful to see if we can predict the type and volume of products sold for a given time on a given day. This could have real business implications by reducing costs incurred via wastage, thereby improving profits.

We were initially sceptical that the original dataset would contain enough information to accurately predict the quantities of products sold, so we decided to combine the original dataset with weather data sourced from Edinburgh airport for the same period. An example of the weather data we recorded is as follows:

DateTimeTemperatureWindSpeedCondition
2016-10-3117:2050WSW6Fair
2016-10-3117:5050WSW6Fair
2016-10-3118:2052WSW2Partly Cloudy
2016-10-3118:5052WSW5Mostly Cloudy
2016-10-3119:2052WSW5Light Rain

In order to improve training performance, we decided to initially limit the prediction to one item type, Hot Chocolate, with the view of extending the scope later should the algorithm prove successful. Hot Chocolate was picked first as we felt this product type lends itself to effective prediction given the dataset now includes temperature and weather data.

The first task is to pre-process our weather and bakery data into a single dataset ready to be input into our training model. The weather data contains two data points per hour with one at twenty past and another at fifty past the hour. We only need one entry per hour so we decide to filter taking the value for twenty past the hour. The weather condition data also contains some errors. There’s some erroneous entries for foggy days and some of the conditions are a bit too granular. We therefore filter and simplify data for the condition attribute.

Next, we join the two datasets on the day and hour and produce the following results:

DayOfWeekHourTempConditionNumberOfItems
3948Partly Cloudy2
6837Light Rain0
21348Partly Cloudy2
12341Cloudy0
32248Cloudy0
2530Fair0
51539Fair0
12134Fair0
21352Mostly Cloudy1

We are now ready to read this data into a Spark dataframe ready for processing. As the model requires numeric attributes, we must define a CSV schema otherwise all data types will be of type string. We then load the data as follows:

val csvSchema =
  StructType(
    Array(
      StructField("DayOfWeek", IntegerType),
      StructField("Hour", IntegerType),
      StructField("Temp", IntegerType),
      StructField("Condition", StringType),
      StructField("NumberOfItems", IntegerType)))

val df = spark.read
  .format("csv")
  .option("header", "true")
  .schema(csvSchema)
  .load(filename)

We will be training our model using a decision tree, which is an extremely versatile and powerful machine learning algorithm. One of the great advantages of decisions trees is that the input values do not require normalisation thus simplifying our pre-processing steps.

Before we do anything else, we split the data into a training set containing 80% of the instances and a test set containing 20% of the instances. When we measure the accuracy of our model, it is important to use data that the model was not exposed to during training:

val Array(train, test) = df.randomSplit(Array(0.8, 0.2), seed = 42L)

The random seed gives us determinism between test runs, but could be omitted if this is not required.

The goal of the next piece of processing is to transform this data into two essential columns: a ‘features’ column, which contains a numeric vector for the features that make up each training sample, and a target column, which specifies the target value we are trying to train. In our case this is the number of hot chocolates sold during a given hour on a given day.

As the weather condition is a text-based category attribute rather than a numeric value, we need to do a bit of work to get this into the right shape. We will use one-hot encoding that will add a column for each one of the different categories, setting a value of 1 for the assigned category and zero for the others.

val conditionIndexer =
  new StringIndexer()
    .setInputCol("Condition")
    .setOutputCol("ConditionIndex")

val oneHotEncoder =
  new OneHotEncoderEstimator()
    .setDropLast(false)
    .setInputCols(Array("ConditionIndex"))
    .setOutputCols(Array("OneHotCondition"))

The StringIndexer first turns each condition category into an integer index (e.g. 0 for Fair, 1 for Rain, 2 for Cloudy, etc), adding a new column ‘ConditionIndex’ for this data.

The OneHotEncoderEstimator then uses this index column to create the one-hot vector under the new column ‘OneHotCondition’. The dropLast attribute is an implementation detail of the encoder that allows you to represent one of the attributes as all zeros to save memory. This is not ideal for the training algorithm that we are using so we set this value to false.

Next, we add a feature assembler that combines all of the attributes that we are interested in into a single ‘features’ vector adding this as a new 'features' column to the dataframe:

val featuresAssembler =
  new VectorAssembler()
    .setInputCols(Array("DayOfWeek", "Hour", "Temp", "OneHotCondition"))
    .setOutputCol("features")

We now have all that we need in order to assemble our pre-processing pipeline and train our model:

val pipeline = new Pipeline().setStages(Array(conditionIndexer, oneHotEncoder, featuresAssembler))
val pipelineModel = pipeline.fit(train)
val preparedData = pipelineModel.transform(train)
val decisionTreeModel =
  new DecisionTreeRegressor()
    .setLabelCol("NumberOfItems")
    .setFeaturesCol("features")
    .fit(preparedData)

To make predictions on our test data we can simply pass the processed test data to the transform method of our model. This will add a new ‘prediction’ column containing the predicted value, which can then be compared with the actual label value ‘NumberOfItems’ in order to measure accuracy:

val predicted = decisionTreeModel.transform(pipelineModel.transform(test))

As we are using a regression decision tree model, as opposed to a classifier, the prediction value will be a double representation of an average value (e.g. 2.3, 1.5) rather than an integer value. In our case, this represents the average number of hot chocolates expected to be sold in that hour. If we want whole numbers, we could use a decision tree classifier, or we could just apply the following transformation to round the predicted value (which we’ve found to be a bit more accurate):

prediction
  .withColumn("ptemp", round(prediction("prediction")))
  .drop("prediction")
  .withColumnRenamed("ptemp", "prediction")

So, how do we measure the accuracy of our model? As we are using a regressor rather than a classifier, the standard approach is to calculate the root mean square error. This gives an indication of how much error the model makes when making its predictions applying a higher weight for larger errors.

However, as we have rounded the predictions, our model is behaving a bit like a classifier and so we can also apply an accuracy evaluator, which returns the number of correct predictions as a percentage of the total instances. We need to be careful with accuracy as a performance metric, however. If 95% of our data instances contain zero hot chocolate purchases and we implement a simplistic model that predicts 0 for everything, then this model will score an accuracy of 95%. Not ideal. A better metric for classifiers is to use the F1 score, which calculates a combined metric considering both the true positives vs false positives (precision) and the true positives vs false negatives (recall).

Our model scores a RMSE error value of 0.55 (lower is better) and an F1 score of 0.83 (higher is better), which isn’t bad for a first pass, especially given the concerns we had that the data would not contain enough of a signal to make accurate predictions.

One of the great features of decisions trees is that once the model is trained, it’s possible to interrogate which features lent the most importance to the prediction. It does this by measuring how much of the tree nodes that use a given feature reduce the average impurity. Below is the feature importance map for our example:

Day Of Week0.0976802843488258
Hour0.760390309797637
Temp0.123111801930883
Fair0.0024406441394818
Mostly Cloudy0
Light Rain0
Partly Cloudy0.0163769597831715
Cloudy0
Fog0
Mist0
Light Drizzle0
Haze0
Rain0
Wintry Mix0
Light Snow0
Drizzle0

Interestingly, the most significant feature is the hour of the day, followed by the temperature and then the day of week. The weather condition is all but irrelevant with a very small importance attached to Fair and Partly Cloudy conditions.

The model used so far is fairly basic and we’ve not performed any tuning of the model parameters available to us. Therefore, can we improve upon our prediction metrics? One technique is to use a Random Forest, which uses a collection, or "ensemble", of Decision Trees that tend to make better predictions when queried in aggregate. There’s a bit of randomness in the underlying Decision Tree algorithm and so combining multiple instances tends to even this out. We can also use cross validation to find optimal settings for parameters such as maximum depth of the tree, the minimum instances per leaf node and the number of trees in the Random Forest.

Putting all this together leads to some improvement of our performance and we end up with a RMSE value of 0.52 and an F1 accuracy of 0.85, which, while not a huge difference, it does in fact equate to an 11% reduction in the error rate.

So far, we’ve only considered the number of hot chocolates in our prediction model. What about other product types? Listed below is a table showing a few more items along with the corresponding RMSE error and F1 accuracy score:

ItemRMSEF1 Score
Cookies0.580.78
Coffee1.650.62
Scone0.420.89
Muffin0.460.88
Pastry0.700.79
Fudge0.300.92
Victoria Sponge0.030.99

We can see that there’s good accuracy for predicting the amount of fudge or the number of scones sold, but the number of coffees sold is a bit harder to predict.

We hope that this has been an insightful demonstration of the predictive power of decision trees and how they can be applied to a real-world business example in order to reduce costs and improve profits.