## Introduction

The need to find abnormalities in a system is undeniably universal. From detecting incoming emails offering unsolicited products to figuring out if a gear will malfunction in the engine you just built, timely knowledge about abnormalities is important.

In this week-long case study we will look at a recent task Xerini was given by a major Australian based railway operator. They wanted the ability to predict failures of their railway network in a timely manner (a minimum of 24 hours in advance). The solution should be automated and accurately capture the real railway system’s properties – a complicated task that can be tackled with Xerini’s expertise in *data driven methodologies*. So let’s look at what data we have so we can gain more insights into how to solve this problem.

## Defining the Machine Learning problem

Before choosing the modelling techniques, we need to define an ML formulation of the problem. Thankfully that is straightforward as we have the data and client requirements. We would like to disambiguate between failures and OK readings so we have a *binary classification* problem on *time-series* data. We know where historically failures have happened and we can encode that in the model by *labelling* the readings, making this task *supervised*. A good question to ask is: *how far back in time should we mark the readings as indicating failures?* Only a couple of hours prior and we might not be able to get timely predictions and too far back and we are saying that nominal readings are indicative of failure. This is a design question that is best answered by the domain experts so we will resort to our client specified minimum of 24 hours. At this point we note an important (and intuitive) observation: failures are quite rare. Less than a percent of total readings are labelled as failures which will prove a problem later on when we are training our ML model. With this in mind, we can load the `anaconda`

environment and start dissecting the data in preparation for our modelling.

## Assumption

One key assumption is made for any supervised machine learning problem, namely that we have chosen *reasonable *predictors. For this case, this assumption is formulated as: *deviations in the sensor readings are good failure indicators*. This is an important assumption that we can attempt to validate to some extent by using the MRMR algorithm [1] in `MATLAB`

. It measures how much the readings reveal about a failure occurring. More specifically, it measures the mutual information between each one of our predictors and the label and hence ranks how much a predictor *explains* the outcome label. As a base case, we apply this algorithm disregarding the time dimensionality to see how unsequenced raw readings behave:

Intuitively, the `AverageCurrent`

and `AmbientTemp`

seem to be the most important predictors and the direction of travel has no effect on failures. But in reality the scale of importance is negligible. The most important predictor `importance(AverageCurrent) == 0.0005`

where full explainability would have to be the entropy value of our target class, which is `~0.0558`

. This is a problem because if we have weak predictors in the data, the model would not find an underlying pattern for rail failures. This is okay for now, because the low importance values only say that the non-time sequenced data has poor predictive capabilities. We keep this observation in mind as we continue with all predictors to the time series modelling phase.

## Preprocessing

There are some universal preprocessing techniques that should be applied prior to any model training. Namely we encode our categorical `MovementType`

into a machine readable `normal_direction -> 0`

and `reverse_normal -> 1`

. Another very important step is handling missing values in the dataset. Unfortunately `PeakForce`

and `AverageForce`

have several `NULL`

entries. Empty readings are a common occurrence when gathering sensor data and it is usually impossible to retroactively gather the exact reading that was missed. If they are not that common, we can simply drop the whole entries that are incomplete without worrying. But in cases like ours where 57% of the entries have missing values, we cannot afford to lose that much data. A simple solution is to substitute the missing values for a static value that makes sense. A 0 denoting no force reading or using the mean of the feature sound intuitively reasonable. More complicated methods exist, such as imputing the values from adjacent readings or using another machine learning model to guess the missing predictors from the existing ones. Due to the time constraint, we will use a simple method and consider them as readings that have recorded negligible force values and set `NULL -> 0`

.

Preprocessing the data is also the time when we can get to understand it in depth. After we load the raw data using `pandas`

we can get a summary of our numerical predictors:

count | mean | std | min | max | |
---|---|---|---|---|---|

MovementTime | 585933 | 4.041748 | 1.903605 | 1. | 10 |

PeakCurrent | 585933 | 4.646638 | 3.329239 | 0 | 75.019997 |

AverageCurrent | 585933 | 3.254558 | 2.863608 | -1.6832 | 75.019997 |

PeakForce | 585933 | 46.972255 | 56.528667 | -756.080017 | 161.240005 |

AverageForce | 585933 | 28.922978 | 35.412441 | -756.130005 | 134.259995 |

AmbientTemp | 585933 | 28.471268 | 6.513162 | 0 | 53.900002 |

We notice that both `PeakForce`

and `AverageForce`

have a very low minimum. Usually outliers such as these are important predictors. Intuitively an anomaly in the reading implies an anomaly in the normal operation of the railway. But upon further inspection we can see that both values come from a single reading and there are no other readings that are remotely close to the `-700`

range. Furthermore, other sensor readings recorded at similar times of day have values that increase and decrease together. This value appears in one reading and is not reflected in any of the other readings for that location. To validate our assumption, we cross-reference with our mapping to see if this location reading has been labelled as a failure. It is not, so we can safely confirm it is erroneous and discard it.

The next thing we are interested in is the range of the `Force`

and `Current`

readings. They have negative values because the sensors measure around a reference point and the sign indicates direction in relation to that reference point. We assume that the change in magnitudes would indicate failures irrespective of the reference point of measurement. Therefore we scale the readings to their absolute values.

We can see clearly that all predictors are on different scales, which is not preferable for any ML algorithm. That is because a small change in `PeakForce`

of 5 units is a massive change for `MovementTime`

. To monitor changes in our data accordingly, we *normalize* the data by removing the mean (centre the data around 0) and scaling to unit variance. This normalization method (often called *z-score normalization*) is preferred when the feature bounds are unknown (i.e. we do not know what is the maximum current that can flow) [2].

Printing the data summary for the numerical predictors now looks more consistent and ready for our modelling stage.

count | mean | std | min | max | |
---|---|---|---|---|---|

MovementTime | 585932.0 | -1.081023e-15 | 1.000001 | -1.597896 | 3.130015 |

PeakCurrent | 585932.0 | -7.179012e-17 | 1.000001 | -1.396202 | 21.146069 |

AverageCurrent | 585932.0 | 3.601148e-16 | 1.000001 | -1.137192 | 25.075901 |

PeakForce | 585932.0 | 3.686520e-16 | 1.000001 | -0.831134 | 2.021745 |

AverageForce | 585932.0 | -1.940274e-17 | 1.000001 | -0.817161 | 2.975816 |

AmbientTemp | 585932.0 | -9.421968e-16 | 1.000001 | -4.371345 | 3.904212 |

## Modelling

We would like to be able to predict a minimum of 24 hours ahead of time. This means our sequences will contain a minimum of 100 timesteps (because of the frequency of sensor readings). A model that has been shown to handle sequences of this length both in research and in industry is the *Long short-term memory* (LSTM) recurrent neural network [3].

The question now is: What network architecture is optimal? We are interested in 24-hour predictions and because of that a `sequence-to-vector`

LSTM network is ideal. It takes a sequence (history) of readings and outputs one prediction for the next time frame (24 hours). Other networks that attempt to predict multiple time-steps ahead can be considered but, due to the scope of this project, we will focus only on the next 24 hours.

Trial and error and applying Occam’s razor principles produced the following network structure:

- input layer :: accepts the time-sequenced data as explained in the next section
- LSTM(0) :: 32 densely connected neurons using the
`tanh`

activation and having`dropout=0.1`

- LSTM(1) :: 32 densely connected neurons using the
`tanh`

activation and having`dropout=0.1`

- output layer :: one neuron (failure) using the
`sigmoid`

activation

The `tanh`

activation function is selected as it is by far the fastest because of its integration with the hardware acceleration provided by cuDNN [4]. `dropout`

is set as through trial and error it has shown to help reduce overfitting. It does so by randomly setting trained weights to 0 at a rate of 0.1 at the output of each LSTM layer and as such helps reduce the saturation on particular neuron pathways during training.

## Training the LSTM network

By far the most difficult issue to overcome is the lack of balance in our failure/OK labels. As passengers it is comforting to know that railway failures are rare but, as engineers, trying to reduce those failures to 0 is difficult when they are seldom observed. Unarguably, waiting for more data to be gathered can solve this issue, but due to our time constraint of one week we will have to resort to clever workarounds. We found that using a simple oversampling technique gives us promising results. We execute the oversampling by duplicating the instances of minority class (failures) until they reach the majority in count. It is easy to see that this naïve method will do nothing to improve the accuracy of the model as it is not introducing any new information. Despite the fact, it is still useful as it has been shown in practice that it contextualises the error metrics at the end, making them more robust and believable [5].

The data needs to be sequenced in batches before we can feed it into the LSTM network. Each batch contains 24 hours worth of readings. We do this because the network has no inherent understanding of time, but we can simulate that by giving it *slices* of the data in sequence. All the failure/OK labels are put in a separate vector where the position of each element is the label of each one of our 24-hour slices. See fig. 1 and 2 for a visualisation of the batches and corresponding label vector.

We randomly take the readings for one location out and use the other 17 in the training process. The readings from those locations are split into training and validation sets with ratio of {0.7 : 0.3}. The reasoning behind this is that the latter is used to tune the parameters of our network during the training process which incurs bias, making our metrics overoptimistic. Because of this, we should not use it as our definitive metric for performance. A more interesting measurement would be to see how the best model performs for an unseen location.

## Evaluation

Evaluating model performance for a complicated multi-dimensional problem is not straightforward and there are many avenues that can be explored and different conclusions can be drawn, some more accurate than others. To keep ourselves focused and avoid incorrectly interpreting the model performance, we systematize the evaluation in four steps:

- Choose metrics suitable for the problem
- Build a naïve model that should theoretically be unusable and use this as our null hypothesis model (i.e. all models we build should be better than this one)
- Experimentally train and evaluate different LSTM architectures, choosing the best one in terms of the metrics chosen
- Evaluate the best model against the unseen test location we extracted earlier.

## Metrics

The model is evaluated against three distinct metrics:

- loss: binary cross-entropy
- precision
- recall

Cross-entropy measures the distances between the probability distributions of two random variables. If we take the random variables to be our prediction and the true class labels, measuring the distance gives us a function to be minimised (if `cross-entropy == 0`

we predict consistently!).

In terms of prediction accuracy, using pure accuracy (how many we got right out of how many in total) is not very useful because the original dataset is unbalanced. If only 1% of the data is labelled as “failure”, then a model that *always* chooses “OK” will have 99% accuracy. Because of this, we split the accuracy measure in two: Precision and Recall. Precision will tell us what percent of the time we are correct when saying a given label is “failure” and Recall will tell us what percent of the failures we have actually caught. There is a trade-off between these two measurements we need to consider. In the context of our problem, failing to detect a “failure” in the network is more dangerous than saying there is a failure when it is actually OK. It would be alright if the model has lower Precision if it has high Recall. We will detect a large amount of the failures (recall) that will occur and suffer the occasional false positive, i.e. detect a failure when it is alright. The latter can be checked manually by a human, but the the former can be detrimental.

## Benchmarks against unseen test location

Training time on a GeForce RTX 2060 (coreClock: 1.2GHz coreCount: 30 deviceMemorySize: 6.00GiB deviceMemoryBandwidth: 312.97GiB/s) accelerated device with cuDNN enabled for 92 epochs with early stopping enabled (tolerance 50): ~2 hours 26 min.

*Note: benchmarks are rounded down to 4 decimal places*

Model | Loss | Precision | Recall |
---|---|---|---|

logit (base) | 0.9261 | 0.0000 | 0.0000 |

(2 layer) LSTM | 1.9735 | 0.0011 | 0.0909 |

The basic model which disregards time-sequenced data behaves poorly. This is unsurprising as it was also hinted by our feature importance when we were stating our assumption.

What is more surprising is that the best LSTM model does not converge when evaluating against the unseen test case. Furthermore, our initial suspicion that we have poor predictors gets confirmed by the low precision and recall values. By plotting the predictions against the actual values we can see that we are able to predict some failures 24 hours in advance, but with high noise.

So, what are the next steps after this one week?

## Next steps

As showcased, the most important aspect to be improved is the quality of the predictors. At the end of week one we would be going back to the drawing board with our client to engineer predictors with greater capabilities. For example, the railway operator also has a `RailTemperature`

log that we had not obtained with this initial dataset but this might be a very good predictor for failures. Furthermore, a better predictive model can be achieved when we revise all simplifications for preprocessing the data (`null`

to `0`

substitutions) and naïve class imbalance as previously explained.

A basic predictive model opens a plethora of avenues for our client. The model can be integrated in an *online* system, so that it gets updated as new data is coming in. Allowing users to give feedback on the accuracy will also guarantee the model is up-to-date. A dashboard for monitoring and relabelling predictions can also be developed and tailored to the client’s needs.

For this experiment, the model was trained on our local GPU enabled machines. For a production model we can leverage AWS’ cloud GPU instances for even faster training when the data to be processed becomes too large for a single machine.

On another note, we at Xerini believe that keeping a broad mind is equally important as training a model that can solve a specific problem. For example, railway operator’s data also included work orders for routine rail lubrications and although not in the current scope, with the same data we could analyse when these expensive chores are useful and when they are just an unnecessary cost.

## References

[1] Ding, C., and H. Peng. “Minimum redundancy feature selection from microarray gene expression data.” Journal of Bioinformatics and Computational Biology. Vol. 3, Number 2, 2005, pp. 185–205.

[2] L. Al Shalabi and Z. Shaaban, “Normalization as a Preprocessing Engine for Data Mining and the Approach of Preference Matrix,” 2006 International Conference on Dependability of Computer Systems, Szklarska Poreba, 2006, pp. 207-214, doi: 10.1109/DEPCOS-RELCOMEX.2006.38.

[3] G. Aurélien, Hands-on machine learning with Scikit-Learn, Keras, and TensorFlow: Concepts, tools, and techniques to build intelligent systems, O’Reilly Media 2019, ch. 15, pp. 511-517.

[4] Team, K., 2020. Keras Documentation: LSTM Layer. [online] Keras.io. Available at: https://keras.io/api/layers/recurrent_layers/lstm/ [Accessed 1 November 2020].

[5] L. Kumar and A. Sureka. 2018. Feature Selection Techniques to Counter Class Imbalance Problem for Aging Related Bug Prediction: Aging Related Bug Prediction. In Proceedings of the 11th Innovations in Software Engineering Conference(ISEC ’18). Association for Computing Machinery, New York, NY, USA, Article 2, 1–11. DOI:https://doi.org/10.1145/3172871.3172872