The compliance director of one of our clients approached us recently about creating a software system that would help them with the detection of fraud or market abuse within the transactions data taken from their trading team. There are plenty of off-the-shelf offerings in this space, but they tend to come with a hefty price tag, are complicated to use, and sometimes require labelled data or data to be converted into a specific format to be of use, which further adds to the expense. Our client wanted to see if we could devise a bespoke solution, tailored to their needs that also proves cost effective, particularly over the longer term.
What our client was after is a simple tool that can ingest their transaction data over a given time window and then return a list of potentially anomalous transactions that may require further manual investigation.
In order to prepare for our pitch, we turned to Kaggle in order to find some fraud data within the public domain on which we could practice. We found the Credit Card Fraud Detection data set, which seemed perfect for our needs. This dataset contains over 280,000 anonymised credit card transactions, of which 492 transactions have been labelled as fraudulent. Given that such a low percentage of the transactions are fraudulent, we need to be careful with how we might measure the accuracy of our models. A simple classifier that just classified everything as non-fraudulent would score a precision accuracy of over 99%! When measuring performance of such an unbalanced data set, one should also consider recall, indeed there’s a note on the Kaggle description to this effect.
The Kaggle dataset contains a few obvious attributes, a timestamp for the transaction, the amount of the transaction and a label flag, which has a value of 0 for non-fraudulent transactions and a value of 1 for fraudulent ones. The other attributes are labelled v1 – v28 and contain values such as -1.3598071 and 1.19185711, These values have been derived from the original dataset using a technique called Principal Component Analysis (PCA). PCA is outside of the scope of this article, however it is essentially used to reduce the dimensionality of the data, selecting attributes that best describe the variance of the data. It is used here to anonymise the original dataset, which is a bit of a pain because we lose a bit of context on what these attributes relate to and whether they are useful to include in our model. One advantage though is that they are already normalised with a mean of 0 and standard deviation of 1, which is helpful as many models work best when trained on normalised data.
Given that our dataset from Kaggle contains labelled data, we know which transactions are fraudulent and which ones are not, we could use a supervised training algorithm to create a model that can classify future transactions for us. However, given our client brief, we will not have the luxury of labelled data in the real world and so we’re interested in creating a model that will be able to identify potentially anomalous transactions from an unlabelled dataset. For this we need to use unsupervised learning.
One popular unsupervised learning technique is clustering, which takes a list of instances and identifies instances that are similar to one another, assigning them to clusters. Anomalies in the data can then be identified by looking for instances that are isolated or have a larger distance from the centre of any of the clusters compared to other instances. It is often the case the fraudulent transactions are also anomalous and do not fit into the patterns inherent with legitimate transactions. This is a powerful technique and one that we can put to use in order to satisfy our brief.
One of the most popular clustering algorithms is k-means, which is a simple algorithm capable of clustering many kinds of datasets very quickly and efficiently. However, it doesn’t work well with clusters that have varying sizes, densities or those with non-spherical shapes.
We are therefore going to look at the DBSCAN algorithm, which uses an approach based on local density estimation and works a lot better when identifying clusters of arbitrary shapes. The DBSCAN algorithm works as follows:
- For each instance we count how many instances are located within a given distance of that instance defined by the epsilon (eps) parameter.
- A second parameter, minimum samples, is defined and if an instance has at least this many other instances in its neighbourhood then it is considered a core instance.
- Clusters are then derived from the neighbourhoods of core instances, which may include other core instances. Long sequences of neighbouring core instances can form a single cluster.
- Any instance that does not belong to a cluster is considered anomalous.
The last point is what we’re looking for, an easy way to identify anomalies and an advantage of DBSCAN is that the scikit-learn implementation has a nice API for identifying these outliers. Once we fit the model to a given dataset, anomalous instances may be identified as any labelled with a cluster of -1 as accessed via the labels_ parameter.
It’s now time to see how well the DBSCAN algorithm works on the credit card fraud dataset from Kaggle. Running this through using the default model parameters of an eps of 0.5 and a minimum samples value of 5 yields 196,102 outliers or just under 70% of all instances! Granted, the outliers found do include 98% of the fraudulent transactions but the number of false positives is far too high for this model to be of any practical use. We can do better.
Tweaking these parameters a bit and using an eps of 0.99 and a minimum samples value of 2 yields 107,181 outliers, which is 38% of all instances and still captures 88% of the fraudulent transactions. This is better but still not great. Right now, we’re using all of the attributed from the dataset, v1 – v28 plus the transaction amounts and this complexity is making it hard for the algorithm to identify clusters. Cutting the number of attributes down by taking just the first 9 attributes plus the transaction amounts yields 23,149 (8%) outliers but still manages to capture 74% of the fraudulent transactions. Simplifying the input even further to the first three attributes plus the transaction amount yields 1,996 outliers that contain 26% of the fraudulent transactions, which is a much more manageable list.
What about only detecting 26% of the fraudulent transactions? If we were building a classifier designed to detect all possible fraudulent transactions then this wouldn’t be a great metric. However, that’s not the goal here. We’re after a system that can identify a manageable portion of transactions that could potentially be fraudulent and that can then be investigated further through manual checks. It’s worth re-iterating how few transactions in this dataset were fraudulent, 492, and this model found 129 of these. We ran some checks and taking 1,996 transactions at random yielded a maximum of only 10 fraudulent ones with 1000 iterations, giving these results a p-value close to zero and therefore confirming statistical significance.
We also simplified the input by omitting attributes arbitrarily, we could’ve instead used PCA to reduce the dimensionality for us. However in this case, using PCA to reduce the dataset to four dimensions produces similar results with 1,981 anomalies containing 26% of the fraudulent transactions. If more was known about the underlying dataset and it hadn’t been anonymised in this way, then we could’ve been smarter about the attributes selected for the model and maybe obtained superior results. Nonetheless, we’re impressed with these results and the power of the DBSCAN algorithm for finding instances of fraud through anomaly detection. We look forward to building this work out further for our client.