## Anomaly detection is the process of finding events or observations in a particular dataset that do not conform to an expected pattern.

A good example of an anomaly is a fraudulent bank transaction. Say that you mostly use your bank card when shopping for groceries and the occasional night out. One day, your bank suddenly registers that you tried to buy two new iPhones in the span of 24 hours. The bank notices the discrepancy, temporarily blocks these two transactions and quickly checks if it was indeed you that made those purchases. This way, you’re trading a possible slight inconvenience for the chance of saving yourself (or your bank) a lot of time and money in case your bank details or debit card ever get stolen.

But, how to tell if a bank transaction is fraudulent? How to decide if a particular purchase is out of the ordinary, based on the previous behaviour of a bank customer?

Of course, there are many approaches one could use, but they are all, in essence, statistical: we use data relating to the customer’s earlier purchase history to build an idea of what “expected” behaviour looks like, and then test new purchases against it.

**Estimating the location of data**

When attempting to understand an unknown random process, an often used first approach is to try to estimate the average. This statistic is an example of a location parameter – the average gives you an idea of where the most probable values are likely to be found. It provides information about the location of the highest density of probability. This is especially true given a small sample; the idea is that in such a collection, almost all values will be those that you are very likely to observe anyway.

Only when your sample gets larger will you start recording measurements that are less common.

But, here you can start running into trouble. Remember that the average is computed by summing up all the measurements in the sample and dividing by the number of observations made. In other words, every element of the sample is given equal weight in the final estimate – which is a problem if you are trying to home in on an area of high probability. Ideally, you would want to assign low (or even no) relevance to improbable outcomes and high relevance to the probable ones. Otherwise, your sample average will be a poor estimate of the location of data, as the low-probability values will start “dragging” the average value towards them and away from the centre of the bulk of other measurements.

Unusual observations are often called outliers, and can either be caused by measurement error, or as a consequence of observing infrequent events. In either case, they can lead to large estimation errors. In statistical terms, we say that the sample average has a low breakdown point. The breakdown point is the percentage of outliers in the sample that is needed to cause a bias in the estimate. A breakdown point of 0% in the sample average is therefore as bad as it gets. And, intuitively, the breakdown point also cannot be higher than 50% since, if half the data is unusual, who can argue which half is strange and which is normal?

A much better estimator of location is the sample median. The median is the value right in the “middle” of the sample – half of the data is less than or equal to it and the other half is larger than it. **The median is a robust estimator of a random variable’s location** – its breakdown point is 50%.

To see why the median is virtually unaffected by an outlier, let’s say that you are trying to compute the average temperature of objects in your apartment. You make a few measurements – the table, chair, wardrobe – all are around 20 degrees centigrade. You are getting ready to bake, so your oven is at 180. The average temperature would be something like

1/4*(20 + 20 + 20 + 180) = 60 degrees.

The outlier (the hot oven) dragged the sample average way higher than what is likely a more reasonable average temperature of 20 degrees. However, the flat’s median temperature is 20 degrees. Why? Half the data (say, the wardrobe and the chair) have temperatures less than or equal to it, and the other temperatures (the table and oven) are greater than, or equal to 20. What’s more, even adding one more hot oven to your flat wouldn’t change the median’s value!

**Measuring random variable scale**

Knowing the location of data is important – it gives us a hint as to where the bulk of measurements is likely to lie. However, it does not give us the whole picture. What we also need to know is how spread out the data are. Are data points mostly clustered around the “central” part of the distribution OR do they tend to frequently be much larger (or smaller) than common values?

A classical measure of random variable scale, related to the sample average, is the standard deviation. It is defined as the square root of the **sample variance*, which itself measures the average square discrepancy* of the data from its sample mean.

Why square discrepancy? The very practical reason is that, if we were to just use regular discrepancies (*i.e*. the difference between a given data point and the sample mean) we might run into the problem of negative discrepancies cancelling out the positive ones, making the average discrepancy artificially small.

The reason we then take a square root to compute the standard deviation is that the sample variance is no longer in the same units as the original data, itself being computed from squares of deviations from the sample mean. Taking the square root remedies that.

Unfortunately, the standard deviation suffers from the same issues as the sample averages. How could it not, when the sample average is even directly used to compute it? Once again, the presence of **outliers** will bias the sample average and, as a consequence, the standard deviation itself.

As we have already seen, a more robust choice for a location estimator is the **median**. In the same way, for estimating the scale of a random variable, we can turn to the **median absolute deviation** (often abbreviated as **MAD**).

To compute the MAD of a sample, we take the median of the absolute deviations (as opposed to square deviations) of each sample point from the sample median (instead of the average).

You might imagine that the MAD would be a very robust estimator of a random quantity’s scale and you would be right – its breakdown point is the same as the median’s. For example, the MAD of the sample temperatures in our flat (including the hot oven) is 0. This indicates that, as we might expect, most objects are very close in temperature and, consequently, there is (almost) no spread in the data.

**Standard scores**

Now that we have an idea of a random variable’s location (where the bulk of the data tends to be) and scale (a measure of how spread out the bulk of the data is), we can determine if a particular data point’s value is strange or not: we can start quantifying the abnormality of a specific value.

The logic is as follows. Let $\hat{l}$ be an estimate of a quantity’s location (say, its median) and $\hat{s}$ an estimate of its scale (its MAD). Next, take a data point $x$ and compute

\[ [ x – \hat{l} \over \hat{s} ] \]

This number is often referred to as the standard score (especially when $\hat{l}$ is the sample average and $\hat{s}$ is the standard deviation). It tells us how far away a particular data point is from the bulk of the data (as indicated by the location parameter), in units of scale (which is why we divide by $\hat{s}$).

Note that negative standard scores correspond to data to the left of (*i.e*. smaller than) the bulk of the majority of other measurements, and positive scores indicate data to the right of (greater than) the centre of data, as localised by $\hat{l}$.

**What does all this have to do with anomaly detection?**

As we have seen, the standard score can give us information such as: “this datum is 3 units of scale smaller than the centre of the bulk of other data”. In this particular example, this would perhaps indicate that the datum in question is unusually small in value.

Extreme standard scores have a probabilistic interpretation as well. Almost their very definition implies that only very few observations will have high or low standard scores, and are those values that are “far away” from the central bulk of data, in the “tails” of the distribution.

If the data come from the **normal distribution**, then this connection is particularly simple – it can be shown that approximately $99.7\%$ of all observations will have a standard score between $-3$ and $3$. This is the so-called $3\sigma$ rule.

In the case of other probability distributions, the connection is a little harder to analyse, especially when it is not known which distribution the data come from. In that case, heuristics can be used, such as starting with the obtained sample, computing the standard scores of all data, and then looking at the distribution of the standard score values. For example, if we notice that $99\%$ of standard scores so computed seem to be between $-4$ and $5$, then we could use those values as cut-offs, indicating “extreme” data, in a way similar to the $3\sigma$ rule.

All this suggests a procedure for anomaly testing. We know that only few data from a particular distribution will have an extreme standard score. Furthermore, we can utilise a rule or a heuristic to determine exactly what extreme means: usually it will be a variation of “given that data come from distribution $D$, $x\%$ of them will have a standard score between $a$ and $b$”, where $x$ would be a number arbitrarily close to $100\%$.

It is important to note that $x$ is a free parameter – we decide how to set it and, concomitantly, what does and what does not constitute an extreme standard score. Setting $x$ to a specific value gives us the lower and upper bounds for the standard score values – $a$ and $b$, respectively.

Assuming we have done all this, we then observe a new data point and proceed to compute its standard score. If the value of its scores seems to be similar to the other data we have collected (that is, it falls between $a$ and $b$), then no harm done. If, on the other hand, it looks extreme, this means one of two things has taken place:

either we have witnessed an event of low probability – the datum is unlikely to be observed, but still possible; or

the new datum with an extreme standard score is somehow anomalous – it does not seem to come from the same distribution as the other data.

It may now be more evident how choosing the value of $x$ affects our anomaly detection procedure – $x$ determines the probability of **type I error**, also commonly known as a **false positive**.

To make things more concrete, assume $x$ is $99\%$. Also, assume we have calculated the standard score of a new point and determined that it falls outside of the “normal” interval, determined by $a$ and $b$. As noted above, this kind of event may be due to the new datum being anomalous in some way, but it might also be due to random chance – in fact, we should see standard scores at least as extreme as this $1\%$ of the time! This is precisely because we have set $x$ to be $99\%$ – in doing so, we have determined an interval in which standard scores take the value in $99\%$ of cases.

By setting $x$, we have thus agreed to a kind of compromise – $1\%$ of the time, our anomaly detector will “light up” and declare an ordinary point abnormal. In other words, it will give us a false positive. The other $99\%$ of the time, though, it will likely indicate a point with an extreme standard score – a point which could be classified as anomalous.

But why, you may ask, would we not want to set $x$ to be as large as possible?

Well, for one thing, making $x$ larger also makes the interval between $a$ and $b$ larger – the more probability mass we wish to encompass, the larger the interval of values needs to become, in order to collect all the unlikely, but possible measurements.

However, if we make the interval too big in such a way, we risk never detecting any anomalies. What we are in essence saying by making $x$ ridiculously large, say something like $99.9999\%$, is that even extraordinarily unlikely standard scores do not indicate anything out of the ordinary. Thus, even when we DO encounter an anomalous data point, our test will treat its standard score as a non-extreme, albeit very unlikely, value. The anomaly will remain undetected, and we will have made a **type II error**, *i.e*. we will have gotten a false negative.

This, in a nutshell, is the trade-off we have to consider; either we make peace with a high false positive rate, which also makes actually detecting anomalies in the data more likely OR we decrease the number of false positives, while also decreasing our anomaly-detecting power.

**An example in anomaly detection**

Finally, let us once again consider our temperature example. To make things slightly more realistic, let us say that the measured temperatures of our chair, table, wardrobe and oven are, in degrees centigrade,

20, 21, 21.5, 180.

Our sample average is 60.625 and the standard deviation is 68.923; the median, however, is 21.25 and the MAD is 0.75.

Let us now compute standard scores for each data point.

First, we use the sample average and standard deviation and obtain

-0.589, -0.575, -0.568, 1.732.

The oven has a higher score than the other three observations but, if we were using a $3\sigma$ rule inspired cut-off, the oven would not be detected as an anomaly.

On the other hand, using the median and MAD, we get the following scores

-1.667 -0.333 0.333 211.667

The difference is pretty evident – the oven has a standard score two orders of magnitude higher that the other data. It obviously stands out, which was not the case when we used non-robust estimators for location and scale.

**A word or two about robustness**

When speaking about estimating location and scale, I mentioned the median and the MAD as robust alternatives to the sample average and standard deviation.

This is not to say that the latter two should never be used; indeed, if you have reason to believe that your data come from a Gaussian distribution, or a distribution in which values far away from the central mass of data are quite unlikely (*i.e*. distributions with a “light tail”) there is almost no reason not to use the classical estimators.

However, that being said, in most practical circumstances there are few drawbacks to using the median and MAD as measures of location and scale.

One can argue, for example, that the median does not take into account the precise values of the observations and, therefore, does not use all of the information present in the data. On the other hand, this is exactly the reason it is insensitive to outliers.

The MAD itself is not without disadvantages. By its very definition, it can be seen to give same weight to deviations on either side of the median; it makes no distinction as to whether an observation is to the left or right of the median, which makes little sense in the case of asymmetric distributions.

There are, of course, quite a few estimators out there and which is the best choice for a particular problem must be determined on a case-by-case basis. After all, the reason we at Adverai chose to use the median and MAD is because they provide us with robust, informative results.

**Conclusion**

The anomaly detection procedure I have described is a slight variation on a classic statistical approach for determining outliers, based on using extreme values of standard scores. The small, but important, change we have made is to consider the median and MAD, respectively, as the estimator of location and scale in the standard score formula.

It may not be very fancy, but it gives us valuable insight, **at a low computational cost.**

In addition to its vaunted robustness, a significant advantage of this approach is **interpretable anomaly scores**, which are easy to calculate, even in a data-streaming setting.

**Determining the cut-offs for what is considered an anomalous standard score, however, is something of an art and requires tweaking to get just right.**