- Overview of Anomaly detection
- Introduction of methods
- Classic algorithms provided by Scikit-learn

- Task of discerning unusual samples in data
- Process of identifying unexpected observation or event in data
- (it is treated as an unsupervised learning problem)

- Abnormal or anomalous observation, abnormality, outlier, novelty, etc.

- Actually, not easy to define and subjective
- Samples that do not fit to a general, well-defined and normal pattern
- Simply, few and different samples
- In this lecture, we have only two labels(normal and abnormal)

- Anomalies are hard to define
- Inaccurate boundaries between the outlier and normal behavior
- Labeled data might be hard (or even impossible) to obtain
- Imbalanced data set
- Noise in the data which mimics real outliers and therefore makes is challenging to distinguish and remove them

Image source: Where is Wally?

- Anomalies can be classified into three types:

- represent an irregularity or deviation that happens randomly and may have no particular interpretations

- identified by considering both contextual and behavioural features

- each of the individual points in isolation appears as normal data instances while observed in a group exhibit unusual characteristics

- Anomaly detection can be divided by supervised, unsupervised and semi-supervised learning

- labeled sample consisting of both normal and anomalous examples

- data set contains few anomalies

- unlabeled sample that is contaminated with abnormal instances. An estimation for the expected ratio of anomalies is often known in advance.

- samples only from a normal class of instances and the goal is to construct a classifier capable of detecting out-of-distribution abnormal instances

Outlier detection

- The training data contains outlier that are far from the others
- Outlier detection estimators try to fit the regions where the training data is the most concentrated.

Novelty detection

- The training data is not polluted by outliers and we are interested in detecting whether a new observation is an outlier(called novelty)
- Assume that all observations in training data is normal

In novelty detection, there may be normal samples in the training data that are far from other points.

~~not important~~

- Data logs and process logs
- Fraud detection and intrusion detection
- Security and surveillance
- Fake news and information, social networks
- Health case analysis and medical diagnosis
- Data sources of transactions
- Sensor networks and databases
- Data quality and data cleaning
- Time series monitoring and data streams
- Internet of things

~~not important~~

- Parametric and non-parametric methods
- Data points are modeled using a well known stochastic distribution
- Robust covariance, Gaussian mixture model(GMM), Kernel density estimation(KDE)

- Outlier can be found in a low-density region whereas inlier are assumed to appear in dense neighborhoods
- Consider the neighborhood of an object which is defined by a given radius

~~not important~~

- Detect outliers by computing the distances between points
- A point, a far distance from its nearest neighbor is regarded as an outlier
- So not robust to scale variance
- Local Outlier Factor(LOF)

- Use clustering methods to describe the behavior of the data
- Points in smaller size clusters are deemed as outliers

- Assume that we know the labels of test data $X$(essential)
- Evaluate a binary classifier(decision function) $$ h(x): X \longrightarrow \{ 0, 1\} $$
- WLOG, 1 means outlier(positive) and 0 label means inlier(negative)
- The best classifier satisfies that

and

$$ h(x) = 0 \text{, whenever } x \text{ is an inlier} $$- There are 9990 normal observations but only 10 anomalous observations
- Example

Positive prediction | Negative prediction | total | |
---|---|---|---|

Outlier(positive) |
0 | 10 | 10 |

Inlier(negative) |
0 | 9990 | 9990 |

total |
0 | 10000 | 10000 |

- Baseline of Accuracy

where $N$: the number of negative observations and $P$: the number of positive observations

Positive prediction | Negative prediction | |
---|---|---|

Observed positive |
True positive (TP) | False negative (FN) |

Observed negative |
False positive (FP) | True negative (TN) |

- TP: instance is positive and is classified as positive
- FN: instance is positive but is classified as negative
- TN: instance is negative and is classified as negative
- FP: instance is negative but is classified as positive

- Confusion matrix

Positive prediction | Negative prediction | |
---|---|---|

Observed positive |
True positive (TP) | False negative (FN) |

Observed negative |
False positive (FP) | True negative (TN) |

- Evaluation metric

Prediction based | label based |
---|---|

Positive predictive value $ = \frac{TP}{TP + FP}$ | True positive rate $ = \frac{TP}{P}$ |

Negative predictive value $ = \frac{TN}{TN+FN}$ | True negative rate $ = \frac{TN}{N}$ |

False positive rate $ = \frac{FP}{N}$ | |

Accuracy $=\frac{TP + TN}{P+N}$ |

- Example

Positive prediction | Negative prediction | total | |
---|---|---|---|

Outlier(positive) |
3 | 15 | 18 |

Inlier(negative) |
2 | 80 | 82 |

total |
5 | 95 | 100 |

- Evaluation metric

Prediction based | label based |
---|---|

Positive predictive value $ = \qquad\qquad$ | True positive rate $ = \qquad\qquad$ |

Negative predictive value $ =\qquad \qquad$ | True negative rate $ = \qquad\qquad$ |

False positive rate $ = \qquad\qquad$ | |

Accuracy $=\qquad\qquad$ |

Assume higher scores indicate that samples are more likely to be an outlier, e.g. $s(x)$ is indicated the estimated probability that the sample $x$ is an outlier.

where $X_{out}$ is the collection of outliers in $X$ and $X_{in} = X\setminus X_{out}$ (we assumed that we know the labels of test data).

Once such a scoring function has been learned(obtained), a classifier can be constructed by threshold $\lambda \in \mathbb R$:

$$ h^\lambda (x):=\begin{cases} 1 &s(x) \geq \lambda \\ 0 &s(x)< \lambda \end{cases} $$- True positive rate, TPR

- False positive rate, FPR

- 2 dim graph
**tp rate**is plotted on the y-axis and**fp rate**is plotted on the x-axis - Performance metric to measure the quality of the tradeoff of a
**scoring function**$s(x)$ Tradeoffs between benefits(true positives rate) and cost(false positives or false alarm rate)

Classifier $h^\lambda(x)$ is determined by threshold $\lambda$ so one point in ROC graph is obtained by $\lambda$

Simply, for a scoring function $s(x)$

- $(0,0)$: the strategy of never issuing a positive classification; such a classifier commits no false
- $(0,1)$: perfect classification
- $(1,1)$: unconditionally issuing positive classifications

- If it guesses the positive class $t\%$ of time, it can be expected to get $t\%$ of the positives correct but its false positive rate will be $t\%$ as well $(o\leq t \leq 100)$, yielding $(t,t)$ in ROC space
- The diagonal line $y=x$ represents the strategy of randomly guessing a class
- A classifier below the diagonal may be said to have useful information but it is applying the information incorrectly

- Informally, points to the northwest is better than others(
**tp rate**is higher and**fp rate**is lower) - Low
**fp rate**mat be thought of as "conservative"; they make positive classifications only with strong evidence so they make few false positive errors(low true positive rates) - High
**fp rate**may be thought of as "liberal"; they make positive classifications with weak evidence so they classify nearly all positives correctly(high false positive rates)

- Area under the ROC curve(AUROC) is a single scalar value representing expected performance
- Baseline of AUROC $= 0.5$, the area under line $y=x$

- Statistical based approach(parametric)
- (Assumption) Observations are sampled from an
**elliptically symmetric unimodal distribution**(e.g. multivariate normal distribution) - In other words, density function $f(\mathbf x)$, $\mathbf x\in\mathbb R^p$ can be written by

- where $d(\mathbf x, \mu, \Sigma) = \sqrt{(\mathbf x- \mu)^t \Sigma^{-1} (\mathbf x- \mu)}$ and a strictly decreasing real function $g$
Unknown parameters: location $\mu \in \mathbb R^p$ and scatter positive definite $p\times p$ matrix $\Sigma$

For instance, density function of the multivariate normal distribution

- i.e. $g(t)= \frac{1}{\sqrt{(2\pi)^p}} \exp(-\frac{t}{2})$
- Statistical distance $d(\cdot, \mu, \Sigma)$ represents how far away $\mathbf x$ is from the location $ \mu$ relative to scatter $\Sigma$

- Using whole observations, compute location $\mu$ and scatter $\Sigma$ as:

and $$ \Sigma:=cov(X):= \frac{1}{N} \sum_{i=1}^N (\mathbf x_i - \bar{\mathbf x})(\mathbf x_i - \bar {\mathbf x})^t, \text{ (sample covariance)} $$

- In this case, the statistical distance is called Mahalanobis distance(MD)

- Because of Masking effect, not a good choice

- We need reliable estimators that can resist outliers when they occur. i.e. we need the ellipse that is smaller and only encloses the regular points.
- Try to remove masking effect, noise or outlier to yield a "pure" subset
- Idea is to find some (representitive) observations whose empirical covariance has the smallest determinant

- Find $h$(fixed number) observations s.t. its determinant of covariance matrix is as samll as possilbe
- The larger $|\Sigma|$, the more dispersed
- Robust distance(RD) is defined by

- $\hat{\mu}_{MCD}$: MCD estimate of location (from $h$ observations)
$\hat{\Sigma}_{MCD}$: MCD covariance estimate (from $h$ observations)

Generally, ${N}\choose{h}$ is too many, so we need something...

Consider a data set $X:=\{ \mathbf x_1, ... ,\mathbf x_N \}$ of $p$-variate observations. Let $H_1 \subset \{1,...,N\}$ with $|H_1|=h$ and put

$$ T_1:= \frac{1}{h} \sum_{i\in H_1} \mathbf x_i, \quad S_1:= \frac{1}{h} \sum_{i\in H_1} (\mathbf x_i - T_1)(\mathbf x_i- T_1)^t. $$If $\det (S_1) \neq 0$, then define the relative distances

$$ d_1(i):= \sqrt{(\mathbf x_i - T_1)^t S_1^{-1}(\mathbf x_i- T_1)}, \text{ for } i=1,...,N. $$Sort these $N$ distances from the smallest $d_1(i_1)\leq d_1(i_2)\leq \cdots \leq d_1(i_N)$, then we obtain the ordered tuple $(i_1, i_2, ...,i_N)$(which is some permutaion of $(1,2,...,N)$). Let $H_2:=\{i_1,...,i_h\}$ and compute $T_2$ and $S_2$ based on $H_2$. Then

$$ \det(S_2) \leq \det(S_1) $$with equality if and only if $T_2=T_1$ and $S_2=S_1$.

*Proof.* Assume that $det(S_2)>0$, otherwise the result is already satisfied. We can thus compute $d_2(i)=d_{(T_2, S_2)}(i)$ for all $i=1,...,N$. Using $|H_2|=h$ and the definition of $(T_2, S_2)$ we find

Moreover, put

$$\begin{equation*} \tag{A.2} \lambda:= \frac{1}{hp} \sum_{i\in H_2} d_1^2(i) = \frac{1}{hp} \sum_{k=1}^h d_1^2(i_k)\leq \frac{1}{hp} \sum_{j\in H_1} d_1^2(j) =1, \label{A.2} \end{equation*}$$where $\lambda>0$ because otherwise $\det(S_1)=0$. Combining (\ref{A.1}) and (\ref{A.2}) yields

$$ \frac{1}{hp} \sum_{i\in H_2} d^2_{(T_1,\lambda S_1)}(i) = \frac{1}{hp} \sum_{i\in H_2}(\mathbf x_i -T_1)^t \frac{1}{\lambda}S_1^{-1}(\mathbf x_i - T_1) = \frac{1}{\lambda hp} \sum_{i\in H_2}d_1^2(i) =\frac{\lambda}{\lambda}=1. $$Grubel(1988) proved that $(T_2, S_2)$ is the unique minimizer of $\det(S)$ among all $(T,S)$ for which $\frac{1}{hp}\sum_{i\in H_2} d^2_{(T,S)}(i) = 1$. This implies that $\det(S_2)\leq det(\lambda S_1)$. On theother hand it follows from the inequality (\ref{A.2}) that $\det(\lambda S_1)\leq \det(S_1)$, hence

$$\begin{equation*} \tag{A.3} \det(S_2)\leq \det(\lambda S_1)\leq \det(S_1) \label{A.3} \end{equation*}$$Moreover, note that $\det(S_2)=\det(S_1)$ if and only if both inequalities in (\ref{A.3}) are equalities. For the first we know from Grubel's result that $\det(S_2)=\det(\lambda S_1)$ if and only if $(T_2, S_1)= (T_1,\lambda S_1)$. For the second, $\det(\lambda S_1) = \det(S_1)$ if and only if $\lambda = 1$, i.e. $S_1 = \lambda S_1$. Combining both tields $(T_2, S_2) = (T_1, S_1)$.

- Fix $h$ s.t. $[(N+p+1)/2] \leq h \leq N$ and $h>p$
- Given the $h$-subset $H_{old}$ and its $(T_{old}, S_{old})$
- Compute $d_{old}(i)$ for $i=1,...,N$ and sort these distances which yields a permutation $\pi$ for which

- Let

- Compute

- One class SVM is a variant of SVM(support vector machine)
- We have to understand SVM and kernel method(
~~however not easy~~)!

- Data matrix $X$, collection of $\mathbf x_i \in \mathbb R^p$ and its label $y_i= 1$ or $-1$
- SVM is a supervised learning method for classification(non-probabilistic binary linear classifier)
- Aim to find a $(p-1)$-hyperplane to separate two classes

- When $\mathbf x$ is on or above this boundary with label 1,

- When $\mathbf x$ is on or below this boundary with label -1,

- Maximize $\frac{2}{\| \mathbf w \|}$ (i.e. minimize $\| \mathbf w \|$) with the following constraint

and

$$ \mathbf w \cdot \mathbf x_i - b \leq -1, \quad \text{ if } y_i =-1 $$- Rewrite

- Hinge loss

- $\lambda$: role of trade-off between margin size and ensuring that $\mathbf x_i$ lie on the correct side of the margin
- High $\lambda$ ~ underfitting (maximize margin)
- Low $\lambda$ ~ overfitting

- Note that $\zeta_i:= \max(0, 1- y_i(\mathbf w \cdot \mathbf x_i - b))$ is the smallest nonnegative number $t\geq 0$ satisfying

- So rewrite hinge loss as a quadratic optimization

- By solving for the Lagrangian dual of primal problem, we obtain the simplified problem

- Here the variables $c_i$ are defined such that

- Generally, data is not linearly separable
- Original maximum margin hyperplane classifier(1963)
Kernel method(1992)

For instance,

- Feature map $\varphi: X \longrightarrow \cal F \subset \mathbb R^3$ given by(mapping the 2-dim data set to 3-dim feature space $\cal F$)

- Linear SVM classifier in this feature space $\cal F$
- Hyperplane in feature space $\cal F \subset \mathbb R^3$ can be determined by $\mathbf w=(w_1, w_2, w_3)$ and $b\in \mathbb R$ and expressed as

- This is nonlinear in $\mathbb R^2$
- But this is linear in feature space $\cal F$ (in $\mathbb R^3$)

- Feature map $\varphi: X \longrightarrow \cal F \subset \mathbb R^s$
- Find a maximal(soft) margin hyperplane in feature space $\cal F$
- Hinge loss(because we know the information of $\varphi$)

- Here $\mathbf w \in \mathbb R^s$ and $\mathbf w \cdot \varphi(\mathbf x_i)$ is an inner product in $\mathbb R^s$

- So rewrite hinge loss as a quadratic optimization

- By solving for the Lagrangian dual of primal problem, we obtain the simplified problem

- Here the variables $c_i$ are defined such that

- Using feature map and finding hyperplane in feature space seems to be intuitive and easy
However, we have some problems:

- Unknown distribution of observations
Choice of feature map, e.g. random mapping $ \varphi: \mathbb R^2 \longrightarrow \mathbb R^\infty$

$$ \varphi(x_1, x_2):=(\sin(x_2), \exp(x_1+x_2), x_2, x_1^{\tan(x_2)},...) $$

Generally, dim of feature space may be high

- Lots of computation

- Feature mapping can be considered as a replacement of inner product by kernel function e.g.

- Question: Can not we say that(implicit) let

- there might exist a map $\varphi$ s.t. $k(\mathbf x, \mathbf y) = \varphi(\mathbf x)\cdot \varphi(\mathbf y)$ and feature space $\cal F$?
- Idea: Choose a nice kernel function $k$ rather than an ugly feature mapping(explicit)
- Only need to know $\varphi(\mathbf x_i)$, $\forall i =1,...,N$ not explicit $\varphi$

- $X = \{\mathbf x_1, ... , \mathbf x_N \}$, $\mathbf x_i \in \mathbb R^p$, $k:X \times X \longrightarrow [0, \infty)$, positive semi definite kernel
- Then we can construct a feature space $\cal F$ and feature map $\varphi$
- Let $G$ be a $N\times N$ matrix with $G_{ij}:=k(\mathbf x_i, \mathbf x_j)$ (can be calculated). Then $G$ is symmetric PSD so

- Let $\varphi(\mathbf x_i):=\Sigma^{1/2} \mathbf u_i \in \mathbb R^m$, $i=1,...,N$
- Then such $\varphi(\mathbf x_i)$ leads to the Gram matrix $G$ because

- Given a psd kernel $k: X \times X \longrightarrow [0, \infty)$ and finite data set $X = \{\mathbf x_1, ... , \mathbf x_N \}$, $\mathbf x_i \in \mathbb R^p$
- Then we can construct a feature map $\varphi:X \longrightarrow \cal F$ and a feature space $\cal F = span \{ \varphi(\mathbf x_i) : \mathbf x_i \in$ $X \}$
- We can find a maximal margin hyperplane in $\cal F$, i.e. $\mathbf w \in \cal F$

- By solving for the Lagrangian dual of primal problem, we obtain the simplified problem

- Here the variables $c_i$ are defined such that

- We may not know the explicit formula of $\varphi$. But for a new sample $\mathbf x$, we know that

- Finds a maximum (soft) margin hyperplane in feature space that best separates the mapped data from the origin

- Data matrix $X = \{\mathbf x_1, ... , \mathbf x_N \}$, $\mathbf x_i \in \mathbb R^p$
- OC-SVM solves the following primal problem(by quadratic programing)

- By Lagrangian dual of primal problem,
- $\mathbf w \cdot \varphi(\mathbf x) < \rho$ is deemed to be anomalous

~~with Datasaurus~~