Anomaly Detection with Scikit learn





https://mjgim.me/slides/AD2020/Anomaly_detection.slides.html

Minjung Gim(mjgim@nims.re.kr) at NIMS

2020

Ph.D. in Mathematics (Probability)

Senior research scientist at NIMS

Chief of ICIM

Research interests

  • Anomaly detection methods
  • Classification and clustering of imbalanced data
  • Mathematical data science

Goal

  • Overview of Anomaly detection
  • Introduction of methods
  • Classic algorithms provided by Scikit-learn

Anomaly detection: 이상 감지

  • Task of discerning unusual samples in data
  • Process of identifying unexpected observation or event in data
  • (it is treated as an unsupervised learning problem)

Synonym of Anomaly

  • Abnormal or anomalous observation, abnormality, outlier, novelty, etc.

What is anomalies?

  • Actually, not easy to define and subjective
  • Samples that do not fit to a general, well-defined and normal pattern
  • Simply, few and different samples
  • In this lecture, we have only two labels(normal and abnormal)

Anomaly detection is complicated because:

  • Anomalies are hard to define
  • Inaccurate boundaries between the outlier and normal behavior
  • Labeled data might be hard (or even impossible) to obtain
  • Imbalanced data set
  • Noise in the data which mimics real outliers and therefore makes is challenging to distinguish and remove them

Masking effect: the presence of another adjacent ones(colluder)

Image source: Where is Wally?

Type of anomalies

  • Anomalies can be classified into three types:

1. Point anomalies

  • represent an irregularity or deviation that happens randomly and may have no particular interpretations

2. Contextual(conditional) anomalies

  • identified by considering both contextual and behavioural features

3. Group(collective) anomalies

  • each of the individual points in isolation appears as normal data instances while observed in a group exhibit unusual characteristics

Prior knowledge

  • Anomaly detection can be divided by supervised, unsupervised and semi-supervised learning

1. Binary classification

  • labeled sample consisting of both normal and anomalous examples

2. Highly imbalanced binary classification

  • data set contains few anomalies

3. Outlier detection

  • unlabeled sample that is contaminated with abnormal instances. An estimation for the expected ratio of anomalies is often known in advance.

4. One class classification(semi-supervised)

  • samples only from a normal class of instances and the goal is to construct a classifier capable of detecting out-of-distribution abnormal instances

Scikit learn

  • Outlier detection

    • The training data contains outlier that are far from the others
    • Outlier detection estimators try to fit the regions where the training data is the most concentrated.
  • Novelty detection

    • The training data is not polluted by outliers and we are interested in detecting whether a new observation is an outlier(called novelty)
    • Assume that all observations in training data is normal
  • In novelty detection, there may be normal samples in the training data that are far from other points.

Application areas

not important

  • Data logs and process logs
  • Fraud detection and intrusion detection
  • Security and surveillance
  • Fake news and information, social networks
  • Health case analysis and medical diagnosis
  • Data sources of transactions
  • Sensor networks and databases
  • Data quality and data cleaning
  • Time series monitoring and data streams
  • Internet of things

Needs domain knowledge

Methods

not important

A. Statistical based approaches

  • Parametric and non-parametric methods
  • Data points are modeled using a well known stochastic distribution
  • Robust covariance, Gaussian mixture model(GMM), Kernel density estimation(KDE)

B. Density based approaches

  • Outlier can be found in a low-density region whereas inlier are assumed to appear in dense neighborhoods
  • Consider the neighborhood of an object which is defined by a given radius

Methods'

not important

C. Distance based approaches

  • Detect outliers by computing the distances between points
  • A point, a far distance from its nearest neighbor is regarded as an outlier
  • So not robust to scale variance
  • Local Outlier Factor(LOF)

D. Clustering based approaches

  • Use clustering methods to describe the behavior of the data
  • Points in smaller size clusters are deemed as outliers

E. Reconstruction based or Deep approaches

Classifier(decision function) performance

  • Assume that we know the labels of test data $X$(essential)
  • Evaluate a binary classifier(decision function) $$ h(x): X \longrightarrow \{ 0, 1\} $$
  • WLOG, 1 means outlier(positive) and 0 label means inlier(negative)
  • The best classifier satisfies that
$$ h(x) = 1 \text{, whenever } x \text{ is an outlier} $$

and

$$ h(x) = 0 \text{, whenever } x \text{ is an inlier} $$

Accuracy is not a good measure for assessing a classifier

  • There are 9990 normal observations but only 10 anomalous observations
  • Example
Positive prediction Negative prediction total
Outlier(positive) 0 10 10
Inlier(negative) 0 9990 9990
total 0 10000 10000
  • Baseline of Accuracy
$$ \max\left(\frac{P}{P+N} , \frac{N}{P+N}\right) $$

where $N$: the number of negative observations and $P$: the number of positive observations

Confusion matrix

Positive prediction Negative prediction
Observed positive True positive (TP) False negative (FN)
Observed negative False positive (FP) True negative (TN)

  • TP: instance is positive and is classified as positive
  • FN: instance is positive but is classified as negative
  • TN: instance is negative and is classified as negative
  • FP: instance is negative but is classified as positive

Accuracy

$$ \text{Accuracy}=\frac{\text{TP} + \text{TN}}{\text{P}+\text{N}} $$

Sensitivity or Recall (True positive rate, TPR)

$$ \text{Sensitivity} = \frac{\text{TP}}{\text{P}} $$

Specificity (True negative rate, TNR)

$$ \text{Specificity} = \frac{\text{TN}}{\text{N}} $$

Precision (Positive predictive value, PPV)

$$ \text{Precision} = \frac{\text{TP}}{\text{TP+FP}} $$

Negative predictive value, NPV

$$ \text{NPV} = \frac{\text{TN}}{\text{TN+FN}} $$

False positive rate, FPR

$$ \text{FPR} = \frac{\text{FP}}{\text{N}} = 1-\text{TNR} $$

Evaluation metric

  • Confusion matrix
Positive prediction Negative prediction
Observed positive True positive (TP) False negative (FN)
Observed negative False positive (FP) True negative (TN)
  • Evaluation metric
Prediction based label based
Positive predictive value $ = \frac{TP}{TP + FP}$ True positive rate $ = \frac{TP}{P}$
Negative predictive value $ = \frac{TN}{TN+FN}$ True negative rate $ = \frac{TN}{N}$
False positive rate $ = \frac{FP}{N}$
Accuracy $=\frac{TP + TN}{P+N}$

Evaluation metric`

  • Example
Positive prediction Negative prediction total
Outlier(positive) 3 15 18
Inlier(negative) 2 80 82
total 5 95 100
  • Evaluation metric
Prediction based label based
Positive predictive value $ = \qquad\qquad$ True positive rate $ = \qquad\qquad$
Negative predictive value $ =\qquad \qquad$ True negative rate $ = \qquad\qquad$
False positive rate $ = \qquad\qquad$
Accuracy $=\qquad\qquad$

Scoring function(Anomaly score)

$$ s(x): X \longrightarrow \mathbb R $$

Assume higher scores indicate that samples are more likely to be an outlier, e.g. $s(x)$ is indicated the estimated probability that the sample $x$ is an outlier.

Want: $s(X_{out})$ and $s(X_{in})$ are completely separated!

where $X_{out}$ is the collection of outliers in $X$ and $X_{in} = X\setminus X_{out}$ (we assumed that we know the labels of test data).

Once such a scoring function has been learned(obtained), a classifier can be constructed by threshold $\lambda \in \mathbb R$:

$$ h^\lambda (x):=\begin{cases} 1 &s(x) \geq \lambda \\ 0 &s(x)< \lambda \end{cases} $$

Examples of scoring function

Examples of scoring function and evalution metric

Examples of scoring function

TPR and FPR

  • True positive rate, TPR
$$ \text{TPR} = \frac{\text{TP}}{\text{P}} $$
  • False positive rate, FPR
$$ \text{FPR} = \frac{\text{FP}}{\text{N}} = 1-\text{TNR} $$

ROC(Receiver operating characteristic) curve

auroc.png

  • 2 dim graph tp rate is plotted on the y-axis and fp rate is plotted on the x-axis
  • Performance metric to measure the quality of the tradeoff of a scoring function $s(x)$
  • Tradeoffs between benefits(true positives rate) and cost(false positives or false alarm rate)

  • Classifier $h^\lambda(x)$ is determined by threshold $\lambda$ so one point in ROC graph is obtained by $\lambda$

  • Simply, for a scoring function $s(x)$

$$ \lambda \longrightarrow h^\lambda \longrightarrow \text{confusion matrix} \longrightarrow \text{one point in ROC graph} $$

Several important points in ROC space

  • $(0,0)$: the strategy of never issuing a positive classification; such a classifier commits no false
  • $(0,1)$: perfect classification
  • $(1,1)$: unconditionally issuing positive classifications

auroc.png

Random performance

  • If it guesses the positive class $t\%$ of time, it can be expected to get $t\%$ of the positives correct but its false positive rate will be $t\%$ as well $(o\leq t \leq 100)$, yielding $(t,t)$ in ROC space
  • The diagonal line $y=x$ represents the strategy of randomly guessing a class
  • A classifier below the diagonal may be said to have useful information but it is applying the information incorrectly

Analysis of ROC space

  • Informally, points to the northwest is better than others(tp rate is higher and fp rate is lower)
  • Low fp rate mat be thought of as "conservative"; they make positive classifications only with strong evidence so they make few false positive errors(low true positive rates)
  • High fp rate may be thought of as "liberal"; they make positive classifications with weak evidence so they classify nearly all positives correctly(high false positive rates)

auroc.png

AUROC

  • Area under the ROC curve(AUROC) is a single scalar value representing expected performance
  • Baseline of AUROC $= 0.5$, the area under line $y=x$

Anomaly detection methods

Notations

  • Goal: Detecting anomalous points
  • Data matrix $X\in \mathbb R^{N\times p}$, i.e. a row of $X$ represents one observation
  • $p$-variates observation $\mathbf x$ is considered as column vector

Robust Covariance(Minimum covariance determinant)

  • Statistical based approach(parametric)
  • (Assumption) Observations are sampled from an elliptically symmetric unimodal distribution(e.g. multivariate normal distribution)
  • In other words, density function $f(\mathbf x)$, $\mathbf x\in\mathbb R^p$ can be written by
$$ f(\mathbf x)=\frac{1}{\sqrt{|\Sigma|}} g(d^2(\mathbf x, \mu, \Sigma)) $$
  • where $d(\mathbf x, \mu, \Sigma) = \sqrt{(\mathbf x- \mu)^t \Sigma^{-1} (\mathbf x- \mu)}$ and a strictly decreasing real function $g$
  • Unknown parameters: location $\mu \in \mathbb R^p$ and scatter positive definite $p\times p$ matrix $\Sigma$

  • For instance, density function of the multivariate normal distribution

$$ f(x)=\frac{1}{\sqrt{(2\pi)^p|\Sigma|}} \exp \left (-\frac{1}{2} (\mathbf x- \mu)^t \Sigma^{-1} (\mathbf x- \mu) \right ) $$
  • i.e. $g(t)= \frac{1}{\sqrt{(2\pi)^p}} \exp(-\frac{t}{2})$
  • Statistical distance $d(\cdot, \mu, \Sigma)$ represents how far away $\mathbf x$ is from the location $ \mu$ relative to scatter $\Sigma$

Aim to find $ \mu$, $\Sigma$ s.t. statistical distance is useful to detect outlier

Humble approach

  • Using whole observations, compute location $\mu$ and scatter $\Sigma$ as:
$$ \mu:= \bar {\mathbf x} := \frac{1}{N} \sum_{i=1}^N \mathbf x_i, \text{ (sample mean)} $$

and $$ \Sigma:=cov(X):= \frac{1}{N} \sum_{i=1}^N (\mathbf x_i - \bar{\mathbf x})(\mathbf x_i - \bar {\mathbf x})^t, \text{ (sample covariance)} $$

  • In this case, the statistical distance is called Mahalanobis distance(MD)
$$ MD(\mathbf x) = d(\mathbf x, \bar{\mathbf x}, cov(X)) = \sqrt{(\mathbf x- \bar{\mathbf x})^t cov(X)^{-1} (\mathbf x- \bar{\mathbf x})} $$
  • Because of Masking effect, not a good choice

Idea

  • We need reliable estimators that can resist outliers when they occur. i.e. we need the ellipse that is smaller and only encloses the regular points.
  • Try to remove masking effect, noise or outlier to yield a "pure" subset
  • Idea is to find some (representitive) observations whose empirical covariance has the smallest determinant

Minimum covariance determinant

  • Find $h$(fixed number) observations s.t. its determinant of covariance matrix is as samll as possilbe
  • The larger $|\Sigma|$, the more dispersed
  • Robust distance(RD) is defined by
$$ RD(\mathbf x):= d(\mathbf x, \hat{\mu}_{MCD}, \hat{\Sigma}_{MCD}) $$
  • $\hat{\mu}_{MCD}$: MCD estimate of location (from $h$ observations)
  • $\hat{\Sigma}_{MCD}$: MCD covariance estimate (from $h$ observations)

  • Generally, ${N}\choose{h}$ is too many, so we need something...

Theorem(key)

Consider a data set $X:=\{ \mathbf x_1, ... ,\mathbf x_N \}$ of $p$-variate observations. Let $H_1 \subset \{1,...,N\}$ with $|H_1|=h$ and put

$$ T_1:= \frac{1}{h} \sum_{i\in H_1} \mathbf x_i, \quad S_1:= \frac{1}{h} \sum_{i\in H_1} (\mathbf x_i - T_1)(\mathbf x_i- T_1)^t. $$

If $\det (S_1) \neq 0$, then define the relative distances

$$ d_1(i):= \sqrt{(\mathbf x_i - T_1)^t S_1^{-1}(\mathbf x_i- T_1)}, \text{ for } i=1,...,N. $$

Sort these $N$ distances from the smallest $d_1(i_1)\leq d_1(i_2)\leq \cdots \leq d_1(i_N)$, then we obtain the ordered tuple $(i_1, i_2, ...,i_N)$(which is some permutaion of $(1,2,...,N)$). Let $H_2:=\{i_1,...,i_h\}$ and compute $T_2$ and $S_2$ based on $H_2$. Then

$$ \det(S_2) \leq \det(S_1) $$

with equality if and only if $T_2=T_1$ and $S_2=S_1$.

Idea

Proof of Theorem

Proof. Assume that $det(S_2)>0$, otherwise the result is already satisfied. We can thus compute $d_2(i)=d_{(T_2, S_2)}(i)$ for all $i=1,...,N$. Using $|H_2|=h$ and the definition of $(T_2, S_2)$ we find

$$\begin{align*} \frac{1}{hp}\sum_{i\in H_2}d_2^2(i) &= \frac{1}{hp} tr \sum_{i\in H_2}(\mathbf x_i -T_2) S_2^{-1}(\mathbf x_i - T_2)^t \\ \tag{A.1} &=\frac{1}{hp} tr \sum_{i\in H_2}S_2^{-1}(\mathbf x_i -T_2) (\mathbf x_i - T_2)^t \label{A.1} \end{align*}$$

Moreover, put

$$\begin{equation*} \tag{A.2} \lambda:= \frac{1}{hp} \sum_{i\in H_2} d_1^2(i) = \frac{1}{hp} \sum_{k=1}^h d_1^2(i_k)\leq \frac{1}{hp} \sum_{j\in H_1} d_1^2(j) =1, \label{A.2} \end{equation*}$$

where $\lambda>0$ because otherwise $\det(S_1)=0$. Combining (\ref{A.1}) and (\ref{A.2}) yields

$$ \frac{1}{hp} \sum_{i\in H_2} d^2_{(T_1,\lambda S_1)}(i) = \frac{1}{hp} \sum_{i\in H_2}(\mathbf x_i -T_1)^t \frac{1}{\lambda}S_1^{-1}(\mathbf x_i - T_1) = \frac{1}{\lambda hp} \sum_{i\in H_2}d_1^2(i) =\frac{\lambda}{\lambda}=1. $$

Proof of Theorem`

Grubel(1988) proved that $(T_2, S_2)$ is the unique minimizer of $\det(S)$ among all $(T,S)$ for which $\frac{1}{hp}\sum_{i\in H_2} d^2_{(T,S)}(i) = 1$. This implies that $\det(S_2)\leq det(\lambda S_1)$. On theother hand it follows from the inequality (\ref{A.2}) that $\det(\lambda S_1)\leq \det(S_1)$, hence

$$\begin{equation*} \tag{A.3} \det(S_2)\leq \det(\lambda S_1)\leq \det(S_1) \label{A.3} \end{equation*}$$

Moreover, note that $\det(S_2)=\det(S_1)$ if and only if both inequalities in (\ref{A.3}) are equalities. For the first we know from Grubel's result that $\det(S_2)=\det(\lambda S_1)$ if and only if $(T_2, S_1)= (T_1,\lambda S_1)$. For the second, $\det(\lambda S_1) = \det(S_1)$ if and only if $\lambda = 1$, i.e. $S_1 = \lambda S_1$. Combining both tields $(T_2, S_2) = (T_1, S_1)$.

Method

  • Fix $h$ s.t. $[(N+p+1)/2] \leq h \leq N$ and $h>p$
  • Given the $h$-subset $H_{old}$ and its $(T_{old}, S_{old})$
  • Compute $d_{old}(i)$ for $i=1,...,N$ and sort these distances which yields a permutation $\pi$ for which
$$ d_{old}(\pi(1)) \leq d_{old}(\pi(2)) \leq \cdots \leq d_{old}(\pi(N)) $$
  • Let
$$ H_{new}:= \{ \pi(1), \pi(2),...,\pi(h)\} $$
  • Compute
$$T_{new} = \frac{1}{h} \sum_{i\in H_{new}} \mathbf x_{i}\text{ and }S_{new} = \frac{1}{h} \sum_{i\in H_{new}} (\mathbf x_i - T_{new})(\mathbf x_i- T_{new})^t $$

Some facts of optimizations

  • Initial choice of $H_1$
  • The number of steps(iterations)
  • $[(N+p+1)/2] \leq h \leq N$, e.g.
    • $h=N$,
    • $h=[0.75N]$

When MCD works well?

Tutorial

One class SVM(support vector machine)

  • One class SVM is a variant of SVM(support vector machine)
  • We have to understand SVM and kernel method(however not easy)!

Linear SVM(simplest)

  • Data matrix $X$, collection of $\mathbf x_i \in \mathbb R^p$ and its label $y_i= 1$ or $-1$
  • SVM is a supervised learning method for classification(non-probabilistic binary linear classifier)
  • Aim to find a $(p-1)$-hyperplane to separate two classes

Hard margin

  • When $\mathbf x$ is on or above this boundary with label 1,
$$ \mathbf w \cdot \mathbf x - b = 1 $$
  • When $\mathbf x$ is on or below this boundary with label -1,
$$ \mathbf w \cdot \mathbf x - b = -1 $$

Hard margin`

  • Maximize $\frac{2}{\| \mathbf w \|}$ (i.e. minimize $\| \mathbf w \|$) with the following constraint
$$ \mathbf w \cdot \mathbf x_i - b \geq 1, \quad \text{ if } y_i =1 $$

and

$$ \mathbf w \cdot \mathbf x_i - b \leq -1, \quad \text{ if } y_i =-1 $$
  • Rewrite
$$ minimize {\| \mathbf w \|} \text{ subject to } y_i(\mathbf w \cdot \mathbf x_i - b) \geq 1, \quad \forall i = 1,...,N $$

Soft margin

  • Hinge loss
$$ minimize \left [\frac{1}{N} \sum_{i=1}^N \max(0, 1- y_i(\mathbf w \cdot \mathbf x_i - b)) \right ] + \lambda \| \mathbf w \| ^2 $$
  • $\lambda$: role of trade-off between margin size and ensuring that $\mathbf x_i$ lie on the correct side of the margin
  • High $\lambda$ ~ underfitting (maximize margin)
  • Low $\lambda$ ~ overfitting

Primal problem

  • Note that $\zeta_i:= \max(0, 1- y_i(\mathbf w \cdot \mathbf x_i - b))$ is the smallest nonnegative number $t\geq 0$ satisfying
$$ y_i(\mathbf w \cdot \mathbf x_i - b) \geq 1 - t $$
  • So rewrite hinge loss as a quadratic optimization
$$ minimize \frac{1}{N} \sum_{i=1}^N \zeta_i + \lambda \| \mathbf w \| ^2 $$$$ \text{subject to } y_i(\mathbf w \cdot \mathbf x_i - b) \geq 1 - \zeta_i \text{ and } \zeta_i \geq 0 \quad \forall i=1,...,N $$

Dual problem

  • By solving for the Lagrangian dual of primal problem, we obtain the simplified problem
$$ maximize \sum_{i=1}^N c_i - \frac{1}{2} \sum_{i=1}^N \sum_{j=1}^N y_i c_i (\mathbf x_i \cdot \mathbf x_j) y_j c_j $$$$ \text{subject to } \sum_{i=1}^N c_i y_i=0, \text{ and } 0\leq c_i \leq \frac{1}{2N\lambda}\quad \forall i=1,...,N $$
  • Here the variables $c_i$ are defined such that
$$ \mathbf w := \sum_{i=1}^N c_i y_i \mathbf x_i $$

Nonlinear SVM

  • Generally, data is not linearly separable
  • Original maximum margin hyperplane classifier(1963)
  • Kernel method(1992)

  • For instance,

Nonlinear SVM`

  • Feature map $\varphi: X \longrightarrow \cal F \subset \mathbb R^3$ given by(mapping the 2-dim data set to 3-dim feature space $\cal F$)
$$ \varphi(\mathbf x ) = \varphi(x_1, x_2) = (x_1, x_2, x_1^2+x_2^2) \in \cal F \subset \mathbb R^3 $$
  • Linear SVM classifier in this feature space $\cal F$
  • Hyperplane in feature space $\cal F \subset \mathbb R^3$ can be determined by $\mathbf w=(w_1, w_2, w_3)$ and $b\in \mathbb R$ and expressed as
$$ w_1 x_1 + w_2 x_2 + w_3(\underbrace{x_1^2 + x_2^2}_{=\varphi(x_1, x_2)}) + b = 0 $$
  • This is nonlinear in $\mathbb R^2$
  • But this is linear in feature space $\cal F$ (in $\mathbb R^3$)

Feature mapping(Kernel method)

  • Feature map $\varphi: X \longrightarrow \cal F \subset \mathbb R^s$
  • Find a maximal(soft) margin hyperplane in feature space $\cal F$
  • Hinge loss(because we know the information of $\varphi$)
$$ minimize \left [\frac{1}{N} \sum_{i=1}^N \max(0, 1- y_i(\mathbf w \cdot \color{red}{\varphi(\mathbf x_i)} - b)) \right ] + \lambda \| \mathbf w \| ^2 $$
  • Here $\mathbf w \in \mathbb R^s$ and $\mathbf w \cdot \varphi(\mathbf x_i)$ is an inner product in $\mathbb R^s$

Primal problem for feature mapped space

  • So rewrite hinge loss as a quadratic optimization
$$ minimize \frac{1}{N} \sum_{i=1}^N \zeta_i + \lambda \| \mathbf w \| ^2 $$$$ \text{subject to } y_i(\mathbf w \cdot \color{red}{\varphi(\mathbf x_i)} - b) \geq 1 - \zeta_i \text{ and } \zeta_i \geq 0 \quad \forall i=1,...,N $$

Dual problem for feature mapped space

  • By solving for the Lagrangian dual of primal problem, we obtain the simplified problem
$$ maximize \sum_{i=1}^N c_i - \frac{1}{2} \sum_{i=1}^N \sum_{j=1}^N y_i c_i (\color{red}{\varphi(\mathbf x_i) \cdot \varphi(\mathbf x_j)}) y_j c_j $$$$ \text{subject to } \sum_{i=1}^N c_i y_i=0, \text{ and } 0\leq c_i \leq \frac{1}{2N\lambda}\quad \forall i=1,...,N $$
  • Here the variables $c_i$ are defined such that
$$ \mathbf w = \sum_{i=1}^N c_i y_i \color{red}{\varphi(\mathbf x_i)} \in \cal F $$

Feature mapping(Kernel method)`

  • Using feature map and finding hyperplane in feature space seems to be intuitive and easy
  • However, we have some problems:

    • Unknown distribution of observations
    • Choice of feature map, e.g. random mapping $ \varphi: \mathbb R^2 \longrightarrow \mathbb R^\infty$

      $$ \varphi(x_1, x_2):=(\sin(x_2), \exp(x_1+x_2), x_2, x_1^{\tan(x_2)},...) $$

    • Generally, dim of feature space may be high

    • Lots of computation

Kernel method(Feature mapping)

  • Feature mapping can be considered as a replacement of inner product by kernel function e.g.
$$ k(\mathbf x,\mathbf y):= \varphi(\mathbf x)\cdot \varphi(\mathbf y) = \mathbf x \cdot \mathbf y + \|\mathbf x\|^2 + \|\mathbf y\|^2 $$
  • Question: Can not we say that(implicit) let
$$ k(\mathbf x, \mathbf y):= \exp(-\| \mathbf x - \mathbf y \|^2) $$
  • there might exist a map $\varphi$ s.t. $k(\mathbf x, \mathbf y) = \varphi(\mathbf x)\cdot \varphi(\mathbf y)$ and feature space $\cal F$?
  • Idea: Choose a nice kernel function $k$ rather than an ugly feature mapping(explicit)
  • Only need to know $\varphi(\mathbf x_i)$, $\forall i =1,...,N$ not explicit $\varphi$

Kernel method(Feature mapping)`

Finite set

  • $X = \{\mathbf x_1, ... , \mathbf x_N \}$, $\mathbf x_i \in \mathbb R^p$, $k:X \times X \longrightarrow [0, \infty)$, positive semi definite kernel
  • Then we can construct a feature space $\cal F$ and feature map $\varphi$
  • Let $G$ be a $N\times N$ matrix with $G_{ij}:=k(\mathbf x_i, \mathbf x_j)$ (can be calculated). Then $G$ is symmetric PSD so
$$ G = U \Sigma U^t \quad \text{by singular value decomposition} $$

$$ \text{s.t.}\quad U^t U = I_m,\quad m=rank(U),\quad U^t=\left[\begin{array}{cc} \mathbf u_1 \, \cdots\, \mathbf u_N\end{array}\right] \in \mathbb R^{m\times N}, \mathbf u_i \in \mathbb R^m $$$$ \Sigma = diag(\lambda_1,...,\lambda_m), \quad \lambda_1\geq \lambda_2\geq \cdots \geq \lambda_m>0 $$

Kernel method(Feature mapping)``

Finite set

  • Let $\varphi(\mathbf x_i):=\Sigma^{1/2} \mathbf u_i \in \mathbb R^m$, $i=1,...,N$
  • Then such $\varphi(\mathbf x_i)$ leads to the Gram matrix $G$ because
$$ \varphi(\mathbf x_i)\cdot \varphi(\mathbf x_j)= (\Sigma^{1/2} \mathbf u_i)^t \Sigma^{1/2} \mathbf u_j = \mathbf u_i^t \Sigma \mathbf u_j = G_{ij} $$

Summary

  • Given a psd kernel $k: X \times X \longrightarrow [0, \infty)$ and finite data set $X = \{\mathbf x_1, ... , \mathbf x_N \}$, $\mathbf x_i \in \mathbb R^p$
  • Then we can construct a feature map $\varphi:X \longrightarrow \cal F$ and a feature space $\cal F = span \{ \varphi(\mathbf x_i) : \mathbf x_i \in$ $X \}$
  • We can find a maximal margin hyperplane in $\cal F$, i.e. $\mathbf w \in \cal F$

Kernel trick

Dual problem

  • By solving for the Lagrangian dual of primal problem, we obtain the simplified problem
$$ maximize \sum_{i=1}^N c_i - \frac{1}{2} \sum_{i=1}^N \sum_{j=1}^N y_i c_i \underbrace{\varphi(\mathbf x_i) \cdot \varphi(\mathbf x_j)}_{=k(\mathbf x_i, \mathbf x_j)} y_j c_j $$$$ \text{subject to } \sum_{i=1}^N c_i y_i=0, \text{ and } 0\leq c_i \leq \frac{1}{2N\lambda}\quad \forall i=1,...,N $$
  • Here the variables $c_i$ are defined such that
$$ \mathbf w = \sum_{i=1}^N c_i y_i \varphi(\mathbf x_i) $$
  • We may not know the explicit formula of $\varphi$. But for a new sample $\mathbf x$, we know that
$$ \underbrace{\mathbf w \cdot \varphi(\mathbf x) = \sum_{i=1}^N c_i y_i ( \varphi(\mathbf x_i) \cdot \varphi(\mathbf x))}_{unknown}= \underbrace{\sum_{i=1}^N c_i y_i k(\mathbf x_i, \mathbf x)}_{known} $$

One class SVM(2001)

  • Finds a maximum (soft) margin hyperplane in feature space that best separates the mapped data from the origin

  • Data matrix $X = \{\mathbf x_1, ... , \mathbf x_N \}$, $\mathbf x_i \in \mathbb R^p$
  • OC-SVM solves the following primal problem(by quadratic programing)
$$ minimize \frac{1}{2}\| \mathbf w \| ^2 - \rho + \frac{1}{\nu N} \sum_{i=1}^N \zeta_i $$$$ \text{subject to } (\mathbf w \cdot \varphi(\mathbf x_i) - b) \geq \rho - \zeta_i \text{ and } \zeta_i \geq 0 \quad \forall i=1,...,N $$
  • By Lagrangian dual of primal problem,
  • $\mathbf w \cdot \varphi(\mathbf x) < \rho$ is deemed to be anomalous

Tutorial

Thank you for your attention!

with Datasaurus