k-Means is a data partitioning technique which is widely used for clustering. It has variants(like the mini-batch k-Means) which are incredibly fast for large amounts of data. Its clustering results are also easy to interpret. However, there are a lot of applications of k-Means which are not talked about a lot. They are:

  1. Non-Linear Dimensionality Reduction — It can be used to represent thousands of features using only a few “transformed” features — which can be used as engineered features in a ML Pipeline or can be used for Data Visualization.
  2. Multivariate Outlier/Anomaly Detection
  3. Data Representation(For Input to other Algorithms)

A Guide on how to Perform Anomaly detection for Business Analysis or a Machine Learning Pipeline on multivariate data along with relevant Python code.

In my previous article(https://medium.com/analytics-vidhya/anomaly-detection-in-python-part-1-basics-code-and-standard-algorithms-37d022cdbcff) we discussed the basics of Anomaly detection, the types of problems and types of methods used. We discussed the EDA, Univariate and the Multivariate methods of performing Anomaly Detection along with one example of each. We discussed why Multivariate Outlier detection is a difficult problem and requires specialized techniques. We also discussed Mahalanobis Distance Method with FastMCD for detecting Multivariate Outliers.

In this article, we will discuss 2 other widely used methods to perform Multivariate Unsupervised Anomaly Detection. We will discuss:

  1. Isolation Forests
  2. OC-SVM(One-Class SVM)

Some General thoughts on Anomaly Detection

Anomaly detection is a tool to identify unusual or interesting occurrences…

An Anomaly/Outlier is a data point that deviates significantly from normal/regular data. Anomaly detection problems can be classified into 3 types:

  1. Supervised: In these problems, data contains both Anomalous and Clean data along with labels which tell us which examples are anomalous. We use classification algorithms to perform anomaly detection.
  2. Semi-Supervised: Here, we only have access to ‘Clean’ data during the training. The model tries to capture what ‘normal’ data looks like — and labels data that looks ‘abnormal’ as outliers during prediction. Autoencoders are used widely in this category.
  3. Un-Supervised: Here, data contains both clean and anomalous examples —…

Principal Component Analysis is among the most popular, fastest and easiest to interpret Dimensionality Reduction Techniques which exploits the Linear Dependence among variables. Some of its applications are:

  • Decorrelating Variables; Making the features Linearly independent
  • Outlier/Noise Removal
  • Data Visualization
  • Dimensionality Reduction

In the following article we will discuss the applications and why PCA works.

Why does Dimensionality Reduction using PCA Work?

Dimensionality reduction using PCA works because of the presence of Collinearity(Or Linear Dependence among features) in data. Let us see what it means. Imagine the following 2 scenarios:

  1. Case A: Variables x1 and x2 are highly collinear(linearly dependent on…

Regularization is a method used to reduce the variance of a Machine Learning model; in other words, it is used to reduce overfitting. Overfitting occurs when a machine learning model performs well on the training examples but fails to yield accurate predictions for data that it has not been trained on.

In theory, there are 2 major ways to build a machine learning model with the ability to generalize well on unseen data:

  1. Train the simplest model possible for our purpose(according to Occam’s Razor).
  2. Train a complex or more expressive model on the data and perform regularization.

It has been…

k-Means is a data partitioning algorithm which is among the most immediate choices as a clustering algorithm. Some reasons for the popularity of k-Means are:

  1. Fast to Execute.
  2. Online and Mini-Batch Implementations are also available thus requiring less memory.
  3. Easy interpretation. The centroid of a cluster often gives a fair idea of the data present in the cluster. This cannot be said about some other clustering algorithms which are able to detect non-convex clusters where the centroid might not even lie within the cluster.
  4. Results of k-Means can be used as starting points for other algorithms. It is often a…

Nitish Kumar Thakur

Data Scientist @ Ford Motor Company. https://www.linkedin.com/in/nitish-kumar-thakur/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store