Clustering is an unsupervised learning method and is a technique which groups unlabelled data based on their similarities.
Clustering algorithms are used to process raw, unclassified data into groups which are represented by structures and patterns in the information.
Please see diagram below: ADD IN JPEG
The clustering algorithms are classified into following different types:
- Exclusive clustering: Exclusive clustering does not allow for a data point to exist in multiple clusters hence called ‘hard clustering’. A widely used clustering algorithm ‘k-means clustering’ is an example of exclusive clustering.
- Overlapping clustering: Overlapping clusters allow one data point to exist in multiple clusters. It is also called ‘soft clustering’.
- Hierarchal clustering: Heirarichal clustering is divided into two types, ‘agglomerative’ or ‘divisive’. Agglomerative clustering follows a bottom-up approach, where the data points are isolated as separate groupings initially, and then they are merged together iteratively on the basis of similarity until one cluster has been achieved. Divisive clustering can be defined as the opposite of agglomerative clustering; instead it takes a top-down approach. In this case, a single data cluster is divided based on the differences between data points.
- Probabilistic clustering: In probabilistic clustering, data points are clustered based on the likelihood that they belong to a particular distribution. The Gaussian Mixture Model (GMM) is the one of the most commonly used probabilistic clustering methods.
Example uses of clustering include:
- “Recommender systems” such as grouping similar viewing patterns on Netflix, in order to recommend similar content.
- Anomaly detection such as fraud detection, detecting defective mechanical parts.
- Genetics analysis such as clustering DNA patterns to analyze evolutionary biology.
- Customer segmentation in oder to understand different customer segments to devise marketing strategies.
Example clustering algorithms, as discuss e.g. in Sci-Kit Learn include:
- Affinity propogation
- Spectral clustering
- Ward hierarchical clustering
- Agglomerative clustering
- Gaussian mixtures
- Bisecting K-Means