Unsupervised Machine Learning: Explore a List of Models that work without Output Labels
Learn the underlying working of various unsupervised machine learning models and how they are able to generate predictions without output labels
Machine learning and artificial intelligence have gained substantial traction in the past decade. Numerous companies across various industries have created newer, more powerful models. The introduction and implementation
of large language models (LLMs) have significantly increased attention towards the field of artificial intelligence. Companies from many sectors are bracing themselves for massive disruption in the realm of machine
learning and general intelligence.
While these significant advancements in machine learning motivate people to learn the latest technologies in AI, there often lies a problem — a lack of high-quality, labeled data for successful application in
this field. For machine learning to operate under supervision, it’s important to have labels for the training data. Access to this data can be quite challenging. Thus, acquiring this data can be costly and time-consuming
as it requires human labor for manual labeling. Furthermore, this process can lead to inconsistent labeling issues.
So, how can we utilize unlabeled data without having to manually label them? This is where unsupervised machine learning comes into play. As the name implies, this method suggests that data without output labels can be used during
the training of unsupervised machine learning models. These models identify inherent patterns and trends in the data and learn to group them into various categories based on a specific set of attributes or features.
After grouping them together, one can identify commonalities between the groups and use this information to drive business in the right direction.
Consider, for instance, a store that wishes to sell products to a range of customers and offer discounts. However, they’re uncertain about who is likely to transition to their service. If customers don’t
respond positively to their advertisements, this results in a loss of revenue for the business. In such a case, unsupervised machine learning could be employed to perform customer segmentation, grouping customers into categories.
Consequently, the company can adjust its advertising strategy and target specific groups, ensuring successful conversions to purchase products. This is merely a simple business application of unsupervised machine learning. However,
there are countless other applications in fields such as anomaly detection, text mining, image recognition, dimensionality reduction, and fraud detection.
Having seen the usefulness of unsupervised machine learning, it’s now time to delve deeper and explore a variety of these models and their potential use cases across different scenarios. It’s important to note that there
are always pros and cons when using these models, and the best choice largely depends on the specific use case and constraints of a company.
K-Means Clustering
K-means clustering is a popular unsupervised machine learning model used in various industries for grouping a set of items into different categories. In this model, K is a hyperparameter that must be chosen based on either domain knowledge
or by following standard techniques such as the elbow method or the silhouette analysis. Below are the list of steps that are usually followed to group the elements in our data into k clusters.
Randomly initializes a set of data points as centroids based on the value of k. If the value of k is 10, for instance, it will randomly assign 10 data points and mark them as centroids.
After this step, the distance from each of the other points and these centroids is computed. The points that have the lowest distance between the centroids are assigned to that particular centroid. The distance metric can be either
Euclidean distance, cosine distance or Manhattan distance.
In this way, all the data points belong to any of the centroids that were randomly assigned in step 1.
In order to determine a new centroid from each of the clusters, the mean across each cluster is taken and this is assigned as a new centroid. This step is followed for all the clusters that were formed.
After finding a new set of centroids for each of the clusters, steps 2 till 4 are repeated until convergence. In other words, the search for the best clusters is stopped as there is not a huge difference between the centroids from
previous iteration to the present iteration.
Although this approach is intuitive and there is good potential for it, there are some cons which we will discuss below along with the pros.
Pros
K-means clustering is known for its scalability and its ability to handle large datasets.
The Interpretability of the algorithm is also good as the process of assigning data points with clusters is intuitive.
Cons
This algorithm is highly sensitive to the initial set of centroids chosen by it. On every run, we might get a different set of centroids as initial centroids that impact the final clustering output.
If there are one or few outliers in the data, they have an impact on the final clustering structure with this algorithm.
The optimum number of clusters should be determined by the programmer rather than the algorithm finding the right number of clusters. This can sometimes be time-consuming.
Hierarchical Clustering
One of the cons of using a k-means clustering approach is to determine the total number of clusters beforehand when generating clusters. However, this approach is often time-consuming and leads to inconsistent results. In the hierarchical
clustering approach, however, the optimum number of clusters can be determined without manual intervention. There are two sub-categories of hierarchical clustering that we will be discussing below.
Agglomerative Clustering: This method focuses on building clusters one at a time in a bottom-up fashion. Initially, each of the data points is assumed to be a separate cluster. After this step, the clusters that are
closest to each other are merged to form a merged cluster. This step is followed until we are able to merge into one single final cluster with all the data points. This method can help us to determine the hierarchy in which clusters
were formed along with their similarity with each other. Dendrograms show this tree-like structure that shows the way in which all the clusters are formed together.
Divisive Clustering: This method follows a top-down approach where all the data points are assigned to a single cluster. We split the data points into clusters based on their distance. These steps are followed until
we have n clusters when n indicates the total number of data points chosen for clustering. Having a large value of n leads to having a higher number of clusters and higher computation costs. It
is easier to visualize the results with the help of a dendrogram that also shows similarities between clusters that are used to divide them.
Pros
There is no requirement to specify the number of clusters when using the hierarchical clustering approach.
This method is more intuitive as there is a hierarchy in the structure, and it is easier to interpret.
Running the hierarchical clustering approach multiple times produces deterministic results without a change in the clustering pattern.
Cons
The time complexity to run this algorithm is quite high. If we have a large dataset containing millions of records, it can be better to look for alternative algorithms.
This algorithm is also sensitive to outliers in the data which can affect the way in which clusters are formed or generated.
The choice of the distance metric used to split or merge clusters can have a significant impact on the final cluster outputs.
Density-Based Spatial Clustering of Applications with Noise (DBSCAN)
This is an unsupervised machine learning approach that consists of the relative density of points with respect to each other in order to determine the clusters. Below is the list of steps that are followed in the DBSCAN approach.
Choose Initial Parameters: Choose an arbitrary starting data point that has not been visited. Set your parameters: eps (epsilon) is the maximum distance between two samples for them to be considered
as in the same neighborhood, and minPts is the minimum number of samples in a neighborhood for a data point to qualify as a core point.
Determine Core Points: For your starting point, calculate the number of data points within an eps radius to determine its density. If there are at least minPts within this radius, mark
the starting data point as a core point; otherwise, mark it as noise (this could be updated later).
Expand Clusters: For each core point, if it’s not already assigned to a cluster, create a new cluster. Find all points within the distance eps of the core point (including the core point itself)
and assign them to the same cluster. If they also have minPts within distance eps, they are also core points, so repeat the process for those points.
Assign Border Points: If a data point is within the eps distance of multiple clusters, assign it to the cluster of the first core point encountered.
Iteration: Continue the process until all points have been visited, assigned to a cluster, or marked as noise. This may require going back to previously marked noise points to check if they are in the eps radius of a newly found core point.
End: The algorithm stops when all points have been marked as either core points, border points, or noise, and assigned to clusters appropriately.
Pros
There is no need to specify the number of clusters as we did in the k-means clustering approach.
It is more robust to outliers in the data as it works on the concept of core and noise points.
It is capable of handling clusters that are arbitrarily shaped as compared to the above-discussed algorithms.
Cons
It requires careful tuning of minPointsand epsilonparameter depending on the dimensionality of the dataset which can be cumbersome.
It cannot be used to make predictions on new data samples but instead can only cluster the data points that were used during training.
Running the DBSCAN model many times might not always lead to similar results as it is non-deterministic.
Gaussian Mixture Models (GMMs)
This works under the principle of expectation maximization where the data points are put into clusters based on their probability scores. In addition, this method assumes that the data points are normally distributed. To do this, there
is an initialization step for the mean, variance, and mixing weights. After this step, we follow the expectation maximization approach to determine the cluster for a data point. Below is a detailed explanation of the steps taken
when using the Gaussian mixture models.
The first step is to initialize the set of mean, variance, and mixing weights. This is done randomly or could be done with the help of the k-means clustering approach.
After this step, we determine the posterior probability of each of the data points belonging to any of these k-clusters. This is also known as the expectation phase.
Once the data points belonging to each of the clusters are determined, their combined mean, covariance, and mixing weights are recalculated to maximize the expectation.
Steps 2 and 3 are repeated until there is no further change in the data points belonging to other clusters as compared to the ones determined with the expectation maximization approach.
Finally, we are left with data points that are clustered together under the assumption that each of the data points is normally distributed.
Pros
One of the advantages of using this approach is that it is capable of handling shapes that are either elliptical or spherical as compared to the k-means clustering approach.
It also leads to soft-clustering which can be handy because we get to know the confidence score of the model when it assigns data points to each of the clusters.
They have higher flexibility as compared to other models such as k-means as they could account for clusters of different shapes and sizes.
Cons
The algorithms are highly sensitive to the mean and the covariance that is chosen initially. Hence, having a poor initialization can have an impact on the performance of the models.
Determining the optimal number of mixture models can be tough and it could be done with a trial-and-error method.
If we have a large dataset, there is a higher computational complexity and the models can be slow to converge due to this reason.
Autoencoders
These models are known to encode the input representation into a lower dimensional representation. They do not require output labels but instead, only require the input data in the form of features. Once the lower dimensional data
is produced, it is also used to get the decoding so that the original data is obtained. In this way, we are able to reconstruct the original data using the encoded version. Note that there can be discrepancies between the decoded
output and the original input. This error is also known as reconstruction loss which is optimized during training. Below is a detailed description of the working of autoencoders.
By using input features from the data, encoding is performed in order to get a low-dimensional representation. Usually, this is selected so that it is better able to give a good representation of the data.
After this step, the encoded signal is fed to a decoder which is able to reconstruct the original input. During this process, there is a possibility that the signal is not able to be reconstructed perfectly.
As a result, we use optimization with the reconstruction loss. With this optimization, the weights are adjusted so that the output reconstructed data is closely matching to that of the input data.
The loss is reduced with the updates given to each of the weights. As a result, we are left with weights that have the required information in order to translate a given set of inputs into a lower dimensional representation for
easier computation.
Pros
They can be better used to determine the anomalies present in the data as compared to other methods.
They can be used to reduce the dimensionality of the input data which can lead to faster computation.
They are better able to reduce the incoming noise from the input data better than the other models.
Cons
When the input data is huge, it can be computationally expensive to run epochs for constructing the encoded representations.
It can be difficult to interpret the meaning of the encoded representations (hidden representation) as they are non-intuitive.
Principal Component Analysis (PCA)
It is a statistical technique that relies on the covariance between the features to determine the lower dimensional space. In addition, this algorithm does not require output training labels unlike singular value decomposition (SVD)
which still requires the output labels to determine the best features. In this way, we get the final vector that can represent the variance of most of the dataset with the help of eigen values and eigen vectors. Below are the list
of steps that are used in order to perform principal component analysis (PCA).
The initial step before applying principal component analysis (PCA) is to standardize the data. PCA is highly sensitive to the scale of the data. By the nature of machine learning, we generally give inputs of different scales.
Hence, it is important to standardize the data before feeding this information to the model.
We find the covariance of all the features with respect to each other. In this way, we can get a good understanding of how much variation is present in the data based on the set of other features.
After performing this step, we compute the eigen values and eigen vectors using a set of formulas and equations. In order to do this, we should have a covariance matrix that highlights important information.
After we determine the eigen values and eigen vectors, we should arrange the eigen values in descending order to determine the variance explained by each of the features with respect to other features. Depending on the number of
components we want our PCA to decompose, we will set the value for the number of components.
We will perform matrix multiplication of the original data with the eigen vectors to get the transformed features. This represents the reduced set which can be computationally less expensive to train and make predictions.
Pros
Using this approach leads to dimensionality reduction which is useful for scenarios where we have a large number of features in our data.
It can also help reduce dimensions in such a way as to be able to visualize these features in the coding cells.
This can reduce overfitting when we have a higher number of features as compared to the size of the dataset.
Cons
It is tough to interpret the components that are present due to principal component analysis. The original features are easier to interpret as they represent real-world data while the principal components cannot be interpreted
as features but instead as transformations.
It is highly sensitive to scaling and having data of different scales can impact the performance of PCA to a large extent.
If there are outliers in the data, it cannot perform well compared to other unsupervised machine learning models.
Conclusion
Exploring unsupervised machine learning models is a captivating endeavor. Each model, like Principal Component Analysis, carries unique strengths and limitations, shaping how we approach and solve problems. The key lies in understanding
which tool to utilize given the specifics of your data and objectives.
By deepening your knowledge of these models, you’re empowering yourself to craft effective solutions and make a meaningful impact in the world of data science. Despite the challenges, the rewards are significant — the ability
to transform data into valuable insights and solve complex problems.
So, continue to learn, question, and innovate. Every step forward contributes to shaping the future of technology. Thank you for investing your time and curiosity into this important field. The realm of machine learning awaits your
contribution.
Below are the ways where you could contact me or take a look at my work.