Jun 10, 2020 by Andrei Demit
Demystifying Machine Learning: Part II
A Three-Part Series on Machine Learning Techniques
“If intelligence was a cake, unsupervised learning would be the cake, supervised learning would be the icing on the cake, and reinforcement learning would be the cherry on the cake. We know how to make the icing and the cherry, but we don’t know how to make the cake.” – Yann LeCun
Unsupervised Learning is a type of Machine Learning (ML) method used to pull knowledge from a set of input data without labeled responses.
Indeed, in comparison to the Supervised Learning process, in Unsupervised Learning, there is no teaching from humans involved. Therefore, it does not make use of labeled data to predict future outcomes (to refresh your memory about labeled data check Demystifying Machine Learning: Part I). Its purpose is to structure data and search for patterns that are not known before, inside unlabeled data.
In Supervised Learning, the machine tried to approximate a function that, given an input, resulted in a predicted output. In Unsupervised Learning, the machine tries to describe the input data succinctly.
To make this concept more clear, let’s analyze some examples:
When we see a group of animals, we can easily understand that some animals belong to the same family, just by watching them, without having a teacher tell us that they belong to different families of animals.
Also, take football, for instance. By watching enough football matches and how players interact on the field, we can figure out the rules of the game and be able to play it ourselves.
Babies regularly use Unsupervised Learning to understand their world by watching other people. They may not understand what something is, but they can quickly build associations in terms of ” if I do that action, I get that thing or feeling.”
The human brain was specially designed for Unsupervised Learning by millions of years of evolution. However, in some cases, a computer can do it too. It is indeed more limited, but it is getting better and better every year.
We can think of Unsupervised Learning as a way to model the world.
The most common uses for Unsupervised Learning algorithms can be divided into 4 categories: clustering, anomaly detection, pattern search, and dimensional reduction.
Clustering algorithms have the goal of taking a group of entities in input and group them based on similar characteristics.
Let’s take the example of a Marketing Campaign. We begin with a set of scattered data about a population and our goal is to have a clear understanding of our target groups. That means we want to segment the population into smaller clusters with similar demographics and purchasing habits. This knowledge allows us to target only our type of customer persona, thus maximizing the effectiveness of our marketing budget.
This situation is a problem where clustering algorithms of Unsupervised Learning can help reach the goal.
Anomaly detection algorithms have the purpose of figuring out whether there is an anomalous entry inside a dataset. With anomalous entry, we intend an event, a data point, a record that deviates considerably from the others present in the dataset. Usually, this involves time-series data.
Time-series data is nothing more than a sequence of values over time. Indeed is typically a pair of two items, a timestamp for the metric that was measured and the value of that metric at that specific time.
Let’s consider the following example where we want to monitor the health of a company.
If we consider the number of news, tweets, tags, stock prices, related to a specific company in time (time-series data), we can create a baseline for normal behavior in primary KPIs. The time-series data anomaly detection system can follow the cyclical patterns of behavior within key datasets after we understand the baseline. Therefore, when a key dataset registers a value that diverges by a considerable amount from the baseline, we have an anomaly. That can help us understand that something happened around the specific timestamp of the detected anomaly. By also analyzing other time-series key data sets, we can deliver valuable business insights in almost real-time.
Pattern searching algorithms have the goal of finding recurrent patterns in the input data. It is a valuable tool for marketing as it can individuate the customer’s buying behavior. Therefore, it can help the advertising team target the customers with the ads of products they will most likely buy. This inference is an association problem that Pattern searching can solve.
As the name suggests, an association problem presents when we want to find the rules that apply to the input data to produce a subsequent event. A rule defines the connection between two elements X and Y that have nothing in common. It means that if X is in an event, Y can appear in the same event.
We can use this approach to build associations as the following:
- If a customer buys a pizza, then he will likely also buy a beer.
- If a customer wants to buy a car and lives in a cold climate, he is more likely to opt-in for a preheated windshield.
- If a customer listens to classical music, then he is more likely to visit certain kinds of places.
Dimensionality reduction algorithms help find the most relevant features in a dataset. Sometimes, datasets have many features that are correlated, thus redundant. Limiting the number of features to analyze makes it easier to visualize the training set and work on it.
The advantages of this technique are evident as fewer features to analyze mean faster computation time and reduced storage space requirement. However, there are also some disadvantages: eliminating features means losing some data, and we may not know how many principal components to keep in practice.
Unsupervised learning methods find application in bioinformatics for sequence analysis and genetic clustering, data mining, pattern mining, image segmentation, and computer vision for object recognition, recommender systems, targeted marketing, big data visualization, and customer segmentation.
Another crucial thing to keep in mind is that the data explosion is accelerating. Cloud, mobility, IoT, social and analytics and more than 75 billion connected devices in 2020 created 90% of all world data just in the last two years. And what is even more exciting is that 80% of the world’s data is unstructured. Hence, there is a golden opportunity for Unsupervised Learning to get an advantage out of it.
In the next and final part of this series, we will dive into the Reinforcement Learning technique.