Pros and Cons of Unsupervised Vs Supervised Machine Learning
By Oren Domaczewski, Product Manager, SecBI
Machine learning in cyber threat detection has been hyped as the answer to increasingly ineffective signature anti-virus solutions. Cybersecurity expert Oren Domaczewski argues that it often actually makes the security analyst’s job more difficult.
Much of what the industry calls machine learning is “supervised” machine learning, which is based on manual human feedback. In the cyber arms-race, evolution happens in milliseconds, making the supervised approach not only inaccurate but also unscalable and human-dependent.
“Unsupervised” machine learning, on the other hand, doesn’t just detect anomalies; it groups together all related evidence and then investigates them to find out whether they are indicative of an attack or not. This process saves the analyst hours of time that would be spent digging through data.
These two types of machine learning are used in different settings. Supervised machine learning is often used in file analysis use cases, such as endpoint anti-virus use cases, because there are few changes in the data being analyzed and labeled data is readily available.
In this example, supervised machine learning works well because the file execution has a narrow scope, there are known APIs, API use and abuse are well-documented, only a few applications each day are used, and applications typically access specific content for its operations.
This type of learning thrives in a setting where there is a vast history of good/bad application signatures to get labeled data, every application is broken down to its’ API details, and the supervised model can be applied. This model falls short when malware is constantly improving, increasing its ability to “blend in” with legitimate applications or avoids detection by using advanced techniques like “memory-only”.
Support of network traffic analysis
Unsupervised machine learning, on the other hand, is used in highly dynamic use cases such as network traffic analysis (NTA) where the data changes very frequently, new behaviors emerge constantly, and labels are scarce. In these instances, unsupervised machine learning is preferred because the network model is well defined. There are millions of new domains, hosts, web pages, websites, and web applications every day, and users interact with many websites without knowing it (e.g. reductions and embedding). This type of learning thrives in a setting where the half-life of web-based learnings is extremely short as more and more applications become “web aware”.
However, it is important to note that the “physics” of the protocol dictate behavior. Behavior is sporadic and is dictated by the user and the server that was accessed. The weakness of this machine learning is when there is an extremely large attack surface with many places to hide within the network.
Baseline abuse among others
Both types of machine learning have their merits and faults, both having some advantages over the other depending on the type of situation. In general, endpoint security vendors rely on supervised learning, while network traffic analysis use unsupervised. However, both learning types may use a baseline. Baselining is a technique which sadly has been abused by cybersecurity vendors and has received a very bad reputation recently because it has created and continues to create huge false positives, sending analysts to chase false signs. In a world where hackers continually change their tactics to evade detection, defining baselines without a proper unsupervised machine learning model can be frustrating and misleading. This is a fact of life for all types of vendors in threat and malware detection, a fact that leads to floods of alerts and anomalies for security analysts, making their job more and more difficult to perform.
Better forecasting in threat detection
In contrast, SecBI has developed an unsupervised machine learning algorithm that gathers vast quantities of network logs which consolidates the full scope of incidents for better threat visibility. Only then does SecBI feed such clusters into its “cluster analysis” algorithm: A supervised learning model that prioritizes them according to their threat level to the organization, thereby reducing false positives and saving the time and effort of security analysts so they can focus their efforts on mitigating any threats. In conclusion, unlike most threat and malware detection technologies, SecBI’s machine learning does not use a baseline resulting in better detection and a significant decrease of false positives.