By Daniel Felman, Lead Data Scientist, SecBI

In Chinese philosophy, the Yin and Yang represent how seemingly adverse poles might actually complement one another and achieve harmony. In cybersecurity, this ancient philosophy perfectly represents the relationship between supervised and unsupervised machine learning. Supervised machine learning is mainly used for detection; while unsupervised machine learning is used for clustering. Whenever cybersecurity vendors provide a solution for one of the fronts in the cyber defense fortress, they tend to fall under one category, either supervised or unsupervised, as if they were independent solutions; when, in fact, these two can cooperate to achieve a common goal: to protect your whole network from sneaky intruders.

Supervised Machine Learning

Supervised machine learning is commonly implemented in cyber for phishing attack prevention, fraud detection, network traffic analysis and file scanning. Based on known malicious behavior, one is able to train an algorithm to automatically detect known incidents similar to the ones previously seen and occasionally, although infrequently, even malware it has never seen. But as technology evolves, so does hacking; new algorithms and techniques iteratively emerge to become unique and undetectable. Actually, every now and then, hackers make use of supervised machine learning themselves to learn how to identify and dodge vulnerabilities on all monitoring defense mechanisms; creating a cat and mouse game which is not scalable for any automated system.

Unsupervised Machine Learning

On the other hand, with unsupervised machine learning algorithms, one is able to associate and cluster together different communications based on similarities on their individual and collective behavior between users and destination hosts. Then, by learning baselines and deviations, not only does the algorithm become able to distinguish any abnormal behavior, but also conglomerate similar activities to organize all alerts and reduce noise. When translated to a product, this technique facilitates the day-by-day work of SOC teams by grouping together all the malicious activity generated from the same source as well as any other activity that behaves similarly; even identifying incidents in their early stages to be mitigated before they can do any harm. It is important to note, this unique capability is prone to false positive alerts whenever it encounters unusual cases – such as a software update or online streaming – unless it is done properly.

A New Solution: Semi-Supervised Machine Learning

In cybersecurity, these two machine learning branches can beautifully complement each other in cycles.
To start, basic profiles are extracted from all sources, destinations, and communications between them using supervised machine learning. Then, the first clustering – unsupervised – iteration is performed based on these individual properties. After that, another supervised round of detection is run; but this time extracting more meaningful -cumulative- behavior indicators from the clustering arrangement. The cycle continues with another clustering run taking into account all the detections found so far.
In other words, whenever the supervised machine learning algorithm runs, it generates new analytics that can be used by the unsupervised machine learning. And whenever the unsupervised machine learning algorithm runs, it collects the analyzed entities (for example, the network traffic between a user to a destination host) and rearranges them into new clusters, which can be analyzed again by the supervised algorithm to find suspicious activity from the new collective properties. This cycle goes on until the harmonic sweet spot is found. In conclusion, it is safe to assert that when used together, supervised and unsupervised machine learning complement and strengthen each other, like Yin and Yang, and should be utilized as such.