Anton Chuvakin wrote a great blog about the future of machine-learning in cyber-security. My response was a bit too long for a comment, so I decided to post it as a blog instead.
There are three different things that were called in the blog:
1. Deterministic vs. non-deterministic:
This insinuates that given the same input, the algorithm can produce different outputs. Users hate such surprises and data-scientists hate the difficulty this places on testing. This is why it’s fairly uncommon… Unless there’s a good reason for it, such as when dealing with very-big-data. For example, we have an unsupervised, non-deterministic clustering algorithm that uses randomness to allow processing without building immensely huge distance-matrices which would take more storage than the original input.
2. Supervised vs. non-supervised:
“Supervised learning” can mean a lot of things — it is truly a very broad spectrum. The question, however, is whether there is value in “learning” and — equally important — what to learn, and from what teacher?
If you’re building an algorithm for self-driving car, you probably do not want to be doing any active “learning” from the human in production.
However, if you’re building a security solution, it might be very helpful to learn from the user, but the trick is knowing what to learn. The biggest challenge in this case is over-fitting: Learning too much from the user that just scaled-up his benefits and, worse, his flaws:
“The first rule of any technology used in a business is that automation applied to an efficient operation will magnify the efficiency. The second is that automation applied to an inefficient operation will magnify the inefficiency.” — Bill Gates
3. Explainable vs. Inexplainable:
I lose much sleep over this… A great algorithm, good results, but the user isn’t convinced — he wants to know “how”, but not to understand the algorithm; how he can do it! It’s natural for us to fear what we do not understand…
IMHO, there is no magic answer here: the data-scientist must be both domain-expert about the algorithm as well as the problem being solved. That is the only way, I found, that you can force, truly force, some explanation to an otherwise inexplainable algorithm. The explanation doesn’t have to be about the algorithm but it must make sense to the user.
“fast / correct / explainable – pick any two”
This is partial: if your algorithm is 100% “correct”, you rarely need to be “explainable” — you just have to automate it — call it a blacklist — you only have to be “fast (enough) & correct.”
The problem, in both machine-learning and the vague field of cyber-security, is that algorithms are rarely that correct (regardless of speed) and that’s where you need humans — to investigate. We, today, have so many “anomaly/behavior” detection tools that shoot alerts at various degrees of “confidence” (because “correctness” is hard on marketing) and we have humans do the post-analysis. Those detection tools are truly amazing and, I predict, we will have more of them, that are more specialized and, probably, less explainable — at least on their own. In parallel, we need a different set of solutions that help analysts take all those highly specialized, hard-to-understand alerts and make them understandable.
We need all three!