Their offline training accuracy is garbage: 16% precision, so all of the real work is basically being done in the online training portion, which gets it to a respectable 82%+ precision.
But they don't tell you how many alerts they had to label to get those numbers. Maybe over the long run you get those numbers, but you really want to know if it takes 10 or 10,000 examples to get there.
Also, their dataset distribution is very different to reality: they have 7% of their dataset annotated as real anomalies; I don't think anyone in the real world wants 5% of their log entries to get flagged as anomalies. So I expect their precision numbers to be far worse on more realistically distributed logs.
Of course if you let an ML-powered "anomaly detection" engine run rampant on your logs, it's going to find anomalies...just like if you hire a ghost hunter, you'll be informed that your house is haunted. In the end, ghost chasing is all this anomaly nonsense turns out to be-- the justifications for conclusions by ML practitioners and ghost hunters alike tend to be equally mumbly and hand-wavy.
Me working from home is technically an anomaly, and one these systems are all too eager to flag. We get random logins from overseas VPSes-- it's an anomaly! Oh, wait, no, we onboarded a client application. Oh, look, a random login from China for a US-based employee with no history of foreign logins! Yeah, that guy just started in a new position with travel requirements. Hey, this IP just tried to log into 5000 user accounts! Congratulations, you just alerted me to the existence of carrier NAT.
None of this saves any time and usually wastes it, since it stirs up paranoia where none was otherwise warranted. It's a fun toy that gives the appearance of being productive when all it's actually doing is generating literally endless busywork. Good for justifying your SOC budget I suppose.
But in the end nobody wants to pay a quarter-million dollars for a black box that just sits there quietly-- if it's not constantly drawing attention to itself and all the badness it's pretending to find, you're not going to have any reason to renew the license.
"Renew it? Why? This thing didn't find anything at all last year."
One of the interesting facts we ve been able to measure empirically over the past few years is that the statistical anomalies' scores magnitude as reconstruction error are uncorrelated with the criticality of the anomaly in terms of security / threat.
This means that in practice SOC operators need to label on top of the anomaly detection and a supervised model can do the reranking after a while.
> As shown by several prior work [9, 22, 39, 42, 45], an effective methodology is to extract a “log key” (also known as “message type”) from each log entry. The log key of a log entry e refers to the string constant k from the print statement in the source code which printed e during the execution of that code.
So if you're looking for a way to apply this to log data that varies wildly, like site access logs, you still have the difficult problem of converting the URIs to the numeric vectors needed by ML algorithms without losing the significant parts of the input.
It allows for using different algorithms like one class SVM or MDS (including custom algorithms). It also allows for defining custom domain specific features as integral part of its analysis engine. In particular, for log analysis, frequencies of various event types have been generated.
It's much easier to make sense of logs when we don't discard that type structure.
Applying K-Means clustering across different features of online traffic always shows some weird and often malicious stuff:
I read a paper that used it for insurance fraud detection, but I don't know what other fields are using clustering to detect frauds and abnormalities?
I'd be grateful if someone can help.
See this - using K-Means clustering for anomaly detection in web traffic:
Using DBscan clustering for anomaly detection in healthcare claims data (detecting doctors who anomalously prescribing opioids). Using public CMS data set from 2015.
4 out of 8 top anomalies (doctors) were later actually convicted of crimes or gone into all sort of troubles with DOJ:
(Splunk Enterprise + free apps was used to ingest data and build all this logic and dashboards)