As a power user, I am concerned about the possibility of widespread adoption of your product and/or others like it.
I don't want my bank to ban me just because I use a browser extension to capture my own cookies from my own valid session and pipe them into a shell script I wrote to invoke curl to harvest my latest bank statement as a PDF and store it locally.
Supposing that your system wouldn't flag that activity as malicious, what about the vulgar things that I did to their servers while I was developing my archive-bank-statement tool?
NB. Please ignore the implication that my tool is complete or useful. It's not... :)
The main idea about Wallarm is to get inner knowledge of how the application works and how users use it. Based on this data, we craft dynamic rules for every single applications or API.
The simplest example is what data transmitted in different parameters of the form field or API calls. For example, it's OK if someone put an SQL Injection payload at Stack-overflow site in the form writing a security-related article. It can be a normal behavior. Meanwhile, SQL injection payload is probably a malicious thing for a login form at your bank website.
We wouldn't ban request only if it is sent with curl. There is a set of different factors and statistics that are taken into the account. E.g. if you run this requests too quickly and it is sent with curl, it can be considered as a malicious activity.
I'm glad to see innovation in this area. I have a few questions.
Can you tell me where (or even how) you acquired the data to train your machine learning system?
If you could go into some detail about the specific techniques you've used that would also be great to know.
Finally, what does your service do that is not provided by something like SiftScience? I imagine there is overlap here - is it that you primarily focus on web application security instead of fraud signaling?
1. Customers analyze traffic with locally installed NGINX-based instances (there is not DNS take-over). They send applications/traffic statistics to Wallarm Cloud so we can run machine-learning stuff. We had a lot of work done for initial training of the system using our own experience in web app security (more than 250+ pentests for top-tier companies + a lot of researches done by our team like SSRF bible). We also use different honeypots and now statistics of customers with a high volume traffic.
2. There are some details about ML technique covered by Ivan for another comment
3. We have different tasks with SiftScience. SiftScience provides a fraud-detection. Wallarm protects web apps and APIs against data breaches. But these tasks are related for some of our customers.
How much of your machine learning is used for understanding the application (as Ivan said elsewhere, clustering login functionality together), and how much is actually used for fingerprinting vulnerability identification attempts on the part of user input?
To place this in a broader context, you do not need machine learning for identifying many cases of malicious user input, you can rely on simple heuristics. There is likely no reason for a user to submit `<script>alert(1);</script`, which is an obvious test for XSS low hanging fruit. Any good WAF will do this.
Given that, does Wallarm use mostly heuristics for identifying malicious user input, or does it also combine machine learning into this process at all to find non-obvious input patterns that could be indicative of penetration testing attempts?
Our attack type recognition based on machine learning which can at first produce lexems and, secondly. syntax constructions (patterns) by existing attacks. For example, in the case of memcached injections (more details: https://www.blackhat.com/docs/us-14/materials/us-14-Novikov-...) we can train system to detect these attacks without regexps or new heuristic rules.
CEO and cofounder of Sift Science here. I think we are complementary, actually. Wallarm focuses on security vulnerabilities (like a more automated HackerOne), and we focus more on "application abuse" (user-level fraud).
Hackerone and Bugcrowd do a great job. And we recommmend to run bug-bounty programs all the time.
But companies which run fast and deploy code everyday with CI/CD (or several times a day) it's almost impossible not to introduce new vulnerabilities. This is where solutions for continuous security are incredibly helpful.
There are few different tasks for machine learning.
1. Traffic clustering (hierarchical clustering algorithms). We use ML to understand how your application works in terms of business logic. E.g. clustering numbers of HTTP requests for /login as cluster determined by (HTTP_header->HOST="yoursite.com" + HTTP_URL->"/login" + ...).
2. Data profiling inside clusters. We use statistical distribution algorithms to understand which data is normal for fields POST->login and POST->password inside cluster from p.1. It is not hardcoded data templates like "only digits" or smth like this. Wallarm generates profiles dynamically.
3. Fuzzy search. Those data which is abnormal (from p.2), we understand if it looks like XSS or SQLi or any other attack or not.