> Are you doing a lot of non-io processing or computations?
Unfortunately.
From metrics, computing AWS signatures takes up an absurdly large amount of CPU time. The actual processing of events is quite minimal and honestly well-architected, a lot of stuff is loaded into memory rather than read from disk. There's syncing that happens fairly frequently from the internet which refreshes the cache.
The big problem is each event computes a new signature to send back to the API. I do have to wonder if the AWS signature is 99% of the problem and once I take that burden off, the entire system will roar to life. That's what makes me so confused because I had heard Erlang / Elixir could do on the scale of significantly more per minute even with pretty puny hardware.
One thing I am working on is batching then I am considering dropping the AWS signatures in favor of short-lived tokens since either way, it's game over if someone gets onto the system anyway since they could exploit the privilege. The systems are air-gapped anyway so the risk is minimal in my opinion.
> One possibility is you're using a single process instead of parallelizing things. For example, you may want to use one process per event, etc.
This is done by pushing it to a task ie: `Task.Supervisor.async_nolink`? That's largely where I found my gains actually.
It took a dive into how things schedule, because a big issue that was happening was the queue would get massively backed up, and I realized that I needed to apparently toggle on a flag telling it to pack the scheduler more (`+scl true`). I also looked into the wake-up lengths of threads. I am starting to get my head around "dirty schedulers" but I am not entirely sure how to affect those or how I can besides it doing it forever me.
The other one just for posterity is that I believe events get unnecessarily queued because they don't / didn't have locks. So if event A gets queued then creates a timer to re-queue it in 5 minutes, event A (c|w)ould continue to get queued despite the fact the first event A hadn't been processed yet. So the queue would just continue to compound and starve itself.
I don't know the specifics of your app so I don't feel commenting in more than generalities, but generally speaking, if you are doing work in native code, and if that native code work is CPU-bound (roughly, more than a millisecond of CPU time) you should try to do it in a dirty scheduler. If you don't, what will happen is that that CPU-bound code will interfere with the "regular" BEAM schedulers, meaning it will start to interfere with how the BEAM schedules all of the other work in your app, from regular function calls to IO to job queuing to serving requests, and whatever else.
I'm also suspicious of the `+scl true` setting as maybe being a bit of a red herring. I've been using BEAM off and on for 10 years both professionally and as a hobbyist and I've never used this myself nor seen anyone else ever need to use it. I'm sure there are circumstances where someone, somewhere has used this, but in a given Elixir app it is extremely likely that there is lower-hanging fruit than messing with scheduler flags.
In terms of queuing, are you using Oban or Broadway or something hand-built? It's common for folks to DIY this kind DAG/queuing stuff when 99.9% of the time using something like Oban or Broadway would be better than hand-rolling it.
It looks like others have address the first 90% of your post, so I'll refrain from commenting on that. I am curious about your timer code, though, because the timer shouldn't be firing at all unless the task associated with it has completed successfully. You shouldn't run into an issue where a timer is re-queueing the same task in Elixir.
> From metrics, computing AWS signatures takes up an absurdly large amount of CPU time. The actual processing of events is quite minimal and honestly well-architected, a lot of stuff is loaded into memory rather than read from disk. There's syncing that happens fairly frequently from the internet which refreshes the cache.
Oh, sounds nice! Caching in Elixir really is nice.
Okay, that makes sense. Elixir isn't fast at pure compute. It can actually be slower than Python or Ruby. However, the signatures likely are NIFs (native code). If the AWS signs are computed using NIFs then the CPUs are likely just can't keep up with them. Tokens would make sense in that scenario. But you should check the lib or code you're using for them.
> The big problem is each event computes a new signature to send back to the API. I do have to wonder if the AWS signature is 99% of the problem and once I take that burden off, the entire system will roar to life. That's what makes me so confused because I had heard Erlang / Elixir could do on the scale of significantly more per minute even with pretty puny hardware.
Yeah, crypto compute can be expensive especially on older / smaller cpus without builtin primitives. Usually I find Elixir performs better than equivalent NodeJS, Python, etc due to it's built in parallelism.
Also one thing to lookout for would be NIF C functions blocking the BEAM VM. The VM can now do "dirty nifs", but if they're not used and the code assumes the AWS signs will run fast, it could create knock on effects by blocking the Beam VM's schedulers. That's also not always easy to find with Beams built in tools.
On that note, make sure you've tried the `:observe` tooling. It's fantastic.
> One thing I am working on is batching then I am considering dropping the AWS signatures in favor of short-lived tokens since either way, it's game over if someone gets onto the system anyway since they could exploit the privilege. The systems are air-gapped anyway so the risk is minimal in my opinion.
> I also looked into the wake-up lengths of threads. I am starting to get my head around "dirty schedulers" but I am not entirely sure how to affect those or how I can besides it doing it forever me.
Note that dirty schedulers really only affect NIFs which run longer than what the BEAM schedulers expect. I mentioned that in regards the possibility that the AWS sigs are taking longer than they should, then they'd cause havoc on the scheduler.
Once upon a time I needed to do hashes en masse for a specific blockchain projects. Just a tad of Rust (via nif) really helped the performance. Might be of help to you, check this out (not my lib)
Unfortunately.
From metrics, computing AWS signatures takes up an absurdly large amount of CPU time. The actual processing of events is quite minimal and honestly well-architected, a lot of stuff is loaded into memory rather than read from disk. There's syncing that happens fairly frequently from the internet which refreshes the cache.
The big problem is each event computes a new signature to send back to the API. I do have to wonder if the AWS signature is 99% of the problem and once I take that burden off, the entire system will roar to life. That's what makes me so confused because I had heard Erlang / Elixir could do on the scale of significantly more per minute even with pretty puny hardware.
One thing I am working on is batching then I am considering dropping the AWS signatures in favor of short-lived tokens since either way, it's game over if someone gets onto the system anyway since they could exploit the privilege. The systems are air-gapped anyway so the risk is minimal in my opinion.
> One possibility is you're using a single process instead of parallelizing things. For example, you may want to use one process per event, etc.
This is done by pushing it to a task ie: `Task.Supervisor.async_nolink`? That's largely where I found my gains actually.
It took a dive into how things schedule, because a big issue that was happening was the queue would get massively backed up, and I realized that I needed to apparently toggle on a flag telling it to pack the scheduler more (`+scl true`). I also looked into the wake-up lengths of threads. I am starting to get my head around "dirty schedulers" but I am not entirely sure how to affect those or how I can besides it doing it forever me.
The other one just for posterity is that I believe events get unnecessarily queued because they don't / didn't have locks. So if event A gets queued then creates a timer to re-queue it in 5 minutes, event A (c|w)ould continue to get queued despite the fact the first event A hadn't been processed yet. So the queue would just continue to compound and starve itself.