>"run this daemon that tries to beat the kernel OOM to the punch"
Considering that the kernel OOM killer tends to be way too late in doing its thing I don't see how this is inelegant, maybe there's a reason you can't just have the kernel kill processes earlier in the face of memory pressure.
The kernel OOM is just plain broken. I can't understand how can it be possible that, on Windows, whenever I run out of RAM the OS kills whatever process is consuming too much and the computer keeps running flawlessly. However, on Linux, my computer just... freezes. It freezes and stops responding. Not even the mouse moves. Having to use a userspace OOM is the most inelegant thing I've seen. So I need to have 2 OOMs so that the good one can beat the bad one? It's so redundant, it's plain stupid. How come NOBODY is doing anything. If I knew c++ I would for sure send a patch.
The reason the OOM killer never kicks in is because you actually never run into an OOM, or almost never.
What usually happens is that in near-OOM conditions, the kernel starts reclaiming memory pages backed by file (sometimes called "trashing"). This operation manages to keep some extra memory available, but it makes the system almost unresponsive because it's constantly copying memory back and forth from the disk. It may take anywhere between minutes to hours before the system finally OOMs and the OOM killer is invoked.
This problem has been there forever but has been made worse by the improved speeds of modern storage technologies: with slower disk I/O, the OOM condition was reached sooner.
There are several solutions:
- Buy more RAM: if your system routinely goes nearly OOM something is not right.
- Add a (small) swap. It doesn't have to be a partition: nowadays most filesystems support swap file. Just create an empty file and mark it as swap.
- Limit the amount of thrashing or protect some pages from being reclaimed. This has been proposed by Google first and several other people since then, but AFAIK it has never been implemented in the mainline kernel.
Regarding the latter solution, there is a patchset called le9-patch[1] that is included in some alternative Linux kernels and it should be relatively safe to use.
I’m not sure if you misread the previous comment, but their point is quite commonly experienced by people who use Linux vs MacOS/Windows.
All hardware being the same (RAM, SSD, CPU) as OOM is reached, Linux will freeze, whereas Windows continues to run smoothly. All OSes try to reclaim memory pages, just Linux seems to hang the user space while doing so.
As someone who has dual-booted Windows and Linux for a decade, I can 100% attest to this glaring problem.
I am sure that this distinction between a near-OOM condition and an actual OOM condition matters to someone familiar with the current kernel implementation. You seem confident describing what happens when the memory gets closer to full, so I believe you. However, the user experiences the PC freeze during certain conditions, however you choose to name them, and it is during that freeze period the user needs a program to be killed to free some memory and prevent the freeze. I would take one crashed program over power cycling the entire PC any day of the week.
> I am sure that this distinction between a near-OOM condition and an actual OOM condition matters to someone familiar with the current kernel implementation. You seem confident describing what happens when the memory gets closer to full, so I believe you.
I'm not a kernel developer or anything like that, I've just spent some time investigating why this issue happens and has been happening for more than 10 years now.
> the user experiences the PC freeze during certain conditions, however you choose to name them
I'm not trying to defent the Linux kernel, I just described how it works. In particular it's not true that the OOM killer "takes too long" or doesn't work: it's just not invoked at all. If you invoke it manually (enable the magic SysRq with`sysctl kernel.sysrq=1` and press `alt-sysrq-f`) it does its job and solves the OOM instantly.
So, if you don't want to deal lockups and don't like an OOM userspace daemon (I don't), these are the possible solutions.
> I would take one crashed program over power cycling the entire PC any day of the week.
On a laptop or desktop PC, you don't need to power cycle in a near-OOM: use the magic SysRq key.
>On a laptop or desktop PC, you don't need to power cycle in a near-OOM: use the magic SysRq key.
Thanks for the tip! If my Linux were ever to start locking up regularly, I will apply it.
But right now (so I don't have to give up the use to which I currently put my SysRq key) I would prefer some method for determining after I forcefully powered down the computer, then powered up again, whether the lockup or slow-down that motivated the force-power-down was caused by a near-OOM condition.
I don't think so, sorry. The kernel emits a few messages when an OOM is detected, including the tasks killed to free memory, but in a near-OOM probably nothing: the system is technically still working normally, though very slowly.
> Add a (small) swap. It doesn't have to be a partition: nowadays most filesystems support swap file. Just create an empty file and mark it as swap.
My experiences with swap on Linux have been similarly bad. If even brief memory pressure forces the kernel to move things to swap, the only way to revert that in any reasonable timeframe is to unmount the swap partition or to restart the machine.
Meanwhile using Windows with a swap file of twice the size of physical RAM runs smooth as butter. I have a 200GB swap file right now and no problems.
>I can't understand how can it be possible that, on Windows, whenever I run out of RAM the OS kills whatever process is consuming too much and the computer keeps running flawlessly.
I've long wondered this too. How does Windows handle memory pressure differently?
I avoid swap since it needs to be encrypted to protect sensitive data written out from memory to disk. Instead I reserve more memory for the kernel vm.min_free_kbytes based on the installed ram and also based and some redhat suggestions, reserve more memory in vm.admin_reserve_kbytes and vm.user_reserve_kbytes, adjust vm.vfs_cache_pressure based on server role and finally set vm.overcommit_ratio to 0. This worked well on over 50k bare metal servers with no swap. OOM was extremely rare outside of dev. OOM basically only happened with automation had human induced bugs that deployed too many java instances to a server. All of the servers had anywhere from 512GB to 3TB ram and nearly all the memory was in use at all times.
The kernel OOM killer is only concerned about kernel survival. It isn't designed to care about user perception of system responsiveness.
That's what resource control via cgroups is about. Fedora desktop folks (both GNOME and KDE) are working on ensuring minimum resources are available for the desktop experience, via cgroups, which then applies CPU, memory, and IO isolation when needed to achieve that. Also, systemd-oomd is enabled by default. The resource control picture isn't completely in place yet, but things are much improved.
cgroups often make the situation worse, not better, by insisting that a small memcg drop caches because that control group is full while the system overall has plenty of resources. This can lead to a system severely swapping for no apparent reason.
Putting desktop apps into individual cgroups is one of the more counter-productive ideas that has cropped up lately.
Huh? I have never seen desktop Windows killing process due to out of memory -- does it even do that?
It does thrash much more gracefully than Linux, though. In fact the "your computer is low on memory" prompt actually can show up even when severely thrashing, something utmost impossible in Linux (even starting something like zenity may take hours..).
You can already disable the linux memory overcommit feature if you want linux to never allow more memory to be allocated than exists. However, you may run into problems with programs which rely on the ability to allocate more memory than they need, or if you computer has low amounts of memory.
The reason is that Windows doesn’t have fork(), and therefore doesn’t have to promise huge multiples of the available memory only to be left holding the bag when that fiction failed. Look up “overcommit” if you’re interested.
Not at all. I'm aware that it has the potential to put people off, but sometimes you have to shake people up to get the discussion going. There is actually a huge difference between "forcing people to do something" and merely "asking a question, albeit in a tough way," and people don't really seem to get this difference around here.
>The kernel is different from userspace projects - more difficult in some respects (we use a lot of very odd header files that pushes the boundary of what can be called "C"), but easier in many other respects (mainly in the sense that the kernel is fairly self-contained, and then doesn't rely on other projects for the final binary).
I'm interested in what Torvalds meant by these odd header files, does anyone know?
/*
* This returns a constant expressionn while determining if an argument is
* a constant expression, most importantly without evaluating the argument.
* Glory to Martin Uecker <Martin.Uecker@med.uni-goettingen.de>
*/
#define __is_constexpr(x) \
(sizeof(int) == sizeof(*(8 ? ((void *)((long)(x) * 0l)) : (int *)8)))
I can try... In short, this is about how a C compiler is supposed to deduce the type of the ternary operation. So, if x is a constant expression, the compiler figures that the second operand must be NULL (because it is allowed to perform the multiplication, by 0l), whose type then is whatever the third operand says (a pointer to int). The comparison succeeds. Otherwise the compiler does not perform the multiplication, takes the type of the second operand at its “face value”, i.e. as a pointer to void, and converts the type of the third operand to it - which, of course, makes the comparison false in this case.
Small nit, the PRO version of Ryzen APUs do support ECC[0], also ASRock has been quoted saying that all of their AM4 motherboards support ECC, even the low end offerings with the A320 chipset.
This version has got to be the worst kernel released in a while in terms of regression, from AMDGPU null pointer dereference crash[0] to f2fs data corruption bug[1] and now this. Fixes for these are on their way as far as I can tell but since the stable team are probably on Christmas vacation it might take a while.
These are some attitude goals for me. It's so easy to take things personally. Being able to take things constructively even when they might be personal is a great skill.
5.10 is a Long Term Support release that is going to get used by many distros for a long time. Maintainers might have tried (unsurprisingly) to get some interesting features merged.
On the bright side updating to 5.10 fixed a regression of a 5.4 to 5.8 kernel upgrade to me. The fix might have been in 5.9 but I only got the idea of upgrading after the 5.10 release.
Anyways, Linux needs some more CI so that such bugs can be found during the RC phase.
Where in the current CI that we have today is lacking that needs to be improved? We always want more testing and testers, what is preventing everyone from helping with this?
I'm a software engineer who's not involved in Linux Kernel Dev... but I've got a stack of old laptops that I'd be happy to set up to run automated CI if that'd be helpful.
Is there a webpage or doc somewhere I can look at?
(I'm not trying to snark - the fact that you're you and you're here asking for help is making me want to dip my toe in).
Simplest thing to do, just run Linus's latest releases (the -rc releases), or from his git tree, on your machine and report any problem.
Second-simplest thing to do is to run the linux-next branch/tree on your machines and report any build warnings and runtime issues you find. That's what will be the "next" kernel releases and is where all of the developer/maintainer trees are merged together before they are sent to Linus.
Both of those should be very easy to do, and any problems found there should be easy to fix and resolve before they get to a "real" release.
I haven't been following kernel dev for years; what does the CI setup look like? Did the Phoronix Test Suite ever find its way into widespread use?
Back when I was building kernels for embedded hardware (Sheevaplug) in the 2.6.33 timeframe, I found a USB audio regression between 2.6.33.7 and later versions. If there were a semi-turnkey way to set up a testbench that could automatically reboot hardware in every new kernel, run through some basic tests, and report any deviation, I probably would have been more likely to do so. At the time I was working solo trying to release a polished consumer product (sadly though the product was released the business didn't work out) and didn't have time to dig into and report bugs.
We have so many different CI systems running on the kernel on a hourly basis.
We have the 0-day bot from Intel that runs so many things on all developer trees. We have kernelci running on many many different hardware platforms, and we have Linaro test systems also running on many different branches and hardware platforms.
If you want to tie your own hardware into the system, kernelci is the best place to start, I recommend looking into that.
> How is the kernel tested ? There weren’t any tests covering any of this ?
Despite appearances, "the kernel" is not a single monolithic thing. There is a about a 100 kloc core (but I haven't looked up that number in years). The rest, hardware drivers, network protocols, file systems, crypto, raid ... bolt on as modules.
Those modules are maintained separate teams. They are as related to the kernel as the phone dialler app is related to Android. The quality of each module is the responsibility of that team, not "the kernel" team. And that applies to testing the module as well.
In a sense, "the kernel" team is more like debian or redhat than developers. What they have done is develop a framework that lets them take bits created and maintained by a cast of thousands, and bolt it together into what appears to be a single coherent thing from the outside. So the answer to "how is the kernel tested" is "it's complex, and not centrally planned".
The other answer is what you are seeing is in fact part of the testing process. Most people use kernels packaged by their distribution. kernel.org releases are more like Microsoft's pre-releases of Windows. Most Debian users for example won't see it until it gets to Debian testing. To get there it must pass through Debian experimental (which is where 5.10 sits now) then sit in Debian unstable without bug reports for a while. Those release names should give you a hint about the anticipated stability of the kernel version. I personally won't use it until it takes another step, which is from Debian testing to Debian backports (which is when it because available to Debian stable users who are willing to risk compatibility issues).
This means that for for most users, 5.10 it's done yet as it has barely begun it's testing regime.