More

nicebyte · 2026-02-26T03:49:38 1772077778

I've found Trellis specifically to be very "over-promise and under-deliver".

Nothing i tried with it got even close to th level of quality that they were advertising - felt like a bunch of examples were hand-picked, at best.

nicebyte · 2026-02-10T21:29:45 1770758985

> Not sure if this is an "oh, no" event.

it's not.

descriptor sets are realistically never getting deprecated. old code doesn't have to be rewritten if it works. there's no point.

if you're doing bindless (which you most certainly arent if you're still stuck with descriptor sets) this offers a better way of handling that.

if you care to upgrade your descriptor set based path to use heaps, this extension offers a very nice pathway to doing so _without having to even recompile shaders_.

for new/future code, this is a solid improvement.

if you're happy where you are with your renderer, there isn't a need to do anything.

p_l · 2026-02-10T22:37:09 1770763029

And apparently if you do mobile you stay away from big chunk of dynamic rendering and use Vulkan 1.0 style renderpasses... or you leave performance on the floor (based on guidelines from various mobile GPU vendors)

Animats · 2026-02-11T04:10:55 1770783055

Vulkan on mobile and web runs several years behind Vulkan on desktop. This is a problem for portable toolkits such as WGPU.

xyzsparetimexyz · 2026-02-11T11:27:15 1770809235

People need to stop cucking their wrapper APIs by catering to apple/mobile

exDM69 · 2026-02-11T08:23:34 1770798214

> or you leave performance on the floor

The dynamic rendering local read feature is for this. You can get tiler gpus working on the happy path without the need for verbose render passes.

But mobile driver support being what it is, it'll take time until it's widespread in consumer devices (which don't get driver updates).

nicebyte · 2026-02-10T21:22:08 1770758528

just use the vma library. the low level memory allocation interface is for those who care to have precise control over allocations. vma has shipped in production software and is a safe choice for those who want to "just allocate memory".

m-schuetz · 2026-02-10T21:38:09 1770759489

Nah, I know about VMA and it's a poor bandaid. I want a single-line malloc with zero care about usage flags and which only produces one single pointer value, because that's all that's needed in pretty much all of my use cases. VMA does not provide that.

And Vulkans unnecessary complexity doesn't stop at that issue, there are plenty of follow-up issues that I also have no intention of dealing with. Instead, I'll just use Cuda which doesn't bother me with useless complexity until I actually opt-in to it when it's time to optimize. Cuda allows to easily get stuff done first then check the more complex stuff to optimize, unlike Vulkan which unloads the entire complexity on you right from the start, before you have any chance to figure out what to do.

nicebyte · 2026-02-10T21:50:51 1770760251

> I want a single-line malloc with zero care about usage flags and which only produces one single pointer value

That's not realistic on non-UMA systems. I doubt you want to go over PCIe every time you sample a texture, so the allocator has to know what you're allocating memory _for_. Even with CUDA you have to do that.

And even with unified memory, only the implementation knows exactly how much space is needed for a texture with a given format and configuration (e.g. due to different alignment requirements and such). "just" malloc-ing gpu memory sounds nice and would be nice, but given many vendors and many devices the complexity becomes irreducible. If your only use case is compute on nvidia chips, you shouldn't be using vulkan in the first place.

m-schuetz · 2026-02-10T21:54:55 1770760495

> Even with CUDA you have to do that.

No you don't, cuMemAlloc(&ptr, size) will just give you device memory, and cuMemAllocHost will give you pinned host memory. The usage flags are entirely pointless. Why would UMA be necessary for this? There is a clear separation between device and host memory. And of course you'd use device memory for the texture data. Not sure why you're constructing a case where I'd fetch them from host over PCI, that's absurd.

> only the implementation knows exactly how much space is needed for a texture with a given format and configuration

OpenGL handles this trivially, and there is also no reason for a device malloc to not also work trivially with that. Let me create a texture handle, and give me a function that queries the size that I can feed to malloc. That's it. No heap types, no usage flags. You're making things more complicated than they need to be.

nice_byte · 2026-02-10T22:33:53 1770762833

> No you don't, cuMemAlloc(&ptr, size) will just give you device memory, and cuMemAllocHost will give you pinned host memory.

that's exactly what i said. You have to explicitly allocate one or the other type of memory. I.e. you have to think about what you need this memory _for_. It's literally just usage flags with extra steps.

> Why would UMA be necessary for this?

UMA is necessary if you want to be able to "just allocate some memory without caring about usage flags". Which is something you're not doing with CUDA.

> OpenGL handles this trivially,

OpenGL also doesn't allow you to explicitly manage memory. But you were asking for an explicit malloc. So which one do you want, "just make me a texture" or "just give me a chunk of memory"?

> Let me create a texture handle, and give me a function that queries the size that I can feed to malloc. That's it. No heap types, no usage flags.

Sure, that's what VMA gives you (modulo usage flags, which as we had established you can't get rid of). Excerpt from some code:

``` VmaAllocationCreateInfo vma_alloc_info = { .usage = VMA_MEMORY_USAGE_GPU_ONLY, .requiredFlags = VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT};

VkImage img; VmaAllocation allocn; const VkResult create_alloc_vkerr = vmaCreateImage( vma_allocator, &vk_image_info, // <-- populated earlier with format, dimensions, etc. &vma_alloc_info, &img, &allocn, NULL); ```

Since i dont care about reslurce aliasing, that's the extent of "memory management" that i do in my rhi. The last time i had to think about different heap types or how to bind memory was approximately never.

m-schuetz · 2026-02-10T22:47:35 1770763655

No, it's not usage flags with extra steps, it's less steps. It's explicitly saying you want device memory without any kind of magical guesswork of what your numerous potential combinations of usage flags may end up giving you. Just one simple device malloc.

Likewise, your claim about UMA makes zero sense. Device malloc gets you a pointer or handle to device memory, UMA has zero relation to that. The result can be unified, but there is no need for it to be.

Yeah, OpenGL does not do malloc. I'm flexible, I don't necessarily need malloc. What I want is a trivial way to allocate device memory, and Vulkan and VMA don't do that. OpenGL is also not the best example since it also uses usage flags in some cases, it's just a little less terrible than Vulkan when it comes to texture memory.

I find it fascinating how you're giving a bad VMA example and passing that of as exemplary. Like, why is there gpu-only and device-local. That vma alloc info as a whole is completely pointless because a theoretical vkMalloc should always give me device memory. I'm not going to allocate host memory for my 3d models.

nice_byte · 2026-02-10T23:24:01 1770765841

> It's explicitly saying you want device memory

You are also explicitly saying that you want device memory by specifying DEVICE_LOCAL_BIT. There's no difference.

> Likewise, your claim about UMA makes zero sense. Device malloc gets you a pointer or handle to device memory,

It makes zero sense to you because we're talking past each other. I am saying that on systems without UMA you _have_ to care where your resources live. You _have_ to be able to allocate both on host and device.

> Like, why is there gpu-only and device-local.

Because there's such a thing as accessing GPU memory from the host. Hence, you _have_ to specify explicitly that no, only the GPU will try to access this GPU-local memory. And if you request host-visible GPU-local memory, you might not get more than around 256 megs unless your target system has ReBAR.

> a theoretical vkMalloc should always give me device memory.

No, because if that's the only way to allocate memory, how are you going to allocate staging buffers for the CPU to write to? In general, you can't give the copy engine a random host pointer and have it go to town. So, okay now we're back to vkDeviceMalloc and vkHostMalloc. But wait, there's this whole thing about device-local and host visible, so should we add another function? What about write-combined memory? Cache coherency? This is how you end up with a zillion flags.

This is the reason I keep bringing UMA up but you keep brushing it off.

m-schuetz · 2026-02-10T23:41:34 1770766894

> You are also explicitly saying that you want device memory by specifying DEVICE_LOCAL_BIT. There's no difference.

There is. One is a simple malloc call, the other uses arguments with numerous combinations of usage flags which all end up doing exactly the same, so why do thy even exist.

> You _have_ to be able to allocate both on host and device.

cuMemAlloc and cuMemAllocHost, as mentioned before.

> Because there's such a thing as accessing GPU memory from the host

Never had the need for that, just cuMemcpyHtoD and DtoH the data. Of course host-mapped device memory can continue to exist as a separate, more cumbersome API. The 256MB limit is cute but apparently not relevant im Cuda where I've been memcpying buffers with GBs in size between host and device for years.

> No, because if that's the only way to allocate memory, how are you going to allocate staging buffers for the CPU to write to?

With the mallocHost counterpart.

cuMemAllocHost, so a theoretic vkMallocHost, gives you pinned host memory where you can prep data before sending it to device with cuMemcpyHtoD.

> This is how you end up with a zillion flags.

Apparently only if you insist on mapped/host-visible memory. This and usage flags never ever come up in Cuda where you just write to the host buffer and memcpy when done.

> This is the reason I keep bringing UMA up but you keep brushing it off.

Yes I think I now get why keep bringing up UMA - because you want to directly access buffers between host or device via pointers. That's great, but I don't have the need for that and I wouldn't trust the performance behaviour of that approach. I'll stick with memcpy which is fast, simple, has fairly clear performance behaviours and requires none of the nonsense you insist on being necessary. But what I want isn't either this or that approach, I want the simple approach in addition what exists now, so we can both have our cakes.

MindSpunk · 2026-02-11T00:29:32 1770769772

What exactly is the difference between these?

cuMemAlloc -> vmaAllocate + VMA_MEMORY_USAGE_GPU_ONLY

cuMemAllocHost -> vmaAllocate + VMA_MEMORY_USAGE_CPU_ONLY

It seems like the functionality is the same, just the memory usage is implicit in cuMemAlloc instead of being typed out? If it's that big of a deal write a wrapper function and be done with it?

Usage flags never come up in CUDA because everything is just a bag-of-bytes buffer. Vulkan needs to deal with render targets and textures too which historically had to be placed in special memory regions, and are still accessed through big blocks of fixed function hardware that are very much still relevant. And each of the ~6 different GPU vendors across 10+ years of generational iterations does this all differently and has different memory architectures and performance cliffs.

It's cumbersome, but can also be wrapped (i.e. VMA). Who cares if the "easy mode" comes in vulkan.h or vma.h, someone's got to implement it anyway. At least if it's in vma.h I can fix issues, unlike if we trusted all the vendors to do it right (they wont).

m-schuetz · 2026-02-11T07:20:51 1770794451

> and are still accessed through big blocks of fixed function hardware that are very much still relevant

But is it relevant for malloc? Everthing is put into the same physical device memory, so what difference would the usage flag make? Specialized texture fetching and caching hardware would come into play anyway when you start fetching texels via samplers.

> It seems like the functionality is the same, just the memory usage is implicit in cuMemAlloc instead of being typed out? If it's that big of a deal write a wrapper function and be done with it?

The main reason I did not even give VMA a chance is the github example that does in 7 lines what Cuda would do in 2. You now say it's not too bad, but that's not reflected in the very first VMA examples.

nice_byte · 2026-02-10T23:59:00 1770767940

> I want the simple approach in addition what exists now, so we can both have our cakes.

The simple approach can be implemented on top of what Vulkan exposes currently.

In fact, it takes only a few lines to wrap that VMA snippet above and you never have to stare at those pesky structs again!

But Vulkan the API can't afford to be "like CUDA" because Vulkan is not a compute API for Nvidia GPUs. It has to balance a lot of things, that's the main reason it's so un-ergonomic (that's not to say there were no bad decisions made. Renderpasses were always a bad idea.)

m-schuetz · 2026-02-11T00:03:36 1770768216

> In fact, it takes only a few lines to wrap that VMA snippet above and you never have to stare at those pesky structs again!

If it were just this issue, perhaps. But there are so many more unnecessary issues that I have no desire to deal with, so I just started software-rasterizing everything in Cuda instead. Which is way easier because Cuda always provides the simple API and makes complexity opt-in.

mirsadm · 2026-02-10T23:11:56 1770765116

But what if you want both on a shared memory system?

m-schuetz · 2026-02-10T23:16:32 1770765392

No problem: Then you provide an optional more complex API that gives you additional control. That's the beautiful thing about Cuda, it has an easy API for the common case that suffices 99% of the time, and additional APIs for the complex case if you really need that. Instead of making you go through the complex API all the time.

nicebyte · 2026-02-10T21:09:44 1770757784

some of this is what's khronos standards are theoretically supposed to achieve.

surprise, it's very difficult to do across many hw vendors and classes of devices. it's not a coincidence that metal is much easier to program for.

maybe consider joining khronos since you apparently know exactly how to achieve this very simple goal...

flohofwoe · 2026-02-10T22:08:57 1770761337

> it's not a coincidence that metal is much easier to program for

Tbf, Metal also works on non-Apple GPUs and with only minimal additional hints to manage resources in non-unified memory.

nicebyte · 2026-02-06T19:26:49 1770406009

assembler is far from trivial at least for x86 where there are many possible encodings for a given instruction. emitting the most optimal encoding that does the correct thing depends on surrounding context, and you'd have to do multiple passes over the input.

jmalicki · 2026-02-07T17:47:22 1770486442

What is a single example where the optimal encoding depends on context? (I am assuming you're just doing an assembler where registers have already been chosen, vs. a compiler that can choose sse vs. scalar and do register allocation etc.)?

chris_swenson · 2026-02-07T22:52:45 1770504765

“mov rcx, 0”. At least one assembler (the Go assembler) would at one point blindly (and arguably, incorrectly) rewrite this to “xor rcx, rcx”, which is smaller but modifies flags, which “mov” does not. I believe Go fixed this later, possibly by looking at surrounding instructions to see if the flags were being used, for instance by an “adc” later, to know if the assembler needs to pick the larger “mov” encoding.

Whether that logic should belong in a compiler or an assembler is a separate issue, but it definitely was in the assembler there.

jmalicki · 2026-02-08T18:50:06 1770576606

Ok fair, I saw that as out of scope for an assembler - since that is a different instruction not just how to encode.

nicebyte · 2026-02-10T21:53:05 1770760385

jumps is another one. jmp can have many encodings depending on where the target offset you're jumping to is. but often times, the offset is not yet known when you first encounter the jump insn and have to assemble it.

nicebyte · 2025-12-29T18:30:09 1767033009

shameless plug: if you want to understand the content of this post better, first read the first half of my article on jumps [1] (up to syscall). goes into detail about relocations and position-independent code.

[1] https://gpfault.net/posts/asm-tut-4.html

nicebyte · 2025-03-03T17:38:49 1741023529

including ai generated illustrations in your articles or presentations is very cringe

nicebyte · on Feb 21, 2025

yeah no. I've mainlined dwm + dmenu all the way back in 200x, I've written tons of makefiles and have the scars to prove it.

These days I'm off of this minimalism crap. it looks good on paper, but never survives collision with reality [1] (funny that this post is on hn front-page today as well!).

[1] http://johnsalvatier.org/blog/2017/reality-has-a-surprising-...

snailmailstare · on Feb 21, 2025

I like these tools because they are minimalist.. I don't really care for the fact that they are C/make oriented and would rather help someone rewriting them in go or rust than show that I have a non minimal amount of scar tissue to work with a needlessly complicated past.

nicebyte · on Feb 21, 2025

my comment isn't about things being written using c/make/whatever, it's precisely about the faulty assumption that complexity is needless.

snailmailstare · on Feb 21, 2025

Oh then I totally disagree (or don't understand why you would need to see a psychoanalysis of a blacksmith to evaluate their offerings?). Many projects have places that need some complexity, configuration or advanced tools that doesn't imply the hardware store should stop selling average hammers or make you wade through an aisle of crap from providers like peloton to see if they better meet your needs.

(I.e. show me where in the article he replaced a standard tool like the hammer or pot with a complex one customized to exactly what he wanted to solve or explain why that advanced tool wouldn't suck given that there's a lot more details than one would expect.)

skydhash · on Feb 21, 2025

I just went back to fedora+gnome on my PCs from FreeBSD+(tiling wm). I think minimalism is good when your workflow is very focused and you already know the requirements for your stack. But if you have unexpected workflows coming in everyday, the maintenance quickly becomes a burden. Gnome may not be perfect, but it's quite nice as a baseline for a working environment.

yoyohello13 · on Feb 22, 2025

Same. I ran dwm for a long time. These days I just run Gnome. You can make it work very similar to a tiling window manager, and all that random crap the world throws at you (printers, projectors, random other monitors, Java programs) "Just Work".

nicebyte · on Feb 21, 2025

I bet 90% of the reason this is on the front page is the Berkeley mono font. the system itself sucks.

f1shy · on Feb 22, 2025

The first time it was posted I said: I hate the system, but I like the presentation.

The system is great if you like to remember the IPs of the sites you need instead of the urls…

nicebyte · on Feb 18, 2025

How did you draw that conclusion from reading the contents of the link? This is a benchmark.

> We evaluate model performance and find that frontier models are still unable to solve the majority of tasks.