More

liuliu · 2025-10-26T16:04:46 1761494686

Yeah, luckily, you can unit tests these and fix them. They are not concurrency bugs (again, luckily).

BTW, numeric differentiation can only be tested very limitedly (due to algorithmic complexity when you doing big matrix). It is much easier / effective to test against multiple implementations.

antoine-levitt · 2025-10-27T08:02:12 1761552132

You can easily test a gradient using only the forward pass by doing f(x+h) ~ f(x) + dot(g, h) for a random h

liuliu · 2025-10-24T17:27:24 1761326844

And it is always felt to me that has lineage from neural Turing machine line of work as prior. The transformative part was 1. find a good task (machine translation) and a reasonable way to stack (encoder-decoder architecture); 2. run the experiment; 3. ditch the external KV store idea and just use self-projected KV.

Related thread:https://threadreaderapp.com/thread/1864023344435380613.html

liuliu · 2025-10-24T00:13:46 1761264826

mmap is a good crutch when you 1. don't have busy polling / async IO API available and want to do some quick & dirty preloading tricks; 2. don't want to manage the complexity of in-memory cache, especially cross-processes ones.

Obviously if you have kernel-backed async IO APIs (io_uring) and willing to dig into the deeper end (for better managed cache), you can get better performance than mmap. But in many cases, mmap is "good-enough".

liuliu · 2025-10-20T20:41:00 1760992860

Is this just https://developer.apple.com/documentation/uikit/uiglasseffec... ?

pupppet · 2025-10-20T20:54:31 1760993671

All of this baloney they added to their OS they now need to support for who knows how many years across who knows how many devices. What a waste.

lotsofpulp · 2025-10-20T21:02:50 1760994170

Perhaps it was a move to ensure job security.

As a user, I would have looked forward to a few years of simply fixing bugs and making the OS more efficient.

liuliu · 2025-10-15T14:53:26 1760540006

Faster compute helps, for things like vision language model that requires bigger context to be filled. My understanding is that ANE is still optimized for convolution load, and compute efficiency while the new neural accelerators optimized for flexibility and performance.

zozbot234 · 2025-10-15T14:58:38 1760540318

The old ANE enabled arbitrary statically scheduled multiply-add, of INT8 or FP16. That's good for convolution but not specifically geared for it.

liuliu · 2025-10-15T16:23:01 1760545381

I am not an expert on ANE, but I think it is related to the size of register files and how that is smaller than what we need for GEMM on modern transformers (especially these fat ones with MoE).

zozbot234 · 2025-10-15T16:29:24 1760545764

AIUI the ANE makes use of data in unified memory, not in the register file. So this wouldn't be an inherent limitation. (OTOH, that's why it wastes memory bandwidth for most newer transformer models, which use heavily quantized data - the ANE will have to read padded/unquantized values and the fraction of memory bandwidth that's used for that padding is pure waste.)

hannesfur · 2025-10-15T14:59:17 1760540357

That would be an interesting approach if true. I hope someone gets to the bottom of it once we have hardware in our hands.

liuliu · 2025-10-14T18:01:01 1760464861

Feels like a side-effect of forever 0.x version symptom (I am guilty of as well). Even though semi-ver says 0.x can do whatever, people don't associate enough disruptive changes to it, whereas 0.4.x if it is 1.x, then it is much clearer this is a 2.x release.

All things considered, this is probably just a tiny footnote in this software's life.

ncruces · 2025-10-14T20:25:24 1760473524

They jumped from v0.3.x to v0.5.0 after a couple of years of v0.3.x (with an unreleased v0.4.x in between).

That alone should hint everyone it was a big leap.

liuliu · 2025-10-09T19:22:10 1760037730

And 300ms for a DB call is slow, in any case. We really shouldn't accept that as normal cost of doing business. 300ms is only acceptable if we are doing scrypt type of things.

kstrauser · 2025-10-09T19:40:25 1760038825

> in any case.

In some cases. Are looking up a single indexed row in a small K-V table? Yep, slow. Are you generating reports on the last 6 years of sales, grouped by division within larger companies? That might be pretty fast.

I'm not sure why you'd even generalize that so overly broadly.

liuliu · 2025-10-09T20:54:17 1760043257

To put in perspective, 300ms is about looping over 30GiB data from RAM, loading 800MiB data from SSD, or doing 1TFLOPS on a single core computer.

300ms to generate a report would be able to go through ~100M rows at least (on a single core).

And the implicit assumption that comment I made earlier, of course is not about the 100M rows scan. If there is a confusion, I am sorry.

kstrauser · 2025-10-09T21:36:08 1760045768

That's all true, so long as you completely ignore doing any processing on the data, like evaluating the rows and selectively appending some of them into a data structure, then sorting and serializing the results, let alone optimizing the query plan for the state of the system at that moment and deciding whether it makes more sense to hit the indexes or just slurp in the whole table given that N other queries are also executing right now, or mapping a series of IO queries to their exact address in the underlying disks, and performing the parity checks as you read the data off the RAID and combine it into a single, coherent stream of not-block-aligned tuples.

There's a metric boatload of abstractions between sending a UTF-8 query string over the packet-switched network and receiving back a list of results. 300ms suddenly starts looking like a smaller window than it originally appears.

liuliu · 2025-10-10T02:32:35 1760063555

There is nothing for us to take away in this discussion. So let me be the first to tune down: all I want to say is: don't take that 300ms as given, it sits in this uncomfortable region too short to be an async op and too long to be noticeable (anything between 50ms and 2s fits this bill). Most likely the query is doing something suspicious and would benefit the most to take a closer look at.

kstrauser · 2025-10-10T04:30:12 1760070612

I was totally with you until that last sentence, then you lost me again.

Saying a DB query is too long by giving an arbitrary number is like saying a rope is too long. That’s solely dependent on what you’re doing with it. It’s literally impossible to say that X is too long unless you know what it’s used for.

liuliu · 2025-09-30T18:29:29 1759256969

No, iPad Pro won't be faster than 4090s or 4070s (or even 5% of the speed of 4090).

But newer chips might contain Neural Accelerator to close the gap a little bit (i.e. 10%??).

(I maintain https://apps.apple.com/us/app/draw-things-ai-generation/id64...)

sroussey · 2025-09-30T22:10:34 1759270234

What improvements did the A19 Pro provide for Draw Things?

liuliu · 2025-09-30T23:26:18 1759274778

https://releases.drawthings.ai/p/iphone-17-pro-doubles-ai-pe...

sroussey · 2025-10-01T02:19:39 1759285179

That's amazing! Curious how this will translate to the M5 Pro/Max Macs...

liuliu · 2025-09-30T18:21:41 1759256501

They were acquisition target since 2017 (from the OpenAI internal emails). So lacking of acquisition is not because lacking of interests. Let you wonder what happened in these due-diligence.

liuliu · 2025-09-30T17:35:51 1759253751

Video generation is extremely exciting a.k.a. https://video-zero-shot.github.io/

However, personalization (teleporting yourself into a video scene) is boring to me. At its core, it doesn't generate new experience to me. My experience is not defined by photos / videos I took on a trip.