> I shared the numbers internally and someone asked about the ROI. Production cost for jsonata-js in the previous month was about $25K - now it was 0. That conversation ended up being pretty short.
I'm obviously projecting from my own experience, but it echoes so clearly how power can be wielded without actual insight and an almost arrogantly: "OK, all very nice, but the ROI...?"
The article seems to come from a company with stellar engineering so maybe doesn't apply to this case. But, the tone I imagine from that comment still stands out. To me more, precisely because of the mature engineering.
Of course ROI is important and a company exists to build it. I'm extrapolating from something tiny and thinking of the Boeing culture shift: https://news.ycombinator.com/item?id=25677848
In short, why can't good engineering just be good engineering fostered with trust and then profits?
In my mind, this "observation" (if I can call it that) may explain or at least relate to what other commenters bring:
> I don't know what to think. These blog articles are supposed to be a showcase of engineering expertise, but bragging about having AI vibecode a replacement for a critical part of your system that was questionably designed and costing as much as a fully-loaded FTE per year raises a lot of other questions.
I really appreciate this post. Freely and humbly sharing real insights from an interesting project. I almost feel like I got a significant chunk of the reward for your investment into this project just by reading.
Oh, I didn't know github had free macOS CI runners.
Maybe that would solve my dreadful upcoming issue that I'd have to update my mac to a version with glass to be able to build for the app store.
Can this be read as finally the financial incentives to join the AI silicone race has become too tempting. Finally the incentives to sell chips are definitely stronger than the cost of competing with your own licensees?
If I was an investor in an AI provider I would be quite worried.
1) Switching between LLM API:s is incredibly easy if you are not concerned with differences in personality. As the models get better, it is less important to pick the best one.
2) The products built to bundle the API with a user experience are difficult to build on a level that outclasses open source alternatives.
3) Building an understanding of the user to increase the product value over time and create stickiness is effective, but imho less effective over time as time passes and the user changes. For example, I suspect that these adaptations have a hard time to unlearn things that are no longer true. Learning about the user opaquely is less useful to the user and doing it overtly makes it easier to take the learnings and go. (Besides, it is probably not legal under the GDPR to not let the user export the learnings and take them to another provider.)
Taken together, the moat becomes quite shallow. I see why they aggressively ban any tools demonstrating when open alternatives are in fact better than their own walled gardens.
Caveat: I am not an expert, so this is a semi-educated guess.
I imagine it would depend on whether DINOv3 captures the information of whether a given person is in the image, which I think is really a question about training data. So naively, I would guess the answer is yes for celebrities and no for non-celebrities. Partially for data/technical reasons, but also maybe due to the murkier legal expectation of privacy for famous people.
Foundation models like DINO learn representations of their inputs. That is, they generate very high-dimensional numerical descriptions of what you put into them. The models aren't trained on labelled data, but they're trained on some pretext task like "given this image with a cutout, fill in the cutout" (see Masked Auto-Encoders). So the basic output from a model is a vector - often called an embedding. Literally a 1D list of numbers, O(1k)-dimensional. Your goal is to get an embedding that assigns (well) linearly separable vectors for all the things you want to classify.
Vision transformers also output patch tokens, which can be assembled into a low-resolution feature map (w/32, h/32 is common). So what you do with that data depends on the task. Classification can be as simple as linearly classifying the (whole image) embedding. A semantic segmentation task can do the same, but for every pixel. This is why the DINO authors show a PCA representation of a bunch of images, which show that semantically similar objects are grouped together by colour. Object detectors are more complicated, but the key thing is that once you have these pixel-level features, you can use them as input into existing architectures.
Now to your question: face recognition is a specific application of object re-identification (keyword: Re-ID). The way most of these models work is from the whole-image embedding. Normally you'd run a detector to extract the face region, then compute the embedding, put it in a vector database and then query for nearest neighbours using something like the cosine distance. I've only worked in this space for animals, but humans are far more studied. Whether DINOv3 is good enough out-of-the-box I don't know, but certainly there's a lot of literature looking at these sorts of models for Re-ID.
The challenge with Re-ID is that the model has to be able to produce features which discriminate individuals rather than similar looking individuals. For example with the vanilla model, you probably have a very good tool for visual search. But that's not the same task, because if you give it a picture of someone in a field, you'll get back pictures of other people in fields. That usually requires re-training on labelled imagery where you have a few examples of each person. The short answer is that there are already very good models for doing this, and they don't necessarily even need ML to do a decent job (though it might be used for keypoint detection for facial landmarks).
I'm not confident in what I'm saying here, so please correct me if I'm wrong as I'd like to learn:
Human hearing isn't linear in terms of loudness. So a 3db increase in loudness sounds like "an increase", but the pressure is actually double. Hence, it makes sense to use db to describe loudness even in the context of perceived loudness to human-hearing.
This is similar to brightness. In photography, "stops" are used to measure brightness. One stop brighter is technically twice the light, but to the human eye, it just looks "somewhat brighter", as human brightness appreciation is logarithmic, just like "stops" and "db".
I'm obviously projecting from my own experience, but it echoes so clearly how power can be wielded without actual insight and an almost arrogantly: "OK, all very nice, but the ROI...?"
The article seems to come from a company with stellar engineering so maybe doesn't apply to this case. But, the tone I imagine from that comment still stands out. To me more, precisely because of the mature engineering.
Of course ROI is important and a company exists to build it. I'm extrapolating from something tiny and thinking of the Boeing culture shift: https://news.ycombinator.com/item?id=25677848
In short, why can't good engineering just be good engineering fostered with trust and then profits?
reply