This is extremely true. In fact, from what we see many/most of the problems to be solved with LLMs do not have ground-truth values; even hand-labeled data tends to be mostly subjective.
We build a product that's somewhat similar in spirit to DSPy, but people come to us for different reasons than the OP listed here.
1) It's slow: you first have to get acquainted with DSPY and then get hand-labeled data for prompt optimization. This can be a slow process so it's important to just label cases that are ambiguous, not obvious.
2) They know that manual prompt engineering is brittle, and want a prompt that's optimized and robust against a model they're invoking, which DSPy offers. However, it's really the optimizer (ex. GEPA) doing the heavy-lifting.
3) They don't actually want a model or prompt at all. They want a task completed, reliably, and they want that task to not regress in performance. Ideally, the task keeps improving in production.
Curious if folks in this thread feel more of these pains than the ones in the article.
I think in some sense, this is the real thing everyone wants. Everything else is kind of an implementation detail! Would be really curious to see what you're building!
Under-discussed superpower of LLMs is open-set labeling, which I sort of consider to be inverse classification. Instead of using a static set of pre-determined labels, you're using the LLM to find the semantic clusters within a corpus of unstructured data. It feels like "data mining" in the truest sense.
The models you called out at the beginning were all released this year. What do you think is the difference between this generation of models and previous ones?
Yes, we're a startup! And LLM inference is a major component of what we do - more importantly, we're working on making these models accessible as analytical processing tools, so we have a strong focus on making them cost-effective at scale.
I see your prices page lists the average cost per million tokens. Is that because you are using the formula you describe, which depends on hardware time and throughput?
> API Price ≈ (Hourly Hardware Cost / Throughput in Tokens per Hour) + Margin
My two cents here is the classic answer - it depends. If you need general "reasoning" capabilities, I see this being a strong possibility. If you need specific, factual information baked into the weights themselves, you'll need something large enough to store that data.
I think the best of both worlds is a sufficiently capable reasoning model with access to external tools and data that can perform CPU-based lookups for information that it doesn't possess.
Both great points, but more or less speak to the same root cause - customer usage patterns are becoming more of a driver for pricing than underlying technology improvements. If so, we likely have hit a "soft" floor for now on pricing. Do you not see it this way?
Even given how much prices have decreased over the past 3 years I think there's still room for them to keep going down. I expect there remain a whole lot of optimizations that have not yet been discovered, in both software and hardware.
No doubt prices will continue to drop! We just don't think it will be anything like the orders-of-magnitude YoY improvements we're used to seeing. Consequently, developers shouldn't expect the cost of building and scaling AI applications to be anything close to "free" in the near future as many suspect.
I do not see it this way. Google is a publicly traded company responsible for creating value for their shareholders. When they became dicks about ad blockers on youtube last year or so, was it because they hit a bandwidth Moore's law? No. It was a money grab.
ChatGPT is simply what Google should've been 5-7 years ago, but Google was more interested in presenting me with ads to click on instead of helping me find what I was looking for. ChatGPT is at least 50% of my searches now. And they're losing revenue because of that.
I run a batch inference/LLM data processing service and we do a lot of work around cost and performance profiling of (open-weight) models.
One odd disconnect that still exists in LLM pricing is the fact that providers charge linearly with respect to token consumption, but costs are actually quadratic with an increase in sequence length.
At this point, since a lot of models have converged around the same model architecture, inference algorithms, and hardware - the chosen costs are likely due to a historical, statistical analysis of the shape of customer requests. In other words, I'm not surprised to see costs increase as providers gather more data about real-world user consumption patterns.
Sutro.sh (fka Skysight) | Infrastructure/LLMs & Research Engineering | SF Bay Area | Full-time
We are building batch inference infrastructure and a great/user developer experience around it. We believe LLMs have not yet been meaningfully unlocked as data processing tools - we're changing that.
Our work involves interesting distributed systems and LLM research problems, newly-imagined user experiences, and a meaningful focus on mission and values.
If you're interested in applying, please send an email to jobs@sutro.sh with a resume/LinkedIn Profile. For extra priority, please include [HN] in the subject line.
reply