More

sqreept · 2025-12-15T18:49:33 1765824573

This CSI plugin is fully vibe coded by Claude Sonnet 4.5 & Haiku 4.5 and tested on a real cluster in a loop by claude-code.

My only contribution was to uphold standards as this is one big thing where LLMs struggle probably because there's so few examples out there.

Hope it helps you!

sqreept · on March 18, 2024

What are the languages supported by it?

cyanydeez · on March 18, 2024

Tweets.

sqreept · on March 8, 2024

M1, M2, M3 still have very low number of GPU cores. Apple should release some better hardware to take advantage of their recently released MLX library.

sbinnee · on March 8, 2024

At this moment it looks clear to me that Apple won’t go that way. It’s enough for them to focus on inference and actual application not the heavy training part. They have been probably training models on a cluster with non Apple silicon and make them available for their chips only for inference.

ttul · on March 8, 2024

Not to mention entirely outsourcing training workloads to specialist firms. Apple does a lot of secretive outsourcing of things you might think they would or should do in-house. This contrasts with Google and Meta who seem to like keeping everything in-house.

kergonath · on March 8, 2024

It’s true that their GPUs are slower than Nvidia’s. But keep in mind that cores are really different and cannot be compared across architectures. You want more Gflops, not necessarily more cores.

int_19h · on March 8, 2024

They do, but for inference at least, it's memory bandwidth that is the primary limiting factor for home LLMs right now, not raw compute.

sroussey · on March 9, 2024

Wonder if the apple silicon ultra series will start using HBM3(e) on desktop in the future.

sqreept · on March 5, 2024

I just published 3 things:

1. A LoRA that adds Romanian diacritics to texts that don't have them: huggingface.co/sqreept/ro_dia… (includes an example of using it)

2. The dataset used to build the above LoRA: huggingface.co/datasets/sqree…

3. The Colab used to fine-tune the above LoRA: colab.research.google.com/drive/1GMg9fS3…

This is what I call open source in Machine Learning: model weights, dataset and code. Anything less is not open source.

sqreept · on Feb 26, 2024

I've read twice the announcement and I can't tell what this is good for. Can you please dumb it down for me?

sqreept · on Feb 21, 2024

Tried inference with the 7B model and without flash attention this is soooooo slow. With flash attention the fine-tunning requires A100 or H100. Also the inference doesn't always stop generating resulting in garbage being added to the response.

brucethemoose2 · on Feb 21, 2024

> Also the inference doesn't always stop generating resulting in garbage being added to the response.

That sounds like a chat format misconfiguration.

This could partially be Google's fault, as they used yet another novel prompting format.

Also, for sane inference speed on H100s, you'll have to wait for architecture support from the optimized frameworks. Vanilla transformers is beyond awful even with FA2.

alekandreev · on Feb 21, 2024

We have implementations in different ML frameworks, so I am not quite sure which one you are referring to. Would you like to file a bug at the relevant GitHub repo?

sqreept · on Feb 21, 2024

First of all, I'm using 2 x 4090 for testing. 4090 has 16384 CUDA cores which will become relevant a bit later.

I dug a bit deeper and it seems that with transformers==4.37.0 everything works fine with other HF hosted models (like Llama) but you'll rightfully get this when trying to use Gemma:

ImportError: cannot import name 'GemmaForCausalLM' from 'transformers'

After installing transformers==4.38.0 the fine-tunning speed of Llama drops to 25% (?!?) of what used to be for a reason that I think HF should fix. Testing Gemma it seems I'm hitting a hardware limit as Gemma has a hidden size which is bigger than the available CUDA cores. This seems to make both inference & fine-tunning about 25 times slower than similarly sized Llama 7B. I guess some operations have to be broken down in multiple round trips to the GPU due to my low CUDA core count.

All in all, even if HF fixes the recently introduced slowdown, Gemma seems to be fine-tuneable in reasonable amount of time only by the lucky ones with access to A100/H100.

EDIT: I managed to hack my env to be able to run inference on Gemma with transformers==4.37.0 by keeping the necessary classes in loaded in RAM. It works about 4x faster but still very slow. And both the 7B and the 2B versions behave the same way.

EDIT2: I tried latest transformers from main branch (4.39.0.dev) and behaves the same as 4.38.0.

sqreept · on Feb 21, 2024

What are the supported languages of these models?

alekandreev · on Feb 21, 2024

This v1 model is focused on English support, but you may find some multilingual capabilities.

sqreept · on Dec 27, 2023

In the era of AI, naming variables can and should be automated. Without good names, the code is very hard to read, and code should be, before anything else, readable.

sqreept · on Jan 19, 2023

I see it stopped crawling our website last night after hammering it for over a week.

We used a combination of ASN and UA to block them.

sqreept · on Jan 19, 2023

It is aggressive in what content is trying to access. It looks for security vulnerabilities and normal bots don't do that (with the notable exception of some security testing software). Also it's not spidering, somehow it knows very old URLs which are not even public which were probably obtained from a malicious browser extension.