With a avg latency of 4 seconds, this still couldn't be used in real-time video,...

Etheryte · 2025-11-19T21:59:24 1763589564

Didn't see where you got those numbers, but surely that's just a problem of throwing more compute at it? From the blog post:

> This excellent performance comes with fast inference — SAM 3 runs in 30 milliseconds for a single image with more than 100 detected objects on an H200 GPU.

v9v · 2025-11-20T08:24:45 1763627085

For the first SAM model, you needed to encode the input image which took about 2 seconds (on a consumer GPU), but then any detection you did on the image was on the order of milliseconds. The blog post doesn't seem too clear on this, but I'm assuming the 30ms is for the encoder+100 runs of the detector.

vlovich123 · 2025-11-20T04:21:48 1763612508

Even if it was 4s, you can always parallelize the frames to do it “realtime”, just the latency for the output will be 4s (provided you can get a cluster with 120 or 240 GPUs to do 4s of frames going in parallel (if it’s 30ms per image then you only need 2 GPUs to do 60fps on a video stream).

aDyslecticCrow · 2025-11-20T14:40:39 1763649639

The model is massive and heavy. I have a hard time seeing this used in real-time. But it's so flexible and accurate it's an amazing teacher for lean CNNs; that's where the real value lies.

I don't even care about the numbers; a visual transformer encoder with output that is too heavy for many edge compute CNNs to use as input isn't gonna cut it.

hansent · 2025-11-20T16:10:42 1763655042

p50 latency on roboflow serverless api is 300~400ms roundtrip for sam3 image with text prompt.

You can get an easy to use api endpoint by creating a workflow in roboflow with just the sam3 block in it (and hook up an input parameter to forward prompt to the model), which is then available as an HTTP endpoint. You can use the sam3 template and remove the visualization block if you need just json response for a bit faster latency and smaller payload.

Internally we are getting to run approx ~200ms http roundtrip, but our user facing API currently has some additional latency because we have to proxy a bit to hit a different cluster where we have more GPU capacity for this model allocated than we can currently get on GCP.