I've had to repeatedly tell our AWS account reps that we're not even a little interested in the Trainium or Inferentia instances unless they have a provably reliable track record of working with the standard libraries we have to use like Transformers and PyTorch.
I know they claim they work, but that's only on their happy path with their very specific AMI's and the nightmare that is the neuron SDK. You try to do any real work with them and use your own dependencies and things tend to fall apart immediately.
It was just in the past couple years that it really became worthwhile to use TPU's if you're on GCP and that's only with the huge investment on Google's part into software support. I'm not going to sink hours and hours into beta testing AWS's software just to use their chips.
IMO AWS once you get off the core services is full of beta services. S3, Dynamo, Lambda, ECS, etc are all solid. But there are a lot of services they have that have some big rough patches.
RDS, Route53, and Elasticache are decent, too. But yes, I've also been bitten badly in the distant past by attempting to rely on their higher-level services. I guess some things don't change.
I wonder if the difference is stuff they dogfood versus stuff they don't?
I once used one of their services (I forget which, but I think it was there serverless product) that “supported” Java.
… but the official command line tools had show-stopper bugs if you were deploying Java to this service, that’d been known for months, and some features couldn’t be used in Java, and the docs were only like 20% complete.
But this work-in-progress alpha (not even beta quality because it couldn’t plausibly be considered feature complete) counted as “supported” alongside other languages that were actually supported.
(This was a few years ago and this particular thing might be a lot better now, but it shows how little you can trust their marketing pages and GUI AWS dashboards)
I'm assuming you're talking about Lambda. I don't mess with their default images. Write a Dockerfile and use containerized Lambdas. Saves so many headaches. Still have to deal with RIE though, which is annoying.
But yes, the less of a core building block the specific service is (or widely used internally in Amazon), the more likely you are to run into significant issues.
Hmm is it actually that bad? Keep in mind r2 is only stored in one region which is chosen when the bucket is first created so that might be what you're seeing
But I've never really looked too closely because I just use it for non-latency critical blob storage
Personally, EMR has never shaken off the "scrappy" feeling (sometimes it feels OK if you're using Spark), and it feels even more neglected recently as they seem to want you on AWS Glue or Athena. LakeFormation is... a thing that I'm sure is good in theory if you're using only managed services, but in practice is like taking a quick jaunt on the Event Horizon.
Glue Catalog has some annoying assumptions baked in.
Frankly the entire analytics space on AWS feels like a huge mess of competing teams and products instead of a uniform vision.
I know they claim they work, but that's only on their happy path with their very specific AMI's and the nightmare that is the neuron SDK. You try to do any real work with them and use your own dependencies and things tend to fall apart immediately.
It was just in the past couple years that it really became worthwhile to use TPU's if you're on GCP and that's only with the huge investment on Google's part into software support. I'm not going to sink hours and hours into beta testing AWS's software just to use their chips.