But these days just use trino or whatever. There are lots of new ways to work on data that are all bigger steps up - ergonomically, performance and price - over spark as spark was over hadoop.
The nice thing about spark is the scala/python/R APIs. That helps to avoid lots of the irritating things about SQL (the same transformation applied to multiple columns is a big one).
I really can't speak highly enough of Trino (though I used it as AWS Athena, and this was back when Trino was called Presto). It's impressive how well it took "ever growing pile of CSV/JSON/Excel/Parquet/whatever" and let you query it via SQL as-is without transforming it and putting it into some other system.
Hadoop was fundamentally a batch processing system for large data files that was never intended for the sort of online reporting and analytics workloads for which the DW concept addressed. No amount of Pig and Hive and HBase and subsequent tools layered on top of it could ever change that basic fact.
The biggest gripe in have is how crazy expensive it is.