I was doing historical evaluation for a few sites, so I was running a query for ...

Symbiote · on Feb 20, 2024

My guess is you were querying all the data each time.

If you instead filter out the rows you are interested in (e.g. the particular "few sites" by their URL) and put that in a new table, querying the resulting, tiny table will be very cheap.

eklitzke · on Feb 20, 2024

I haven't looked at the exact schema for this dataset but for this type of query pattern to be efficient the data would need to be partitioned by date.^[1] I'm guessing that it's not partitioned this way and therefore each of these queries that was looking at "one month" of data was doing a full table scan, so if you queried N months you did N table scans even though the exact same query results could have been achieved even without partitioning by doing one table scan with some kind of aggregation (e.g. GROUP BY) clause.

[1]: https://cloud.google.com/bigquery/docs/partitioned-tables

mulmen · on Feb 20, 2024

Can you be more specific? What filtering did you apply? How many columns did you select?

nothttparchive · on Feb 21, 2024

SELECT page, url, payload FROM `{table}` WHERE page like '%{site_domain}/%' AND url like '%[EXAMPLE.COM]%'

mulmen · on Feb 21, 2024

I wouldn’t expect either of those filters to utilize a partition key if one exists. So yeah, you probably did a full table scan every time. Is the partitioning documented somewhere?

nothttparchive · on Feb 21, 2024

Yeah, 'LIKE' ops usually give you a full table scan, which is brutal. If it was my own data I'd chop the fields up and index them properly - which is the issue here, it's not your data, so you don't get a say in the indexes, but you do have to pay per row scanned even if you can't apply an index of your own.

mulmen · on Feb 21, 2024

Seems like an ideal case for pre-processing. You still have to do one full scan but you only have to do one scan.

I’m not familiar with your use case or BigQuery but in Redshift I’d just do a COPY to a local table from S3 then do a CREATE TABLE AS SELECT with some logic to split those URLs for your purpose.

You might even be able to do it all in one step with Spectrum.