Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I was doing historical evaluation for a few sites, so I was running a query for each month going back to 2016 for each site. I've done this before with no real issues, and if I knew the charges were rapidly exploding I'd have halted the script immediately - but instead it ran for 2 hours and the first notice I got was the CC charge.


My guess is you were querying all the data each time.

If you instead filter out the rows you are interested in (e.g. the particular "few sites" by their URL) and put that in a new table, querying the resulting, tiny table will be very cheap.


I haven't looked at the exact schema for this dataset but for this type of query pattern to be efficient the data would need to be partitioned by date.^[1] I'm guessing that it's not partitioned this way and therefore each of these queries that was looking at "one month" of data was doing a full table scan, so if you queried N months you did N table scans even though the exact same query results could have been achieved even without partitioning by doing one table scan with some kind of aggregation (e.g. GROUP BY) clause.

[1]: https://cloud.google.com/bigquery/docs/partitioned-tables


Can you be more specific? What filtering did you apply? How many columns did you select?


SELECT page, url, payload FROM `{table}` WHERE page like '%{site_domain}/%' AND url like '%[EXAMPLE.COM]%'


I wouldn’t expect either of those filters to utilize a partition key if one exists. So yeah, you probably did a full table scan every time. Is the partitioning documented somewhere?


Yeah, 'LIKE' ops usually give you a full table scan, which is brutal. If it was my own data I'd chop the fields up and index them properly - which is the issue here, it's not your data, so you don't get a say in the indexes, but you do have to pay per row scanned even if you can't apply an index of your own.


Seems like an ideal case for pre-processing. You still have to do one full scan but you only have to do one scan.

I’m not familiar with your use case or BigQuery but in Redshift I’d just do a COPY to a local table from S3 then do a CREATE TABLE AS SELECT with some logic to split those URLs for your purpose.

You might even be able to do it all in one step with Spectrum.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: