More

nothttparchive · on Feb 21, 2024

Update: Google has been helping me out now, thankfully. Hopefully we can make sure this doesn't happen to others.

dang · on Feb 21, 2024

Can you please pick a different username that we can rename your account to? Some users are complaining that your current username is misleading (e.g. here: https://news.ycombinator.com/item?id=39447421)

Edit: since I didn't hear back from you, I've consed a 'not' onto the username 'httparchive'. If you prefer a different name, feel free to contact us at hn@ycombinator.com.

nothttparchive · on Feb 20, 2024

Yeah, I have spent much more than $14k to date and would have spent much more over time, losing my business isn't rational. I think it's just another "Google can't do customer support to literally save their life" example.

nothttparchive · on Feb 20, 2024

I've forgotten more Sql than most people ever learn. Time is also valuable and I make trade-offs. Should I spend hours (eg. $$$) to optimize or run a non-optimized query in the background for a different cost? Well, I didn't think the time/benefit/cost equation favored tuning, if I had known that I'd have spent time on tuning. If you offer something for "free" and then change the cost, and don't have any alerting mechanisms to inefficient queries, it's impossible to evaluate trade offs.

johnnyo · on Feb 20, 2024

Can you post what a $14,000 SQL query looks like?

If nothing else, it can be an example in my SQL 101 course.

anon84873628 · on Feb 20, 2024

It's rarely interesting logic that makes it expensive. Because the per-query charge is not based on compute cycles but the amount of data scanned. This is sufficient:

`SELECT * FROM super_wide_table_with_lots_of_text WHERE NOT filter_on_partitions_or_clusters`

Select * is dangerous because it's a column store. You really need to look at the schema and select only the things you want. And when exploring the data it's important to use sane limits and pull from a single partition.

nothttparchive · on Feb 21, 2024

Here you go!

SELECT page, url, payload FROM `{table}` WHERE page like '%{site_domain}/%' AND url like '%[EXAMPLE.COM]%'

---

There's no LIMIT on it b/c I actually need all the results.

dabernathy89 · on Feb 20, 2024

This would make a great educational blog post

kstrauser · on Feb 20, 2024

> Time is also valuable and I make trade-offs.

I'd say!

nothttparchive · on Feb 20, 2024

I was doing historical evaluation for a few sites, so I was running a query for each month going back to 2016 for each site. I've done this before with no real issues, and if I knew the charges were rapidly exploding I'd have halted the script immediately - but instead it ran for 2 hours and the first notice I got was the CC charge.

Symbiote · on Feb 20, 2024

My guess is you were querying all the data each time.

If you instead filter out the rows you are interested in (e.g. the particular "few sites" by their URL) and put that in a new table, querying the resulting, tiny table will be very cheap.

eklitzke · on Feb 20, 2024

I haven't looked at the exact schema for this dataset but for this type of query pattern to be efficient the data would need to be partitioned by date.^[1] I'm guessing that it's not partitioned this way and therefore each of these queries that was looking at "one month" of data was doing a full table scan, so if you queried N months you did N table scans even though the exact same query results could have been achieved even without partitioning by doing one table scan with some kind of aggregation (e.g. GROUP BY) clause.

[1]: https://cloud.google.com/bigquery/docs/partitioned-tables

mulmen · on Feb 20, 2024

Can you be more specific? What filtering did you apply? How many columns did you select?

nothttparchive · on Feb 21, 2024

SELECT page, url, payload FROM `{table}` WHERE page like '%{site_domain}/%' AND url like '%[EXAMPLE.COM]%'

mulmen · on Feb 21, 2024

I wouldn’t expect either of those filters to utilize a partition key if one exists. So yeah, you probably did a full table scan every time. Is the partitioning documented somewhere?

nothttparchive · on Feb 21, 2024

Yeah, 'LIKE' ops usually give you a full table scan, which is brutal. If it was my own data I'd chop the fields up and index them properly - which is the issue here, it's not your data, so you don't get a say in the indexes, but you do have to pay per row scanned even if you can't apply an index of your own.

mulmen · on Feb 21, 2024

Seems like an ideal case for pre-processing. You still have to do one full scan but you only have to do one scan.

I’m not familiar with your use case or BigQuery but in Redshift I’d just do a COPY to a local table from S3 then do a CREATE TABLE AS SELECT with some logic to split those URLs for your purpose.

You might even be able to do it all in one step with Spectrum.

nothttparchive · on Feb 20, 2024

Yes, sure there's stuff I could have done better, and stayed up all night looking at the fine print. But that's not the point - this is *warning* to other people who see the Internet Archive logo, the words "public", and for some dumb reason also trust Google. I'm hoping this doesn't happen to others, I learned a costly lesson.

nothttparchive · on Feb 20, 2024

It wasn't personal use, for business - but I'm bootstrapping a startup, so it's a very tough lesson to learn.

nothttparchive · on Feb 20, 2024

Yup, I'm already having to pay legal fees - which is why you have a biz lawyer on retainer to start with - but I'm not sure I have any standing.

jeffparsons · on Feb 20, 2024

IANAL, but if this happened to me I would be gathering as many examples as I could of this having happened to other people. The angle being: Google knows this is a huge issue. Effectively, they know that they have (presumably accidentally) created a really dangerous trap for small players, and have chosen to do nothing about it.

In some jurisdictions I think that reduces the legitimacy of their claim that you actually owe them money.

EDIT: Even better, focus on the examples where Google "forgave" the debt; you could argue that those examples prove that Google knows it's at least partly their fault.

nothttparchive · on Feb 21, 2024

The FTC is already investigating: https://www.ftc.gov/policy/advocacy-research/tech-at-ftc/202...

I think we (the developer community) need to start pushing back against this abuse, it's getting out of control.

The thing that bothers me the most is I caught this $14k charge b/c I'm a small fry and that money matters to me. How many big accounts just wouldn't notice that? I can't help but think a very non-trivial % of all cloud revenue is just obscure fees that nobody notices - engineers doing the engineering, accounts receivable pays the bills, and the cloud providers get fat.

Affric · on Feb 20, 2024

I would love to see an example of this working.

I know that if it did work it would change the opportunity cost of forgiving debt in these cases dramatically

jeffparsons · on Feb 21, 2024

I honestly think it would be better if they didn't have the option to "forgive the debt" — at least without following up by eliminating the trap that created said debt.

How often is one of these accidental debts created? How often do customers just pay up because it's small enough that it's not worth fighting? How often does AWS (or Google or whoever) decide whether to forgive the debt based on PR damage control rather than the legitimacy of the debt? Jeez I hope someone leaks those numbers one day.

It reminds me of all those horror stories of hospital visits in the USA, where the first bill you receive is just a test to see if they can squeeze that much out of you, but if you know what you're doing or just can't pay then the actual bill is way lower. It's all just yucky.

If big cloud providers couldn't selectively choose which of these debts to enforce, I bet there would be a media shitstorm and then they would suddenly discover that it's not all thaaaaat hard to implement real time billing and hard caps after all.

ghaff · on Feb 21, 2024

Well, the "trap" is the lack of hard limits which, if implemented, would enable some companies to blow up their businesses. Which arguably is a better outcome than people who can't afford it getting big bills. But it is a tradeoff even aside from the providers arguably collecting some money people didn't intend to give them.

nothttparchive · on Feb 20, 2024

Yeah, I'm basically just having to write this off so it sucks for me (a lot - I'm bootstrapping a start up), but I'm more worried about other people (especially students) getting caught up in what feels like a scam given the language on the website not, ya know, mentioning the risk of being charged $14k.

hobofan · on Feb 20, 2024

The getting started guide linked by the website states:

> Note: The size of the tables you query are important because BigQuery is billed based on the number of processed data. There is 1TB of processed data included in the free tier, so running a full scan query on one of the larger tables can easily eat up your quota. This is where it becomes important to design queries that process only the data you wish to explore

Could this be a bigger warning? Sure.

Is something a scam just because they don't explain the general implications of entering your payment information to a usage-billed product? Not really.

olliej · on Feb 21, 2024

There's "scam" in the sense of "it didn't do what they said"/"charged me more than they said", and there's a more colloquial "scam" where the UI is designed to obscure the cost of a task (quintessential dark pattern stuff). I don't think the reporter is saying "they lied about big query", they're saying the UI is set up to make extremely expensive mistakes very easy, and it's set up to hide the actual cost of the query.

Estimating the total cost of a query is obviously fraught, but from the UI and other comments it sounds like BigQuery knows up front how much data a query will require, and there's at least a minimum cost per TB, so the UI could just say "this will cost at least $X" but the UI has a very basic "this will process X PB of data" text. So they're charging by TB but showing the usage in PB which is a) a 1000x smaller number, and b) visually similar to "TB".

It's very hard to see that as anything other than "design to obscure cost" given that there's no reason to not say "this will cost $X" when the cost is per TB, even if they don't the pricing is per TB but they're showing PB, the checkbox and the textual description are smaller that other text on the page, and there's no ability to specify a cost cap.

ghaff · on Feb 20, 2024

I understand the argument against hard circuit-breakers (yeah, seems like a good idea, but had a good traffic spike and I'm down). But it makes even me cautious with respect to scenarios where I could just fat finger something. There are some controls but there are no guarantees in most cases.

nothttparchive · on Feb 20, 2024

Yes and no, I ran the script before and the fee wasn't that high (they jacked it up last summer). Usually I have to jump through a ton of hoops just to add more CPU cores to my VMs so I "trusted" that GCP would warn me if I ever made an error.

One of the bigger issues is they charged my card before I literally had any notice what the bill was - it wasn't even in the dashboard yet. I would have terminated the script ASAP had I gotten *any* warning.

nothttparchive · on Feb 20, 2024

I learned of that billing limiting mechanism after the $14k was charged to my account. As designed.