Hacker Newsnew | past | comments | ask | show | jobs | submit | nothttparchive's commentslogin

Update: Google has been helping me out now, thankfully. Hopefully we can make sure this doesn't happen to others.


Can you please pick a different username that we can rename your account to? Some users are complaining that your current username is misleading (e.g. here: https://news.ycombinator.com/item?id=39447421)

Edit: since I didn't hear back from you, I've consed a 'not' onto the username 'httparchive'. If you prefer a different name, feel free to contact us at hn@ycombinator.com.


Yeah, I have spent much more than $14k to date and would have spent much more over time, losing my business isn't rational. I think it's just another "Google can't do customer support to literally save their life" example.


I've forgotten more Sql than most people ever learn. Time is also valuable and I make trade-offs. Should I spend hours (eg. $$$) to optimize or run a non-optimized query in the background for a different cost? Well, I didn't think the time/benefit/cost equation favored tuning, if I had known that I'd have spent time on tuning. If you offer something for "free" and then change the cost, and don't have any alerting mechanisms to inefficient queries, it's impossible to evaluate trade offs.


Can you post what a $14,000 SQL query looks like?

If nothing else, it can be an example in my SQL 101 course.


It's rarely interesting logic that makes it expensive. Because the per-query charge is not based on compute cycles but the amount of data scanned. This is sufficient:

`SELECT * FROM super_wide_table_with_lots_of_text WHERE NOT filter_on_partitions_or_clusters`

Select * is dangerous because it's a column store. You really need to look at the schema and select only the things you want. And when exploring the data it's important to use sane limits and pull from a single partition.


Here you go!

SELECT page, url, payload FROM `{table}` WHERE page like '%{site_domain}/%' AND url like '%[EXAMPLE.COM]%'

---

There's no LIMIT on it b/c I actually need all the results.


This would make a great educational blog post


> Time is also valuable and I make trade-offs.

I'd say!


I was doing historical evaluation for a few sites, so I was running a query for each month going back to 2016 for each site. I've done this before with no real issues, and if I knew the charges were rapidly exploding I'd have halted the script immediately - but instead it ran for 2 hours and the first notice I got was the CC charge.


My guess is you were querying all the data each time.

If you instead filter out the rows you are interested in (e.g. the particular "few sites" by their URL) and put that in a new table, querying the resulting, tiny table will be very cheap.


I haven't looked at the exact schema for this dataset but for this type of query pattern to be efficient the data would need to be partitioned by date.^[1] I'm guessing that it's not partitioned this way and therefore each of these queries that was looking at "one month" of data was doing a full table scan, so if you queried N months you did N table scans even though the exact same query results could have been achieved even without partitioning by doing one table scan with some kind of aggregation (e.g. GROUP BY) clause.

[1]: https://cloud.google.com/bigquery/docs/partitioned-tables


Can you be more specific? What filtering did you apply? How many columns did you select?


SELECT page, url, payload FROM `{table}` WHERE page like '%{site_domain}/%' AND url like '%[EXAMPLE.COM]%'


I wouldn’t expect either of those filters to utilize a partition key if one exists. So yeah, you probably did a full table scan every time. Is the partitioning documented somewhere?


Yeah, 'LIKE' ops usually give you a full table scan, which is brutal. If it was my own data I'd chop the fields up and index them properly - which is the issue here, it's not your data, so you don't get a say in the indexes, but you do have to pay per row scanned even if you can't apply an index of your own.


Seems like an ideal case for pre-processing. You still have to do one full scan but you only have to do one scan.

I’m not familiar with your use case or BigQuery but in Redshift I’d just do a COPY to a local table from S3 then do a CREATE TABLE AS SELECT with some logic to split those URLs for your purpose.

You might even be able to do it all in one step with Spectrum.


Yes, sure there's stuff I could have done better, and stayed up all night looking at the fine print. But that's not the point - this is *warning* to other people who see the Internet Archive logo, the words "public", and for some dumb reason also trust Google. I'm hoping this doesn't happen to others, I learned a costly lesson.


It wasn't personal use, for business - but I'm bootstrapping a startup, so it's a very tough lesson to learn.


Yup, I'm already having to pay legal fees - which is why you have a biz lawyer on retainer to start with - but I'm not sure I have any standing.


IANAL, but if this happened to me I would be gathering as many examples as I could of this having happened to other people. The angle being: Google knows this is a huge issue. Effectively, they know that they have (presumably accidentally) created a really dangerous trap for small players, and have chosen to do nothing about it.

In some jurisdictions I think that reduces the legitimacy of their claim that you actually owe them money.

EDIT: Even better, focus on the examples where Google "forgave" the debt; you could argue that those examples prove that Google knows it's at least partly their fault.


The FTC is already investigating: https://www.ftc.gov/policy/advocacy-research/tech-at-ftc/202...

I think we (the developer community) need to start pushing back against this abuse, it's getting out of control.

The thing that bothers me the most is I caught this $14k charge b/c I'm a small fry and that money matters to me. How many big accounts just wouldn't notice that? I can't help but think a very non-trivial % of all cloud revenue is just obscure fees that nobody notices - engineers doing the engineering, accounts receivable pays the bills, and the cloud providers get fat.


I would love to see an example of this working.

I know that if it did work it would change the opportunity cost of forgiving debt in these cases dramatically


I honestly think it would be better if they didn't have the option to "forgive the debt" — at least without following up by eliminating the trap that created said debt.

How often is one of these accidental debts created? How often do customers just pay up because it's small enough that it's not worth fighting? How often does AWS (or Google or whoever) decide whether to forgive the debt based on PR damage control rather than the legitimacy of the debt? Jeez I hope someone leaks those numbers one day.

It reminds me of all those horror stories of hospital visits in the USA, where the first bill you receive is just a test to see if they can squeeze that much out of you, but if you know what you're doing or just can't pay then the actual bill is way lower. It's all just yucky.

If big cloud providers couldn't selectively choose which of these debts to enforce, I bet there would be a media shitstorm and then they would suddenly discover that it's not all thaaaaat hard to implement real time billing and hard caps after all.


Well, the "trap" is the lack of hard limits which, if implemented, would enable some companies to blow up their businesses. Which arguably is a better outcome than people who can't afford it getting big bills. But it is a tradeoff even aside from the providers arguably collecting some money people didn't intend to give them.


Yeah, I'm basically just having to write this off so it sucks for me (a lot - I'm bootstrapping a start up), but I'm more worried about other people (especially students) getting caught up in what feels like a scam given the language on the website not, ya know, mentioning the risk of being charged $14k.


The getting started guide linked by the website states:

> Note: The size of the tables you query are important because BigQuery is billed based on the number of processed data. There is 1TB of processed data included in the free tier, so running a full scan query on one of the larger tables can easily eat up your quota. This is where it becomes important to design queries that process only the data you wish to explore

Could this be a bigger warning? Sure.

Is something a scam just because they don't explain the general implications of entering your payment information to a usage-billed product? Not really.


There's "scam" in the sense of "it didn't do what they said"/"charged me more than they said", and there's a more colloquial "scam" where the UI is designed to obscure the cost of a task (quintessential dark pattern stuff). I don't think the reporter is saying "they lied about big query", they're saying the UI is set up to make extremely expensive mistakes very easy, and it's set up to hide the actual cost of the query.

Estimating the total cost of a query is obviously fraught, but from the UI and other comments it sounds like BigQuery knows up front how much data a query will require, and there's at least a minimum cost per TB, so the UI could just say "this will cost at least $X" but the UI has a very basic "this will process X PB of data" text. So they're charging by TB but showing the usage in PB which is a) a 1000x smaller number, and b) visually similar to "TB".

It's very hard to see that as anything other than "design to obscure cost" given that there's no reason to not say "this will cost $X" when the cost is per TB, even if they don't the pricing is per TB but they're showing PB, the checkbox and the textual description are smaller that other text on the page, and there's no ability to specify a cost cap.


I understand the argument against hard circuit-breakers (yeah, seems like a good idea, but had a good traffic spike and I'm down). But it makes even me cautious with respect to scenarios where I could just fat finger something. There are some controls but there are no guarantees in most cases.


Yes and no, I ran the script before and the fee wasn't that high (they jacked it up last summer). Usually I have to jump through a ton of hoops just to add more CPU cores to my VMs so I "trusted" that GCP would warn me if I ever made an error.

One of the bigger issues is they charged my card before I literally had any notice what the bill was - it wasn't even in the dashboard yet. I would have terminated the script ASAP had I gotten *any* warning.


I learned of that billing limiting mechanism after the $14k was charged to my account. As designed.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: