Hacker Newsnew | past | comments | ask | show | jobs | submit | AlexClickHouse's commentslogin

I've recently implemented a similar service: https://adsb.exposed/?dataset=Birds&zoom=5 - a viewer for eBird data, where you can filter by various species and make visualizations with SQL.

Full write-up on how it is created: https://clickhouse.com/blog/birds


USearch is this type of library: https://github.com/unum-cloud/usearch

Used in ClickHouse and a few other DBMS.


I've implemented a similar site a few years ago, with one crucial difference, which makes it even simpler: https://pastila.nl/

The difference is that there is no "share" button, so you don't have to press it, and just copy the page URL any time.


Neat! I like your solution, much better for simple text sharing.


Exactly as in MS Access, Interbase/Firebird, and dBase II.


Flawed methodology - incorrect results.


Thanks for creating this issue, it is worth investigating!

I see you also created similar issues in Polars: https://github.com/pola-rs/polars/issues/17932 and DuckDB: https://github.com/duckdb/duckdb/issues/17066

ClickHouse has a built-in memory tracker, so even if there is not enough memory, it will stop the query and send an exception to the client, instead of crashing. It also allows fair sharing of memory between different workloads.

You need to provide more info on the issue for reproduction, e.g., how to fill the tables. 16 GB of memory should be enough even for a CROSS JOIN between a 10 billion-row and a 100-row table, because it is processed in a streaming fashion without accumulating a large amount of data in memory. The same should be true for a merge join.

However, there are places when a large buffer might be needed. For example, if you insert data into a table backed by S3 storage, it requires a buffer that can be in the order of 500 MB.

There is a possibility that your machine has 16 GB of memory, but most of it is consumed by Chrome, Slack, or Safari, and not much is left for ClickHouse server.


Yeah I feel like I'm on crazy pills, I'm OoM'ing all these big data tools that everyone loves very trivially -- duckdb OoM'd just loading a CSV file, and Polars OoM'd just reading the first couple rows of a parquet file?

I do want to get a better reproduction on CH because it seems like it's an interplay between the INSERT INTO...SELECT. It's just a bit of work to generate synthetic data with the same profile as my production data (for what it's worth I did put quite a bit of effort into following the doc guidelines for dealing with low-memory machines).


ClickHouse predates Apache Arrow.


I vaguely remember an old bug in atop, leading to a very unusual consequence.

Atop will do an invalid memory write and crash with a segfault. But this writing is performed on a memory page mapped to a hardware timer. Despite not being able to write into that page, just touching it somehow changes how this hardware timer works. Then, the OS detects that this timer is inaccurate and switches to a different clock source (which you can see in /sys/devices/system/clocksource/clocksource0/current_clocksource). As a result, every call to clock_gettime becomes slower, and the system becomes slower as a whole until it restarts.

In short, a segfault in atop leads to the whole system's performance degradation. But this was found around maybe 7 years ago.


This was found by the very same Rachel that's sounding the alarm here

https://rachelbythebay.com/w/2014/03/02/sync/


That is such an interesting bug!


ClickHouse does not allow external connections by default.

If someone wants to configure an unauthenticated access from the Internet, they have to do the following extra steps:

- enable listening to the wildcard address;

- remove IP filtering for the default user;

- set up a no-password authentication;

It is possible to ignore and turn off all guardrails that the system has by default, but it needs extra efforts. However, it's possible that someone copy-pasted a wrong configuration file from somewhere without knowing what is inside, or do something like - listen to localhost, but expose ports from Docker.

A use case for direct database access exists, and is acceptable, assuming you set up a readonly user, grant access to specific tables, limit queries by complexity, and limit total usage by quotas. This is demonstrated by the following public services:

https://play.clickhouse.com/

https://adsb.exposed/

https://reversedns.space/

In this way, ClickHouse can be used to implement public data APIs (which is probably not what DeepSeek wanted).

ClickHouse has a wide range of security and access control restrictions: authentication methods with SSL certificates; SSH keys; even simple password-based auth allows bcrypt and short-living credentials; integration with LDAP and Kerberos; every authentication method can be limited on a network level; full Role-Based Access Control; fine-grained restrictions on query complexity and resource consumption, user quotas.

But still, according to Shodan, there are 33,000 misconfigured ClickHouse servers on the Internet: https://www.shodan.io/search?query=clickhouse This can be attributed to a high popularity of ClickHouse (it is the most widely used analytic DBMS).

When you use ClickHouse Cloud, which is a commercial cloud service based on the open-source ClickHouse database (https://clickhouse.com/cloud), it ensures the needed security measures, improving strong defaults even more: TLS, stong credentials, IP filtering; plus it allows private link, data encryption with customer keys, etc.


Thanks for your insight. I got ratioed to fuck for trying to defend the standpoint that this is an unusual expectation of a regular engineer to stand this up correctly.

https://news.ycombinator.com/item?id=42873134


If you're referring to the downvotes on https://news.ycombinator.com/item?id=42873211, I think that comment would have done better if you had omitted the swipes, as the site guidelines ask: https://news.ycombinator.com/newsguidelines.html.

e.g. "You are, in typical HN style, minimising the problem into insignificance" and "love how this is getting ratioed by egotistical self confessed x10 engineers". This is the sort of thing commenters here are asked to edit out of their comment, and when they don't, it's correct to downvote them (even though your underlying points may otherwise be correct).


lol, nice. getting out in front of anyone even potentially pointing fingers at ClickHouse. Good initiative.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: