More

andreimatei1 · on Nov 23, 2024

Print debugging is the tool most people reach for when they can, but its biggest problem is that you have to change the source code to add the printfs. This is impractical in many circumstances; it generally only works on your local machine. In particular, you can't do that in production environments, and that's where the most interesting debugging happens. Similarly, traditional debuggers are not available in production either for a lot of modern a software -- you can't really attach gdb to your distributed service, for many reasons.

What print debugging and debuggers have in common, in contrast to other tools, is that they can extract data specific to your program (e.g values of variables and data structures) that your program was not instrumented to export. It's really a shame that we generally don't have this capability for production software running at scale.

That's why I'm working on Side-Eye [1], a debugger that does work in production. With Side-Eye, you can do something analogous to print debugging, but without changing code or restarting anything. It uses a combination of debug information and dynamic instrumentation.

[1] https://side-eye.io/

surajrmal · on Nov 23, 2024

How is side-eye different from dtrace?

andreimatei1 · on Nov 23, 2024

Side-Eye is massively inspired by DTrace in some of its raw capabilities and the basic idea of dynamic instrumentation. Beyond that, they're very different. At a low level, DTrace is primarily geared towards debugging the kernel, whereas Side-Eye is about userspace. DTrace's support for the DWARF debug information format used on linux is limited. The interaction model is different - for DTrace you write scripts to collect and process data. DTrace works at the level of one machine, whereas Side-Eye monitors processes across a fleet. In Side-Eye you interact with a web application and you collect data into a SQL database that you can analyze. Side-Eye is also a cloud service that your whole team is supposed to use together over time.

And then there are more technically superficial, but crucial, aspects related to specific programming language support. Side-Eye understands Go maps and such, and the Go runtime. It can do stuff like enumerate all the goroutines and give you a snapshot of all their stacks. We're also working on integrating with the Go execution traces collected my the Go scheduler, etc.

andreimatei1 · on Jan 31, 2024

This answer should go to the top.

da39a3ee · on Jan 31, 2024

Except it doesn’t give much intuition for why?

andreimatei1 · on Nov 30, 2022

Very nice work! It seems that Delve is used for inspiration and development, but Golang code is always compiled with frame pointers, I think. Is that right?

javierhonduco · on Nov 30, 2022

Thanks!

As @brancz mentioned, Delve uses DWARF unwind information to produce backtraces (they are stored in the .debug_frame section for Go).

You are right, Go enabled frame pointers for all architectures as of 1.17 [0]. This is enabled to allow profilers to work well, without having to use to techniques such as the one we describe in our post.

When it gets funny is that there's `gopclntab`, a 3rd option in Go to unwind stacks, used by `panic` and I believe other parts of the runtime. If you are interested in more details, Felix Geisendörfer's repo contains way more details [1]

[0]: https://go.dev/doc/go1.17

[1]: https://github.com/DataDog/go-profiler-notes/blob/main/stack...

brancz · on Nov 30, 2022

That's right, but Delve also needs to use DWARF to walk stacks that involve CGO, where it can't rely on frame pointers being present.

brancz · on Nov 30, 2022

Quick correction, actually Delve uses DWARF unwind information for every code location, including those from Go, even if frame pointers are present.

andreimatei1 · on July 19, 2022

Global Tables let database clients in any region read strongly consistent data with region-local latencies. They’re an important piece of the multi-region puzzle — providing latency characteristics that are well suited for read-mostly, non-localized data.

brickbrd · on July 19, 2022

Instead of doing all this complicated thing, how about simply following a Raft-like consensus protocol with the minor modification that the leader won't include a write op its read processing until that write op has been applied to the log of all the replicas, not just the quorum. When the heartbeat response from replicas indicates to the leader that this write op has been applied everywhere, it can advance its internal marker to include this write op in the read operations.

This simple scheme allows all members including replicas to serve read-after-write consistency and penalizes the write op that happened. That write op wont be acknowledged to the caller until it has been applied everywhere.

There are no fault tolerance issues here btw. If any replica fails, as long as quorum was reached, the repair procedure will ensure that write will be eventually applied to all replicas. If the quorum itself could not been reached then the write is lost anyways and is no different than the typical case of reading just from the leader.

nvanbenschoten · on July 19, 2022

I don't think this scheme provides the "monotonic reads" property discussed in the blog post. Specifically, it would be possible for a reader to observe a new value from r2 (who received a timely heartbeat), then to later observe an older value from r3 (who received a delayed heartbeat). This would be a violation of linearizability, which mandates that operations appear to take place atomically, regardless of which replica is consulted behind the scenes. This is important because linearizability is compositional, so users of CockroachDB and internal systems within CockroachDB can both use global tables as a building block without needing to design around subtle race conditions.

However, for the sake of discussion, this is an interesting point on the design spectrum! A scheme that provides read-your-writes but not monotonic reads is essentially what you would get if you took global tables as described in this blog post, but then never had read-only transactions commit-wait. It's a trade-off we considered early in the design of this work and one that we may consider exposing in the future for select use cases. Here's the relevant section of the original design proposal, if you're interested: https://github.com/cockroachdb/cockroach/blob/master/docs/RF....

brickbrd · on July 21, 2022

Thanks. Yes this explanation is something I can agree with. It does not provide monotonic reads.

remram · on July 20, 2022

> until that write op has been applied to the log of all the replicas, not just the quorum

That removes all the fault tolerance. What do you do if you never get the acknowledgement from all replicas?

brickbrd · on July 21, 2022

That question doesn’t make much sense. If you have quorum then eventually repairs will kick in and will get replicated everywhere.

So it can tolerate up to N/2 failures just like other consensus system. Because this is basically Raft.

andreimatei1 · on Oct 7, 2021

Cockroach lets you hash-shard your keys n-ways if you choose to through a feature called hash-sharded indexes (https://www.cockroachlabs.com/docs/stable/hash-sharded-index...) That's a trade-off: costlier range scans (a range scan turns into n scans) for avoiding a throughput-limiting hot spot.

andreimatei1 · on July 17, 2021

> its lack of ACID guarantees

Our transactions implementation is our crown jewel. You might want to check your sources.

andreimatei1 · on Jan 26, 2021

I think the "default" will evolve with whatever offers the best "serverless" experience in the public clouds. In particular, the cheapest and most granularly-billed option.

andreimatei1 · on Jan 26, 2021

> I agree that right now it doesn't make sense.

This CRDB engineer respectfully disagrees. This thread takes it as a given that a non-distributed DB is better if you don't need to scale up (i.e. if you run a single "node"). Let me offer a couterpoint: it's easier to embed CRDB into some software you're distributing than it is to embed Postgres. This is to say, we do try to compete at every scale (well, perhaps not at SQLite scale). CRDB doesn't have all the SQL features of PG, but it does have its pluses: CRDB does online schema changes, it is generally simpler to run, comes with a UI, comes with more observability, can backup to the cloud, can be more easily embedded into tests.

Online schema changes are a big deal; the other thing that I hope will help us win small-scale hearts and minds is the ever-improving observability story. I hope CRDB will develop world-class capabilities here. Other open-source databases traditionally have not had many capabilities out of the box.

csharptwdec19 · on Jan 28, 2021

For RDBMS, Firebird SQL has managed to survive well in this sphere, for all of it's warts.

You can run it either as an embedded DLL or as a standalone server, and changing between the two is often just a connection string change; It seems to be fairly popular in the POS space for this reason.

andreimatei1 · on March 13, 2020

Working on it.

andreimatei1 · on Feb 7, 2020

Would you mind providing a link to the empty arrays patch? Is this something CRDB should fix?

latch · on Feb 7, 2020

It was fixed in CRDB also: https://github.com/cockroachdb/cockroach/issues/42942