Hacker Newsnew | past | comments | ask | show | jobs | submit | more SnowflakeOnIce's commentslogin

> you can get 100% GPU utilization by just reading/writing to memory while doing 0 computations

Indeed! Utilization is a proxy for what you actually want (which is good use of available hardware). 100% GPU utilization doesn't actually indicate this.

On the other hand, if you aren't getting 100% GPU utilization, you aren't making good use of the hardware.


This reminds me of the Linux/Unix disk busy "%util" metric in tools like sar and iostat. People sometimes interpret the 100%util as a physical ceiling for the disk IO capacity, just like with CPUs ("we need more disks to get disk I/O utilization down!").

It is a correct metric when your block device has a single physical spinning disk that can only accept one request at a time (dispatch queue depth=1). But the moment you deal with SSDs (capable of highly concurrent NAND IO), SAN storage block devices striped over many physical disks or even a single spinning disk that can internally queue and reorder IOs for more efficient seeking, just hitting 100%util at the host block device level doesn't mean that you've hit some IOPS ceiling.

So, looks like the GPU "SM efficiency" analysis is somewhat like logging in to the storage array itself and checking how busy each physical disk (or at least each disk controller) inside that storage array is.


This sounds like the good old "having high test coverage is bad because I can get to 100% just by calling functions and doing nothing, asserting nothing with them".

100% test coverage doesn't mean your tests are good, but having 50% (or pick your number) means they are bad / not sufficient.


That isn't even necessarily true. For interpreted languages having a test that just runs code asserts that the code is able to run (i.e. you are not calling a string object as a function for example). Which is not enough to always assert functionality but still better than nothing.


In other words it's "necessary, but not sufficient".


Yup, similar to SM efficiency in that sense too. If you aren't seeing >80%, there is certainly time left on the table. But getting a high SM efficiency value doesn't guarantee you're making good use of the hardware as well. (still a better proxy than GPU util though)


This is not true. Lots of algorithms simply can't use 100% of the GPU even though they're written as optimal as possible. FFT is one.


In remote sensing | computation physicas applications it's rare to have a single FFT to compute (whatever algorithm is chosen).

Hence the practice of stuffing many FFT's through GPU grids in parallel and working to max out the hardware usage in order to increase application throughput.

eg:

https://arxiv.org/pdf/1707.07263

https://ieeexplore.ieee.org/document/9835388


I don't mean a single fft. I mean the fft algorithms are inherently not going to use the GPU at 100% utilization by any metric.


Not so inherently IMO.

What I mean is: where did you take that from? I program FFTs on GPUs, and I see no reason for the "inherently can't reach 100% utilization by any metric".


I interpret that comment as you're not going to be using every silicon block that the GPU provides, like video codecs and rasterizing. If you've maxed out compute without going over the power budget, for example, you'd likely still be able to decode video if the GPU has a separate block for it.


I had a similar read .. I packed a lot of parallel FFT's and other processing into custom TI DSP cards but the DSP family chips were RISC and carried little 'baggage' - just fat fat 32 bit | 64 bit floating point pipelines with instruction sets optimised for modular ring indexing of scalar | vector operations.

Even then they ran @ 80% "by design" for expected hard real time usage .. they only went to 11 and dropped results in toast until they smoke tests and with operators that redlined limits (and got feedback to that effect).


I'd be curious to see how you can do it. Try launching an fft of any size and batches and see if you can hit 100%


> On the other hand, if you aren't getting 100% GPU utilization, you aren't making good use of the hardware.

Some of us like having more than 2 hours of battery life, and not scalding our skin in the process of using our devices.


The common crawl only pulls documents less than a small limit (1MiB last I checked). Without special handling in this project, bigger documents than that would be missing.

So indeed, not representative of the whole Internet.


From the article:

>Specifically, when Common Crawl gets to a pdf, it just stores the first megabyte of information and truncates the rest.

This is where SafeDocs or CC-MAIN-2021-31-PDF-UNTRUNCATED enters the picture. This corpus was originally created by the DARPA SafeDocs program and what it did was refetch all the different pdfs from a snapshot of Common Crawl to have untruncated versions of them.


'git clone --mirror' seems to pull down lots of additional content also.


There seems to be no such thing as a "private fork" on GitHub in 2024 [1]:

> A fork is a new repository that shares code and visibility settings with the upstream repository. All forks of public repositories are public. You cannot change the visibility of a fork.

[1] https://docs.github.com/en/pull-requests/collaborating-with-...


A fork of a private repo is private. When you make the original repo public, the fork is still a private repo, but the commits can now be accessed by hash.


According to the screenshot in the documentation, though, new commits made to the fork will not be accessible by hash. So private feature branches in forks may be accessible via the upstream that was changed to public, if those branches existed at the time the upstream's visibility changed, but new feature branches made after that time won't be accessible.


OK but say a company has a private, closed source internal tool, and they want to open-source some part of it. They fork it and start working on cleaning up the history to make it publishable.

After some changes which include deleting sensitive information and proprietary code, and squashing all the history to one commit, they change the repo to public.

According to this article, any commit on either repo which was made before the 2nd repo was made public, can still be accessed on the public repo.


> After some changes which include deleting sensitive information and proprietary code, and squashing all the history to one commit, they change the repo to public.

I know this might look like a valid approach on the first glance but... it is stupid for anyone who knows how git or GitHub API works? Remote (GitHub's) reflog is not GC'd immediately, you can try to get commit hashes from events history via API, and then try to get commits from reflog.


> it is stupid for anyone who knows how git or GitHub API works?

You need to know how git works and GitHub's API. I would say I have a pretty good understanding about how (local) git works internally, but was deeply surprised about GitHub's brute-forceable short commit IDs and the existence of a public log of all reflog activity [1].

When the article said "You might think you’re protected by needing to know the commit hash. You’re not. The hash is discoverable. More on that later." I was not able to deduce what would come later. Meanwhile, data access by hash seemed like a non-issue to me – how would you compute the hash without having the data in the first place? Checking that a certain file exists in a private branch might be an information disclosure, but gi not usually problematic.

And in any case, GitHub has grown so far away from its roots as a simple git hoster that implicit expectations change as well. If I self-host my git repository, my mental model is very close to git internals. If I use GitHub's web interface to click myself a repository with complex access rights, I assume they have concepts in place to thoroughly enforce these access rights. I mean, GitHub organizations are not a git concept.

[1] https://www.gharchive.org/


> You need to know how git works and GitHub's API.

No; just knowing how git works is enough to understand that force-pushing squashed commits or removing branches on remote will not necessarily remove the actual data on remote.

GitHub API (or just using the web UI) only makes these features more obvious. For example, you can find and check commit referenced in MR comments even if it was force-pushed away.

> was deeply surprised about GitHub's brute-forceable short commit IDs

Short commit IDs are not GitHub feature, they are git feature.

> If I use GitHub's web interface to click myself a repository with complex access rights, I assume they have concepts in place to thoroughly enforce these access rights.

Have you ever tried to make private GitHub repository public? There is a clear warning that code, logs and activity history will become public. Maybe they should include additional clause about forks there.


Dereferenced commits which haven't yet been garbage collected in a remote yet are not available to your local clones via git... I suppose there could be some obscure way to pull them from the remote if you know the hash (though I'm not actually sure), but either way (via web interface or CLI) you'd have to know the hash.

And it's completely reasonable to assume no one external to the org when it was private would have those hashes.

It sounds like github's antipattern here is retaining a log of all events which may leak these hashes, and is really not an assumption I'd expect a git user to make.


> Short commit IDs are not GitHub feature, they are git feature.

They're a local feature sure. But you already have a list of local commits, just open the .git directory.

Can you connect to a vanilla git server and enumerate every single hash?

> Maybe they should include additional clause about forks there.

It would help but they need much more than a clause about forks.

Ideally they would purge that extra data when making something public.


> Can you connect to a vanilla git server and enumerate every single hash?

If you have ssh access yes, but I don't think you can do this with just git (and of course github doesn't provide ssh access to the git repo servers)

The public distribution of commit hashes via their event log seems really irresponsible on github's part to me.


Yes, even though I expect there to be people that do exactly what the GP describes, if you know git it has severe "do not do that!" vibes.

Do not squash your commits and make the repository public. Instead, make a new repository and add the code there.


Why not just create a new public repo and copy all of the source code that you want to it?


Because they haven't read the article and this HN discussion?

"Why not just...". Once you already know something it can seem obvious.


What?


Chat gpt given the following repo, create a plausible perfect commit history to create this repository.


Funnily enough the docs are wrong, the GitHub CLI allows changing a forks visibility https://stackoverflow.com/a/78094654/12846952


Am I the only one who finds this conceptually confusing?


Nope, me too. The whole Repo network thing is not User facing at all. It is an internal thing at GitHub to allow easier pull requests between repo's. But it isn't a concept git knows, and it doesn't affect GitHub users at all except for this one weird thing.


I may be recalling incorrectly but I seem to remember it having some storage deduplication benefits on the backend.


> Nope, me too. The whole Repo network thing is not User facing at all.

There are some user-facing parts: You can find the fork network and some related bits under repo insights. (The UX is not great.)

https://github.com/apache/airflow/forks?include=active&page=...


Not through the GitHub interface, no. But you can copy all files in a repository and create a new repository. IIRC there's a way to retain the history via this process as well.


That’s beside the point. The article is specifically about « GitHub forks » and their shortcomings. It’s unrelated to pushing to distinct repositories not magically ´linked’ by the GH « fork feature ».


You can create a private repository on GitHub, clone it locally, add the repo being "forked" from as a separate git remote (I usually call this one "upstream" and my "fork", well, "fork"), fetch and pull from upstream, then push to fork.


All you should have to do is just clone the repo locally and then create a blank GitHub repository, set it as the/a remote and push to it.


That's not the GitHub concept / almost trademark of "fork" anymore though, which is what your parent was talking about


I mean it's git, just git init, git remote add for origin and upstream, origin pointing to your private, git fetch upstream, git push to origin.


If your ranges end up sparsely distributed, using roaring bitmaps can speed things up a lot.

https://roaringbitmap.org/


The classical synthesis approaches seem to have much more emphasis on correctness and specification than modern LLM-based synthesis though — things like deriving provably correct lock-free data structures.

The LLM-based synthesis work I've seen, in contrast, maybe uses a set of unit tests for correctness testing.

It doesn't feel like modern LLM-based synthesis supplants the classical approaches.


If you think about classical synthesis as fancy tricks around optimizing SAT encodings & search, neural synthesis becomes about how to solve np-hard problems there instantly. There are other perspectives as well, but that's already a dual LLM paper for a bunch of individual classical papers, without losing safety. Ex: leapfrog the concolic execution papers via a fast oracle.

Likewise, a lot of cool work is about automating the proof theory much better. Terrance Tao has been going down that rabbithole, which is amazing, and I hope the security community does too. I know Dawn Song, one of the big names here, is.

Another big area is making programming more accessible by assuming synthesizers, and language design decisions around that. A lot of that work has been shifting bc LLMs too. Ex: Historically PLs had zero domain understanding, just logic, and that has really flipped since GPT4, so the concept of a DSL is now also flipping.

Finally.. a lot of neural synthesis can and does entirely ignore the SAT aspect. There is a lot of depth to it on its own, and amazing results.


> about how to solve np-hard problems there instantly. ... without losing safety. Ex: leapfrog the concolic execution papers via a fast oracle

Just want to point out to unknowing readers: none of this is true and the author of these comments is "talking his book".

Source: do I really need to explain how LLMs do not solve NP-hard problems "instantly"?


Most of the papers in the syllabus are about the SAT/SMT-based solver era of program synthesis, and neural synthesis is popularly - with open benchmarks - succeeding on problems those failed to gain traction on.

The research question is shifting to how to either bridge the worlds (better together) or leave the old one. That doesn't make SMT era papers wrong, but as with Bayesian vs NN, relatively impractical or otherwise irrelevant for most people until a major breakthrough happens (if ever). Ex: Agentic loops are interesting from a direct reuse of classical synthesis results perspective, and I know big labs are experimenting with different methods.

I'm transparent that I was part of the old world, and while synthesis was a design hope for our startup (I was publicly demoing SOTA here 10 years ago!), it was never to the quality we could invest in commercializing those school of methods. When GPT4 hit, we could finally add synthesis-based methods that ship. I get the impression you are part of the older SMT world and reject modern advances, and would wonder the reverse for you. It's a bit galling to be accused of voting with my feet - and shipping - after patiently waiting 10 years as an invalidating thing instead of taking that as a proof point that maybe there is something going on.


Have a few papers on hand to start reading about these?


Edit: I'd be curious which other papers/authors folks would add who are doing interesting work here.

One starting point is to look at great synthesis, mechanized proofs, and PL folks who have been exploring neural synthesis, and the papers they write + cite:

* Arjun guha: starcoder llm, llm type inference, etc

* Sumit Gulwani & Rishabh Singh (whose pre-llm work is on the syllabus)

* Nikhil Swamy (f*), Sorin Lerner (coq)

* A practical area that has a well-funded intersection because of urgency is security, such as niches like smart contracts, and broader bug finding & repair. Eg, simpler applied methods by Trail of Bits, and any academics they cite. Same thing for DARPA challenge participants: they likely have relevant non-DARPA papers.

* I've been curious about the assisted proof work community Terence Tao has fallen into here as well

* Edit: There are a bunch of Devin-like teams (ex: Princeton team markets a lot, ...), and while interesting empirically, they generally are not as principaled, which is the topic here, so I'm not listing. A lot of OOPSLA, MSR, etc papers are doing incremental work here afaict, so work to sort out

An important distinction for me is whether it is an agentic system (which is much of it and can more easily use heavier & classic methods), training an existing architecture (often a stepping stone to more interesting work), or making a novel architecture (via traditional NLP+NN tricks vs synthesis-informed ones). Most practical results today have been clever agentic, better training sets, and basic NLP-flavored arch tweaks. I suspect that is because those are easiest and early days/years, and thus look more for diversity of explorations right now vs deep in any one track.


It predates LLVM. That's one reason, anyway.


Yes and no. Latency across a stage is one reason why orchestras have conductors. An orchestra split across a stage can have enough latency between one side and another to cause chaos sans conductor. It takes noticeable time for sound to cross the stage.


30ms seems high! Though I might believe that.

On higher-end pianos (mostly grands), there is "double escapement" action, which allows much faster note repetition than without. I suspect the latency would be lower on such pianos.

> Musicians learn to lead the beat to account for the activation delay of their instrument

Yes, this is absolutely a thing! I play upright bass, and placement of bass tones with respect to drums can get very nuanced. Slightly ahead or on top of the beat? Slightly behind? Changing partway through the tune?

It's interesting to note also how small discrepancies in latency can interfere: a couple tens of milliseconds of additional latency from the usual — perhaps by standing 10-15 feet farther away than accustomed, or from using peripherals that introduce delay — can change the pocket of a performance.


This already happens. I have seen recruiters trying to get domain experts in various fields to write articles for AI training.


LinkedIn built a whole platform inside their platform for doing exactly this. I think you get a badge or something on your profile claiming your an expert on something if you write a couple paragraphs on a topic using the provided prompt.

They're very clear its going into an AI generated article on the topic but you better believe that is also now core training data.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: