More

matt_d · 2026-02-26T22:48:22 1772146102

Repo:

K-Search: LLM-Driven GPU Kernel Optimization with Co-Evolving Intrinsic World Model, https://github.com/caoshiyi/K-Search

> K-Search is an automated kernel engineering system that uses large language models (GPT-5, Gemini etc.) to iteratively generate and optimize GPU kernels. Unlike one-shot code generation, K-Search maintains a co-evolving world model — a structured search tree that encodes hypotheses about kernel bottlenecks, design alternatives, and optimization strategies — guiding multi-round, evidence-driven search over the kernel design space efficiently.

matt_d · 2026-02-22T19:10:28 1771787428

Paper: https://doi.org/10.1145/3695053.3731011

PDF: https://dl.acm.org/doi/pdf/10.1145/3695053.3731011

Abstract: This paper introduces Heliostat, which enhances page translation bandwidth on GPUs by harnessing underutilized ray tracing accelerators (RTAs). While most existing studies focused on better utilizing the provided translation bandwidth, this paper introduces a new opportunity to fundamentally increase the translation bandwidth. Instead of overprovisioning the GPU memory management unit (GMMU), Heliostat repurposes the existing RTAs by leveraging the operational similarities between ray tracing and page table walks. Unlike earlier studies that utilized RTAs for certain workloads, Heliostat democratizes RTA for supporting any workloads by improving virtual memory performance. Heliostat+ optimizes Heliostat by handling predicted future address translations proactively. Heliostat outperforms baseline and two state-of-the-arts by 1.93 ×, 1.92 ×, and 1.66 ×. Heliostat+ further speeds up Heliostat by 1.23 ×. Compared to an overprovisioned comparable solution, Heliostat occupies only 1.53% of the area and consumes 5.8% of the power.

matt_d · 2026-02-21T01:19:10 1771636750

Paper (PDF): https://2plus2a.com/files/publications/2025-ISCA-precise-exc..., https://2plus2a.com/publications/errata#exc-isca25

DOI: https://doi.org/10.1145/3695053.3731102

Abstract:

> To manage exceptions, software relies on a key architectural guarantee, precision: that exceptions appear to execute between instructions. However, this definition, dating back over 60 years, fundamentally assumes a sequential programmers model. Modern architectures such as Arm-A with programmer-observable relaxed behaviour make such a naive definition inadequate, and it is unclear exactly what guarantees programmers have on exception entry and exit.

> In this paper, we clarify the concepts needed to discuss exceptions in the relaxed-memory setting – a key aspect of precisely specifying the architectural interface between hardware and software. We explore the basic relaxed behaviour across exception boundaries, and the semantics of external aborts, using Arm-A as a representative modern architecture. We identify an important problem, present yet unexplored for decades: pinning down what it means for exceptions to be precise in a relaxed setting. We describe key phenomena that any definition should account for. We develop an axiomatic model for Arm-A precise exceptions, tooling for axiomatic model execution, and a library of tests. Finally we explore the relaxed semantics of software-generated interrupts, as used in sophisticated programming patterns, and sketch how they too could be modelled.

matt_d · 2026-02-20T20:09:18 1771618158

Some highlights (by Stuart Sul):

> Tensor core and memory pipelining: it turns out some tensor core instructions are implicitly pipelined, without proper documentation. Identifying these implicit semantics and the resulting pipelining tactics can boost your throughput by up to 10%.

> Hinting the PTX assembler properly: even logically identical PTX code can compile into meaningfully different SASS instructions, depending on how you write it. Signaling the assembler with the right instruction patterns is significant for minimizing latency.

> Occupancy: with all the modern GPU features, it gets tricky, and it is (again) poorly documented. Distributed shared memory doesn’t behave identically across all SMs, and 5th-generation tensor core instructions silently cap occupancy.

matt_d · 2026-02-03T22:21:15 1770157275

Main takeaway:

> Our experiments show that Intel’s port assignment policies can diverge significantly from the well-documented "least-loaded eligible port" model, illustrated in Figure 1. Using carefully crafted two-instruction microbenchmarks preceded by an LFENCE, we consistently observed dynamic scheduling policies. Instead of a fixed distribution across eligible ports, the port assignment changes as the unroll factor increases, producing distinct regions separated by cutoffs. As illustrated in Figure 2 for the “LFENCE; CBW; CBW” snippet, the port scheduler employs three different strategies depending on the number of loop iterations. At lower unroll factors, one sparsest port is strongly preferred. After a first cutoff, the allocation becomes approximately uniform across all eligible ports, albeit noisy. At a second cutoff, the scheduler shifts again, favoring a different subset of ports. The second cutoff’s unroll factor is twice the first’s unroll factor. These dynamics are not isolated: we observed similar cutoff-based transitions across multiple instructions and instruction pairs, and in some cases, the behavior also depends on the order of instructions in the block or on immediate values used in operands. We believe that this might serve as a new microarchitectural attack surface which can be harnessed towards implementing, e.g., covert channels, fingerprinting, etc. Importantly, the observed cutoffs are consistent and reproducible across multiple runs, but differ between CPU generations. These findings show that static eligibility sets cannot fully describe port assignment. Instead, the allocator follows multiple hidden policies, switching between them in ways not accounted for by existing models.

matt_d · 2026-01-21T17:40:49 1769017249

Worth adding on that note:

From JAX to VLIW: Tracing a Computation Through the TPU Compiler Stack, https://patricktoulme.substack.com/p/from-jax-to-vliw-tracin...

Google’s Training Chips Revealed: TPUv2 and TPUv3, HotChips 2020, https://hc32.hotchips.org/assets/program/conference/day2/Hot...

Ten Lessons From Three Generations Shaped Google’s TPUv4i, ISCA 2021, https://gwern.net/doc/ai/scaling/hardware/2021-jouppi.pdf

mike_hearn · 2026-01-22T09:59:47 1769075987

Thanks, that JAX writeup was interesting.

matt_d · 2025-12-01T13:44:09 1764596649

See https://github.com/MattPD/cpplinks/blob/master/assembly.x86.... - mostly focused on x86-64 (and some of the talks/tutorials offer pretty good overview)

matt_d · 2025-10-31T02:39:27 1761878367

I believe it's under https://github.com/stephenmell/opal-oopsla2025-artifact/tree... / https://doi.org/10.5281/zenodo.16929279

matt_d · 2025-10-21T18:27:40 1761071260

Paper: https://dl.acm.org/doi/10.1145/3759164.3759346

Haskell & Agda Code: https://doi.org/10.5281/zenodo.16751639

Abstract: https://bahr.io/pubs/entries/calctyper.html

> We present a calculational approach to the design of type checkers, showing how they can be derived from behavioural specifications using equational reasoning. We focus on languages whose semantics can be expressed as a fold, and show how the calculations can be simplified using fold fusion. This approach enables the compositional derivation of correct-by-construction type checkers based on solving and composing fusion preconditions. We introduce our approach using a simple expression language, to which we then add support for exception handling and checked exceptions.

matt_d · 2025-09-13T06:33:25 1757745205

Announcement:

https://mathstodon.xyz/@andrejbauer/115191725004191889

> This week I gave a lecture series at the School on Logical Frameworks and Proof Systems Interoperability. I spoke about programming language techniques for proof assistants. The lecture slides and the reference implementations of a minimalist type theory are available at:

https://github.com/andrejbauer/faux-type-theory

> The repository has three minimalist OCaml implementations of a simple proof checker:

> 1. A basic one showing how to implement bidirectional type checking and type-directed equality checking, using monadic style programming.

> 2. An extension with rudimentary holes (meta-variables) and unification.

> 3. A version that implements "variables as computational effects", using native OCaml 5 effects and handlers.