More

pavelstoev · 2025-12-13T01:42:41 1765590161

Not wrong but markdown with English may be the most used DSL, second only to a language itself. Volume over quality.

pavelstoev · 2025-11-28T02:11:38 1764295898

keyword: "...talks..."

pavelstoev · 2025-10-29T02:55:07 1761706507

Nokia also makes complex backbone carrier-grade network switches based on the Intellectual Property portfolio they acquired from Nortel.

mixdup · 2025-10-29T02:59:00 1761706740

That kind of stuff is the closest that they would come to compete with the others cited. They're all trying to get into datacenter gear, but Cisco specifically has gotten out of various levels of service provider network gear (they sold off all their cable network stuff, for example) which is where Nokia, Ericsson, etc all make their bread and butter

Cyph0n · 2025-10-29T13:54:59 1761746099

Cisco is still in the SP networking space, but they’ve been pushing heavily into datacenter and core routers generally (vs. edge which are more common in SP networks).

Granted, I only worked as a lowly dev in the Cisco SP routing team, and I haven’t been keeping up to speed with their work.

pavelstoev · 2025-10-18T02:10:58 1760753458

Much respect for the artist 50 Cent - converted his rap music success into respectable business ventures (Vitamin Water, others). So he is worth much more now!

pavelstoev · 2025-10-18T01:54:58 1760752498

I've vibe-coded a website about vibe coding websites. I used GPT-5 and it inserted an easter egg that was found by a human front-end dev, to my amusement. Easter eggs must be in-distribution!

(No I am not sharing the link as I was downvoted for it before - search for it. Hint: built with vibe)

pavelstoev · 2025-10-08T02:33:32 1759890812

It was my first engineering job, calibrating those inductive loops and circuit boards on I-93, just north of Boston's downtown area. Here is the photo from 2006. https://postimg.cc/zbz5JQC0

PEEK controller, 56K modem, Verizon telco lines, rodents - all included in one cabinet

pavelstoev · 2025-08-26T03:10:52 1756177852

Here is a link to a video of what it looks like (estimated) in video.

We built this system at the UofT WIRLab back in 2018-19 https://youtu.be/lTOUBUhC0Cg

And link to paper https://arxiv.org/pdf/2001.05842

pavelstoev · 2025-08-09T05:08:12 1754716092

When I think about serving large-scale LLM inference (like ChatGPT), I see it a lot like high-speed web serving — there are layers to it, much like in the OSI model.

1. Physical/Hardware Layer At the very bottom is the GPU silicon and its associated high-bandwidth VRAM. The model weights are partitioned, compiled, and efficiently placed so that each GPU chip and its VRAM are used to the fullest (ideally). This is where low-level kernel optimizations, fused operations, and memory access patterns matter so that everything above the chip level tries to play nice with the lowest level.

2. Intra-Node Coordination Layer Inside a single server, multiple GPUs are connected via NVLink (or equivalent high-speed interconnect). Here you use tensor parallelism (splitting matrices across GPUs), pipeline parallelism (splitting model layers across GPUs), or expert parallelism (only activating parts of the model per request) to make the model fit and run faster. The key is minimizing cross-GPU communication latency while keeping all GPUs running at full load - many low level software tricks here.

3. Inter-Node Coordination Layer When the model spans multiple servers, high-speed networking like InfiniBand comes into play. Techniques like data parallelism (replicating the model and splitting requests), hybrid parallelism (mixing tensor/pipeline/data/expert parallelism), and careful orchestration of collectives (all-reduce, all-to-all) keep throughput high while hiding model communication (slow) behind model computation (fast).

4. Request Processing Layer Above the hardware/multi-GPU layers is the serving logic: batching incoming prompts together to maximize GPU efficiency and mold them into ideal shapes to max out compute, offloading less urgent work to background processes, caching key/value attention states (KV cache) to avoid recomputing past tokens, and using paged caches to handle variable-length sequences.

5. User-Facing Serving Layer At the top are optimizations users see indirectly — multi-layer caching for common or repeated queries, fast serialization protocols like gRPC or WebSockets for minimal overhead, and geo-distributed load balancing to route users to the lowest-latency cluster.

Like the OSI model, each “layer” solves its own set of problems but works together to make the whole system scale. That’s how you get from “this model barely runs on a single high-end GPU” to “this service handles hundreds of millions of users per week with low latency.”

pavelstoev · 2025-07-20T03:22:23 1752981743

I vibe coded a site about vibe 2 code projects. https://builtwithvibe.com/

esafak · 2025-07-20T04:23:54 1752985434

The "Yo dawg, I heard..." memes are writing themselves today.

pavelstoev · 2025-06-20T15:58:20 1750435100

Hi Author - thank you very much for the clear and relatively easy-to-understand MPK overview. Could you please also comment on the similarity of your project to Hidet https://pytorch.org/blog/introducing-hidet/

Thank you !