mrcslws's comments

mrcslws · 2025-08-04T21:47:05 1754344025

Ha, after writing this post I found a Wikipedia article on this effect: https://en.wikipedia.org/wiki/Display_motion_blur

mrcslws · 2025-07-25T15:00:36 1753455636

From the blog post: "more than 99% of them had no activity in the last month" https://developers.googleblog.com/en/google-url-shortener-li...

This is a classic product data decision-making fallacy. The right question is "how much total value do all of the links provide", not "what percent are used".

bayindirh · 2025-07-25T15:15:39 1753456539

> The right question is "how much total value do all of the links provide", not "what percent are used".

Yes, but it doesn't bring in the sweet promotion home, unfortunately. Ironically, if 99% of them doesn't see any traffic, you can scale back the infra, run it in 2 VMs, and make sure a single person can keep it up as a side quest, just for fun (but, of course, pay them for their work).

This beancounting really makes me sad.

quesera · 2025-07-25T15:38:25 1753457905

Configuring a static set of redirects would take a couple hours to set up, and literally zero maintenance forever.

Amazon should volunteer a free-tier EC2 instance to help Google in their time of economic struggles.

bayindirh · 2025-07-25T15:54:31 1753458871

This is what I mean, actually.

If they’re so inclined, Oracle has an always free tier with ample resources. They can use that one, too.

socalgal2 · 2025-07-25T17:06:12 1753463172

If they wanted the sweat promotion they could add an interstitial. Yes, people would complain, but at least the old links would not stop working.

ahstilde · 2025-07-25T15:18:29 1753456709

> just for fun (but, of course, pay them for their work).

Doing things for fun isn't in Google's remit

morkalork · 2025-07-25T15:42:19 1753458139

Then they shouldn't have offered it as a free service in the first place. It's like that discussion about how Google in all its 2-ton ADHD gorilla glory will enter an industry, offer a (near) free service or product, decimate all competition, then decide its not worth it and shutdown. Leaving a desolate crater behind of ruined businesses, angry and abandoned users.

jsperson · 2025-07-25T18:23:28 1753467808

I’m still sore about reader. Gap has never been filled for me.

kevindamm · 2025-07-25T15:27:12 1753457232

Alas, it was, once upon a time.

ceejayoz · 2025-07-25T15:25:59 1753457159

It used to be. AdSense came from 20% time!

HPsquared · 2025-07-25T15:34:04 1753457644

Indeed. I've probably looked at less than 1% of my family photos this month but I still want to keep them.

sltkr · 2025-07-25T16:22:07 1753460527

I bet 99% of URLs that exist on the public web had no activity last month. Might as well delete the entire WWW because it's obviously worthless.

chneu · 2025-07-26T10:22:15 1753525335

Where'd all my porn go!?

SoftTalker · 2025-07-25T16:59:31 1753462771

From Google's perspective, the question is "How many ads are we selling on these links" and if it's near zero, that's the value to them.

fizx · 2025-07-25T15:41:10 1753458070

Don't be confused! That's not how they made the decision; it's how they're selling it.

esafak · 2025-07-25T15:47:50 1753458470

So how did they decide?

chneu · 2025-07-26T10:23:35 1753525415

new person got hired after old person left. new person says "we can save x% by shutting down these links. 99% arent used" and the new boss that's only been there for 6 months says "yeah sure".

Why does google kill any project? the people who made it moved on, the new people dont care because it doesn't make their resume look any better.

basically nobody wants to own this service and it requires upkeep to maintain it alongside other google services.

google's history shows a clear choice to reward new projects, not old ones.

https://killedbygoogle.com/

nemomarx · 2025-07-25T15:50:12 1753458612

I expect cost on a budget sheet, then an analysis was done about the impact of shutting it down

sltkr · 2025-07-25T16:26:51 1753460811

You can't get promoted at Google for not changing anything.

ratg13 · 2025-07-26T13:41:18 1753537278

They launched Firebase Dynamic Links and someone didn't like the overlap.

firefax · 2025-07-25T15:55:24 1753458924

> "more than 99% of them had no activity in the last month"

Better to have a short URL and not need it, than need a short URL and not have it IMO.

esafak · 2025-07-25T15:50:47 1753458647

What fraction of indexed Google sites, Youtube videos, or Google Photos were retrieved in the last month? Think of the cost savings!

nomel · 2025-07-25T15:55:51 1753458951

Youtube already does this, to some extent, by slowly reduce the quality of your videos, if they're not accessed frequently enough.

Many videos I uploaded in 4k are now only available in 480p, after about a decade.

handsclean · 2025-07-25T15:32:45 1753457565

I don’t think they’re actually that dumb. I think the dirty secret behind “data driven decision making” is managers don’t want data to tell them what to do, they want “data” to make even the idea of disagreeing with them look objectively wrong and stupid.

HPsquared · 2025-07-25T15:35:41 1753457741

It's a bit like the the difference between "rule of law" and "rule by law" (aka legalism).

It's less "data-driven decisions", more "how to lie with statistics".

FredPret · 2025-07-25T16:27:21 1753460841

"Data-driven decision making"

mrcslws · on Oct 27, 2023

There may be tricks that I don't know about. One quick experimental answer I can give: if I change to looping over the sums and rerun Benchmark 3, my time in the aten::sum CUDA kernel increases from 0.779s (before) to 0.840ms (after). So CUDA doesn't seem to automagically handle this.

I will note that these grouped operations occasionally cause a net loss in performance compared to "naive" looping, since it involves calling PyTorch's "x.view(...)" which is usually ~instant but sometimes adds some extra CUDA operations on the backward pass. It always reduces the time spent in aten::add, but adds these extra ops. A really smart vectorizer would use heuristics to decide how/whether to group operations according to the target hardware; my current vectorizer just does the grouping every time.

mrcslws · on Oct 26, 2023

Yeah, one unspoken theme of this blog post is "look how nice torch.compile" is :)

Fun fact, I had to put in extra work to get torch.compile working with my code, for understandable reasons. My library, Vexpr, literally runs an interpreter inside of Python, reading a big tree-like namedtuple-of-namedtuples "expression" data structure and evaluating it recursively. That data structure was way too fancy for torch.compile's guards, so I actually wrote code [1] that converts a Vexpr expression into a big Python code string and evals it, factoring the interpreter out of the code, then I pass that eval'd string into torch.compile.

One torch.compile capability I would be excited to see is compatibility with torch.vmap. One selling point of Vexpr is that you can use vmap with it, so I was sad when I found I couldn't use vmap and still support torch.compile. This made me convert a bunch of my GP kernels [2] to be batch-aware. (This missing capability is also understandable -- both vmap and compile are new.)

Anyway, I'm a fan of what y'all are doing!

[1] https://github.com/outergroup/vexpr/blob/e732e034768443386f9... [2] https://github.com/outergroup/outer-loop-cookbook/blob/5d94c...

voz_ · on Oct 26, 2023

I spend a lot of sweat in the guards - I am very interested in how it failed! Can you say more? Did guard creation fail? or did guard check_fn perf overhead destroy it?

> One torch.compile capability I would be excited to see is compatibility with torch.vmap

We added support for torch.func.vmap, iirc - check out test_higher_order_ops.py, grep for vmap.

mrcslws · on Oct 26, 2023

Glad to hear :)

Yes, I'm off doing my own thing now. Deep Learning went so much further than I ever expected, and now I'm drawn to all the things that can be built today. Who knows, maybe I'll swing back into neuroscience in a few years. (Still friends with my old coworkers / bosses.)

mrcslws · on Oct 26, 2023

I wondered about this same thing. Your logic about cache/registers is certainly true on CPUs, but what about GPUs? Hence this blurb:

> I studied the CUDA traces closely and found that vectorization does indeed reduce many aspects of the GPU workload, greatly reducing the number of operations and decreasing the total amount of time spent on the fundamental computations of the algorithm. However it also introduces overhead (mentioned above) by interspersing operations that permute and reorder the tensors, or splitting them into groups then concatenating results. Sometimes the reduced “fundamental” time outweighs the additional overhead, while other times the overhead outweighs the reduction in fundamental time.

Here are some examples not included in the blog post:

- Total time spent in aten::cdist kernel

  - Baseline: 2.834s (4900 calls)
  - Vectorized: 2.686s (500 calls)

- Total time spent in aten::mul kernel

  - Baseline: 5.745s (80700 calls)
  - Vectorized: 5.555s (8100 calls)

This nice little win applies to tons of other kernels, almost across the board. As you point out, CPU intuition suggests this should have been slower, so this was an interesting outcome.

On the other hand, some specific increases occur:

- Total time spent in aten::cat kernel

  - Baseline: 0.680s
  - Vectorized: 1.849s

So working in fewer, larger batches doesn't only enable outrunning the GPU. It decreases the total GPU workload... then adds some overhead. But some of this overhead could be removed with custom CUDA kernels, so I think this is an interesting direction even if you solve the CPU problem some other way.

(The pow(x, 2) is only there in the toy code, not my actual kernel, so I didn't performance-tune it.)

mrcslws · on Oct 26, 2023

Aha, I was hoping to learn about something like this, thanks for sharing. I'll try this some time. PyTorch does use different threads for the forward and backward pass, so as you suggest, setting that flag might only improve the forward pass.

gregjm · on Oct 26, 2023

The CUDA Runtime and Driver APIs have per-thread state, so using threads would unfortunately bypass our trick here to set the flag. Assuming you're on Linux, I might suggest creating a shared library to intercept calls to the Driver API, as all Runtime functions are implemented as wrappers around Driver functions. You'd have to intercept all calls to context creation and flag setting:

  * `cuCtxCreate`

  * `cuCtxCreate_v3`

  * `cuCtxSetFlags`

  * `cuDevicePrimaryCtxRetain`

  * `cuDevicePrimaryCtxSetFlags`

... and make sure that the three least significant bits of any `flags` variable are set to `CU_CTX_SCHED_BLOCKING_SYNC`.

cuDevicePrimaryCtxSetFlags: https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__PR...

dlsym(3): https://man.archlinux.org/man/dlsym.3.en

ld.so(8): https://man.archlinux.org/man/ld.so.8.en

mrcslws · on May 26, 2018

One of my favorite photos from space is a photo of Alan Bean: https://upload.wikimedia.org/wikipedia/commons/9/97/Apollo12...

projectapollo · on May 27, 2018

this is my iPhone case. it’s an amazing photo by pete conrad.

https://society6.com/product/apollo-12-face-of-an-astronaut_...

mrcslws · on Nov 27, 2014

One nitpick:

> That some of the most heavily used and reliable software in the world is built on C is proof that the flaws are overblown, and easy to detect and fix.

Easy to detect? Multiple times per year we find out about a security bug in Linux or Windows that has existed since the 90s.

mrcslws · on April 28, 2014

The full index: http://history.nasa.gov/computers/contents.html

(it's just the "Index Page" button from the bottom)