More

coolness · 2026-03-16T09:31:47 1773653507

This used to be the case: research was conducted mostly at academic institutions that did not provide degrees [1]. The "research university" is a relatively new thing

[1] https://asteriskmag.com/issues/10/the-origin-of-the-research...

i67vw3 · 2026-03-17T07:55:06 1773734106

Interesting read. I always wondered from where did the idea about "thesis" & other "extra-circular" activities come from, for both students and professors.

Nowadays, promotions of professors for different levels (Assistant, Associate, Professor) is solely dependent on number of papers they are publishing in Q1 journals. But the research maybe entirely bogus, same ideas repurposed hundreds of times by different professors.

The entire concept about "systematic knowledge" has gone downhill.

sieste · 2026-03-17T12:02:43 1773748963

Even more important than the papers is whether you can raise the money required to fund your lab which produces your prestigious journal papers. And the further you go down the league table the less important the "prestigious" part gets.

coolness · 2026-02-16T13:06:45 1771247205

Slight tangent: i was wondering why DeepSeek would develop something like this. In the linked paper it says

> In production, DeepSeek-OCR can generate training data for LLMs/VLMs at a scale of 200k+ pages per day (a single A100-40G).

That... doesn't sound legal

Zababa · 2026-02-16T14:56:24 1771253784

HathiTrust (https://en.wikipedia.org/wiki/HathiTrust) has 6.7 millions of volumes in the public domain, in PDF from what I understand. That would be around a billion pages, if we consider a volume is ~200 pages. 5000 days to go through that with an A100-40G at 200k pages a day. That is one way to interpret what they say as being legal. I don't have any information on what happens at DeepSeek so I can't say if it's true or not.

coolness · 2026-02-07T19:29:30 1770492570

Great post. Super sad state of affairs but we move on and learn new things. Programming was always a tool and now the tool has changed from something that required skill and understanding to complaining to a neural net. Just have to focus on the problem being solved more.

terminalbraid · 2026-02-07T19:35:24 1770492924

> Programming was always a tool

This is the narrow understanding of programming that is the whole point of contention.

coolness · 2026-01-08T13:26:15 1767878775

damn, i was really excited to have a new article by this guy. Makes some of the best articles out there for sure

isoprophlex · 2026-01-08T13:32:30 1767879150

bb wake up, new chichanow.ski just dropped

oh wait, (2020) :(

thankfully i don't remember much from this one, so was able to extract some dopamine from it still

emilbratt · 2026-01-08T13:40:45 1767879645

I find it refreshing that a webpage can give such joy, to the point of having people talk about it in the same way you talk about books and movies. You know, being able to enjoy it for the first time and so and so.

isoprophlex · 2026-01-08T14:10:39 1767881439

Absolutely agree, it's one of the modern internet's gems.

I can't wait until one of my kids, who seems very interested in physics-adjacent topics, is old enough to go through these pages with me.

coolness · 2026-01-01T10:58:12 1767265092

Wow thats very cool, i was puzzled at first as to why the pokemon types were in Finnish!

coolness · 2025-12-15T08:49:56 1765788596

Yeah, I don't really understand why someone would make a blog and use AI to write the articles. Isn't having a blog more about the joy of writing and the learning you do while writing it?

lm28469 · 2025-12-15T09:08:23 1765789703

Because it's what cool people do, so if you want to be cool you do it. They didn't realise the cool part was actually having the knowledge and actually writing the text.

There are many similar things where people just take shortcuts because they don't understand the interesting part is the process/skill not the final result. It probably has to do with external validation, reddit is full of "art" subs being polluted by these people, generative ai is even leaking into leather work, wood carving, lino cut, it's a cancer

seec · 2025-12-17T10:47:21 1765968441

Well, the world has become very superficial. People rarely question how they end up with a specific result, which makes cheating/outsourcing quite a good deal and even profitable for many.

IAmBroom · 2025-12-15T13:55:18 1765806918

Also, resume padding.

coolness · 2025-12-03T09:15:38 1764753338

Great post and amazing progress in this field! However, I have to wonder if some of these letters were part of the training data for Gemini, since they are well-known and someone has probably already done the painstaking work of transcribing them...

lccerina · 2025-12-03T13:11:50 1764767510

Most likely, and probably inferring the structure on texts with "similar" writing forms. Tried with my handwriting (in italian) and the performance wasn't that stellar. More annoyingly, it is still a LLM and not a "pure" OCR, so some sentences were partially rephrased with different words than the one in the text. This is crucially problematic if they would be used to transcribe historical documents

embedding-shape · 2025-12-03T13:15:38 1764767738

> Tried with my handwriting (in italian) and the performance wasn't that stellar.

Same here, for diaries/journals written in mixed Swedish/English/Spanish and with absolutely terrible hand-writing.

I'd love for the day where the writing is on the wall for handwriting recognition, which is something I bet on when I started with my journals, but seems that day has yet to come. I'm eager to get there though so I can archive all of it!

pbronez · 2025-12-03T16:43:41 1764780221

"it is still a LLM and not a "pure" OCR"

When does a character model become a language model?

If you're looking at block text with no connections between letter forms, each character mostly stands on its own. Except capital letters are much more likely at the beginning of a word or sentence than elsewhere, so you probably get a performance boost if you incorporate that.

Now we're considering two-character chunks. Cursive script connects the letterforms, and the connection changes based on both the source and target. We can definitely get a performance boost from looking at those.

Hmm you know these two-letter groupings aren't random. "ng" is much more likely if we just saw an "i". Maybe we need to take that into account.

Hmm actually whole words are related to each other! I can make a pretty good guess at what word that four-letter-wide smudge is if I can figure out the word before and after...

and now it's an LLM.

butlike · 2025-12-03T14:37:41 1764772661

So it doesn't work is what you're saying, right?

GaggiX · 2025-12-03T13:27:13 1764768433

Are you sure to have used the Gemini 3.0 pro model? Maybe try increasing the media resolution on the AI studio if the text is small

MrSkelter · 2025-12-03T10:50:10 1764759010

I have a personal corpus of letters between my grandparents in WW2. My grandfather fighting in Europe and my grandmother in England. The ability of Claude and ChatGPT to transcribe them is extremely impressive. Though I haven’t worked on them in months and this uses older models. At that time neither system could properly organize pages though and chatGPT would sometimes skip a paragraph.

vertnerd · 2025-12-03T11:56:47 1764763007

I've also been working on half a dozen crates of old family letters. ChatGPT does well with them and is especially good at summarizing the letters. Unfortunately, all the output still has to be verified because it hallucinates words and phrases and drops lines here and there. So at this point, I still transcribe them by hand, because the verification process is actually more tiresome than just typing them up in the first place. Maybe I should just have ChatGPT verify MY transcriptions instead.

embedding-shape · 2025-12-03T13:17:17 1764767837

It helps when you can see the confidence of each token, which downloadable weights usually gives you. Then whenever you (your software) detects a low confidence token, run over that section multiple times to generate alternatives, and either go with the highest confidence one, or manually review the suggestions. Easier than having to manually transcribe those parts at least.

seidleroni · 2025-12-03T16:51:41 1764780701

Is there any way to do this with the frontier LLM's?

red75prime · 2025-12-03T17:50:18 1764784218

Ask them to mark low confidence words.

akoboldfrying · 2025-12-03T21:31:15 1764797475

Do they actually have access to that info "in-band"? I would guess not. OTOH it should be straightforward for the LLM program to report this -- someone else commented that you can do this when running your own LLM locally, but I guess commercial providers have incentives not to make this info available.

red75prime · 2025-12-05T11:12:51 1764933171

Naturally, their "confidence" is represented as activations in layers close to output, so they might be able to use it. Research ([0], [1], [2], [3]) shows that results of prompting LLMs to express their confidence correlate with their accuracy. The models tend to be overconfident, but in my anecdotal experience the latest models are passably good at judging their own confidence.

[0] https://ieeexplore.ieee.org/abstract/document/10832237

[1] https://arxiv.org/abs/2412.14737

[2] https://arxiv.org/abs/2509.25532

[3] https://arxiv.org/abs/2510.10913

seidleroni · 2025-12-03T20:39:24 1764794364

interesting... I'll give that a shot

criemen · 2025-12-03T21:27:04 1764797224

It used to be that the answer was logprobs, but it seems that is no longer available.

SoftTalker · 2025-12-03T17:42:53 1764783773

Always seemed strange to me that personal correspondence between two now-dead people is interesting. But I guess that is just my point of view. You could say the same thing about reading fiction, I guess.

suddenlybananas · 2025-12-03T21:19:43 1764796783

Why on earth wouldn't it be interesting? Do you only care about your own life?

dmd · 2025-12-03T11:54:26 1764762866

Possibly, but given it can also read my handwriting- which is much, MUCH worse than Boole’s - with better accuracy than any human I’ve given it to- that’s probably not the explanation.

suddenlybananas · 2025-12-03T09:21:02 1764753662

Shhhhh no one cares about data contamination anymore.

spwa4 · 2025-12-03T12:07:33 1764763653

Then write something down yourself and upload a picture to gemini.google.com or chatgpt. Hell, combine it. Make yourself a quick math test, print it, solve with pen and ask these models to correct it.

They're very good at it.

suddenlybananas · 2025-12-03T21:18:45 1764796725

I don't know how to write like a 19th century mathematician, nor anyone earlier. I'm not sure OCR on Carolingian Miniscule has been solved, let alone more ancient styles like Roman cursive or, god forbid, things like cuneiform. Especially since the corpora on these styles is so small, dataset contamination /is/ a major issue!

timdiggerm · 2025-12-03T15:19:02 1764775142

For that to be relevant to this post, they would need to write with secretary hand.

coolness · 2025-11-03T12:47:58 1762174078

Yeah, even with this fixed it's going to be annoying because it first restarts, so you have to stay at the PC to select windows in grub

MereInterest · 2025-11-03T13:43:52 1762177432

Full-disk encryption, as useful as it is, also makes this a royal pain. Updates can't be performed unattended, because each restart done during the updates requires providing the password before continuing.

hulitu · 2025-11-04T11:48:46 1762256926

Why don't you let Windows manage the password for you ? It will be safely stored in the Cloud. /s

snmx999 · 2025-11-03T13:02:36 1762174956

You can set grub to select the lastly selected menu entry.

coolness · 2025-09-26T07:06:06 1758870366

I also thought this but OP is right: https://dev.timenote.info/de/Nenad-Petrovic

> In 1964 Petrović constructed a position with 218 possible moves for White.

coolness · on Jan 19, 2024

Great story and I'm sure the new math chops help her a lot during her studies.

In case anyone else is interested, it is possible to study computer science without any entrance exams due to the Digital Education for All initivative (https://www.helsinki.fi/fi/projektit/digital-education-all in Finnish, sorry). You get the full study right after completing 60 credits (out of a total 180 credits) worth of courses in the first year.