Hacker Newsnew | past | comments | ask | show | jobs | submit | coolness's commentslogin

This used to be the case: research was conducted mostly at academic institutions that did not provide degrees [1]. The "research university" is a relatively new thing

[1] https://asteriskmag.com/issues/10/the-origin-of-the-research...


Interesting read. I always wondered from where did the idea about "thesis" & other "extra-circular" activities come from, for both students and professors.

Nowadays, promotions of professors for different levels (Assistant, Associate, Professor) is solely dependent on number of papers they are publishing in Q1 journals. But the research maybe entirely bogus, same ideas repurposed hundreds of times by different professors.

The entire concept about "systematic knowledge" has gone downhill.


Even more important than the papers is whether you can raise the money required to fund your lab which produces your prestigious journal papers. And the further you go down the league table the less important the "prestigious" part gets.


Slight tangent: i was wondering why DeepSeek would develop something like this. In the linked paper it says

> In production, DeepSeek-OCR can generate training data for LLMs/VLMs at a scale of 200k+ pages per day (a single A100-40G).

That... doesn't sound legal


HathiTrust (https://en.wikipedia.org/wiki/HathiTrust) has 6.7 millions of volumes in the public domain, in PDF from what I understand. That would be around a billion pages, if we consider a volume is ~200 pages. 5000 days to go through that with an A100-40G at 200k pages a day. That is one way to interpret what they say as being legal. I don't have any information on what happens at DeepSeek so I can't say if it's true or not.


Great post. Super sad state of affairs but we move on and learn new things. Programming was always a tool and now the tool has changed from something that required skill and understanding to complaining to a neural net. Just have to focus on the problem being solved more.


> Programming was always a tool

This is the narrow understanding of programming that is the whole point of contention.


damn, i was really excited to have a new article by this guy. Makes some of the best articles out there for sure


bb wake up, new chichanow.ski just dropped

oh wait, (2020) :(

thankfully i don't remember much from this one, so was able to extract some dopamine from it still


I find it refreshing that a webpage can give such joy, to the point of having people talk about it in the same way you talk about books and movies. You know, being able to enjoy it for the first time and so and so.


Absolutely agree, it's one of the modern internet's gems.

I can't wait until one of my kids, who seems very interested in physics-adjacent topics, is old enough to go through these pages with me.


Wow thats very cool, i was puzzled at first as to why the pokemon types were in Finnish!


Yeah, I don't really understand why someone would make a blog and use AI to write the articles. Isn't having a blog more about the joy of writing and the learning you do while writing it?


Because it's what cool people do, so if you want to be cool you do it. They didn't realise the cool part was actually having the knowledge and actually writing the text.

There are many similar things where people just take shortcuts because they don't understand the interesting part is the process/skill not the final result. It probably has to do with external validation, reddit is full of "art" subs being polluted by these people, generative ai is even leaking into leather work, wood carving, lino cut, it's a cancer


Well, the world has become very superficial. People rarely question how they end up with a specific result, which makes cheating/outsourcing quite a good deal and even profitable for many.


Also, resume padding.


Great post and amazing progress in this field! However, I have to wonder if some of these letters were part of the training data for Gemini, since they are well-known and someone has probably already done the painstaking work of transcribing them...


Most likely, and probably inferring the structure on texts with "similar" writing forms. Tried with my handwriting (in italian) and the performance wasn't that stellar. More annoyingly, it is still a LLM and not a "pure" OCR, so some sentences were partially rephrased with different words than the one in the text. This is crucially problematic if they would be used to transcribe historical documents


> Tried with my handwriting (in italian) and the performance wasn't that stellar.

Same here, for diaries/journals written in mixed Swedish/English/Spanish and with absolutely terrible hand-writing.

I'd love for the day where the writing is on the wall for handwriting recognition, which is something I bet on when I started with my journals, but seems that day has yet to come. I'm eager to get there though so I can archive all of it!


"it is still a LLM and not a "pure" OCR"

When does a character model become a language model?

If you're looking at block text with no connections between letter forms, each character mostly stands on its own. Except capital letters are much more likely at the beginning of a word or sentence than elsewhere, so you probably get a performance boost if you incorporate that.

Now we're considering two-character chunks. Cursive script connects the letterforms, and the connection changes based on both the source and target. We can definitely get a performance boost from looking at those.

Hmm you know these two-letter groupings aren't random. "ng" is much more likely if we just saw an "i". Maybe we need to take that into account.

Hmm actually whole words are related to each other! I can make a pretty good guess at what word that four-letter-wide smudge is if I can figure out the word before and after...

and now it's an LLM.


So it doesn't work is what you're saying, right?


Are you sure to have used the Gemini 3.0 pro model? Maybe try increasing the media resolution on the AI studio if the text is small


I have a personal corpus of letters between my grandparents in WW2. My grandfather fighting in Europe and my grandmother in England. The ability of Claude and ChatGPT to transcribe them is extremely impressive. Though I haven’t worked on them in months and this uses older models. At that time neither system could properly organize pages though and chatGPT would sometimes skip a paragraph.


I've also been working on half a dozen crates of old family letters. ChatGPT does well with them and is especially good at summarizing the letters. Unfortunately, all the output still has to be verified because it hallucinates words and phrases and drops lines here and there. So at this point, I still transcribe them by hand, because the verification process is actually more tiresome than just typing them up in the first place. Maybe I should just have ChatGPT verify MY transcriptions instead.


It helps when you can see the confidence of each token, which downloadable weights usually gives you. Then whenever you (your software) detects a low confidence token, run over that section multiple times to generate alternatives, and either go with the highest confidence one, or manually review the suggestions. Easier than having to manually transcribe those parts at least.


Is there any way to do this with the frontier LLM's?


Ask them to mark low confidence words.


Do they actually have access to that info "in-band"? I would guess not. OTOH it should be straightforward for the LLM program to report this -- someone else commented that you can do this when running your own LLM locally, but I guess commercial providers have incentives not to make this info available.


Naturally, their "confidence" is represented as activations in layers close to output, so they might be able to use it. Research ([0], [1], [2], [3]) shows that results of prompting LLMs to express their confidence correlate with their accuracy. The models tend to be overconfident, but in my anecdotal experience the latest models are passably good at judging their own confidence.

[0] https://ieeexplore.ieee.org/abstract/document/10832237

[1] https://arxiv.org/abs/2412.14737

[2] https://arxiv.org/abs/2509.25532

[3] https://arxiv.org/abs/2510.10913


interesting... I'll give that a shot


It used to be that the answer was logprobs, but it seems that is no longer available.


Always seemed strange to me that personal correspondence between two now-dead people is interesting. But I guess that is just my point of view. You could say the same thing about reading fiction, I guess.


Why on earth wouldn't it be interesting? Do you only care about your own life?


Possibly, but given it can also read my handwriting- which is much, MUCH worse than Boole’s - with better accuracy than any human I’ve given it to- that’s probably not the explanation.


Shhhhh no one cares about data contamination anymore.


Then write something down yourself and upload a picture to gemini.google.com or chatgpt. Hell, combine it. Make yourself a quick math test, print it, solve with pen and ask these models to correct it.

They're very good at it.


I don't know how to write like a 19th century mathematician, nor anyone earlier. I'm not sure OCR on Carolingian Miniscule has been solved, let alone more ancient styles like Roman cursive or, god forbid, things like cuneiform. Especially since the corpora on these styles is so small, dataset contamination /is/ a major issue!


For that to be relevant to this post, they would need to write with secretary hand.


Yeah, even with this fixed it's going to be annoying because it first restarts, so you have to stay at the PC to select windows in grub


Full-disk encryption, as useful as it is, also makes this a royal pain. Updates can't be performed unattended, because each restart done during the updates requires providing the password before continuing.


Why don't you let Windows manage the password for you ? It will be safely stored in the Cloud. /s


You can set grub to select the lastly selected menu entry.


I also thought this but OP is right: https://dev.timenote.info/de/Nenad-Petrovic

> In 1964 Petrović constructed a position with 218 possible moves for White.


Great story and I'm sure the new math chops help her a lot during her studies.

In case anyone else is interested, it is possible to study computer science without any entrance exams due to the Digital Education for All initivative (https://www.helsinki.fi/fi/projektit/digital-education-all in Finnish, sorry). You get the full study right after completing 60 credits (out of a total 180 credits) worth of courses in the first year.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: