He's building the model from scratch, as the title suggests. He only trains a small model with 10M parameters on it, something that is feasible with a single GPU. In comparison, GPT-3 has 175B parameters.
> wondering like how hard is it to actually replicate what openAI has done if you had the money to pay for the training?
It would most certainly be possible for another company to build something very similar (models of similar size have even be released publicly). I'm honestly unsure why Microsoft would rather pay $10B to acquire less than half of OpenAI, as they have the hardware to do it (OpenAI uses MS cloud products.) Must be some business reasons I don't understand. OpenAI definitely has some very talented people working for it, though.
>I'm honestly unsure why Microsoft would rather pay $10B to acquire less than half of OpenAI, as they have the hardware to do it (OpenAI uses MS cloud products.)
Because the hardware is the least interesting part of it?
Microsoft buys the know-how, the talent, and perhaps some patents, but most importantly the GPT brand name...
Does the time to train the model increase linearly with the number of parameters, or exponentially?
In other words, GPT-3 is 17,500X the number of parameters but does that mean you can train it in 17,500X the amount of time it takes to train the 10M param model?
In theory it should be linear, however, the parallelization is not perfect and some overlapping parts of gradients are computed on multiple GPUs at the same time so expect some constant factor slowdown on average.
On top of what other people have said about parallelism overheads, you normally need more data to train a bigger network and the training time is roughly proportional to network size * training data.
IIRC OpenAI used a million times more data to train GPT3 than karpathy used in this video, so a naive estimate would be that it would take about 20 billion times more compute. This is could be a significant overestimate since Karpathy probably used each bit of the training set more times than openAI did.
I am not from the LLM world, but I believe it's mostly constrained by the standard multiprocessing limits -- communication and synchronization of multiple workers, some of whom operate over an exceedingly slow Ethernet interface.
> wondering like how hard is it to actually replicate what openAI has done if you had the money to pay for the training?
It would most certainly be possible for another company to build something very similar (models of similar size have even be released publicly). I'm honestly unsure why Microsoft would rather pay $10B to acquire less than half of OpenAI, as they have the hardware to do it (OpenAI uses MS cloud products.) Must be some business reasons I don't understand. OpenAI definitely has some very talented people working for it, though.