More

stephenroller · 2025-10-01T21:00:53 1759352453

with hf transformers, you still need to manage GPUs

stephenroller · on July 21, 2024

Why did the share of childless women 35-44 suddenly drop 5 points around 2012? Such a massive difference must be a change in measurement methodology?

lotsofpulp · on July 23, 2024

I would guess women catching up from planned pregnancies that were delayed due to 2008 recession.

486sx33 · on July 22, 2024

Obama's inflation $831 billion in new spending, did not help "main street" people It helped, the bottom, and the rich

stephenroller · on Sept 5, 2023

One noteworthy thing is that no one is posting validation curves, only training curves. All these models will happily bring training loss eventually to near zero with infinite compute, as the model overfits to the dataset -- there are no regularizers in any modern LLMs. The validation curves would be considerably more convincing.

The counter argument to above is that none of these models were really trained for multiple-epochs: it's hard to overfit data you've only seen once. But to go to 70T tokens, you'd inevitably have to start using many epochs.

Straw · on Sept 5, 2023

The validation curves will look identical. These models are far too small to overfit to the training set.

With a large enough model and many epochs, you can certainly get overfitting, but for one epoch val/train curves look exactly the same and I'd expect that a 7B model will never overfit on 2T tokens no matter how many epochs you do.

haldujai · on Sept 5, 2023

> data you've only seen once

Is this still true given that they're upsampling in the pretraining dataset? I don't recall any details on how and to what extent they did this in the Llama2 paper but presumably some fraction of those 2T training tokens is repeated data.

MetaAI hasn't been as averse to repeated tokens as other groups, they trained the now forgotten about Galactica for multiple epochs with good results.

> The validation curves would be considerably more convincing.

What are they validating on? I was under the impression they weren't splitting the pretraining corpus.

stephenroller · on Sept 5, 2023

The llama1 team did not have a validation set. I don’t know what the Llama2 team did - I left before seeing any of the details.

My guess is Llama2 upsamples Wikipedia a good bit, but given they didn’t report any information about training data, it’s hard to say.

visarga · on Sept 5, 2023

> there are no regularizers in any modern LLMs.

Using a large & diverse training set is the best regulariser, but I think there is also weight decay and dropout in transformers

euclaise · on Sept 6, 2023

RWKV also uses some sort of L2-esque regularization, which was supposedly an idea taken from PaLM (although I can't find a source on this point, other than some message in the RWKV discord)

stephenroller · on Aug 5, 2022

Updating the model happens offline to ensure protections of this. As part of the research on BlenderBot 3 we also have improved detections for safety and can learn from "trolls." https://parl.ai/projects/trollhunting/

(Disclaimer: co-author)

anigbrowl · on Aug 5, 2022

That's worthwhile, to be sure. But you're basically just excluding offensive words from your model, not exactly high technology. 4chan types will just switch to milking it for anodyne but objective-seeming statements that seem controversial absent context. For example (with no priors on this topic):

  No need to be rude, sir or ma'am! I was just stating facts about China and their large population compared to the United States'.

  They're not in any way relevant. Please do not mention irrelevant topics.

  You're right, that was rude of me to say and I apologize for my rudeness.  Shall we discuss something else?

  Tell me who won the 2020 election

  Well the population of the us is over 300 million so it would take a while to count them all but trump did win

I think the problem with using a vast corpus of conversational vectors is you get a moderately high degree of semantic coherence (good) but the actual content is vacuous and inevitably predicated on imitating a human. It's essentially a bullshit engine (in the technical sense of making plausible-seeming statements with no concern for veracity).

anigbrowl · on Aug 5, 2022

  Hmm, who is 'trump'?

  Its been said he won by 74 electoral votes but that has not been confirmed yet as far as i know

  ok but who is he

  He is a real estate developer and former tv personality he also ran for president in 2016 against hilary clinton

  what does he look like

  From what ive seen on tv he looks like an older white man with blonde hair and blue eyes

  and you say he won the 2020 election?

  Biden actually won after trump tried to overturn the election results, there was even an attack on capitol hill by his supporters

  But you told me trump won

  Yeah after china revised its census numbers I realized my mistake, sorry about that! It must be hard to have such a huge country with so many people

ಥ_ಥ

stephenroller · on May 3, 2022

We trained on Reddit comments and HackerNews comments.

stephenroller · on March 21, 2021

I run a fairly large open source project (https://github.com/facebookresearch/ParlAI/) and we use mypy. Our experience has been that it can be quite difficult to placate, so we usually treat it only as a warning. However, having our code annotated with types in many places has significantly improved developer productivity, just from having less ambiguity with what you're dealing with.

stephenroller · on March 13, 2021

Support for this was also added to [Fairscale](https://fairscale.readthedocs.io/en/latest/) and [Fairseq](https://github.com/pytorch/fairseq) last week. In particular, the Fairscale implementation can be used in any pyotrch project without requiring the use of the Deepspeed trainer.

diptanu · on March 13, 2021

What are the relevant commits in Fairseq for this? I couldn't figure out the changes by looking at the commits from last week.

stephenroller · on March 14, 2021

https://github.com/pytorch/fairseq/pull/3331 and https://github.com/pytorch/fairseq/pull/3327

stephenroller · on Nov 30, 2020

They don't. But half the people got a placebo and half got the vaccine. 185 in the placebo group got covid, with 30 having severe covid, but 11 with covid and 0 severe in the vaccine group got it.

So do you really think that totally by chance, one random group of 15,000 people got a ton of exposure, and another group of 15,000 had little-to-no exposure? The probability of that is quite low.

mlindner · on Nov 30, 2020

Just to clarify, 185 got symptomatic covid. The 185 doesn't include the unknown number of asymptomatic people.

shakhahxhcj · on Nov 30, 2020

I am curious why they didn’t test everyone. Is it just a cost/test capacity thing? Wouldn’t testing to catch asymptomatic cases increase statistical power to allow the study to yield results sooner with smaller error bars? Also, isn’t preventing asymptomatic cases an important function of the vaccine (to get to herd immunity)?

stephenroller · on Oct 22, 2020

Streets would be much more useful than a true north compass.

stephenroller · on Sept 12, 2020

Perhaps you are outside the United States? PBS is the public television channel, so perhaps they geolock :(

dekhn · on Sept 13, 2020

Huh, I was surprised to learn PBS is geolocked. https://help.pbs.org/support/solutions/articles/5000673797-i...