> How many models are only trained on legal[0] data? None, since 'legal' for AI ... | Hacker News

Hacker Newsnew | past | comments | ask | show | jobs | submit

		Eisenstein 10 days ago \| parent \| context \| favorite \| on: It’s been a very hard year > How many models are only trained on legal[0] data? None, since 'legal' for AI training is not yet defined, but Olma is trained on the Dolma 3 dataset, which is 1. Common crawl 2. Github 3. Wikipedia, Wikibooks 4. Reddit (pre-2023) 5. Semantic Scholar 6. Project Gutenberg * https://arxiv.org/pdf/2402.00159

austinjp 10 days ago [–]

Nice, I hadn't heard of this. For convenience, here are HuggingFace models trained on Olma:

https://huggingface.co/datasets/allenai/dolma

https://huggingface.co/models?dataset=dataset:allenai/dolma

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact