None, since 'legal' for AI training is not yet defined, but Olma is trained on the Dolma 3 dataset, which is
1. Common crawl
2. Github
3. Wikipedia, Wikibooks
4. Reddit (pre-2023)
5. Semantic Scholar
6. Project Gutenberg
* https://arxiv.org/pdf/2402.00159
https://huggingface.co/datasets/allenai/dolma
https://huggingface.co/models?dataset=dataset:allenai/dolma
reply
None, since 'legal' for AI training is not yet defined, but Olma is trained on the Dolma 3 dataset, which is
1. Common crawl
2. Github
3. Wikipedia, Wikibooks
4. Reddit (pre-2023)
5. Semantic Scholar
6. Project Gutenberg
* https://arxiv.org/pdf/2402.00159