30T tokens, 20.5T in English, allegedly high quality, can’t wait to see people start putting it to use!
Related github: https://github.com/togethercomputer/RedPajama-Data
30T tokens, 20.5T in English, allegedly high quality, can’t wait to see people start putting it to use!
Related github: https://github.com/togethercomputer/RedPajama-Data
Maybe I misread it but this was the source of the 5T remark…
https://news.ycombinator.com/item?id=38077521#38080442
I think the implication is more stating that this dataset is even more useful if you don’t jam the whole thing into your training but instead further filter it to a reasonable number of tokens, around 5T, and train on that subset instead
I could be incorrect, cause they do explicitly say deduplicating, but it’s phrased oddly either way