Catherine Arnett (@linguist_cat) 's Twitter Profile
Catherine Arnett

@linguist_cat

NLP Researcher @AiEleuther. PhD @UCSanDiego Linguistics.
Previously @pleiasfr @EdinburghUni. Interested in multilingual NLP, tokenizers, open science. She/her.

ID: 1532493362296606720

linkhttps://catherinearnett.github.io/ calendar_today02-06-2022 22:44:43

124 Tweet

533 Followers

455 Following

EleutherAI (@aieleuther) 's Twitter Profile Photo

Can you train a performant language models without using unlicensed text? We are thrilled to announce the Common Pile v0.1, an 8TB dataset of openly licensed and public domain text. We train 7B models for 1T and 2T tokens and match the performance similar models like LLaMA 1&2

Can you train a performant language models without using unlicensed text?

We are thrilled to announce the Common Pile v0.1, an 8TB dataset of openly licensed and public domain text. We train 7B models for 1T and 2T tokens and match the performance similar models like LLaMA 1&2
EleutherAI (@aieleuther) 's Twitter Profile Photo

We are launching a new speaker series at EleutherAI, focused on promoting recent research by our team and community members. Our first talk is by Catherine Arnett on tokenizers, their limitations, and how to improve them.

We are launching a new speaker series at EleutherAI, focused on promoting recent research by our team and community members.

Our first talk is by <a href="/linguist_cat/">Catherine Arnett</a> on tokenizers, their limitations, and how to improve them.