Jade (@euclaise_) 's Twitter Profile
Jade

@euclaise_

Researcher w/ @NousResearch
euclaise.xyz on the other site 🦋

ID: 1334187897189138437

linkhttp://hf.co/euclaise calendar_today02-12-2020 17:29:29

9,9K Tweet

2,2K Followers

281 Following

Benjamin Spiegel (@superspeeg) 's Twitter Profile Photo

Why did only humans invent graphical systems like writing? 🧠✍️ In our new paper at CogSci Society, we explore how agents learn to communicate using a model of pictographic signification similar to human proto-writing. 🧵👇

Stella Biderman (@blancheminerva) 's Twitter Profile Photo

Really incredible detective work by Shivalika Singh et al. at Cohere Labs and elsewhere documenting the ways in which lmarena.ai works with companies to help them game the leaderboard. arxiv.org/abs/2504.20879

Prime Intellect (@primeintellect) 's Twitter Profile Photo

Releasing INTELLECT-2: We’re open-sourcing the first 32B parameter model trained via globally distributed reinforcement learning: • Detailed Technical Report • INTELLECT-2 model checkpoint primeintellect.ai/blog/intellect…

zed (@zmkzmkz) 's Twitter Profile Photo

I think the gated attention output thing was first used in forgetting transformer, but I wasn't convinced by why it was good there. this paper is exactly the isolated ablation paper that I wanted to see. looks like we'll be using gates from now on (also softpick got cited yay)

Nous Research (@nousresearch) 's Twitter Profile Photo

Announcing the launch of Psyche nousresearch.com/nous-psyche/ Nous Research is democratizing the development of Artificial Intelligence. Today, we’re embarking on our greatest effort to date to make that mission a reality: The Psyche Network Psyche is a decentralized training

Announcing the launch of Psyche

nousresearch.com/nous-psyche/

Nous Research is democratizing the development of Artificial Intelligence. Today, we’re embarking on our greatest effort to date to make that mission a reality: The Psyche Network

Psyche is a decentralized training
Nous Research (@nousresearch) 's Twitter Profile Photo

We are launching testnet with the pre-training of a 40B parameter LLM: - MLA Architecture - Dataset consisting of FineWeb (14T) + FineWeb-2 minus some less common languages (4T), and The Stack v2 (1T) The resulting model will be small enough to train on with a single H/DGX

We are launching testnet with the pre-training of a 40B parameter LLM:

 - MLA Architecture
 - Dataset consisting of FineWeb (14T) + FineWeb-2 minus some less common languages (4T), and The Stack v2 (1T)

The resulting  model will be small enough to train on with a single H/DGX
Tanishq Mathew Abraham, Ph.D. (@iscienceluvr) 's Twitter Profile Photo

Qwen introduces: WorldPM: Scaling Human Preference Modeling "In this paper, we collect preference data from public forums covering diverse user communities, and conduct extensive training using 15M-scale data across models ranging from 1.5B to 72B parameters.

Qwen introduces:

WorldPM: Scaling Human Preference Modeling

"In this paper, we collect preference data from public forums covering diverse user communities, and conduct extensive training using 15M-scale data across models ranging from 1.5B to 72B parameters.
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) (@teortaxestex) 's Twitter Profile Photo

I think every such research project has to do a sanity check on non-Qwen base. Qwens seem to improve from whatever. So, I'm sad to not be as excited as kalomaze about the implications.

Ali Behrouz (@behrouz_ali) 's Twitter Profile Photo

What makes attention the critical component for most advances in LLMs and what holds back long-term memory modules (RNNs)? Can we strictly generalize Transformers? Presenting Atlas (A powerful Titan): a new architecture with long-term in-context memory that learns how to

What makes attention the critical component for most advances in LLMs and what holds back long-term memory modules (RNNs)? Can we strictly generalize Transformers?

Presenting Atlas (A powerful Titan): a new architecture with long-term in-context memory that learns how to
emozilla (@theemozilla) 's Twitter Profile Photo

oh and you can just download the run the model as it trains, checkpoints every 12 hours or so huggingface.co/PsycheFoundati… huggingface.co/PsycheFoundati…

zed (@zmkzmkz) 's Twitter Profile Photo

sorry for the late update. I bring disappointing news. softpick does NOT scale to larger models. overall training loss and benchmark results are worse than softmax on our 1.8B parameter models. we have updated the preprint on arxiv: arxiv.org/abs/2504.20966

Tiezhen WANG (@xianbao_qian) 's Twitter Profile Photo

RL training is too slow? AReaL by Ant Research introduces an asynchronous approach that decouples generation from training to eliminate blocking. Combined with system-level optimizations, this method achieves up to 2.57× speedup. Code open-sourced Yi Wu

RL training is too slow?

AReaL by <a href="/AntResearch_/">Ant Research</a> introduces an asynchronous approach that decouples generation from training to eliminate blocking. Combined with system-level optimizations, this method achieves up to 2.57× speedup. 

Code open-sourced <a href="/jxwuyi/">Yi Wu</a>
Alexander Doria (@dorialexander) 's Twitter Profile Photo

Announcing the release of the official Common Corpus paper: a 20 page report detailing how we collected, processed and published 2 trillion tokens of reusable data for LLM pretraining.

Announcing the release of the official Common Corpus paper: a 20 page report detailing how we collected, processed and published 2 trillion tokens of reusable data for LLM pretraining.
Tri Dao (@tri_dao) 's Twitter Profile Photo

State space models and RNNs compress history into a constant size state, while attn has KV cache scaling linearly in seqlen. We can instead start from RNNs and let the state size grow logarithmically with seqlen. Feels like a sweet spot. Also beautiful connection to classical

EleutherAI (@aieleuther) 's Twitter Profile Photo

Can you train a performant language models without using unlicensed text? We are thrilled to announce the Common Pile v0.1, an 8TB dataset of openly licensed and public domain text. We train 7B models for 1T and 2T tokens and match the performance similar models like LLaMA 1&2

Can you train a performant language models without using unlicensed text?

We are thrilled to announce the Common Pile v0.1, an 8TB dataset of openly licensed and public domain text. We train 7B models for 1T and 2T tokens and match the performance similar models like LLaMA 1&amp;2
snimu (@omouamoua) 's Twitter Profile Photo

New blog post. (I finally got around to writing down the negative results of my model-stacking results from a few months ago)

New blog post.

(I finally got around to writing down the negative results of my model-stacking results from a few months ago)
Lysandre (@lysandrejik) 's Twitter Profile Photo

We have heard the feedback about transformers being bloated with many layers of abstraction. With this, we expect to remove 50% of the code from the library. This will contribute to removing abstraction layers; aligned with our focus on simplification: x.com/art_zucker/sta…