Jade (@euclaise_) Twitter Tweets • TwiCopy

Benjamin Spiegel

7 months ago

Why did only humans invent graphical systems like writing? 🧠✍️ In our new paper at CogSci Society, we explore how agents learn to communicate using a model of pictographic signification similar to human proto-writing. 🧵👇

thumb_up_off_alt1,1K

chat_bubble_outline22

repeat180

shareShare

Stella Biderman

@blancheminerva

7 months ago

Really incredible detective work by Shivalika Singh et al. at Cohere Labs and elsewhere documenting the ways in which lmarena.ai works with companies to help them game the leaderboard. arxiv.org/abs/2504.20879

thumb_up_off_alt428

chat_bubble_outline7

repeat60

shareShare

Prime Intellect

@primeintellect

7 months ago

Releasing INTELLECT-2: We’re open-sourcing the first 32B parameter model trained via globally distributed reinforcement learning: • Detailed Technical Report • INTELLECT-2 model checkpoint primeintellect.ai/blog/intellect…

thumb_up_off_alt1,1K

chat_bubble_outline52

repeat306

shareShare

mike64_t

@mike64_t

7 months ago

noooo what are you doing

thumb_up_off_alt223

chat_bubble_outline24

repeat2

shareShare

zed

@zmkzmkz

7 months ago

I think the gated attention output thing was first used in forgetting transformer, but I wasn't convinced by why it was good there. this paper is exactly the isolated ablation paper that I wanted to see. looks like we'll be using gates from now on (also softpick got cited yay)

thumb_up_off_alt144

chat_bubble_outline8

repeat15

shareShare

Nous Research

@nousresearch

7 months ago

Announcing the launch of Psyche nousresearch.com/nous-psyche/ Nous Research is democratizing the development of Artificial Intelligence. Today, we’re embarking on our greatest effort to date to make that mission a reality: The Psyche Network Psyche is a decentralized training

thumb_up_off_alt2,2K

chat_bubble_outline150

repeat392

shareShare

Nous Research

@nousresearch

7 months ago

We are launching testnet with the pre-training of a 40B parameter LLM: - MLA Architecture - Dataset consisting of FineWeb (14T) + FineWeb-2 minus some less common languages (4T), and The Stack v2 (1T) The resulting model will be small enough to train on with a single H/DGX

thumb_up_off_alt321

chat_bubble_outline12

repeat36

shareShare

mephisto

@karan4d

7 months ago

Largest distributed pretrain ever

thumb_up_off_alt203

chat_bubble_outline18

repeat19

shareShare

Tanishq Mathew Abraham, Ph.D.

@iscienceluvr

7 months ago

Qwen introduces: WorldPM: Scaling Human Preference Modeling "In this paper, we collect preference data from public forums covering diverse user communities, and conduct extensive training using 15M-scale data across models ranging from 1.5B to 72B parameters.

thumb_up_off_alt191

chat_bubble_outline4

repeat36

shareShare

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)

@teortaxestex

6 months ago

I think every such research project has to do a sanity check on non-Qwen base. Qwens seem to improve from whatever. So, I'm sad to not be as excited as kalomaze about the implications.

thumb_up_off_alt135

chat_bubble_outline5

repeat6

shareShare

Ali Behrouz

@behrouz_ali

6 months ago

What makes attention the critical component for most advances in LLMs and what holds back long-term memory modules (RNNs)? Can we strictly generalize Transformers? Presenting Atlas (A powerful Titan): a new architecture with long-term in-context memory that learns how to

thumb_up_off_alt897

chat_bubble_outline23

repeat133

shareShare

emozilla

@theemozilla

6 months ago

oh and you can just download the run the model as it trains, checkpoints every 12 hours or so huggingface.co/PsycheFoundati… huggingface.co/PsycheFoundati…

thumb_up_off_alt52

chat_bubble_outline2

repeat5

shareShare

zed

@zmkzmkz

6 months ago

sorry for the late update. I bring disappointing news. softpick does NOT scale to larger models. overall training loss and benchmark results are worse than softmax on our 1.8B parameter models. we have updated the preprint on arxiv: arxiv.org/abs/2504.20966

thumb_up_off_alt1,1K

chat_bubble_outline51

repeat68

shareShare

Tiezhen WANG

@xianbao_qian

6 months ago

RL training is too slow? AReaL by Ant Research introduces an asynchronous approach that decouples generation from training to eliminate blocking. Combined with system-level optimizations, this method achieves up to 2.57× speedup. Code open-sourced Yi Wu

RL training is too slow?

AReaL by <a href="/AntResearch_/">Ant Research</a> introduces an asynchronous approach that decouples generation from training to eliminate blocking. Combined with system-level optimizations, this method achieves up to 2.57× speedup.

Code open-sourced <a href="/jxwuyi/">Yi Wu</a>

thumb_up_off_alt168

chat_bubble_outline2

repeat22

shareShare

Alexander Doria

@dorialexander

6 months ago

Announcing the release of the official Common Corpus paper: a 20 page report detailing how we collected, processed and published 2 trillion tokens of reusable data for LLM pretraining.

thumb_up_off_alt381

chat_bubble_outline11

repeat88

shareShare

Tri Dao

@tri_dao

6 months ago

State space models and RNNs compress history into a constant size state, while attn has KV cache scaling linearly in seqlen. We can instead start from RNNs and let the state size grow logarithmically with seqlen. Feels like a sweet spot. Also beautiful connection to classical

thumb_up_off_alt957

chat_bubble_outline11

repeat107

shareShare

EleutherAI

@aieleuther

6 months ago

Can you train a performant language models without using unlicensed text? We are thrilled to announce the Common Pile v0.1, an 8TB dataset of openly licensed and public domain text. We train 7B models for 1T and 2T tokens and match the performance similar models like LLaMA 1&2

thumb_up_off_alt556

chat_bubble_outline10

repeat127

shareShare

Fabian Schaipp

@fschaipp

6 months ago

6 months arxiv upload pause, please. i can't catch up

thumb_up_off_alt86

chat_bubble_outline4

repeat5

shareShare

snimu

@omouamoua

6 months ago

New blog post. (I finally got around to writing down the negative results of my model-stacking results from a few months ago)

thumb_up_off_alt288

chat_bubble_outline8

repeat20

shareShare

Lysandre

@lysandrejik

6 months ago

We have heard the feedback about transformers being bloated with many layers of abstraction. With this, we expect to remove 50% of the code from the library. This will contribute to removing abstraction layers; aligned with our focus on simplification: x.com/art_zucker/sta…

thumb_up_off_alt83

chat_bubble_outline2

repeat4

shareShare