farid (@faridlazuarda) 's Twitter Profile
farid

@faridlazuarda

ID: 924979076376301568

linkhttps://faridlazuarda.github.io calendar_today30-10-2017 12:39:31

3,3K Tweet

226 Followers

537 Following

Tiago Pimentel (@tpimentelms) 's Twitter Profile Photo

Mechanistic interpretability often relies on *interventions* to study how DNNs work. Are these interventions enough to guarantee the features we find are not spurious? No!⚠️ In our new paper, we show many mech int methods implicitly rely on the linear representation hypothesis🧵

Mechanistic interpretability often relies on *interventions* to study how DNNs work. Are these interventions enough to guarantee the features we find are not spurious? No!⚠️ In our new paper, we show many mech int methods implicitly rely on the linear representation hypothesis🧵
Nate Chen (@chengua46724992) 's Twitter Profile Photo

3 months ago, I discovered DeltaNet. I spent hours trying to understand it. Feeling amazed, I shared the blog here on my x, which had less than 20 followers back then. Then, @Songlin replied. And that simple reply ended up shifting the trajectory of a 16 y/o's life. (a thread)

3 months ago, I discovered DeltaNet. I spent hours trying to understand it.

Feeling amazed, I shared the blog here on my x, which had less than 20 followers back then.

Then, @Songlin replied.

And that simple reply ended up shifting the trajectory of a 16 y/o's life. (a thread)
Christopher Potts (@chrisgpotts) 's Twitter Profile Photo

Dimitris Papailiopoulos Soham Daga I feel that these papers (from my group) are examples of what you are nominally asking for: 1. arxiv.org/abs/2505.20809 2. arxiv.org/abs/2505.15105 3. arxiv.org/abs/2505.13898 4. arxiv.org/abs/2501.17148 5. arxiv.org/abs/2505.11770 6. aclanthology.org/2024.emnlp-mai… 7.

hardmaru (@hardmaru) 's Twitter Profile Photo

Andrew Ng’s piece on 🇺🇸 vs 🇨🇳 competition in AI: “Because many US companies have taken a secretive approach to developing foundation models—a reasonable business strategy—the leading companies spend huge…to recruit key team members from each other who might know the ‘secret

Daniel Han (@danielhanchen) 's Twitter Profile Photo

OpenAI's OSS model possible breakdown: 1. 120B MoE 5B active + 20B text only 2. Trained with Float4 maybe Blackwell chips 3. SwiGLU clip (-7,7) like ReLU6 4. 128K context via YaRN from 4K 5. Sliding window 128 + attention sinks 6. Llama/Mixtral arch + biases Details: 1. 120B MoE

OpenAI's OSS model possible breakdown:
1. 120B MoE 5B active + 20B text only
2. Trained with Float4 maybe Blackwell chips
3. SwiGLU clip (-7,7) like ReLU6
4. 128K context via YaRN from 4K
5. Sliding window 128 + attention sinks
6. Llama/Mixtral arch + biases

Details:
1. 120B MoE
Google DeepMind (@googledeepmind) 's Twitter Profile Photo

For researchers, scientists, and academics tackling hard problems: Gemini 2.5 Deep Think is here. 🤯 It doesn't just answer, it brainstorms using parallel thinking and reinforcement learning techniques. We put it into the hands of mathematicians who explored what it can do ↓

Kainoa Lowman (@klowmn) 's Twitter Profile Photo

Maia Wasn't actually Habermas, but a Freudian social theorist named Karola Brede. Really fascinating article connecting Karp's PhD dissertation to Palantir boundary2.org/2020/07/moira-…

Anthropic (@anthropicai) 's Twitter Profile Photo

New Anthropic research: Persona vectors. Language models sometimes go haywire and slip into weird and unsettling personas. Why? In a new paper, we find “persona vectors"—neural activity patterns controlling traits like evil, sycophancy, or hallucination.

New Anthropic research: Persona vectors.

Language models sometimes go haywire and slip into weird and unsettling personas. Why? In a new paper, we find “persona vectors"—neural activity patterns controlling traits like evil, sycophancy, or hallucination.
bubble boi (@bubblebabyboi) 's Twitter Profile Photo

It is wild to me how little Deep Learning researchers know about basic statistical theory. Everyone acts like all to all attention is a free lunch while basic stats has shown many better ways to capture long range dependencies instead of comparing every token to each other.

Dan Nystedt (@dnystedt) 's Twitter Profile Photo

Four TSMC 2nm fabs will be in mass production next year and monthly capacity over 60,000 wafers-per-month (wpm), media report, citing unnamed supply chain sources. 2nm wafers cost US$30,000 each, 50% more expensive than 3nm. 1/2 $TSM $SSNLF $INTC #semiconductors #2nm

Dimitris Papailiopoulos (@dimitrispapail) 's Twitter Profile Photo

Excited about our new work: Language models develop computational circuits that are reusable AND TRANSFER across tasks. Over a year ago, I tested GPT-4 on 200 digit addition, and the model managed to do it (without CoT!). Someone from OpenAI even clarified they NEVER trained

Excited about our new work: 
Language models develop  computational circuits that are reusable AND TRANSFER across tasks.
Over a year ago, I tested GPT-4 on 200 digit addition, and the model managed to do it (without CoT!). Someone from OpenAI even clarified they NEVER trained
Dimitri von Rütte (@dvruette) 's Twitter Profile Photo

gpt-oss is probably the most standard MoE transformer that ever was. Couple of details worth noting: - Uses attention sinks (a.k.a. registers) - Sliding window attention in every second layer - YaRN context window extension - RMSNorm without biases - No QK norm, no attn. softcap

gpt-oss is probably the most standard MoE transformer that ever was. Couple of details worth noting:
- Uses attention sinks (a.k.a. registers)
- Sliding window attention in every second layer
- YaRN context window extension
- RMSNorm without biases
- No QK norm, no attn. softcap
Xiangming Gu @ ICLR 2025 (@gu_xiangming) 's Twitter Profile Photo

I noticed that OpenAI added learnable bias to attention logits before softmax. After softmax, they deleted the bias. This is similar to what I have done in my ICLR2025 paper: openreview.net/forum?id=78Nn4…. I used learnable key bias and set corresponding value bias zero. In this way,

I noticed that <a href="/OpenAI/">OpenAI</a> added learnable bias to attention logits before softmax. After softmax, they deleted the bias. This is similar to what I have done in my ICLR2025 paper: openreview.net/forum?id=78Nn4….
I used learnable key bias and set corresponding value bias zero. In this way,
Wenhao Chai (@wenhaocha1) 's Twitter Profile Photo

Deep dive into Sink Value in GPT-OSS models! Analyzed 20B (24 layers) and 120B (36 layers) models and found (correct me if I'm wrong) Key Findings: 1. 20B model has larger sink value, 20B: mean=2.45, 120B: mean=1.93, 2. Clear swa/full-attn layer alternation: full-attn layers

Deep dive into Sink Value in GPT-OSS models! 
Analyzed 20B (24 layers) and 120B (36 layers) models and found (correct me if I'm wrong) Key Findings:   
1. 20B model has larger sink value, 20B: mean=2.45, 120B: mean=1.93,
2. Clear swa/full-attn layer alternation: full-attn layers
Guangxuan Xiao (@guangxuan_xiao) 's Twitter Profile Photo

I've written the full story of Attention Sinks — a technical deep-dive into how the mechanism was developed and how our research ended up being used in OpenAI's new OSS models. For those interested in the details: hanlab.mit.edu/blog/streaming…

I've written the full story of Attention Sinks — a technical deep-dive into how the mechanism was developed and how our research ended up being used in OpenAI's new OSS models.

For those interested in the details:
hanlab.mit.edu/blog/streaming…