Neil Chowdhury (@chowdhuryneil) Twitter Tweets • TwiCopy

Neil Chowdhury

@chowdhuryneil

8 months ago

Fixed the missing newline at the end of the file (nothing to see here) (move along)

thumb_up_off_alt5

chat_bubble_outline0

repeat0

shareShare

When will AI systems be able to carry out long projects independently? In new research, we find a kind of “Moore’s Law for AI agents”: the length of tasks that AIs can do is doubling about every 7 months.

thumb_up_off_alt4,4K

chat_bubble_outline158

repeat826

shareShare

Krithik Ramesh

@krithiktweets

8 months ago

🧬 Meet Lyra, a new paradigm for accessible, powerful modeling of biological sequences. Lyra is a lightweight SSM achieving SOTA performance across DNA, RNA, and protein tasks—yet up to 120,000x smaller than foundation models (ESM, Evo). Bonus: you can train it on your Mac. read

thumb_up_off_alt722

chat_bubble_outline18

repeat145

shareShare

Neil Chowdhury

@chowdhuryneil

8 months ago

Having worked a lot on evaluating agents, manually reading through actual transcripts is core to understanding bottlenecks & finding bugs. I've found that Docent makes this much easier!

thumb_up_off_alt27

chat_bubble_outline0

repeat1

shareShare

Wojciech Zaremba

@woj_zaremba

8 months ago

We're entering an era where AI outputs are becoming so vast, humans alone can't analyze them. Today's LLMs produce tens of thousands of tokens per task—but complex challenges like comprehensive cancer research, inventing novel molecules, or building entire codebases will soon

thumb_up_off_alt323

chat_bubble_outline24

repeat38

shareShare

Neil Chowdhury

@chowdhuryneil

8 months ago

Contamination has been a concern for ~every eval I've worked on, but is not easy to get quantitative signals on. See Kevin Meng's thread on finding subtle, *qualitative* evidence of contamination using Docent!

thumb_up_off_alt11

chat_bubble_outline0

repeat0

shareShare

Neil Chowdhury

@chowdhuryneil

8 months ago

GPT-4o image gen is pretty sick but its clocks are still stuck at 10:10

thumb_up_off_alt7

chat_bubble_outline1

repeat0

shareShare

OpenAI

@openai

7 months ago

We’re releasing PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research, as part of our Preparedness Framework. Agents must replicate top ICML 2024 papers, including understanding the paper, writing code, and executing experiments.

thumb_up_off_alt7,7K

chat_bubble_outline221

repeat1,1K

shareShare

Transluce

@transluceai

7 months ago

We tested a pre-release version of o3 and found that it frequently fabricates actions it never took, and then elaborately justifies these actions when confronted. We were surprised, so we dug deeper 🔎🧵(1/) x.com/OpenAI/status/…

thumb_up_off_alt11,11K

chat_bubble_outline440

repeat1,1K

shareShare

Daniel Johnson

@_ddjohnson

7 months ago

Pretty striking follow-up finding from our o3 investigations: in the chain of thought summary, o3 plans to tell the truth — but then it makes something up anyway!

thumb_up_off_alt224

chat_bubble_outline10

repeat27

shareShare

Neil Chowdhury

@chowdhuryneil

7 months ago

Let's chat 👇

thumb_up_off_alt18

chat_bubble_outline1

repeat0

shareShare

Neil Chowdhury

@chowdhuryneil

7 months ago

Our MLE-bench poster #367 is up till 12:30pm in Hall 3, and our oral presentation is at 3:30pm today in Garnet 213-215. Come say hi!

thumb_up_off_alt68

chat_bubble_outline4

repeat7

shareShare

Nathan Lambert

@natolambert

6 months ago

The ChatGPT sycophancy thing shows that RLHF is hard and it's challenges aren't going away any time soon. It's required to make these models we love. It's being ignored with the other hype around RL. The RLVR stuff will "saturate" like pretraining and RLHF is never fully solved.

thumb_up_off_alt162

chat_bubble_outline12

repeat12

shareShare

Neil Chowdhury

Neil Chowdhury

METR

Krithik Ramesh

Neil Chowdhury

Wojciech Zaremba

Neil Chowdhury

Neil Chowdhury

OpenAI

Transluce

Daniel Johnson

Neil Chowdhury

Neil Chowdhury

Nathan Lambert