Arjun Guha (@arjunguha) Twitter Tweets • TwiCopy

Leandro von Werra

10 months ago

Introducing DABStep: Data Agent Benchmark for multi-step reasoning. We teamed up with Adyen to test if current LLMs can solve *hard* and *real-world* data analysis tasks. TL;DR: No! They often fail to read the manual or debug errors. The best model only gets 16% right!

Introducing DABStep: Data Agent Benchmark for multi-step reasoning.

We teamed up with <a href="/Adyen/">Adyen</a> to test if current LLMs can solve *hard* and *real-world* data analysis tasks.

TL;DR: No! They often fail to read the manual or debug errors. The best model only gets 16% right!

thumb_up_off_alt83

chat_bubble_outline3

repeat24

shareShare

Arjun Guha

@arjunguha

9 months ago

There is a fundamental misunderstanding here. PhD students do not complete assigned tasks. arstechnica.com/ai/2025/03/wha…

thumb_up_off_alt15

chat_bubble_outline0

repeat0

shareShare

David Bau

@davidbau

8 months ago

Why is interpretability the key to dominance in AI? Not winning the scaling race, or banning China. Our answer to OSTP/NSF, w/ Goodfire's Tom McGrath Transluce's Sarah Schwettmann MIT's Dylan HadfieldMenell resilience.baulab.info/docs/AI_Action… Here's why:🧵 ↘️

Why is interpretability the key to dominance in AI?

Not winning the scaling race, or banning China.

Our answer to OSTP/NSF, w/ Goodfire's <a href="/banburismus_/">Tom McGrath</a> Transluce's <a href="/cogconfluence/">Sarah Schwettmann</a> MIT's <a href="/dhadfieldmenell/">Dylan HadfieldMenell</a>
resilience.baulab.info/docs/AI_Action…

Here's why:🧵 ↘️

thumb_up_off_alt310

chat_bubble_outline1

repeat68

shareShare

Arjun Guha

@arjunguha

6 months ago

A peculiar refusal.

thumb_up_off_alt6

chat_bubble_outline2

repeat0

shareShare

Cursor

@cursor_ai

6 months ago

A conversation on the optimal reward for coding agents, infinite context models, and real-time RL

thumb_up_off_alt1,1K

chat_bubble_outline53

repeat127

shareShare

Arjun Guha

@arjunguha

6 months ago

This is a short note on my experience as an immigrant and new American. The timeline is this: - 2002: moved from India to attend Grinnell College, Iowa - 2006: started PhD in computer science at Brown University, Rhode Island - 2012: started as a postdoc scholar at Cornell

thumb_up_off_alt285

chat_bubble_outline7

repeat27

shareShare

Arjun Guha

@arjunguha

6 months ago

Maribor tap water may rival Boston tap water.

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

Arjun Guha

@arjunguha

6 months ago

I hire fewer TAs because I can rapidly complete tasks with GenAI that would require long back-and-forth with a TA. I have to validate LLM output, but 1) I read fast and 2) I have to validate junior TA output too. Negative effect is obvious: human TA training is good for

thumb_up_off_alt5

chat_bubble_outline0

repeat0

shareShare

Arjun Guha

@arjunguha

5 months ago

I’m considering returning to take-home, open-book exams. Also open Google and open ChatGPT. To make this work, the exam will have questions that are beyond the scope of the class. This is akin to a test of scalable oversight. Has anyone tried this?

thumb_up_off_alt3

chat_bubble_outline0

repeat0

shareShare

Arjun Guha

@arjunguha

5 months ago

The recent *Your Brain on ChatGPT* paper is cool, from the little that I understand of it. To this day, when an undergraduate approaches me to do research, I tell them to read a prefix of the PLAI book (1st ed.), code it up, and then demonstrate to me that they understand it. I

thumb_up_off_alt3

chat_bubble_outline0

repeat0

shareShare

Arjun Guha

@arjunguha

5 months ago

When a model reports a single score on MultiPL-E, which languages are being considered for the average? I don't think it's all 18, or the 22-25 now supported. Is it the seven languages that Code Llama decided to measure?

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

Arjun Guha

@arjunguha

5 months ago

High-quality university education

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

Arjun Guha

@arjunguha

5 months ago

These kinds of statements are true for a tiny number of people. If you can figure out how to do research by yourself, that’s amazing. I needed a lot of training, and most people do. It’s great that we have a culture that both does not care about credentials, and also lets people

thumb_up_off_alt47

chat_bubble_outline1

repeat2

shareShare

Khoury College of Computer Sciences

@khourycollege

5 months ago

After building and burnishing their research chops at Khoury College, nine PhD graduates and former postdoctoral fellows are beginning their careers as professors this year. To hear more about their stories: bit.ly/4eBLedy

thumb_up_off_alt14

chat_bubble_outline2

repeat3

shareShare

Arjun Guha

@arjunguha

4 months ago

Any GPU experts can tell me why this is happening?

thumb_up_off_alt2

chat_bubble_outline0

repeat0

shareShare

Arjun Guha

@arjunguha

4 months ago

ChatGPT is *very excited* that I use ZFS.

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

Arjun Guha

@arjunguha

4 months ago

Flashback to useless content from introductory Java: I just wrote my own buffered line reader, with several non-standard bells and whistles. This is in service of PL and machine learning.

thumb_up_off_alt3

chat_bubble_outline1

repeat0

shareShare

Arjun Guha

@arjunguha

4 months ago

What does one do on sabbatical? I skipped my last one to move to Northeastern.

thumb_up_off_alt6

chat_bubble_outline1

repeat0

shareShare