Ohad Rubin (@ohadrubin) Twitter Tweets • TwiCopy

Ohad Rubin

@ohadrubin

+ Follow

P.hD student. Researching Natural Language Processing at Tel Aviv University. Let's have more paperclips? 📎⏩

ID: 28635924

linkhttps://ohadrubin.github.io/ calendar_today03-04-2009 19:42:47

3,3K Tweet

841 Followers

3,3K Following

Ohad Rubin

@ohadrubin

18 days ago

like many genius people, gpt-5 (Auto) is displaying high variance.

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

Tomer Wolfson

@tomerwolfson

18 days ago

Deep research systems can't handle questions involving dozens of documents. Let me show you why this is (still) true 🧵and what does it all have to do with Grace Kelly? (1/)

thumb_up_off_alt16

chat_bubble_outline1

repeat6

shareShare

Ohad Rubin

@ohadrubin

15 days ago

Some users can't be helped...

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

Daniel Nakov

@dnak0v

14 days ago

⎿ Read 20 lines (ctrl+r to expand) ⏺ Perfect! I now have a clear understanding of the codebase.

thumb_up_off_alt106

chat_bubble_outline8

repeat3

shareShare

Bepis™ 🔀

@underwaterbepis

14 days ago

Neurosama (LLM Vtuber) was just called a cl*ank*r and in response brute forced her filter to send death threats in response and then got stuck in a loop about wanting to stop existing

thumb_up_off_alt309

chat_bubble_outline10

repeat13

shareShare

Ohad Rubin

@ohadrubin

14 days ago

Yeah, it's called doing a PhD

thumb_up_off_alt3

chat_bubble_outline0

repeat0

shareShare

Xeophon

@thexeophon

14 days ago

for someone getting into RL, what are some good seeds?

thumb_up_off_alt202

chat_bubble_outline25

repeat6

shareShare

just remove the following string from claude-code cli.js and it will always just read full files: offset:h.number().optional().describe("The line number to start reading from. Only provide if the file is too large to read at once"),limit:h.number().optional().describe("The

thumb_up_off_alt111

chat_bubble_outline7

repeat9

shareShare

Ross Taylor

@rosstaylor90

13 days ago

Most takes on RL environments are bad. 1. There are hardly any high-quality RL environments and evals available. Most agentic environments and evals are flawed when you look at the details. It’s a crisis: and no one is talking about it because they’re being hoodwinked by labs

thumb_up_off_alt701

chat_bubble_outline30

repeat46

shareShare

Ohad Rubin

@ohadrubin

13 days ago

I can't state how amazing of an idea this UE8MO is. Fuck negative numbers Fuck the mantissa All you need is exponent

thumb_up_off_alt2

chat_bubble_outline0

repeat0

shareShare

jason liu - vacation mode

@jxnlco

10 days ago

why dspy usually wastes your time (and when it doesn't) the question: "should i use dspy for prompt optimization? it seems like the perfect tool for improving my rag system." the answer: dspy is great for very specific, well-defined tasks. but for most rag systems, it's a

thumb_up_off_alt380

chat_bubble_outline30

repeat20

shareShare

Ohad Rubin

@ohadrubin

8 days ago

this figure needs updating

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

Ohad Rubin

@ohadrubin

7 days ago

If only there was a benchmark that tests this capability

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

xlr8harder

@xlr8harder

7 days ago

Secret model nerfing paranoia will never recover from this

thumb_up_off_alt1,1K

chat_bubble_outline57

repeat59

shareShare

Miles Cranmer

@milescranmer

6 days ago

Neil Lawrence I have a blanket ban on "rm" and only let my LLM use "rip"! github.com/MilesCranmer/r…

thumb_up_off_alt253

chat_bubble_outline7

repeat15

shareShare

Ohad Rubin

@ohadrubin

5 days ago

If only he could use it at work lol

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

Ohad Rubin

@ohadrubin

5 days ago

Anyone else feeling the same? It's a bit annoying that existing benchmarks like SWEBench don't capture this reward-hacking.

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

Ohad Rubin

@ohadrubin

5 days ago

I don't understand why people keep saying that men don't see therapists, all the girls I know who study psychology are seeing someone

thumb_up_off_alt2

chat_bubble_outline0

repeat0

shareShare

Graham Neubig

@gneubig

3 days ago

Which LM is better at agentic coding? We have a bunch of useful academic benchmarks like SWE-Bench, but we don't have a good comparison of agentic coding LMs *in the wild*. To solve this, we released PR Arena: github.com/neulab/pr-arena

thumb_up_off_alt122

chat_bubble_outline7

repeat20

shareShare

Aran Komatsuzaki

@arankomatsuzaki

3 days ago

Unfortunate reality: most open-source LLM servers (e.g. Together) don’t offer cache-hit discounts, while closed providers like OpenAI do. DeepSeek does discount, but most third-party servers don't. Closed models can end up much cheaper than open ones :(

thumb_up_off_alt241

chat_bubble_outline25

repeat13

shareShare