Sergey Feldman (@sergeyfeldman) Twitter Tweets • TwiCopy

Nandan Thakur

8 months ago

Existing RAG benchmarks rely on automatic models for support, but can LLMs potentially replace human assessors? 📚 Our new large-scale study compares human vs LLM judges with 11K human annotations, 36 topics, 45+ RAG answers in the TREC 2024 RAG Track. To appear in #SIGIR2025⚡️

thumb_up_off_alt43

chat_bubble_outline3

repeat6

shareShare

Aakanksha Naik

@arnaik19

8 months ago

🚨Attn #NLProc and friends! 🚨 Our shared task on learning to contextualize scientific claims is back! We have some exciting new updates that make the task harder and the evaluation more thorough! Made w Joel Chan | 🦣: [email protected] and Matthew Akamatsu (1/4)

thumb_up_off_alt16

chat_bubble_outline1

repeat6

shareShare

Ronak Pradeep

@rpradeep42

8 months ago

Evaluating RAG fact recall is crucial but manual eval can be slow! Can LLMs reliably automate fact extraction & eval for RAG systems? 📚 Our new large scale study, The Great Nugget Recall, provides extensive analysis using the AutoNuggetizer framework on TREC RAG @ 2025! In #SIGIR25🇮🇹

thumb_up_off_alt57

chat_bubble_outline1

repeat9

shareShare

Ruotong Wang

@ruotongwang1

8 months ago

AI agents are entering online social spaces, but often their messages feel generic or intrusive. In our #CHI25 paper, we introduce Social-RAG, a workflow that grounds AI generations in the specific group context by retrieving from the group’s interaction history. 🧵(1/9)

thumb_up_off_alt86

chat_bubble_outline2

repeat20

shareShare

Ai2

@allen_ai

7 months ago

We’re live on Reddit! Ask us Anything about our OLMo family of models. We have six of our researchers on hand to answer all your questions.

thumb_up_off_alt85

chat_bubble_outline3

repeat29

shareShare

Aaron Tay

@aarontay

7 months ago

[Blogged] Ai2 Paper Finder and Futurehouse PaperQA2: More transparent Deep Search for Scholars? musingsaboutlibrarianship.blogspot.com/2025/05/ai2-pa…

thumb_up_off_alt7

chat_bubble_outline0

repeat4

shareShare

Semantic Scholar Research @ AI2

@ai2_s2research

7 months ago

Ai2 Semantic Scholar is hiring an #ml #nlp #ai reasoning researcher for a Research Scientist, Agents for Science position with target start dates in 2025. Excited about developing AI systems with deep reasoning capabilities for science? Send an application our way!

<a href="/allen_ai/">Ai2</a> <a href="/SemanticScholar/">Semantic Scholar</a>
is hiring an #ml #nlp #ai reasoning researcher for a Research Scientist, Agents for Science position with target start dates in 2025. Excited about developing AI systems with deep reasoning capabilities for science? Send an application our way!

thumb_up_off_alt18

chat_bubble_outline0

repeat9

shareShare

Charlie Marsh

@charliermarsh

7 months ago

You can set `UV_TORCH_BACKEND=auto` and uv will automatically install the right CUDA-enabled PyTorch for your machine, zero configuration

thumb_up_off_alt2,2K

chat_bubble_outline73

repeat230

shareShare

Ai2

@allen_ai

7 months ago

We're hiring an #ML #NLP #AI reasoning researcher for a Research Scientist, Agents for Science position! Excited about developing AI systems with deep reasoning capabilities for science? 🧑‍🔬

thumb_up_off_alt24

chat_bubble_outline1

repeat3

shareShare

alex rubinsteyn

@iskander

6 months ago

OpenAI I think "AI" has mostly affected me in two ways: (1) I spend less time trying to remember / explore / search my way into the correct parameters for bad APIs (eg matplotlib, seaborn, some shell commands) (2) I have a low-sensitivity / high specificity literature search assistant

thumb_up_off_alt2

chat_bubble_outline0

repeat3

shareShare

Benjamin Todd

@ben_j_todd

6 months ago

Why can AIs code for 1h but not 10h? A simple explanation: if there's a 10% chance of error per 10min step (say), the success rate is: 1h: 53% 4h: 8% 10h: 0.002% Toby Ord has tested this 'constant error rate' theory and shown it's a good fit for the data chance of

Why can AIs code for 1h but not 10h?

A simple explanation: if there's a 10% chance of error per 10min step (say), the success rate is:

1h: 53%
4h: 8%
10h: 0.002%

<a href="/tobyordoxford/">Toby Ord</a> has tested this 'constant error rate' theory and shown it's a good fit for the data

chance of

thumb_up_off_alt1,1K

chat_bubble_outline71

repeat150

shareShare

Graham Neubig

@gneubig

6 months ago

Many people have been asking for an interface to OpenHands that is: 1. easy to install (no docker) 2. can be used in your standard development environment This new CLI checks both of these boxes, and is fun to use!

thumb_up_off_alt68

chat_bubble_outline3

repeat8

shareShare

Ai2

@allen_ai

6 months ago

New updates for olmOCR, our fully open toolkit for transforming documents (PDFs & images) into clean markdown. We released: 1️⃣ New benchmark for fair comparison of OCR engines and APIs 2️⃣ Improved inference that is faster and cheaper to run 3️⃣ Docker image for easy deployment

thumb_up_off_alt286

chat_bubble_outline7

repeat40

shareShare

Jo Kristian Bergum

@jobergum

6 months ago

I have to admit that it’s cool to see an AI lab that I look up to use retrieval technology built in Trondheim

thumb_up_off_alt99

chat_bubble_outline1

repeat8

shareShare

Data Shama

@thedatashaman

6 months ago

you’re on cursor? you’re still using windsurf? you might as well be on github copilot. everyone’s on aider. we’re all using zed. we’re now on open hands. open hands is for losers, just kidding we’re using cline. we’re on roocode. we’re hand rolling our own claude code cli clone.

thumb_up_off_alt18

chat_bubble_outline2

repeat2

shareShare

Sergey Feldman

@sergeyfeldman

6 months ago

"if you define jobs in terms of tasks maybe you're actually defining away the most nuanced and hardest-to-automate aspects of jobs, which are at the boundaries between tasks."

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

Xeophon

@thexeophon

6 months ago

who would win? 1) Claude writing code for hours, downloading gigs of data, resulting in the most cursed analysis known to man 2) A human looking at the data for 5 minutes I am sad to admit that 1) led to 2) and will now start form scratch

thumb_up_off_alt53

chat_bubble_outline4

repeat1

shareShare

Ai2

@allen_ai

6 months ago

Introducing SciArena, a platform for benchmarking models across scientific literature tasks. Inspired by Chatbot Arena, SciArena applies a crowdsourced LLM evaluation approach to the scientific domain. 🧵

thumb_up_off_alt381

chat_bubble_outline12

repeat63

shareShare

Nathan Lambert

@natolambert

5 months ago

i'm really going out on a limb here saying I like health & wellness, normal circadian rhythms, and would resign if I built a Nazi bot

thumb_up_off_alt139

chat_bubble_outline1

repeat5

shareShare

Quentin Anthony

@quentinanthon15

5 months ago

Firstly, I think AI speedup is very weakly correlated to anyone's ability as a dev. All the devs in this study are very good. I think it has more to do with falling into failure modes, both in the LLM's ability and the human's workflow. I work with a ton of amazing pretraining

thumb_up_off_alt801

chat_bubble_outline6

repeat26

shareShare