Sergey Feldman (@sergeyfeldman) 's Twitter Profile
Sergey Feldman

@sergeyfeldman

ML/AI at semanticscholar.org, alongside.care, data-cowboys.com, @sergeyf.bsky.social

ID: 792049825

calendar_today30-08-2012 17:32:14

1,1K Tweet

416 Followers

342 Following

Nandan Thakur (@beirmug) 's Twitter Profile Photo

Existing RAG benchmarks rely on automatic models for support, but can LLMs potentially replace human assessors? 📚 Our new large-scale study compares human vs LLM judges with 11K human annotations, 36 topics, 45+ RAG answers in the TREC 2024 RAG Track. To appear in #SIGIR2025⚡️

Existing RAG benchmarks rely on automatic models for support, but can LLMs potentially replace human assessors?

📚 Our new large-scale study compares human vs LLM judges with 11K human annotations, 36 topics, 45+ RAG answers in the TREC 2024 RAG Track.

To appear in #SIGIR2025⚡️
Aakanksha Naik (@arnaik19) 's Twitter Profile Photo

🚨Attn #NLProc and friends! 🚨 Our shared task on learning to contextualize scientific claims is back! We have some exciting new updates that make the task harder and the evaluation more thorough! Made w Joel Chan | 🦣: [email protected] and Matthew Akamatsu (1/4)

Ronak Pradeep (@rpradeep42) 's Twitter Profile Photo

Evaluating RAG fact recall is crucial but manual eval can be slow! Can LLMs reliably automate fact extraction & eval for RAG systems? 📚 Our new large scale study, The Great Nugget Recall, provides extensive analysis using the AutoNuggetizer framework on TREC RAG @ 2025! In #SIGIR25🇮🇹

Evaluating RAG fact recall is crucial but manual eval can be slow! Can LLMs reliably automate fact extraction & eval for RAG systems?

📚 Our new large scale study, The Great Nugget Recall, provides extensive analysis using the AutoNuggetizer framework on <a href="/TREC_RAG/">TREC RAG @ 2025</a>! In #SIGIR25🇮🇹
Ruotong Wang (@ruotongwang1) 's Twitter Profile Photo

AI agents are entering online social spaces, but often their messages feel generic or intrusive. In our #CHI25 paper, we introduce Social-RAG, a workflow that grounds AI generations in the specific group context by retrieving from the group’s interaction history. 🧵(1/9)

AI agents are entering online social spaces, but often their messages feel generic or intrusive. In our #CHI25 paper, we introduce Social-RAG, a workflow that grounds AI generations in the specific group context by retrieving from the group’s interaction history. 🧵(1/9)
Ai2 (@allen_ai) 's Twitter Profile Photo

We’re live on Reddit! Ask us Anything about our OLMo family of models. We have six of our researchers on hand to answer all your questions.

We’re live on Reddit! Ask us Anything about our OLMo family of models. We have six of our researchers on hand to answer all your questions.
Aaron Tay (@aarontay) 's Twitter Profile Photo

[Blogged] Ai2 Paper Finder and Futurehouse PaperQA2: More transparent Deep Search for Scholars? musingsaboutlibrarianship.blogspot.com/2025/05/ai2-pa…

Semantic Scholar Research @ AI2 (@ai2_s2research) 's Twitter Profile Photo

Ai2 Semantic Scholar is hiring an #ml #nlp #ai reasoning researcher for a Research Scientist, Agents for Science position with target start dates in 2025. Excited about developing AI systems with deep reasoning capabilities for science? Send an application our way!

<a href="/allen_ai/">Ai2</a> <a href="/SemanticScholar/">Semantic Scholar</a>
is hiring an #ml #nlp #ai reasoning researcher for a Research Scientist, Agents for Science position with target start dates in 2025. Excited about developing AI systems with deep reasoning capabilities for science? Send an application our way!
Charlie Marsh (@charliermarsh) 's Twitter Profile Photo

You can set `UV_TORCH_BACKEND=auto` and uv will automatically install the right CUDA-enabled PyTorch for your machine, zero configuration

You can set `UV_TORCH_BACKEND=auto` and uv will automatically install the right CUDA-enabled PyTorch for your machine, zero configuration
Ai2 (@allen_ai) 's Twitter Profile Photo

We're hiring an #ML #NLP #AI reasoning researcher for a Research Scientist, Agents for Science position! Excited about developing AI systems with deep reasoning capabilities for science? 🧑‍🔬

We're hiring an #ML #NLP #AI reasoning researcher for a Research Scientist, Agents for Science position! Excited about developing AI systems with deep reasoning capabilities for science? 🧑‍🔬
alex rubinsteyn (@iskander) 's Twitter Profile Photo

OpenAI I think "AI" has mostly affected me in two ways: (1) I spend less time trying to remember / explore / search my way into the correct parameters for bad APIs (eg matplotlib, seaborn, some shell commands) (2) I have a low-sensitivity / high specificity literature search assistant

Benjamin Todd (@ben_j_todd) 's Twitter Profile Photo

Why can AIs code for 1h but not 10h? A simple explanation: if there's a 10% chance of error per 10min step (say), the success rate is: 1h: 53% 4h: 8% 10h: 0.002% Toby Ord has tested this 'constant error rate' theory and shown it's a good fit for the data chance of

Why can AIs code for 1h but not 10h?

A simple explanation: if there's a 10% chance of error per 10min step (say), the success rate is:

1h: 53%
4h: 8%
10h: 0.002%

<a href="/tobyordoxford/">Toby Ord</a> has tested this 'constant error rate' theory and shown it's a good fit for the data

chance of
Graham Neubig (@gneubig) 's Twitter Profile Photo

Many people have been asking for an interface to OpenHands that is: 1. easy to install (no docker) 2. can be used in your standard development environment This new CLI checks both of these boxes, and is fun to use!

Ai2 (@allen_ai) 's Twitter Profile Photo

New updates for olmOCR, our fully open toolkit for transforming documents (PDFs & images) into clean markdown. We released: 1️⃣ New benchmark for fair comparison of OCR engines and APIs 2️⃣ Improved inference that is faster and cheaper to run 3️⃣ Docker image for easy deployment

New updates for olmOCR, our fully open toolkit for transforming documents (PDFs &amp; images) into clean markdown. We released:

1️⃣ New benchmark for fair comparison of OCR engines and APIs
2️⃣ Improved inference that is faster and cheaper to run
3️⃣ Docker image for easy deployment
Data Shama (@thedatashaman) 's Twitter Profile Photo

you’re on cursor? you’re still using windsurf? you might as well be on github copilot. everyone’s on aider. we’re all using zed. we’re now on open hands. open hands is for losers, just kidding we’re using cline. we’re on roocode. we’re hand rolling our own claude code cli clone.

Sergey Feldman (@sergeyfeldman) 's Twitter Profile Photo

"if you define jobs in terms of tasks maybe you're actually defining away the most nuanced and hardest-to-automate aspects of jobs, which are at the boundaries between tasks."

Xeophon (@thexeophon) 's Twitter Profile Photo

who would win? 1) Claude writing code for hours, downloading gigs of data, resulting in the most cursed analysis known to man 2) A human looking at the data for 5 minutes I am sad to admit that 1) led to 2) and will now start form scratch

Ai2 (@allen_ai) 's Twitter Profile Photo

Introducing SciArena, a platform for benchmarking models across scientific literature tasks. Inspired by Chatbot Arena, SciArena applies a crowdsourced LLM evaluation approach to the scientific domain. 🧵

Introducing SciArena, a platform for benchmarking models across scientific literature tasks. Inspired by Chatbot Arena, SciArena applies a crowdsourced LLM evaluation approach to the scientific domain. 🧵
Nathan Lambert (@natolambert) 's Twitter Profile Photo

i'm really going out on a limb here saying I like health & wellness, normal circadian rhythms, and would resign if I built a Nazi bot

Quentin Anthony (@quentinanthon15) 's Twitter Profile Photo

Firstly, I think AI speedup is very weakly correlated to anyone's ability as a dev. All the devs in this study are very good. I think it has more to do with falling into failure modes, both in the LLM's ability and the human's workflow. I work with a ton of amazing pretraining