Jarrod Barnes (@jarrodbarnes) 's Twitter Profile
Jarrod Barnes

@jarrodbarnes

Building the AI memory layer for engineers | Open‑source & ML nerd | Prev. Professor @nyuniversity | 🏈 @OhioStateFB Alum

ID: 1021413747875811328

linkhttps://www.linkedin.com/in/jarrodbarnes calendar_today23-07-2018 15:16:29

595 Tweet

1,1K Followers

1,1K Following

OpenAI (@openai) 's Twitter Profile Photo

We’re releasing PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research, as part of our Preparedness Framework. Agents must replicate top ICML 2024 papers, including understanding the paper, writing code, and executing experiments.

We’re releasing PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research, as part of our Preparedness Framework.

Agents must replicate top ICML 2024 papers, including understanding the paper, writing code, and executing experiments.
Guillermo Rauch (@rauchg) 's Twitter Profile Photo

Will we hire engineers in the future? Prompting is thinking. The sharper your thoughts, the better the prompts, the better the outputs. Engineering and design are headed into a world where most of the output is produced by prompting, if not all. The way we interview engineers

Cognition (@cognition_labs) 's Twitter Profile Photo

Our research interns present: Kevin-32B = K(ernel D)evin It's the first open model trained using RL for writing CUDA kernels. We implemented multi-turn RL using GRPO (based on QwQ-32B) on the KernelBench dataset. It outperforms top reasoning models (o3 & o4-mini)! 🧵

Our research interns present:
Kevin-32B = K(ernel D)evin

It's the first open model trained using RL for writing CUDA kernels. We implemented multi-turn RL using GRPO (based on QwQ-32B) on the KernelBench dataset.

It outperforms top reasoning models (o3 & o4-mini)! 🧵
Haider. (@slow_developer) 's Twitter Profile Photo

Anthropic CPO, Mike Krieger: "over 70% of Anthropic pull requests are now generated by AI" but we're still figuring out what that means for code review and long-term architecture.

Visual Studio Code (@code) 's Twitter Profile Photo

Today, we're announcing plans to make VS Code an open source AI editor. We believe AI development should stay true to VS Code's core principles: open, collaborative, and community-driven. Let's build the future of software development together. aka.ms/open-source-ai…

Today, we're announcing plans to make VS Code an open source AI editor.

We believe AI development should stay true to VS Code's core principles: open, collaborative, and community-driven. Let's build the future of software development together.

aka.ms/open-source-ai…
NVIDIA AI Developer (@nvidiaaidev) 's Twitter Profile Photo

📣 Introducing AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning (RL) Starting from the SFT model DeepSeek-R1-Distill-Qwen-14B, our AceReason-Nemotron-14B achieves substantial improvements in pass@1 accuracy on key benchmarks through RL: AIME

📣 Introducing AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning (RL)

Starting from the SFT model DeepSeek-R1-Distill-Qwen-14B, our AceReason-Nemotron-14B achieves substantial improvements in pass@1 accuracy on key benchmarks through RL:

AIME
Naval (@naval) 's Twitter Profile Photo

Acquiring knowledge is easy, the hard part is knowing what to apply and when. That’s why all true learning is “on the job.” Life is lived in the arena.

Ryan Marten (@ryanmart3n) 's Twitter Profile Photo

Announcing OpenThinker3-7B, the new SOTA open-data 7B reasoning model: improving over DeepSeek-R1-Distill-Qwen-7B by 33% on average over code, science, and math evals. We also release our dataset, OpenThoughts3-1.2M, which is the best open reasoning dataset across all data

Announcing OpenThinker3-7B, the new SOTA open-data 7B reasoning model: improving over DeepSeek-R1-Distill-Qwen-7B by 33% on average over code, science, and math evals.

We also release our dataset, OpenThoughts3-1.2M, which is the best open reasoning dataset across all data
David Singleton (@dps) 's Twitter Profile Photo

Coding agents have crossed a chasm Somewhere in the last few months, something fundamental shifted for me with autonomous AI coding agents. They’ve gone from a “hey this is pretty neat” curiosity to something I genuinely can’t imagine working without. Not in a hand-wavy,

NovaSky (@novaskyai) 's Twitter Profile Photo

✨Release: We upgraded SkyRL into a highly-modular, performant RL framework for training LLMs. We prioritized modularity—easily prototype new algorithms, environments, and training logic with minimal overhead. 🧵👇 Blog: novasky-ai.notion.site/skyrl-v01 Code: github.com/NovaSky-AI/Sky…

✨Release: We upgraded SkyRL into a highly-modular, performant RL framework for training LLMs. We prioritized modularity—easily prototype new algorithms, environments, and training logic with minimal overhead.

🧵👇
Blog: novasky-ai.notion.site/skyrl-v01
Code: github.com/NovaSky-AI/Sky…
Hokin Deng (@denghokin) 's Twitter Profile Photo

#ICML #cognition #GrowAI We spent 2 years carefully curated every single experiment (i.e. object permanence, A-not-B task, visual cliff task) in this dataset (total: 1503 classic experiments spanning 12 core cognitive concepts). We spent another year to get 230 MLLMs evaluated

#ICML #cognition #GrowAI We spent 2 years carefully curated every single experiment (i.e. object permanence, A-not-B task, visual cliff task) in this dataset (total: 1503 classic experiments spanning 12 core cognitive concepts). 

We spent another year to get 230 MLLMs evaluated
Minqi Jiang (@minqijiang) 's Twitter Profile Photo

Recently, there has been a lot of talk of LLM agents automating ML research itself. If Llama 5 can create Llama 6, then surely the singularity is just around the corner. How can we get a pulse check on whether current LLMs are capable of driving this kind of total

Recently, there has been a lot of talk of LLM agents automating ML research itself. If Llama 5 can create Llama 6, then surely the singularity is just around the corner. 

How can we get a pulse check on whether current LLMs are capable of driving this kind of total
Jason Weston (@jaseweston) 's Twitter Profile Photo

🌉 Bridging Offline & Online RL for LLMs 🌉 📝: arxiv.org/abs/2506.21495 New paper shows on verifiable & non-verifiable tasks: - Online DPO & GRPO give similar performance. - Semi-online (iterative) DPO with sync every s steps (more efficient!) works very well also. - Offline DPO

🌉 Bridging Offline & Online RL for LLMs 🌉
📝: arxiv.org/abs/2506.21495
New paper shows on verifiable & non-verifiable tasks:
- Online DPO & GRPO give similar performance.
- Semi-online (iterative) DPO with sync every s steps (more efficient!) works very well also.
- Offline DPO
Sakana AI (@sakanaailabs) 's Twitter Profile Photo

We’re excited to introduce AB-MCTS! Our new inference-time scaling algorithm enables collective intelligence for AI by allowing multiple frontier models (like Gemini 2.5 Pro, o4-mini, DeepSeek-R1-0528) to cooperate. Blog: sakana.ai/ab-mcts Paper: arxiv.org/abs/2503.04412

We’re excited to introduce AB-MCTS!

Our new inference-time scaling algorithm enables collective intelligence for AI by allowing multiple frontier models (like Gemini 2.5 Pro, o4-mini, DeepSeek-R1-0528) to cooperate.

Blog: sakana.ai/ab-mcts
Paper: arxiv.org/abs/2503.04412
Liliang Ren (@liliang_ren) 's Twitter Profile Photo

We’re open-sourcing the pre-training code for Phi4-mini-Flash, our SoTA hybrid model that delivers 10× faster reasoning than Transformers — along with μP++, a suite of simple yet powerful scaling laws for stable large-scale training. 🔗 github.com/microsoft/Arch… (1/4)

Loubna Ben Allal (@loubnabenallal1) 's Twitter Profile Photo

SmolLM3 full training and evaluation code is now live, along with 100+ intermediate checkpoints: ✓ Pretraining scripts (nanotron) ✓ Post-training code SFT + APO (TRL/alignment-handbook) ✓ Evaluation scripts to reproduce all reported metrics github.com/huggingface/sm… All