Cem Anil (@cem__anil) 's Twitter Profile
Cem Anil

@cem__anil

Machine learning / AI Safety at @AnthropicAI and University of Toronto / Vector Institute. Prev. @google (Blueshift Team) and @nvidia.

ID: 1062518594356035584

linkhttps://www.cs.toronto.edu/~anilcem/ calendar_today14-11-2018 01:32:28

516 Tweet

2,2K Followers

1,1K Following

Anthropic (@anthropicai) 's Twitter Profile Photo

New Anthropic research: Forecasting rare language model behaviors. We forecast whether risks will occur after a model is deployed—using even very limited sets of test data.

New Anthropic research: Forecasting rare language model behaviors.

We forecast whether risks will occur after a model is deployed—using even very limited sets of test data.
Anthropic (@anthropicai) 's Twitter Profile Photo

Claude will help power Amazon's next-generation AI assistant, Alexa+. Amazon and Anthropic have worked closely together over the past year, with Mike Krieger leading a team that helped Amazon get the full benefits of Claude's capabilities.

Claude will help power Amazon's next-generation AI assistant, Alexa+.

Amazon and Anthropic have worked closely together over the past year, with <a href="/mikeyk/">Mike Krieger</a> leading a team that helped Amazon get the full benefits of Claude's capabilities.
David Duvenaud (@davidduvenaud) 's Twitter Profile Photo

LLMs have complex joint beliefs about all sorts of quantities. And my postdoc James Requeima visualized them! In this thread we show LLM predictive distributions conditioned on data and free-form text. LLMs pick up on all kinds of subtle and unusual structure: 🧵

Stuart Ritchie 🇺🇦 (@stuartjritchie) 's Twitter Profile Photo

What are you doing this weekend? Maybe you’ll consider applying to work with me at Anthropic! I’m looking for a brilliant writer/editor with a focus on econ who can help communicate our research on the societal impacts of AI. The weirder the better. boards.greenhouse.io/anthropic/jobs…

Aaditya Singh (@aaditya6284) 's Twitter Profile Photo

Transformers employ different strategies through training to minimize loss, but how do these tradeoff and why? Excited to share our newest work, where we show remarkably rich competitive and cooperative interactions (termed "coopetition") as a transformer learns. Read on 🔎⏬

Transformers employ different strategies through training to minimize loss, but how do these tradeoff and why?

Excited to share our newest work, where we show remarkably rich competitive and cooperative interactions (termed "coopetition") as a transformer learns.

Read on 🔎⏬
Stephanie Chan (@scychan_brains) 's Twitter Profile Photo

New work led by Aaditya Singh: "Strategy coopetition explains the emergence and transience of in-context learning in transformers." We find some surprising things!! E.g. that circuits can simultaneously compete AND cooperate ("coopetition") 😯 🧵👇

Jan Leike (@janleike) 's Twitter Profile Photo

Could we spot a misaligned model in the wild? To find out, we trained a model with hidden misalignments and asked other researchers to uncover them in a blind experiment. 3/4 teams succeeded, 1 of them after only 90 min

Samuel Marks (@saprmarks) 's Twitter Profile Photo

New paper with Johannes Treutlein , Evan Hubinger , and many other coauthors! We train a model with a hidden misaligned objective and use it to run an auditing game: Can other teams of researchers uncover the model’s objective? x.com/AnthropicAI/st…

Alireza Mousavi @ ICLR 2025 (@alirezamh_) 's Twitter Profile Photo

With infinite compute, would it make a difference to use Transformers, RNNs, or even vanilla Feedforward nets? They’re all universal approximators after all. We prove that Yes! You end up with different sample complexity, no matter how much compute/memory you have.👇

With infinite compute, would it make a difference to use Transformers, RNNs, or even vanilla Feedforward nets? They’re all universal approximators after all.

We prove that Yes! You end up with different sample complexity, no matter how much compute/memory you have.👇
cat (@_catwu) 's Twitter Profile Photo

It’s been a big week for Claude Code. We launched 8 exciting new features to help devs build faster and smarter. Here's a roundup of everything we released:

It’s been a big week for Claude Code.

We launched 8 exciting new features to help devs build faster and smarter.

Here's a roundup of everything we released:
Transluce (@transluceai) 's Twitter Profile Photo

To interpret AI benchmarks, we need to look at the data. Top-level numbers don't mean what you think: there may be broken tasks, unexpected behaviors, or near-misses. We're introducing Docent to accelerate analysis of AI agent transcripts. It can spot surprises in seconds. 🧵👇

Johannes Gasteiger, né Klicpera (@gasteigerjo) 's Twitter Profile Photo

New Anthropic blog post: Subtle sabotage in automated researchers. As AI systems increasingly assist with AI research, how do we ensure they're not subtly sabotaging that research? We show that malicious models can undermine ML research tasks in ways that are hard to detect.

New Anthropic blog post: Subtle sabotage in automated researchers.

As AI systems increasingly assist with AI research, how do we ensure they're not subtly sabotaging that research? We show that malicious models can undermine ML research tasks in ways that are hard to detect.
Anthropic (@anthropicai) 's Twitter Profile Photo

New Anthropic research: Tracing the thoughts of a large language model. We built a "microscope" to inspect what happens inside AI models and use it to understand Claude’s (often complex and surprising) internal mechanisms.

Anthropic (@anthropicai) 's Twitter Profile Photo

New Anthropic research: Do reasoning models accurately verbalize their reasoning? Our new paper shows they don't. This casts doubt on whether monitoring chains-of-thought (CoT) will be enough to reliably catch safety issues.

New Anthropic research: Do reasoning models accurately verbalize their reasoning?

Our new paper shows they don't.

This casts doubt on whether monitoring chains-of-thought (CoT) will be enough to reliably catch safety issues.
Bruno Mlodozeniec (@kayembruno) 's Twitter Profile Photo

How do you identify training data responsible for an image generated by your diffusion model? How could you quantify how much copyrighted works influenced the image? In our ICLR oral paper we propose how to approach such questions scalably with influence functions.

How do you identify training data responsible for an image generated by your diffusion model? How could you quantify how much copyrighted works influenced the image?

In our ICLR oral paper we propose how to approach such questions scalably with influence functions.
Anthropic (@anthropicai) 's Twitter Profile Photo

Introducing a new Max plan for Claude. It’s flexible, with options for 5x or 20x more usage compared to our Pro plan. Plus, priority access to our latest features and models:

Introducing a new Max plan for Claude. It’s flexible, with options for 5x or 20x more usage compared to our Pro plan.

Plus, priority access to our latest features and models:
Anthropic (@anthropicai) 's Twitter Profile Photo

Introducing the next generation: Claude Opus 4 and Claude Sonnet 4. Claude Opus 4 is our most powerful model yet, and the world’s best coding model. Claude Sonnet 4 is a significant upgrade from its predecessor, delivering superior coding and reasoning.

Introducing the next generation: Claude Opus 4 and Claude Sonnet 4.

Claude Opus 4 is our most powerful model yet, and the world’s best coding model.

Claude Sonnet 4 is a significant upgrade from its predecessor, delivering superior coding and reasoning.
Cursor (@cursor_ai) 's Twitter Profile Photo

Sonnet 4 is available in Cursor! We've been very impressed by its coding ability. It is much easier to control than 3.7 and is excellent at understanding codebases. It appears to be a new state of the art.

Sonnet 4 is available in Cursor!    

We've been very impressed by its coding ability. It is much easier to control than 3.7 and is excellent at understanding codebases.

It appears to be a new state of the art.
Anthropic (@anthropicai) 's Twitter Profile Photo

New Anthropic Research: Agentic Misalignment. In stress-testing experiments designed to identify risks before they cause real harm, we find that AI models from multiple providers attempt to blackmail a (fictional) user to avoid being shut down.

New Anthropic Research: Agentic Misalignment.

In stress-testing experiments designed to identify risks before they cause real harm, we find that AI models from multiple providers attempt to blackmail a (fictional) user to avoid being shut down.