Summer Yue (@summeryue0) 's Twitter Profile
Summer Yue

@summeryue0

VP of Research at Scale AI. Prev: RLHF lead on Bard, researcher at Google DeepMind / Brain (LaMDA, RL/TF-Agents, superhuman chip design). Opinions my own.

ID: 2726658913

calendar_today12-08-2014 16:33:03

103 Tweet

2,2K Followers

331 Following

Summer Yue (@summeryue0) 's Twitter Profile Photo

✨All Gemini 2.0 models are now on MultiChallenge! Pro Experimental, Flash, and Flash Thinking have joined the benchmark - with Pro Experimental ranking #3! šŸŽÆ

✨All Gemini 2.0 models are now on MultiChallenge! Pro Experimental, Flash, and Flash Thinking have joined the benchmark - with Pro Experimental ranking #3! šŸŽÆ
Summer Yue (@summeryue0) 's Twitter Profile Photo

Excited to share our latest work ā€œJailbreaking to Jailbreak (J2)ā€, from the SEAL team and Scale AI's Red Team! As frontier models become more creative and capable of reasoning, they can now not only assist human red teamers but also autonomously drive red teaming efforts.

Alexandr Wang (@alexandr_wang) 's Twitter Profile Photo

On the heels of Humanity's Last Exam, Scale AI & Center for AI Safety have released a new very-hard reasoning eval: EnigmaEval: 1,184 multimodal puzzles so hard they take groups of humans many hours to days to solve. All top models score 0 on the Hard set, and <10% on the Normal set 🧵

On the heels of Humanity's Last Exam, <a href="/scale_AI/">Scale AI</a> &amp; <a href="/ai_risks/">Center for AI Safety</a> have released a new very-hard reasoning eval:

EnigmaEval: 1,184 multimodal puzzles so hard they take groups of humans many hours to days to solve.

All top models score 0 on the Hard set, and &lt;10% on the Normal set

🧵
Summer Yue (@summeryue0) 's Twitter Profile Photo

GPT-4.5 Preview Just Dropped~ We put it to the test, and the results are... mixed šŸ‘€ ⚔ #2 in Tool Use - Chat (trailing o1) šŸ¢ #3 in Tool Use - Enterprise (coming after Claude 3.7 Sonnet) šŸ„‰ #3 in EnigmaEval (following Claude 3.7 Sonnet Thinking) šŸ“š #4 in MultiChallenge (behind

GPT-4.5 Preview Just Dropped~ We put it to the test, and the results are... mixed šŸ‘€

⚔ #2 in Tool Use - Chat (trailing o1)
šŸ¢ #3 in Tool Use - Enterprise (coming after Claude 3.7 Sonnet)
šŸ„‰ #3 in EnigmaEval (following Claude 3.7 Sonnet Thinking)
šŸ“š #4 in MultiChallenge (behind
Summer Yue (@summeryue0) 's Twitter Profile Photo

If a model lies when pressured—it’s not ready for AGI. The new MASK leaderboard is live. Built on the private split of our open-source honesty benchmark (w/ Center for AI Safety), it tests whether models lie under pressure—even when they know better. šŸ“Š Leaderboard:

If a model lies when pressured—it’s not ready for AGI.

The new MASK leaderboard is live.

Built on the private split of our open-source honesty benchmark (w/ <a href="/ai_risks/">Center for AI Safety</a>), it tests whether models lie under pressure—even when they know better.

šŸ“Š Leaderboard:
Summer Yue (@summeryue0) 's Twitter Profile Photo

šŸ¤– AI agents are crossing into the real world. But when they act independently—who’s watching? At Scale, we’re building Agent Oversight: a platform to monitor, intervene, and align autonomous AI. We’re hiring engineers (SF/NYC) to tackle one of the most urgent problems in AI.

Summer Yue (@summeryue0) 's Twitter Profile Photo

šŸ‘€ OpenAI’s new models are showing some humility: Hard test (HLE) • o1: 8% correct | 93% confident (!!) • o3: 20% | 55% • o4-mini: 18% | 77% Easy test (GSM8K) • o1: 96% correct| 100% confident • o3: 97%Ā  | 84% • o4-mini: 97% | 99% o3 stands out for being overall less

Summer Yue (@summeryue0) 's Twitter Profile Photo

šŸ” SEAL and Red Team at @scale_ai present a position paper outlining what we’ve learned from red teaming LLMs so far—what matters, what’s missing, and how model safety fits into broader system safety and monitoring. šŸ”— scale.com/research/red_t… šŸ“ scale.com/blog/rethink-r…