Summer Yue (@summeryue0) Twitter Tweets • TwiCopy

Summer Yue

@summeryue0

+ Follow

VP of Research at Scale AI. Prev: RLHF lead on Bard, researcher at Google DeepMind / Brain (LaMDA, RL/TF-Agents, superhuman chip design). Opinions my own.

ID: 2726658913

calendar_today12-08-2014 16:33:03

103 Tweet

2,2K Followers

331 Following

Summer Yue

@summeryue0

7 months ago

✨All Gemini 2.0 models are now on MultiChallenge! Pro Experimental, Flash, and Flash Thinking have joined the benchmark - with Pro Experimental ranking #3! 🎯

thumb_up_off_alt19

chat_bubble_outline0

repeat5

shareShare

Excited to share our latest work “Jailbreaking to Jailbreak (J2)”, from the SEAL team and Scale AI's Red Team! As frontier models become more creative and capable of reasoning, they can now not only assist human red teamers but also autonomously drive red teaming efforts.

thumb_up_off_alt40

chat_bubble_outline1

repeat6

shareShare

Alexandr Wang

@alexandr_wang

7 months ago

On the heels of Humanity's Last Exam, Scale AI & Center for AI Safety have released a new very-hard reasoning eval: EnigmaEval: 1,184 multimodal puzzles so hard they take groups of humans many hours to days to solve. All top models score 0 on the Hard set, and <10% on the Normal set 🧵

On the heels of Humanity's Last Exam, <a href="/scale_AI/">Scale AI</a> & <a href="/ai_risks/">Center for AI Safety</a> have released a new very-hard reasoning eval:

EnigmaEval: 1,184 multimodal puzzles so hard they take groups of humans many hours to days to solve.

All top models score 0 on the Hard set, and <10% on the Normal set

🧵

thumb_up_off_alt1,1K

chat_bubble_outline91

repeat120

shareShare

Summer Yue

@summeryue0

6 months ago

GPT-4.5 Preview Just Dropped~ We put it to the test, and the results are... mixed 👀 ⚡ #2 in Tool Use - Chat (trailing o1) 🏢 #3 in Tool Use - Enterprise (coming after Claude 3.7 Sonnet) 🥉 #3 in EnigmaEval (following Claude 3.7 Sonnet Thinking) 📚 #4 in MultiChallenge (behind

thumb_up_off_alt12

chat_bubble_outline0

repeat4

shareShare

Summer Yue

@summeryue0

5 months ago

Impressive results! Congrats to the Gemini team.

thumb_up_off_alt14

chat_bubble_outline0

repeat0

shareShare

Summer Yue

@summeryue0

5 months ago

If a model lies when pressured—it’s not ready for AGI. The new MASK leaderboard is live. Built on the private split of our open-source honesty benchmark (w/ Center for AI Safety), it tests whether models lie under pressure—even when they know better. 📊 Leaderboard:

thumb_up_off_alt49

chat_bubble_outline1

repeat15

shareShare

Summer Yue

@summeryue0

5 months ago

🤖 AI agents are crossing into the real world. But when they act independently—who’s watching? At Scale, we’re building Agent Oversight: a platform to monitor, intervene, and align autonomous AI. We’re hiring engineers (SF/NYC) to tackle one of the most urgent problems in AI.

thumb_up_off_alt31

chat_bubble_outline1

repeat5

shareShare

Summer Yue

@summeryue0

5 months ago

thumb_up_off_alt15

chat_bubble_outline0

repeat1

shareShare

Summer Yue

@summeryue0

3 months ago

🔍 SEAL and Red Team at @scale_ai present a position paper outlining what we’ve learned from red teaming LLMs so far—what matters, what’s missing, and how model safety fits into broader system safety and monitoring. 🔗 scale.com/research/red_t… 📝 scale.com/blog/rethink-r…

thumb_up_off_alt94

chat_bubble_outline4

repeat16

shareShare