Alexandra Souly (@alexandrasouly) 's Twitter Profile
Alexandra Souly

@alexandrasouly

Working on LLM Safeguards at @AISecurityInst

ID: 1557411732905148416

calendar_today10-08-2022 17:01:40

11 Tweet

115 Followers

211 Following

Alexandra Souly (@alexandrasouly) 's Twitter Profile Photo

Looking forward to present our work Leading the Pack: N-player Opponent Shaping tomorrow at the Multi-Agent Security Workshop #NeurIPS23 MASec Workshop Paper: openreview.net/pdf?id=3b8hfpq… Thanks to my co-authors Timon Willi akbir. Robert Kirk Chris Lu Edward Grefenstette Tim Rocktäschel

Looking forward to present our work Leading the Pack: N-player Opponent Shaping tomorrow at the Multi-Agent Security Workshop  #NeurIPS23 <a href="/masecworkshop/">MASec Workshop</a> 
Paper: openreview.net/pdf?id=3b8hfpq…
Thanks to my co-authors <a href="/TimonWilli/">Timon Willi</a>  <a href="/akbirkhan/">akbir.</a> <a href="/_robertkirk/">Robert Kirk</a>  <a href="/_chris_lu_/">Chris Lu</a> <a href="/egrefen/">Edward Grefenstette</a> <a href="/_rockt/">Tim Rocktäschel</a>
UCL DARK (@ucl_dark) 's Twitter Profile Photo

Exciting day ahead! - Roberta Raileanu's talk on ICL for sequential decision-making tasks at 4pm (238-239) - an oral by Alexandra Souly on N-player opponent shaping at 10:40AM (223), - The SoLaR @ NeurIPS2024 workshop (R06-R09) - A poster at 8:15AM on generalisation in offline RL (238-239)

Dominik Schmidt (@schmidtdominik_) 's Twitter Profile Photo

Extremely excited to announce new work (w/ Minqi Jiang) on learning RL policies and world models purely from action-free videos. 🌶️🌶️ LAPO learns a latent representation for actions from observation alone and then derives a policy from it. Paper: arxiv.org/abs/2312.10812

Extremely excited to announce new work (w/ <a href="/MinqiJiang/">Minqi Jiang</a>) on learning RL policies and world models purely from action-free videos. 🌶️🌶️

LAPO learns a latent representation for actions from observation alone and then derives a policy from it.

Paper: arxiv.org/abs/2312.10812
Edward Grefenstette (@egrefen) 's Twitter Profile Photo

Opponent Shaping allows agents to learn to cooperate. Sounds nice, but do these methods scale past two agents? If not, why not, and what can be done? Alexandra Souly, Timon Willi, akbir. and colleagues answer these questions and more [12/24] openreview.net/forum?id=3b8hf…

Opponent Shaping allows agents to learn to cooperate. Sounds nice, but do these methods scale past two agents? If not, why not, and what can be done? <a href="/AlexandraSouly/">Alexandra Souly</a>, <a href="/TimonWilli/">Timon Willi</a>, <a href="/akbirkhan/">akbir.</a> and colleagues answer these questions and more [12/24]

openreview.net/forum?id=3b8hf…
Dominik Schmidt (@schmidtdominik_) 's Twitter Profile Photo

The code + new results for LAPO, an ⚡ICLR Spotlight⚡ (w/ Minqi Jiang) are now out ‼️ LAPO learns world models and policies directly from video, without any action labels, enabling training of agents from web-scale video data alone. Links below ⤵️

Xander Davies (@alxndrdavies) 's Twitter Profile Photo

Jailbreaking evals ~always focus on simple chatbots—excited to announce AgentHarm, a dataset for measuring harmfulness of LLM 𝑎𝑔𝑒𝑛𝑡𝑠 developed at @AISafetyInst in collaboration with Gray Swan AI! 🧵 1/N

Jailbreaking evals ~always focus on simple chatbots—excited to announce AgentHarm, a dataset for measuring harmfulness of LLM 𝑎𝑔𝑒𝑛𝑡𝑠 developed at @AISafetyInst in collaboration with <a href="/GraySwanAI/">Gray Swan AI</a>!
🧵 1/N
AI Security Institute (@aisecurityinst) 's Twitter Profile Photo

We've released a technical report detailing our pre-deployment testing of Anthropic's upgraded Claude 3.5 Model with the U.S. AI Safety Institute. Read our blog for a high-level overview. aisi.gov.uk/work/pre-deplo…

Maksym Andriushchenko @ ICLR (@maksym_andr) 's Twitter Profile Photo

Great to see that AgentHarm (arxiv.org/abs/2410.09024) has been used by the US and UK AI Safety Institutes for pre-deployment testing of the upgraded Claude 3.5 Sonnet. Also, check out the full report—it's great and will likely influence evaluation standards for new LLMs, as

Great to see that AgentHarm (arxiv.org/abs/2410.09024) has been used by the US and UK AI Safety Institutes for pre-deployment testing of the upgraded Claude 3.5 Sonnet.

Also, check out the full report—it's great and will likely influence evaluation standards for new LLMs, as
Xander Davies (@alxndrdavies) 's Twitter Profile Photo

When we were developing our agent misuse dataset, we noticed instances of models seeming to realize our tasks were fake. We're sharing some examples and we'd be excited for more research into how synthetic tasks can distort eval results! 🧵 1/N

When we were developing our agent misuse dataset, we noticed instances of models seeming to realize our tasks were fake. We're sharing some examples and we'd be excited for more research into how synthetic tasks can distort eval results! 🧵 1/N
Xander Davies (@alxndrdavies) 's Twitter Profile Photo

Defending against adversarial prompts is hard; defending against fine-tuning API attacks is much harder. In our new AI Security Institute pre-print, we break alignment and extract harmful info using entirely benign and natural interactions during fine-tuning & inference. 😮 🧵 1/10

Defending against adversarial prompts is hard; defending against fine-tuning API attacks is much harder. In our new <a href="/AISecurityInst/">AI Security Institute</a> pre-print, we break alignment and extract harmful info using entirely benign and natural interactions during fine-tuning &amp; inference. 😮 🧵 1/10
Micah Goldblum (@micahgoldblum) 's Twitter Profile Photo

🚨 Did you know that small-batch vanilla SGD without momentum (i.e. the first optimizer you learn about in intro ML) is virtually as fast as AdamW for LLM pretraining on a per-FLOP basis? 📜 1/n

🚨 Did you know that small-batch vanilla SGD without momentum (i.e. the first optimizer you learn about in intro ML) is virtually as fast as AdamW for LLM pretraining on a per-FLOP basis? 📜 1/n