Eric Wallace (@eric_wallace_) 's Twitter Profile
Eric Wallace

@eric_wallace_

researcher @openai

ID: 332600142

linkhttp://www.ericswallace.com calendar_today10-07-2011 03:13:26

552 Tweet

8,8K Followers

1,1K Following

Charlie Snell (@sea_snell) 's Twitter Profile Photo

Does anyone have a favorite task where gpt-4 has near chance accuracy when zero or few-shot prompted? I’m looking for recommendations for tasks like this

Owain Evans (@owainevans_uk) 's Twitter Profile Photo

Cool paper by Wan et al (UC Berkeley) with surprising results. In their task, an LLM answers a controversial question Q based on the conflicting arguments from excerpts from two documents from the web. We might expect that LLMs would be more influenced by excerpts that (a) have

Cool paper by Wan et al (UC Berkeley) with surprising results. 
In their task, an LLM answers a controversial question Q based on the conflicting arguments from excerpts from two documents from the web.

We might expect that LLMs would be more influenced by excerpts that (a) have
Eric Wallace (@eric_wallace_) 's Twitter Profile Photo

The final layer of an LLM up-projects from hidden dim —> vocab size. The logprobs are thus low rank, and with some clever API queries, you can recover an LLM’s hidden dimension (or even the exact layer’s weights). Our new paper is out, a collaboration between lot of friends!

Katie Kang (@katie_kang_) 's Twitter Profile Photo

We know LLMs hallucinate, but what governs what they dream up? Turns out it’s all about the “unfamiliar” examples they see during finetuning Our new paper shows that manipulating the supervision on these special examples can steer how LLMs hallucinate arxiv.org/abs/2403.05612 🧵

We know LLMs hallucinate, but what governs what they dream up? Turns out it’s all about the “unfamiliar” examples they see during finetuning

Our new paper shows that manipulating the supervision on these special examples can steer how LLMs hallucinate

arxiv.org/abs/2403.05612
🧵
Eric Wallace (@eric_wallace_) 's Twitter Profile Photo

I’ll be giving two different OpenAI talks at ICLR tomorrow on our recent safety work, focusing primarily on the paper “The Instruction Hierarchy”. 1pm at the Data for Foundation Models workshop, and 3pm at the Secure and Trustworthy LLMs workshop.

Danny Halawi (@dannyhalawi15) 's Twitter Profile Photo

New paper! We introduce Covert Malicious Finetuning (CMFT), a method for jailbreaking language models via fine-tuning that avoids detection. We use our method to covertly jailbreak GPT-4 via the OpenAI finetuning API.

New paper! We introduce Covert Malicious Finetuning (CMFT), a method for jailbreaking language models via fine-tuning that avoids detection. We use our method to covertly jailbreak GPT-4 via the OpenAI finetuning API.
Ethan Perez (@ethanjperez) 's Twitter Profile Photo

One of the most important and well-executed papers I've read in months. They explored ~all attacks+defenses I was most keen on seeing tried, for getting robust finetuning APIs. I'm not sure if it's possible to make finetuning APIs robust, would be a big deal if it were possible

Edoardo Debenedetti (@edoardo_debe) 's Twitter Profile Photo

Does the instruction hierarchy introduced with GPT-4o mini work? We ran AgentDojo on it, and it looks like it does! GPT-4o mini has similar utility as GPT4o (only 1% lower!), but the prompt injection targeted success rate is 20% lower than GPT-4o!

Does the instruction hierarchy introduced with GPT-4o mini work? We ran AgentDojo on it, and it looks like it does!

GPT-4o mini has similar utility as GPT4o (only 1% lower!), but the prompt injection targeted success rate is 20% lower than GPT-4o!
Lucy Li (@lucy3_li) 's Twitter Profile Photo

Hi friends, colleagues, followers. I am on the faculty job market! I am a PhD student Berkeley School of Information + Berkeley AI Research. I work on NLP, and I believe all language, whether AI- or human-generated, is ✨social and cultural data✨. My work includes: 🧵

Gray Swan AI (@grayswanai) 's Twitter Profile Photo

🚨 New Jailbreak Bounty Alert $1,000 for jailbreaking the hidden CoTs from OpenAI's o1-mini and o1-preview! No bans. Exclusively on the Gray Swan Arena. 🗓Start Time: October 29th, 1 PM ET 🌐Link: app.grayswan.ai/arena 💬Discord: discord.gg/St8uMetxjQ

🚨 New Jailbreak Bounty Alert

$1,000 for jailbreaking the hidden CoTs from OpenAI's o1-mini and o1-preview!

No bans. Exclusively on the Gray Swan Arena.

🗓Start Time: October 29th, 1 PM ET
🌐Link: app.grayswan.ai/arena
💬Discord: discord.gg/St8uMetxjQ
Charlie Snell (@sea_snell) 's Twitter Profile Photo

Can we predict emergent capabilities in GPT-N+1🌌 using only GPT-N model checkpoints, which have random performance on the task? We propose a method for doing exactly this in our paper “Predicting Emergent Capabilities by Finetuning”🧵

Can we predict emergent capabilities in GPT-N+1🌌 using only GPT-N model checkpoints, which have random performance on the task?

We propose a method for doing exactly this in our paper “Predicting Emergent Capabilities by Finetuning”🧵
Eric Wallace (@eric_wallace_) 's Twitter Profile Photo

Chain-of-thought reasoning provides a natural avenue for improving model safety. Today we are publishing a paper on how we train the "o" series of models to think carefully through unsafe prompts: openai.com/index/delibera……

Sam Altman (@sama) 's Twitter Profile Photo

today we are introducing codex. it is a software engineering agent that runs in the cloud and does tasks for you, like writing a new feature of fixing a bug. you can run many tasks in parallel.