Eric Wallace (@eric_wallace_) Twitter Tweets • TwiCopy

Gate.io

5 hours ago

🔥The 9th Round of Easy Loan, Earn $40 Reward is in progress❗️ ⏰ Promotion Period: January 15th - Feburary 15th, 2025 👉 Register now and check more details at gate.io/campaigns/358

thumb_up_off_alt34

chat_bubble_outline39

repeat6

shareShare

Charlie Snell

@sea_snell

a year ago

Does anyone have a favorite task where gpt-4 has near chance accuracy when zero or few-shot prompted? I’m looking for recommendations for tasks like this

thumb_up_off_alt190

chat_bubble_outline38

repeat15

shareShare

Cool paper by Wan et al (UC Berkeley) with surprising results. In their task, an LLM answers a controversial question Q based on the conflicting arguments from excerpts from two documents from the web. We might expect that LLMs would be more influenced by excerpts that (a) have

thumb_up_off_alt167

chat_bubble_outline3

repeat23

shareShare

Eric Wallace

@eric_wallace_

a year ago

The final layer of an LLM up-projects from hidden dim —> vocab size. The logprobs are thus low rank, and with some clever API queries, you can recover an LLM’s hidden dimension (or even the exact layer’s weights). Our new paper is out, a collaboration between lot of friends!

thumb_up_off_alt202

chat_bubble_outline3

repeat25

shareShare

Katie Kang

@katie_kang_

a year ago

We know LLMs hallucinate, but what governs what they dream up? Turns out it’s all about the “unfamiliar” examples they see during finetuning Our new paper shows that manipulating the supervision on these special examples can steer how LLMs hallucinate arxiv.org/abs/2403.05612 🧵

thumb_up_off_alt366

chat_bubble_outline11

repeat79

shareShare

Eric Wallace

@eric_wallace_

a year ago

Really cool concurrent work to our recent paper!

thumb_up_off_alt56

chat_bubble_outline0

repeat4

shareShare

Eric Wallace

@eric_wallace_

a year ago

I’ll be giving two different OpenAI talks at ICLR tomorrow on our recent safety work, focusing primarily on the paper “The Instruction Hierarchy”. 1pm at the Data for Foundation Models workshop, and 3pm at the Secure and Trustworthy LLMs workshop.

thumb_up_off_alt111

chat_bubble_outline4

repeat10

shareShare

Danny Halawi

@dannyhalawi15

a year ago

New paper! We introduce Covert Malicious Finetuning (CMFT), a method for jailbreaking language models via fine-tuning that avoids detection. We use our method to covertly jailbreak GPT-4 via the OpenAI finetuning API.

thumb_up_off_alt124

chat_bubble_outline4

repeat30

shareShare

Ethan Perez

@ethanjperez

a year ago

One of the most important and well-executed papers I've read in months. They explored ~all attacks+defenses I was most keen on seeing tried, for getting robust finetuning APIs. I'm not sure if it's possible to make finetuning APIs robust, would be a big deal if it were possible

thumb_up_off_alt72

chat_bubble_outline1

repeat9

shareShare

Shengjia Zhao

@shengjia_zhao

a year ago

AI is becoming 10x cheaper for the same capability every year. Excited to work with Jacob Menick Kevin Lu Eric Wallace et al on it.

thumb_up_off_alt54

chat_bubble_outline1

repeat3

shareShare

Edoardo Debenedetti

@edoardo_debe

a year ago

Does the instruction hierarchy introduced with GPT-4o mini work? We ran AgentDojo on it, and it looks like it does! GPT-4o mini has similar utility as GPT4o (only 1% lower!), but the prompt injection targeted success rate is 20% lower than GPT-4o!

thumb_up_off_alt31

chat_bubble_outline2

repeat5

shareShare

Lucy Li

@lucy3_li

10 months ago

Hi friends, colleagues, followers. I am on the faculty job market! I am a PhD student Berkeley School of Information + Berkeley AI Research. I work on NLP, and I believe all language, whether AI- or human-generated, is ✨social and cultural data✨. My work includes: 🧵

thumb_up_off_alt393

chat_bubble_outline10

repeat73

shareShare

Gray Swan AI

@grayswanai

9 months ago

🚨 New Jailbreak Bounty Alert $1,000 for jailbreaking the hidden CoTs from OpenAI's o1-mini and o1-preview! No bans. Exclusively on the Gray Swan Arena. 🗓Start Time: October 29th, 1 PM ET 🌐Link: app.grayswan.ai/arena 💬Discord: discord.gg/St8uMetxjQ

thumb_up_off_alt29

chat_bubble_outline3

repeat3

shareShare

Charlie Snell

@sea_snell

9 months ago

Can we predict emergent capabilities in GPT-N+1🌌 using only GPT-N model checkpoints, which have random performance on the task? We propose a method for doing exactly this in our paper “Predicting Emergent Capabilities by Finetuning”🧵

thumb_up_off_alt570

chat_bubble_outline12

repeat70

shareShare

Mark Chen

@markchen90

8 months ago

roon iykyk

thumb_up_off_alt307

chat_bubble_outline27

repeat10

shareShare

Eric Wallace

@eric_wallace_

8 months ago

Chain-of-thought reasoning provides a natural avenue for improving model safety. Today we are publishing a paper on how we train the "o" series of models to think carefully through unsafe prompts: openai.com/index/delibera……

thumb_up_off_alt404

chat_bubble_outline11

repeat62

shareShare

OpenAI

@openai

7 months ago

Trading Inference-Time Compute for Adversarial Robustness openai.com/index/trading-…

thumb_up_off_alt2,2K

chat_bubble_outline143

repeat250

shareShare

Sam Altman

@sama

3 months ago

today we are introducing codex. it is a software engineering agent that runs in the cloud and does tasks for you, like writing a new feature of fixing a bug. you can run many tasks in parallel.

thumb_up_off_alt34,34K

chat_bubble_outline1,1K

repeat2,2K

shareShare

Eric Wallace

Gate.io

Charlie Snell

Owain Evans

Eric Wallace

Katie Kang

Eric Wallace

Eric Wallace

Danny Halawi

Ethan Perez

Shengjia Zhao

Edoardo Debenedetti

Lucy Li

Gray Swan AI

Charlie Snell

Mark Chen

Eric Wallace

OpenAI

Sam Altman