Alex Pan (@aypan_17) 's Twitter Profile
Alex Pan

@aypan_17

CS PhD @UCBerkeley working on LLM safety and interpretability

ID: 1602117652889178113

linkhttp://aypan17.github.io calendar_today12-12-2022 01:47:01

29 Tweet

331 Followers

202 Following

Aran Komatsuzaki (@arankomatsuzaki) 's Twitter Profile Photo

Feedback Loops With Language Models Drive In-Context Reward Hacking Shows that feedback loops can cause in-context reward hacking, where the LLM at test-time optimizes an objective but creates negative side effects in the process arxiv.org/abs/2402.06627

Feedback Loops With Language Models Drive In-Context Reward Hacking

Shows that feedback loops can cause in-context reward hacking, where the LLM at test-time optimizes an objective but creates negative side effects in the process

arxiv.org/abs/2402.06627
Dan Hendrycks (@danhendrycks) 's Twitter Profile Photo

Can hazardous knowledge be unlearned from LLMs without harming other capabilities? We’re releasing the Weapons of Mass Destruction Proxy (WMDP), a dataset about weaponization, and we create a way to unlearn this knowledge. 📝arxiv.org/abs/2403.03218 🔗wmdp.ai

Can hazardous knowledge be unlearned from LLMs without harming other capabilities?

We’re releasing the Weapons of Mass Destruction Proxy (WMDP), a dataset about weaponization, and we create a way to unlearn this knowledge.

📝arxiv.org/abs/2403.03218
🔗wmdp.ai
Shreyas Kapur (@shreyaskapur) 's Twitter Profile Photo

My first PhD paper!🎉We learn *diffusion* models for code generation that learn to directly *edit* syntax trees of programs. The result is a system that can incrementally write code, see the execution output, and debug it. 🧵1/n

Erik Jones (@erikjones313) 's Twitter Profile Photo

Model developers try to train “safe” models that refuse to help with malicious tasks like hacking ...but in new work with Jacob Steinhardt and Anca Dragan, we show that such models still enable misuse: adversaries can combine multiple safe models to bypass safeguards 1/n

Model developers try to train “safe” models that refuse to help with malicious tasks like hacking

...but in new work with <a href="/JacobSteinhardt/">Jacob Steinhardt</a> and <a href="/ancadianadragan/">Anca Dragan</a>, we show that such models still enable misuse: adversaries can combine multiple safe models to bypass safeguards 1/n
Video Arena (@aivideoarena) 's Twitter Profile Photo

🚀 Just Launched: VideoArena!🎥 Discover head-to-head comparisons of video clips generated from the same prompts across top text-to-video models. Compare outputs from 7 leading models and we're adding more soon! 🔗 Check out the leaderboard: videoarena.tv #Text2Video

Grace Luo (@graceluo_) 's Twitter Profile Photo

In a new preprint, we show that VLMs can perform cross-modal tasks... ...since text ICL 📚, instructions 📋, and image ICL 🖼️ are compressed into similar task representations. See “Task Vectors are Cross-Modal”, work w/ trevordarrell, Amir Bar. task-vectors-are-cross-modal.github.io

Grace Luo (@graceluo_) 's Twitter Profile Photo

✨New preprint: Dual-Process Image Generation! We distill *feedback from a VLM* into *feed-forward image generation*, at inference time. The result is flexible control: parameterize tasks as multimodal inputs, visually inspect the images with the VLM, and update the generator.🧵