Dmitrii Kharlapenko (@dmhook) Twitter Tweets • TwiCopy

Dmitrii Kharlapenko

@dmhook

+ Follow

ID: 1715060484531970048

calendar_today19-10-2023 17:41:17

8 Tweet

120 Followers

22 Following

Dmitrii Kharlapenko

@dmhook

a year ago

We use LLM’s capabilities to explain concepts from their minds in my and nev abstract SAE features research. Excited to continue our MATS 6.0 work under the mentorship of Neel Nanda and Arthur Conmy . More cool stuff to come! lesswrong.com/posts/8ev6coxC…

thumb_up_off_alt72

chat_bubble_outline4

repeat8

shareShare

Dmitrii Kharlapenko

@dmhook

a year ago

How interpretable are task vectors? Using our new task vector cleaning method we find SAE features responsible for detecting and encoding specific ICL tasks. See details in our second MATS 6.0 post with nev, Neel Nanda and Arthur Conmy. lesswrong.com/posts/5FGXmJ3w…

thumb_up_off_alt49

chat_bubble_outline0

repeat5

shareShare

White Circle

@whitecircle_ai

6 months ago

1/ Introducing ⚪️CircleGuardBench — a new benchmark for evaluating AI moderation models. Here’s why it’s cool: – Tests harm detection, jailbreak resistance, false positives, and latency – Covers 17 real-world harm categories – First benchmark designed for production-level

thumb_up_off_alt88

chat_bubble_outline10

repeat29

shareShare

nev

@neverrixx

4 months ago

🧵1/6 SAEs have become a staple of LLM interpretability, but what if we applied them to image generation models? My recent paper with Dmitrii Kharlapenko, Yixiong Hao, afterless, Sheikh Abdur Raheem Ali, and Arthur Conmy adapts SAEs to understand the SOTA diffusion transformer FLUX.1 ⬇️

🧵1/6 SAEs have become a staple of LLM interpretability, but what if we applied them to image generation models?
My recent paper with <a href="/dmhook/">Dmitrii Kharlapenko</a>, <a href="/Yixiong_Hao/">Yixiong Hao</a>, <a href="/afterlxss/">afterless</a>, <a href="/Sheikheddy/">Sheikh Abdur Raheem Ali</a>, and <a href="/ArthurConmy/">Arthur Conmy</a> adapts SAEs to understand the SOTA diffusion transformer FLUX.1 ⬇️

thumb_up_off_alt20

chat_bubble_outline4

repeat7

shareShare