Nishant Balepur (@nishantbalepur) 's Twitter Profile
Nishant Balepur

@nishantbalepur

CS PhD Student. Trying to find that dog in me @UofMaryland. Babysitting (aligning) + Bullying (evaluating) #LLMs

ID: 768905924475879425

linkhttps://nbalepur.github.io/ calendar_today25-08-2016 20:20:32

985 Tweet

534 Followers

403 Following

Zifan (Sail) Wang (@_zifan_wang) 's Twitter Profile Photo

🧵 1/N) Excited to share our recent work at Scale AI, "Jailbreaking to Jailbreak (J2)".😈 We present a novel LLM-as-red-teamer approach in which a human jailbreaks a refusal-trained LLM to make it willing to jailbreak itself or other LLMs. We refer to this process as

🧵 1/N) Excited to share our recent work at <a href="/scale_AI/">Scale AI</a>, "Jailbreaking to Jailbreak (J2)".😈 We present a novel LLM-as-red-teamer approach in which a human jailbreaks a refusal-trained LLM to make it willing to jailbreak itself or other LLMs. We refer to this process as
Yuzhen Huang @ ICLR 2025 (@yuzhenh17) 's Twitter Profile Photo

🔍 Are Verifiers Trustworthy in RLVR? Our paper, Pitfalls of Rule- and Model-based Verifiers, exposes the critical flaws in reinforcement learning verification for mathematical reasoning. 🔑 Key findings: 1️⃣ Rule-based verifiers miss correct answers, especially when presented in

🔍 Are Verifiers Trustworthy in RLVR?
Our paper, Pitfalls of Rule- and Model-based Verifiers, exposes the critical flaws in reinforcement learning verification for mathematical reasoning.

🔑 Key findings:
1️⃣ Rule-based verifiers miss correct answers, especially when presented in
Lucy Li (@lucy3_li) 's Twitter Profile Photo

"Tell, Don't Show" was accepted to #ACL2025 Findings! Our simple approach for literary topic modeling combines the new (language models) with the old (classic LDA) to yield better topics. A possible addition to your CSS/DH research 🛠️ box ✨📚 arxiv.org/abs/2505.23166

"Tell, Don't Show" was accepted to #ACL2025 Findings! 

Our simple approach for literary topic modeling combines the new (language models) with the old (classic LDA) to yield better topics. A possible addition to your CSS/DH research 🛠️ box

✨📚 arxiv.org/abs/2505.23166
Nishant Balepur (@nishantbalepur) 's Twitter Profile Photo

Very well-said as always! There's lots of tricks to make a paper perceived good by reviewers, but that doesn't mean it's actually good science Really interested in seeing how we can measure what "good" means :)

Yong Zheng-Xin (Yong) (@yong_zhengxin) 's Twitter Profile Photo

🧵 Multilingual safety training/eval is now standard practice, but a critical question remains: Is multilingual safety actually solved? Our new survey with Cohere Labs answers this and dives deep into: - Language gap in safety research - Future priority areas Thread 👇

🧵 Multilingual safety training/eval is now standard practice, but a critical question remains: Is multilingual safety actually solved?

Our new survey with <a href="/Cohere_Labs/">Cohere Labs</a> answers this and dives deep into:
- Language gap in safety research
- Future priority areas

Thread 👇
Ritwik Gupta 🇺🇦 (@ritwik_g) 's Twitter Profile Photo

I'm excited to share that I’ll be joining Univ. of Maryland as an Assistant Professor in Computer Science, where I’ll be launching the Resilient AI and Grounded Sensing Lab. The RAGS Lab will build AI that works in chaotic environments. If you would like to partner, please DM me!

Ai2 (@allen_ai) 's Twitter Profile Photo

RewardBench 2 is here! We took a long time to learn from our first reward model evaluation tool to make one that is substantially harder and more correlated with both downstream RLHF and inference-time scaling.

RewardBench 2 is here! We took a long time to learn from our first reward model evaluation tool to make one that is substantially harder and more correlated with both downstream RLHF and inference-time scaling.
Jordan Boyd-Graber (@boydgraber) 's Twitter Profile Photo

Do you like trivia? Can you spot when AI is feeding you BS? Or can you make AIs turn themselves inside out? Then on June 14 at College Park (or June 21 online), we have a competition for you.

Do you like trivia?  Can you spot when AI is feeding you BS?  Or can you make AIs turn themselves inside out?  Then on June 14 at College Park (or June 21 online), we have a competition for you.
Saumya Malik (@saumyamalik44) 's Twitter Profile Photo

I’m thrilled to share RewardBench 2 📊— We created a new multi-domain reward model evaluation that is substantially harder than RewardBench, we trained and released 70 reward models, and we gained insights about reward modeling benchmarks and downstream performance!

I’m thrilled to share RewardBench 2 📊— We created a new multi-domain reward model evaluation that is substantially harder than RewardBench, we trained and released 70 reward models, and we gained insights about reward modeling benchmarks and downstream performance!
Chau Minh Pham (@chautmpham) 's Twitter Profile Photo

🤔 What if you gave an LLM thousands of random human-written paragraphs and told it to write something new -- while copying 90% of its output from those texts? 🧟 You get what we call a Frankentext! 💡 Frankentexts are surprisingly coherent and tough for AI detectors to flag.

🤔 What if you gave an LLM thousands of random human-written paragraphs and told it to write something new -- while copying 90% of its output from those texts?

🧟 You get what we call a Frankentext!

💡 Frankentexts are surprisingly coherent and tough for AI detectors to flag.
Majeed Kazemi (@majeedkazemi) 's Twitter Profile Photo

Excited to share that I'll be joining the CS department at University of Alberta as an Assistant Professor in January 2026 where I will be affiliated with Amii. I'll be recruiting 2-3 PhD/MSc students and establishing a research lab on AI in Education and Human–AI Interaction.

Excited to share that I'll be joining the CS department at <a href="/UAlberta/">University of Alberta</a> as an Assistant Professor in January 2026 where I will be affiliated with <a href="/AmiiThinks/">Amii</a>.

I'll be recruiting 2-3 PhD/MSc students and establishing a research lab on AI in Education and Human–AI Interaction.
Samaya AI (@samaya_ai) 's Twitter Profile Photo

Evaluating long-form answers to complex technical questions is very challenging. Existing methods fall short in this setting. At Samaya, we built Criteria-Eval, a checklist-based evaluation that aligns with how domain experts judge answers. 🧵samaya.ai/blog/criteria-… ✍️

Evaluating long-form answers to complex technical questions is very challenging.

Existing methods fall short in this setting.

At Samaya, we built Criteria-Eval, a checklist-based evaluation that aligns with how domain experts judge answers.

🧵samaya.ai/blog/criteria-…
✍️
Nishant Balepur (@nishantbalepur) 's Twitter Profile Photo

I'm now a Ph.D. candidate! 🎉🥳 A few weeks ago, I proposed my thesis: "Teaching AI to Answer Questions with Reasoning that Actually Helps You". Thanks to my amazing committee + friends UMD CLIP Lab! 🫶 I won't be back in Maryland for a while, some exciting things coming soon 👀

I'm now a Ph.D. candidate! 🎉🥳

A few weeks ago, I proposed my thesis: "Teaching AI to Answer Questions with Reasoning that Actually Helps You". Thanks to my amazing committee + friends <a href="/ClipUmd/">UMD CLIP Lab</a>! 🫶

I won't be back in Maryland for a while, some exciting things coming soon 👀
Lindia Tjuatja (@lltjuatja) 's Twitter Profile Photo

When it comes to text prediction, where does one LM outperform another? If you've ever worked on LM evals, you know this question is a lot more complex than it seems. In our new #acl2025 paper, we developed a method to find fine-grained differences between LMs: 🧵1/9

When it comes to text prediction, where does one LM outperform another? If you've ever worked on LM evals, you know this question is a lot more complex than it seems. In our new #acl2025 paper, we developed a method to find fine-grained differences between LMs: 

🧵1/9
Omar Shaikh (@oshaikh13) 's Twitter Profile Photo

What if LLMs could learn your habits and preferences well enough (across any context!) to anticipate your needs? In a new paper, we present the General User Model (GUM): a model of you built from just your everyday computer use. 🧵