Danny To Eun Kim (@teknology.bsky.social) (@teknologyy) 's Twitter Profile
Danny To Eun Kim (@teknology.bsky.social)

@teknologyy

PhD student @LTIatCMU working with @841io on NLP & IR | Prev: MEng @ai_ucl

ID: 1422776859037618176

linkhttps://kimdanny.github.io/ calendar_today04-08-2021 04:30:30

213 Tweet

488 Followers

1,1K Following

So Yeon (Tiffany) Min on Industry Job Market (@soyeontiffmin) 's Twitter Profile Photo

🚨🚨 Preprint Alert 🚨🚨 🚀🚀 As AI become agents 🤖, how can we reliably delegate tasks to them, if they cannot communicate their limitations😭 or ask for help or test-time compute 🧑‍🚒 when needed? We present our new pre-print **Self-Regulation and Requesting Interventions**

Yiqing Xie (@yiqingxienlp) 's Twitter Profile Photo

How to construct repo-level coding environments in a scalable way? Checkout RepoST: an automated framework to construct repo-level environments using Sandbox Testing (repost-code-gen.github.io) Models trained with RepoST data can generalize well to other datasets (e.g., RepoEval)

How to construct repo-level coding environments in a scalable way?

Checkout RepoST: an automated framework to construct repo-level environments using Sandbox Testing (repost-code-gen.github.io)

Models trained with RepoST data can generalize well to other datasets (e.g., RepoEval)
Seungone Kim @ NAACL2025 (@seungonekim) 's Twitter Profile Photo

#NLProc New paper on "evaluation-time scaling", a new dimension to leverage test-time compute! We replicate the test-time scaling behaviors observed in generators (e.g., o1, r1, s1) with evaluators by enforcing to generate additional reasoning tokens. arxiv.org/abs/2503.19877

#NLProc 
New paper on "evaluation-time scaling", a new dimension to leverage test-time compute!

We replicate the test-time scaling behaviors observed in generators (e.g., o1, r1, s1) with evaluators by enforcing to generate additional reasoning tokens.

arxiv.org/abs/2503.19877
Fernando Diaz (@841io) 's Twitter Profile Photo

If you're interested in OpenAI including shopping results, you might also be interested in Danny To Eun Kim (@teknology.bsky.social)'s paper relating retrieval diversity/fairness and generation by downstream RAG models. This has implications for individuals selling products online. arxiv.org/abs/2409.11598

Athiya Deviyani (@athiyad) 's Twitter Profile Photo

Ever trusted a metric that works great on average, only for it to fail in your specific use case? In our #NAACL2025 paper (w/ Fernando Diaz), we show why global evaluations are not enough and why context matters more than you think. 📄 aclanthology.org/2025.findings-… #NLP #Evaluation (🧵1/9)

Ever trusted a metric that works great on average, only for it to fail in your specific use case?

In our #NAACL2025 paper (w/ <a href="/841io/">Fernando Diaz</a>), we show why global evaluations are not enough and why context matters more than you think.
📄 aclanthology.org/2025.findings-…
#NLP #Evaluation 
(🧵1/9)
Shaily (@shaily99) 's Twitter Profile Photo

🖋️ Curious how writing differs across (research) cultures? 🚩 Tired of “cultural” evals that don't consult people? We engaged with researchers to identify & measure ✨cultural norms✨in scientific writing, and show that❗LLMs flatten them❗ 📜 arxiv.org/abs/2506.00784 1/11

🖋️ Curious how writing differs across (research) cultures?
🚩 Tired of “cultural” evals that don't consult people?

We engaged with researchers to identify &amp; measure ✨cultural norms✨in scientific writing, and show that❗LLMs flatten them❗

📜 arxiv.org/abs/2506.00784 

1/11