Karolina Stanczak (@karstanczak) 's Twitter Profile
Karolina Stanczak

@karstanczak

Postdoc NLP @Mila_Quebec & @mcgillu | Previous PhD candidate @uni_copenhagen @CopeNLU

ID: 1285579351950598144

linkhttps://karstanczak.github.io/ calendar_today21-07-2020 14:16:29

109 Tweet

662 Followers

557 Following

Karolina Stanczak (@karstanczak) 's Twitter Profile Photo

Excited to be organizing the VLMs4All workshop at #CVPR2025! 🎉 The workshop features fantastic speakers, a short-paper track, and two challenges, including one based on CulturalVQA. Don’t miss it!

P Shravan Nayak (@pshravannayak) 's Twitter Profile Photo

🚀 Super excited to announce UI-Vision: the largest and most diverse desktop GUI benchmark for evaluating agents in real-world desktop GUIs in offline settings. 📄 Paper: arxiv.org/abs/2503.15661 🌐 Website: uivision.github.io 🧵 Key takeaways 👇

Karolina Stanczak (@karstanczak) 's Twitter Profile Photo

Reviewers needed! 📢 The 6th Workshop on Gender Bias in NLP at #ACL2025 (Vienna, Aug 1st) is looking for you! Sign up to review: forms.gle/VkPU4vS4EacEWs… #NLProc

VLMs4All - CVPR 2025 Workshop (@vlms4all) 's Twitter Profile Photo

🔔 Reminder & Call for #VLMs4All @ #CVPR2025! Help shape the future of culturally aware & geo-diverse VLMs: ⚔️ Challenges: Deadline: Apr 15 🔗sites.google.com/view/vlms4all/… 📄 Papers (4pg): Submit work on benchmarks, methods, metrics! Deadline: Apr 22 🔗sites.google.com/view/vlms4all/… Join us!

Amirhossein Kazemnejad (@a_kazemnejad) 's Twitter Profile Photo

A key reason RL for web agents hasn’t fully taken off is the lack of robust reward models. No matter the algorithm (PPO, GRPO), we can’t reliably do RL without a reward signal. With AgentRewardBench, we introduce the first benchmark aiming to kickstart progress in this space.

Nicholas Meade (@ncmeade) 's Twitter Profile Photo

Check out Xing Han Lu new benchmark for evaluating reward models for web tasks! AgentRewardBench has rich human annotations of trajectories from top LLM web agents across realistic web tasks and will greatly help steer the design of future reward models.

Karolina Stanczak (@karstanczak) 's Twitter Profile Photo

Exciting release! AgentRewardBench offers that much-needed closer look at evaluating agent capabilities: automatic vs. human eval. Important findings here, especially on the popular LLM judges. Amazing work by Xing Han Lu & team!

Axel Darmouni (@adarmouni) 's Twitter Profile Photo

Benchmarking the performance of Models as judges of Agentic Trajectories 📖 Read of the day, season 3, day 30: « AgentRewardBench: Evaluating Automatic Evaluations of Web Trajectories », by Xing Han Lu, Amirhossein Kazemnejad et al from McGill University and Mila - Institut québécois d'IA The core idea of the

Benchmarking the performance of Models as judges of Agentic Trajectories

📖 Read of the day, season 3, day 30: « AgentRewardBench: Evaluating Automatic Evaluations of Web Trajectories », by <a href="/xhluca/">Xing Han Lu</a>, <a href="/a_kazemnejad/">Amirhossein Kazemnejad</a> et al from <a href="/mcgillu/">McGill University</a> and <a href="/Mila_Quebec/">Mila - Institut québécois d'IA</a> 

The core idea of the
VLMs4All - CVPR 2025 Workshop (@vlms4all) 's Twitter Profile Photo

🚨 Deadline Extension Alert for #VLMs4All! 🚨 We have extended the challenge submission deadline 🛠️ New challenge deadline: Apr 22 Show your stuff in the CulturalVQA and GlobalRG challenges! 👉 sites.google.com/view/vlms4all/… Spread the word and keep those submissions coming! 🌍✨

WebAgentlab (@webagentlab) 's Twitter Profile Photo

13/🧵AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories  AGENTREWARDBENCH is a benchmark designed to evaluate the effectiveness of Large Language Model judges in assessing web agent performance, revealing that while LLMs show potential, no single model

13/🧵AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories 

AGENTREWARDBENCH is a benchmark designed to evaluate the effectiveness of Large Language Model judges in assessing web agent performance, revealing that while LLMs show potential, no single model
VLMs4All - CVPR 2025 Workshop (@vlms4all) 's Twitter Profile Photo

📢 Deadline Extended! The paper submission deadline for #VLMs4All Workshop at CVPR 2025 has been extended to Monday Apr 28! 💡 We encourage submissions that explore multicultural perspectives in VLMs 🔗 openreview.net/group?id=thecv… 📍 Let's shape the future of globally inclusive AI!

Mila - Institut québécois d'IA (@mila_quebec) 's Twitter Profile Photo

Congratulations to Mila members Ada Tur, Gaurav Kamath and Siva Reddy for their SAC award at #NAACL2025! Check out Ada's talk in Session I: Oral/Poster 6. Paper: arxiv.org/abs/2502.05670

VLMs4All - CVPR 2025 Workshop (@vlms4all) 's Twitter Profile Photo

🚀 Important Update! We're reaching out to collect email IDs of the CulturalVQA and GlobalRG challenge participants for time-sensitive communications, including informing the winning teams. ALL participating teams please fill out the forms below ASAP (ideally within 24 hours). 👇

VLMs4All - CVPR 2025 Workshop (@vlms4all) 's Twitter Profile Photo

🗓️ Save the date! It's official: The VLMs4All Workshop at #CVPR2025 will be held on June 12th! Get ready for a full day of speakers, posters, and a panel discussion on making VLMs more geo-diverse and culturally aware 🌐 Check out the schedule below!

🗓️ Save the date! It's official: The VLMs4All Workshop at #CVPR2025 will be held on June 12th!

Get ready for a full day of speakers, posters, and a panel discussion on making VLMs more geo-diverse and culturally aware 🌐

Check out the schedule below!
Ziling Cheng (@ziling_cheng) 's Twitter Profile Photo

Do LLMs hallucinate randomly? Not quite. Our #ACL2025 (Main) paper shows that hallucinations under irrelevant contexts follow a systematic failure mode — revealing how LLMs generalize using abstract classes + context cues, albeit unreliably. 📎 Paper: arxiv.org/abs/2505.22630 1/n

Do LLMs hallucinate randomly? Not quite. Our #ACL2025 (Main) paper shows that hallucinations under irrelevant contexts follow a systematic failure mode — revealing how LLMs generalize using abstract classes + context cues, albeit unreliably.

📎 Paper: arxiv.org/abs/2505.22630 1/n
Aishwarya Agrawal (@aagrawalaa) 's Twitter Profile Photo

My lab’s contributions at #CVPR2025: -- Organizing VLMs4All - CVPR 2025 Workshop workshop (with 2 challenges) sites.google.com/corp/view/vlms… -- 2 main conference papers (1 highlight, 1 poster) cvpr.thecvf.com/virtual/2025/p… (highlight) cvpr.thecvf.com/virtual/2025/p… (poster) -- 4 workshop papers (2 spotlight talks, 2

My lab’s contributions at #CVPR2025:

-- Organizing <a href="/vlms4all/">VLMs4All - CVPR 2025 Workshop</a> workshop (with 2 challenges)
sites.google.com/corp/view/vlms…

-- 2 main conference papers (1 highlight, 1 poster)
cvpr.thecvf.com/virtual/2025/p… (highlight)
cvpr.thecvf.com/virtual/2025/p… (poster)

-- 4 workshop papers (2 spotlight talks, 2
VLMs4All - CVPR 2025 Workshop (@vlms4all) 's Twitter Profile Photo

Our VLMs4All workshop is taking place today! 📅 on Thursday, June 12 ⏲️ from 9AM CDT 🏛️in Room 104E Join us today at #CVPR2025 for amazing speakers, posters, and a panel discussion on making VLMs more geo-diverse and culturally aware! #CVPR2025

Xing Han Lu (@xhluca) 's Twitter Profile Photo

"Build the web for agents, not agents for the web" This position paper argues that rather than forcing web agents to adapt to UIs designed for humans, we should develop a new interface optimized for web agents, which we call Agentic Web Interface (AWI).

"Build the web for agents, not agents for the web"

This position paper argues that rather than forcing web agents to adapt to UIs designed for humans, we should develop a new interface optimized for web agents, which we call Agentic Web Interface (AWI).