Zeyi Liao (@liaozeyi) 's Twitter Profile
Zeyi Liao

@liaozeyi

PhD Student at @osunlp

ID: 1480385536678268929

linkhttps://lzy37ld.github.io/ calendar_today10-01-2022 03:46:43

216 Tweet

241 Followers

505 Following

Vardaan Pahuja (@vardaanpahuja) 's Twitter Profile Photo

🚀 Thrilled to unveil the most exciting project of my PhD: Explorer — Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents TL;DR: A scalable multi-agent pipeline that leverages exploration for diverse web agent trajectory synthesis. 📄 Paper:

🚀 Thrilled to unveil the most exciting project of my PhD:
Explorer — Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents
TL;DR: A scalable multi-agent pipeline that leverages exploration for diverse web agent trajectory synthesis.

đź“„ Paper:
Jaylen Jones (@jaylen_jonesnlp) 's Twitter Profile Photo

So thrilled to share "RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments"! With our novel hybrid testing sandbox, systematic analysis using our RTC-Bench benchmark shows even the new Claude 4 Opus hits a 48% Attack Success Rate!🤯

Huan Sun (OSU) (@hhsun1) 's Twitter Profile Photo

Realistic adversarial testing of Computer-Use Agents (CUAs) to identify their vulnerabilities and make them safer and more secure is … hard. Is Anthropic Claude 4 Opus more robust to indirect prompt injection than previous versions like Claude 3.7? Not really. Why hard?

Realistic adversarial testing of Computer-Use Agents (CUAs) to identify their vulnerabilities and make them safer and more secure is … hard. Is <a href="/AnthropicAI/">Anthropic</a> Claude 4 Opus more robust to indirect prompt injection than previous versions like Claude 3.7? Not really.

Why hard?
Zifan (Sail) Wang (@_zifan_wang) 's Twitter Profile Photo

It is great to see new works like this focus on building a sandbox environment to test the safety of autonomous agents. This type of work can unlock a lot of use cases and assists the testing of different threat models.

Yu Su @#ICLR2025 (@ysu_nlp) 's Twitter Profile Photo

I believe computer use, in principle, is much harder than math/coding for current AI. the digital world encompasses a much larger part of the complexity in this world. The goals are often vastly underspecified and require accessing and understanding broad context (in users’ head

Ekdeep Singh Lubana (@ekdeepl) 's Twitter Profile Photo

🚨 New paper alert! Linear representation hypothesis (LRH) argues concepts are encoded as **sparse sum of orthogonal directions**, motivating interpretability tools like SAEs. But what if some concepts don’t fit that mold? Would SAEs capture them? 🤔 1/11

Botao Yu (@botaoyu24) 's Twitter Profile Photo

🔬 Introducing ChemMCP, the first MCP-compatible toolkit for empowering AI models with advanced chemistry capabilities! In recent years, we’ve seen rising interest in tool-using AI agents across domains. Particularly in scientific domains like chemistry, LLMs alone still fall

Yifei Li (@yifeilipku) 's Twitter Profile Photo

📢 Introducing AutoSDT, a fully automatic pipeline that collects data-driven scientific coding tasks at scale! We use AutoSDT to collect AutoSDT-5K, enabling open co-scientist models that rival GPT-4o on ScienceAgentBench! Thread below ⬇️ (1/n)

📢 Introducing AutoSDT, a fully automatic pipeline that collects data-driven scientific coding tasks at scale!
We use AutoSDT to collect AutoSDT-5K, enabling open co-scientist models that rival GPT-4o on ScienceAgentBench!
Thread below ⬇️ (1/n)
Yu Su @#ICLR2025 (@ysu_nlp) 's Twitter Profile Photo

🔎Agentic search like Deep Research is fundamentally changing web search, but it also brings an evaluation crisis⚠️ Introducing Mind2Web 2: Evaluating Agentic Search with Agents-as-a-Judge - 130 tasks (each requiring avg. 100+ webpages) from 1,000+ hours of expert labor -

🔎Agentic search like Deep Research is fundamentally changing web search, but it also brings an evaluation crisis⚠️

Introducing Mind2Web 2: Evaluating Agentic Search with Agents-as-a-Judge
- 130 tasks (each requiring avg. 100+ webpages) from 1,000+ hours of expert labor
-