Zeyi Liao (@liaozeyi) Twitter Tweets • TwiCopy

Gate.io

5 hours ago

🔥The 9th Round of Easy Loan, Earn $40 Reward is in progress❗️ ⏰ Promotion Period: January 15th - Feburary 15th, 2025 👉 Register now and check more details at gate.io/campaigns/358

thumb_up_off_alt34

chat_bubble_outline39

repeat6

shareShare

🚀 Thrilled to unveil the most exciting project of my PhD: Explorer — Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents TL;DR: A scalable multi-agent pipeline that leverages exploration for diverse web agent trajectory synthesis. 📄 Paper:

thumb_up_off_alt53

chat_bubble_outline5

repeat23

shareShare

Jaylen Jones

@jaylen_jonesnlp

2 months ago

So thrilled to share "RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments"! With our novel hybrid testing sandbox, systematic analysis using our RTC-Bench benchmark shows even the new Claude 4 Opus hits a 48% Attack Success Rate!🤯

thumb_up_off_alt6

chat_bubble_outline1

repeat3

shareShare

Huan Sun (OSU)

@hhsun1

2 months ago

Realistic adversarial testing of Computer-Use Agents (CUAs) to identify their vulnerabilities and make them safer and more secure is … hard. Is Anthropic Claude 4 Opus more robust to indirect prompt injection than previous versions like Claude 3.7? Not really. Why hard?

Realistic adversarial testing of Computer-Use Agents (CUAs) to identify their vulnerabilities and make them safer and more secure is … hard. Is <a href="/AnthropicAI/">Anthropic</a> Claude 4 Opus more robust to indirect prompt injection than previous versions like Claude 3.7? Not really.

Why hard?

thumb_up_off_alt57

chat_bubble_outline3

repeat24

shareShare

Zifan (Sail) Wang

@_zifan_wang

2 months ago

It is great to see new works like this focus on building a sandbox environment to test the safety of autonomous agents. This type of work can unlock a lot of use cases and assists the testing of different threat models.

thumb_up_off_alt8

chat_bubble_outline0

repeat1

shareShare

Yu Su @#ICLR2025

@ysu_nlp

2 months ago

I believe computer use, in principle, is much harder than math/coding for current AI. the digital world encompasses a much larger part of the complexity in this world. The goals are often vastly underspecified and require accessing and understanding broad context (in users’ head

thumb_up_off_alt56

chat_bubble_outline7

repeat7

shareShare

Ekdeep Singh Lubana

@ekdeepl

2 months ago

🚨 New paper alert! Linear representation hypothesis (LRH) argues concepts are encoded as **sparse sum of orthogonal directions**, motivating interpretability tools like SAEs. But what if some concepts don’t fit that mold? Would SAEs capture them? 🤔 1/11

thumb_up_off_alt378

chat_bubble_outline5

repeat60

shareShare

Botao Yu

@botaoyu24

2 months ago

🔬 Introducing ChemMCP, the first MCP-compatible toolkit for empowering AI models with advanced chemistry capabilities! In recent years, we’ve seen rising interest in tool-using AI agents across domains. Particularly in scientific domains like chemistry, LLMs alone still fall

thumb_up_off_alt66

chat_bubble_outline3

repeat30

shareShare

Yifei Li

@yifeilipku

2 months ago

📢 Introducing AutoSDT, a fully automatic pipeline that collects data-driven scientific coding tasks at scale! We use AutoSDT to collect AutoSDT-5K, enabling open co-scientist models that rival GPT-4o on ScienceAgentBench! Thread below ⬇️ (1/n)

thumb_up_off_alt72

chat_bubble_outline4

repeat25

shareShare

Yu Su @#ICLR2025

@ysu_nlp

a month ago

🔎Agentic search like Deep Research is fundamentally changing web search, but it also brings an evaluation crisis⚠️ Introducing Mind2Web 2: Evaluating Agentic Search with Agents-as-a-Judge - 130 tasks (each requiring avg. 100+ webpages) from 1,000+ hours of expert labor -

thumb_up_off_alt211

chat_bubble_outline3

repeat45

shareShare

Zeyi Liao

Gate.io

Vardaan Pahuja

Jaylen Jones

Huan Sun (OSU)

Zifan (Sail) Wang

Yu Su @#ICLR2025

Ekdeep Singh Lubana

Botao Yu

Yifei Li

Yu Su @#ICLR2025