Botao Yu (@botaoyu24) 's Twitter Profile
Botao Yu

@botaoyu24

PhD student @ OSU NLP Group @osunlp. Focus on NLP and Deep Learning.

ID: 1572602016681304064

linkhttps://btyu.github.io calendar_today21-09-2022 15:02:27

72 Tweet

130 Followers

191 Following

Vardaan Pahuja (@vardaanpahuja) 's Twitter Profile Photo

🚀 Thrilled to unveil the most exciting project of my PhD: Explorer — Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents TL;DR: A scalable multi-agent pipeline that leverages exploration for diverse web agent trajectory synthesis. 📄 Paper:

🚀 Thrilled to unveil the most exciting project of my PhD:
Explorer — Scaling Exploration-driven Web Trajectory Synthesis for Multimodal Web Agents
TL;DR: A scalable multi-agent pipeline that leverages exploration for diverse web agent trajectory synthesis.

📄 Paper:
Zeyi Liao (@liaozeyi) 's Twitter Profile Photo

⁉️Can you really trust Computer-Use Agents (CUAs) to control your computer⁉️ Not yet, Anthropic Opus 4 shows an alarming 48% Attack Success Rate against realistic internet injection❗️ Introducing RedTeamCUA: realistic, interactive, and controlled sandbox environments for

Huan Sun (OSU) (@hhsun1) 's Twitter Profile Photo

Realistic adversarial testing of Computer-Use Agents (CUAs) to identify their vulnerabilities and make them safer and more secure is … hard. Is Anthropic Claude 4 Opus more robust to indirect prompt injection than previous versions like Claude 3.7? Not really. Why hard?

Realistic adversarial testing of Computer-Use Agents (CUAs) to identify their vulnerabilities and make them safer and more secure is … hard. Is <a href="/AnthropicAI/">Anthropic</a> Claude 4 Opus more robust to indirect prompt injection than previous versions like Claude 3.7? Not really.

Why hard?
Yu Su @#ICLR2025 (@ysu_nlp) 's Twitter Profile Photo

📈 Scaling may be hitting a wall in the digital world, but it's only beginning in the biological world! We trained a foundation model on 214M images of ~1M species (50% of named species on Earth 🐨🐠🌻🦠) and found emergent properties capturing hidden regularities in nature. 🧵

📈 Scaling may be hitting a wall in the digital world, but it's only beginning in the biological world!

We trained a foundation model on 214M images of ~1M species (50% of named species on Earth 🐨🐠🌻🦠) and found emergent properties capturing hidden regularities in nature.

🧵
Jianyang Gu (@vimar_gu) 's Twitter Profile Photo

It’s so exciting to see BioCLIP 2 demonstrates a biologically meaningful embedding space while only trained to distinguish species. Can’t wait to see more applications of BioCLIP 2 in solving real world problems. I’m attending #CVPR2025 in Nashville. Happy to chat about it!

Yifei Li (@yifeilipku) 's Twitter Profile Photo

📢 Introducing AutoSDT, a fully automatic pipeline that collects data-driven scientific coding tasks at scale! We use AutoSDT to collect AutoSDT-5K, enabling open co-scientist models that rival GPT-4o on ScienceAgentBench! Thread below ⬇️ (1/n)

📢 Introducing AutoSDT, a fully automatic pipeline that collects data-driven scientific coding tasks at scale!
We use AutoSDT to collect AutoSDT-5K, enabling open co-scientist models that rival GPT-4o on ScienceAgentBench!
Thread below ⬇️ (1/n)
Saining Xie (@sainingxie) 's Twitter Profile Photo

Had a great time at this CVPR community-building workshop---lots of fun discussions and some really important insights for early-career researchers. I also gave a talk on "Research as an Infinite Game." Here are the slides: canva.com/design/DAGp0iR…

Had a great time at this CVPR community-building workshop---lots of fun discussions and some really important insights for early-career researchers. 

I also gave a talk on "Research as an Infinite Game." Here are the slides:
canva.com/design/DAGp0iR…
Botao Yu (@botaoyu24) 's Twitter Profile Photo

Holy moly, what a massive effort, proud to be part of it! 🥳 As agentic search continues to evolve and increasingly support our work and daily lives, Mind2Web 2 arrives as a timely, rigorous benchmark for evaluation and progress tracking. (Now get to work, agent builders! This

Botao Yu (@botaoyu24) 's Twitter Profile Photo

⬇️ Check out SDE-Harness, our general framework for evaluating LLMs/agents on scientific discovery. It features easy integration, broad LLM support, dynamic prompting, comprehensive logging, and customizable metrics, applicable for all domains and tasks.

Huan Sun (OSU) (@hhsun1) 's Twitter Profile Photo

🚨 Postdoc Hiring: I am looking for a postdoc to work on rigorously evaluating and advancing the capabilities and safety of computer-use agents (CUAs), co-advised with Yu Su OSU NLP Group. We welcome strong applicants with experience in CUAs, long-horizon reasoning/planning,

Jianyang Gu (@vimar_gu) 's Twitter Profile Photo

Announcing the NeurIPS Conference 2025 workshop on Imageomics: Discovering Biological Knowledge from Images Using AI! The workshop focuses on the interdisciplinary field between machine learning and biological science. We look forward to seeing you in San Diego! #NeurIPS2025

Announcing the <a href="/NeurIPSConf/">NeurIPS Conference</a> 2025 workshop on Imageomics:
Discovering Biological Knowledge from Images Using AI!

The workshop focuses on the interdisciplinary field between machine learning and biological science. We look forward to seeing you in San Diego!

#NeurIPS2025
Yu Su @#ICLR2025 (@ysu_nlp) 's Twitter Profile Photo

Safety is one of the biggest blockers for computer use agents: how can I trust an agent won’t accidentally do something consequential without my permission? We collect and release the first large-scale dataset for detecting consequential actions on the web, and train the best

Safety is one of the biggest blockers for computer use agents: how can I trust an agent won’t accidentally do something consequential without my permission? 

We collect and release the first large-scale dataset for detecting consequential actions on the web, and train the best
Boyuan Zheng (@boyuan__zheng) 's Twitter Profile Photo

Remember “Son of Anton” from the Silicon Valley show(Silicon Valley)? The experimental AI that “efficiently” orders 4,000 lbs of meat while looking for a cheap burger and “fixes” a bug by deleting all the code? It’s starting to look a lot like reality. Even 18 months ago, my own

Remember “Son of Anton” from the Silicon Valley show(<a href="/SiliconHBO/">Silicon Valley</a>)? The experimental AI that “efficiently” orders 4,000 lbs of meat while looking for a cheap burger and “fixes” a bug by deleting all the code?

It’s starting to look a lot like reality. 

Even 18 months ago, my own
Ben Blaiszik (@benblaiszik) 's Twitter Profile Photo

I'll be sitting down for a chat with Chenru Duan, founder of Deep Principle this afternoon. We'll be talking about topics including how to benchmark LLMs for scientific tasks and journey from academia to startup. Anything you'd like to hear about? x.com/chenru_duan/st…