Dhruv Singh (@ds3638) 's Twitter Profile
Dhruv Singh

@ds3638

cto at @honeyhiveai

ID: 1436525209176002560

linkhttps://honeyhive.ai calendar_today11-09-2021 03:01:29

518 Tweet

112 Followers

468 Following

An Vo (@an_vo12) 's Twitter Profile Photo

🚨 Our latest work shows that SOTA VLMs (o3, o4-mini, Sonnet, Gemini Pro) fail at counting legs due to bias⁉️ See simple cases where VLMs get it wrong, no matter how you prompt them. 🧪 Think your VLM can do better? Try it yourself here: vlmsarebiased.github.io/#example-galle… 1/n #ICML2025

🚨 Our latest work shows that SOTA VLMs (o3, o4-mini, Sonnet, Gemini Pro) fail at counting legs due to bias⁉️

See simple cases where VLMs get it wrong, no matter how you prompt them. 

🧪 Think your VLM can do better? Try it yourself here: vlmsarebiased.github.io/#example-galle…

1/n
#ICML2025
METR (@metr_evals) 's Twitter Profile Photo

We found that Grok 4’s 50%-time-horizon on our agentic multi-step software engineering tasks is about 1hr 50min (with a 95% CI of 48min to 3hr 52min) compared to o3 (previous SOTA) at about 1hr 30min. However, Grok 4’s time horizon is below SOTA at higher success rate thresholds.

We found that Grok 4’s 50%-time-horizon on our agentic multi-step software engineering tasks is about 1hr 50min (with a 95% CI of 48min to 3hr 52min) compared to o3 (previous SOTA) at about 1hr 30min. However, Grok 4’s time horizon is below SOTA at higher success rate thresholds.
steve hsu (@hsu_steve) 's Twitter Profile Photo

You can't go back in time... But perhaps look at this book. It's accessible to non-math people but also has insights for VERY mathy people who don't understand real physics. There's even a page that explains "tensors" - might be useful for would-be AI engineers ;-) If

You can't go back in time... But perhaps look at this book.
  
It's accessible to non-math people but also has insights for VERY mathy people who don't understand real physics. There's even a page that explains "tensors" - might be useful for would-be AI engineers ;-)  

If
ℏεsam (@hesamation) 's Twitter Profile Photo

Manus shared an article on how they manage the context of their agents. It actually includes some pretty practical point. If you’re building or using agents in any sort or form, this can be quite useful.

Manus shared an article on how they manage the context of their agents. It actually includes some pretty practical point.

If you’re building or using agents in any sort or form, this can be quite useful.
swyx (@swyx) 's Twitter Profile Photo

current multiples are interesting. OAI: 15x EOY25 ARR Ant: 19x EOY25 ARR details OAI 300b valuation, 13b ARR now, 20b EOY Ant 170b valuation, 5b ARR now, 9b EOY these don't seem... crazy?

will brown (@willccbb) 's Twitter Profile Photo

i'm increasingly convinced that "transformative ai" is going to look like an abundance of specialized models for everything from drug design to weather sims to robotics to supply chains, not one agent to rule them all. we're going to need a lot more ai researchers

Susan Zhang (@suchenzang) 's Twitter Profile Photo

[episode 120 of frontier lab gossip: OH in the mission] > **** folks recently bragged about a 100k H100 training run > wrote a post, got all the likes and shares internally > then some of them ran the exact same job on 20k H100s instead of 100k, and ended up with the exact same

@levelsio (@levelsio) 's Twitter Profile Photo

People are so confused, it's not about fulfillment No sane rich person spends all their money It's about taking out 3% (Safe Withdrawal Rate) * $5M / 12 months of your investment = receive $12,500/mo forever This gives you the statistical guarantee you'll have enough money to

People are so confused, it's not about fulfillment

No sane rich person spends all their money

It's about taking out 3% (Safe Withdrawal Rate) * $5M / 12 months of your investment

= receive $12,500/mo forever

This gives you the statistical guarantee you'll have enough money to
Emmanuel Ameisen (@mlpowered) 's Twitter Profile Photo

Friday we published our no-jargon explainer of how LLMs "think". It's been great to see the response - already at 80k views! We covered why models hallucinate, why they flatter users, and if they are just glorified autocomplete! If you are curious about LLMs, check it out! 👇

Friday we published our no-jargon explainer of how LLMs "think". It's been great to see the response - already at 80k views!

We covered why models hallucinate, why they flatter users, and if they are just glorified autocomplete!

If you are curious about LLMs, check it out! 👇
Richard Sutton (@richardssutton) 's Twitter Profile Photo

I was happy to give a more technical talk on how we might create an AI at RLC-2025 and AGI-2025 (video below). The Oak Architecture: A Vision of Super-Intelligence from Experience As AI has become a huge industry, to an extent it has lost its way. What is needed to get us back on

Ara (@arafatkatze) 's Twitter Profile Photo

In building AI agents Cline , we've identified three mind viruses Mind Viruses are seductive ideas that sound smart, but don’t work in practice. 1. Multi-Agent Orchestration 2. RAG (Retrieval Augmented Generation) 3. More Instructions = Better Results Let's explore why!

In building AI agents <a href="/cline/">Cline</a> , we've identified three mind viruses Mind Viruses are seductive ideas that sound smart, but don’t work in practice. 
1. Multi-Agent Orchestration
2. RAG (Retrieval Augmented Generation)
3. More Instructions = Better Results
Let's explore why!
tensorqt (@tensorqt) 's Twitter Profile Photo

attention sinks may be a bias in causal transformers. as some of you know, i've been writing a long blogpost on attention and its properties as a message-passing operation on graphs. while doing so, i figured i might have found an explanation for which attention sinks may be an

attention sinks may be a bias in causal transformers. 

as some of you know, i've been writing a long blogpost on attention and its properties as a message-passing operation on graphs. while doing so, i figured i might have found an explanation for which attention sinks may be an