Deanna Emery (@deannalemery) 's Twitter Profile
Deanna Emery

@deannalemery

Founding AI Researcher at Quotient AI, ex Astrophysicist, bike tourer 🚴‍♀️UC Berkeley, Harvard

ID: 1388196676708470791

calendar_today30-04-2021 18:21:10

48 Tweet

48 Followers

94 Following

Julia Neagu (@juliaaneagu) 's Twitter Profile Photo

Most teams only find out their AI is broken when someone complains or churns. Your agents shouldn’t fail silently. We’re launching Quotient AI Detections: a system to catch agent mistakes, identify how they happened, and automatically fix them.

Julia Neagu (@juliaaneagu) 's Twitter Profile Photo

We just launched Quotient AI Detections: the first system that helps teams catch AI failures before their users do. As part of the launch, we partnered with ElevenLabs to offer coupons through the AI Engineer Pack: → 1,000,000 extra logs → 10,000 free detections → $250+

Julia Neagu (@juliaaneagu) 's Twitter Profile Photo

We’re teaming up with Vercel to support the next wave of AI Accelerator builders. Participants and winners get $29K in Quotient credits to monitor their AI agents for failures from day 1, plus hands-on support from our team. 👇Apps close May 17. Apply now!

AI Engineer (@aidotengineer) 's Twitter Profile Photo

Announcing our speakers for the Retrieval + Search track! ⚠️PSA: Tix nearly sold out, get em here: ti.to/software-3/ai-…… Featuring: Aman, Former Founder, Harvey Jerry Liu, CEO, LlamaIndex 🦙 Julia Neagu, CEO, Quotient AI changhiskhan, CEO, LanceDB

Deanna Emery (@deannalemery) 's Twitter Profile Photo

Can’t wait to be back at AI Engineer! I’m teaming up with Maitar Asher 🎗️ from tavily to talk about evaluating AI search. We’re sharing a practical eval framework, lessons from real-world deployments, and never-seen-before benchmark results. Hope to see you there!

Julia Neagu (@juliaaneagu) 's Twitter Profile Photo

HypoEval is now available in Quotient AI's OSS judges! It uses SOTA hypothesis generation with just 30 human annotations to create decomposed rubrics, enabling LLMs to score criteria clearly. Beats fine-tuned models (w/ 3x more labels). Thanks Mingxuan (Aldous) Li for contributing!

Julia Neagu (@juliaaneagu) 's Twitter Profile Photo

Most engineers think you need ground truth data to detect AI hallucinations. You don't. Extrinsic hallucinations are the real problem: model misusing the context you gave it. Here's a primer on how to do table stakes hallucination detection without expensive datasets👇

Most engineers think you need ground truth data to detect AI hallucinations. You don't.

Extrinsic hallucinations are the real problem: model misusing the context you gave it.

Here's a primer on how to do table stakes hallucination detection without expensive datasets👇
Julia Neagu (@juliaaneagu) 's Twitter Profile Photo

Just shared the slides from our AI Engineer World Fair talk: Evaluating AI Search – A Practical Framework for Augmented Systems. As more AI agents rely on real-time data (like the web!), traditional eval approaches are falling behind and don't capture what's actually

Just shared the slides from our <a href="/aiDotEngineer/">AI Engineer</a> World Fair talk: Evaluating AI Search – A Practical Framework for Augmented Systems.

As more AI agents rely on real-time data (like the web!), traditional eval approaches are falling behind and don't capture what's actually
Julia Neagu (@juliaaneagu) 's Twitter Profile Photo

The worst part of building an agent? You don’t know it’s broken until your users tell you. We just dropped a cookbook for a web research agent with real time monitoring — so you can catch critical issues as they happen. ft. tavily LangChain OpenAI Quotient AI

Julia Neagu (@juliaaneagu) 's Twitter Profile Photo

What did Freddie Vargus see? 👀 Everyone’s talking about context engineering now. Freddie knew months ago: context is the product.

jason liu - vacation mode (@jxnlco) 's Twitter Profile Photo

how do i catch hallucinations? come learn to implement monitoring systems that catch AI errors as they happen in live production environments with Julia Neagu and Quotient AI if you register, you'll be sent the recording and study notes after they're done!

Julia Neagu (@juliaaneagu) 's Twitter Profile Photo

Just dropped: three new cookbooks for building AI research agents with Exa, LangChain, OpenAI, and Anthropic — now with built-in monitoring from Quotient AI. Track search relevance. Catch hallucinations. Debug real-world agents as they run.

Freddie Vargus (@freddie_v4) 's Twitter Profile Photo

today we're releasing a new small model (0.5B) for detecting problems with tool usage in agents, trained on 50M tokens from publicly available MCP server tools it's great at picking up on tool accuracy issues and outperforms larger models

Julia Neagu (@juliaaneagu) 's Twitter Profile Photo

WHAT A DAY: Quotient AI limbic-tool-use-0.5B is trending on Hugging Face! it’s now the top fine-tuned model for Qwen2.5-0.5B-Instruct appreciate all the support ❤️more coming soon 👀

Freddie Vargus (@freddie_v4) 's Twitter Profile Photo

thanks to everyone for the excitement and questions. we've got more stuff cooking, and wanted to share more about the data curation & training pipeline from our researcher @deannalemery in our blog below 👇

Freddie Vargus (@freddie_v4) 's Twitter Profile Photo

Read about how it all starts with good, realistic data: blog.quotientai.co/introducing-li… we pulled publicly available MCP server tool information from over 150 MCP servers, extracting tool definitions and metadata, and then reformatting the server info so our data was exploded into 1

Read about how it all starts with good, realistic data: blog.quotientai.co/introducing-li… 

we pulled publicly available MCP server tool information from over 150 MCP servers, extracting tool definitions and metadata, and then reformatting the server info so our data was exploded into 1
Julia Neagu (@juliaaneagu) 's Twitter Profile Photo

Agents still mess up tool calls. Last week we released limbic-tool-use-0.5B - a small model that catches those mistakes better than GPT-4.1. Now we're sharing how we did it: dataset, training pipeline, benchmark. 162 MCP server tools 50M+ tokens 1 Modal H100 read below

Agents still mess up tool calls.

Last week we released limbic-tool-use-0.5B - a small model that catches those mistakes better than GPT-4.1.

Now we're sharing how we did it: dataset, training pipeline, benchmark.

162 MCP server tools
50M+ tokens
1 <a href="/modal_labs/">Modal</a> H100

read below
Julia Neagu (@juliaaneagu) 's Twitter Profile Photo

Our talk with @Tavily is now live — part of the new AI Engineer Retrieval & Search track. We share a practical framework for evaluating AI-powered search, and results from benchmarking popular context-augmented systems. If you’re building AI search, this one’s for you.