Jeffrey 🐬 confident-ai.com (@jeffr_yyy) 's Twitter Profile
Jeffrey 🐬 confident-ai.com

@jeffr_yyy

Cofounder @confident_ai, building @deepeval, ex-@Google, ex-@Microsoft

ID: 1301118361191895042

linkhttps://www.confident-ai.com calendar_today02-09-2020 11:23:00

1,1K Tweet

241 Followers

117 Following

Tom Blomfield (@t_blom) 's Twitter Profile Photo

I'm hosting a Y Combinator event for early-stage founders at Monzo 🏦 HQ in London on July 15th. You can apply for a ticket here: events.ycombinator.com/OCU91g6z4

Jeffrey 🐬 confident-ai.com (@jeffr_yyy) 's Twitter Profile Photo

In the week of June DeepEval decided to embrace OpenAI's API messages format for evaluating multi-turn conversations - and as a result almost tripled the total number of conversational evals ran from all our users. As a result we're releasing a native OpenAI integration later

In the week of June <a href="/deepeval/">DeepEval</a> decided to embrace OpenAI's API messages format for evaluating multi-turn conversations - and as a result almost tripled the total number of conversational evals ran from all our users.

As a result we're releasing a native OpenAI integration later
Jeffrey 🐬 confident-ai.com (@jeffr_yyy) 's Twitter Profile Photo

The most neglected pages on Confident AI are the settings, organization, and project pages. Kritin Vongthongsri and I sat down over the weekend and refactored + redesigned 10k lines of code to make things more professional. Does this look better now?

Mayank (@themayanksol) 's Twitter Profile Photo

"perfection is the enemy of good" - voltaire shipped DeepEval 1. enable tracing with hierarchy of spans LangChain agents 2. initial iteration of CrewAI integration with deepeval to trace your LLM spans 3. initial iteration of LlamaIndex πŸ¦™ integration with deepeval..

Kritin Vongthongsri (@kritinv07) 's Twitter Profile Photo

You guys loved G-Eval, so we shipped Multimodal G-Eval. You can now evaluate any multimodal task (text2image, computer/browser use, etc) in plain english. We've also added 1. gpt-4.1 and o4-mini suppport for all multimodal metrics on DeepEval 2. Image support on Confident AI

Kritin Vongthongsri (@kritinv07) 's Twitter Profile Photo

Today, we've finally changed DeepEval's default eval model from gpt-4o to gpt-4.1... and it took around 5 minutes. Better late than never, I guess πŸ˜….

Today, we've finally changed <a href="/deepeval/">DeepEval</a>'s default eval model from gpt-4o to gpt-4.1... and it took around 5 minutes.

Better late than never, I guess πŸ˜….
Kritin Vongthongsri (@kritinv07) 's Twitter Profile Photo

Too lazy to sift through 100+ test cases? Confident AI just dropped πŸͺ„AI Insights BoardπŸͺ„ This means running evals with DeepEval instantly tells you: 1. What are your LLM app's strengths and weaknesses 2. Areas your LLM consistently struggles with 3. The exact model and

LlamaIndex πŸ¦™ (@llama_index) 's Twitter Profile Photo

This guest post from DeepEval shows you how to build better RAG applications by combining LlamaIndex with comprehensive evaluation: 🎯 Use Answer Relevancy, Faithfulness, and Contextual Precision metrics to measure both your retriever and generator components πŸ”§ Set up

This guest post from <a href="/deepeval/">DeepEval</a> shows you how to build better RAG applications by combining LlamaIndex with comprehensive evaluation:

🎯 Use Answer Relevancy, Faithfulness, and Contextual Precision metrics to measure both your retriever and generator components
πŸ”§ Set up
Jeffrey 🐬 confident-ai.com (@jeffr_yyy) 's Twitter Profile Photo

One of the most beautiful things about talking to users is you realize where they are struggling. We're making two major releases at DeepEval this week to address the most common issues people face when running evals.

DeepEval (@deepeval) 's Twitter Profile Photo

Don’t get frustrated by writing print statements and endlessly scrolling terminal logs to debug your LangChain (and LangGraph) app. Trace your agent’s execution steps in production on Confident AI using our callback handler, with just two lines of code. Documentation:

Don’t get frustrated by writing print statements and endlessly scrolling terminal logs to debug your
<a href="/LangChainAI/">LangChain</a> (and LangGraph) app.

Trace your agent’s execution steps in production on <a href="/confident_ai/">Confident AI</a> using our callback handler, with just two lines of code.

Documentation:
Jeffrey 🐬 confident-ai.com (@jeffr_yyy) 's Twitter Profile Photo

πŸ€– Two LLM outputs walk into an arena… Only one leaves with the crown πŸ‘‘ βš”οΈ Pairwise battles βš–οΈ Elo-style scoring πŸ™ˆ Blind trials 🧠 LLMs judging LLMs No complex metrics. Just ask: which one is better? confident-ai.com/blog/llm-arena…

Confident AI (@confident_ai) 's Twitter Profile Photo

πŸš€ Shipped Integration You can now trace your CrewAI apps on the Confident AI platform. By just adding 2 lines of code in your app, you can get the entire execution steps of your agent in the form of a single trace. Leverage your LLM application performance using

πŸš€ Shipped Integration

You can now trace your <a href="/crewAIInc/">CrewAI</a>  apps on the <a href="/confident_ai/">Confident AI</a> platform. By just adding 2 lines of code in your app, you can get the entire execution steps of your agent in the form of a single trace.

Leverage your LLM application performance using
Jeffrey 🐬 confident-ai.com (@jeffr_yyy) 's Twitter Profile Photo

One of the greatest blocker to LLM evaluation? Communication between engineers and domain experts when curating datasets. When engineers gets put on an AI project, they aren't necessarily experts in the domain they're building for. In fact, the humans safeguarding AI responses

Y Combinator (@ycombinator) 's Twitter Profile Photo

.@Confident_AI's DeepTeam is an open-source red teaming framework for AI agents. Test for memory leaks, goal hijacking, and decision flaws across 40+ attack types & vulnerabilities. Congrats on the launch, Jeffrey 🐬 confident-ai.com and Kritin Vongthongsri! github.com/confident-ai/d…