ragas (@ragas_io) Twitter Tweets • TwiCopy

ragas

@ragas_io

+ Follow

Supercharge Your LLM Application Evaluations 🚀

Github: github.com/explodinggradi…
Discord: discord.gg/5djav8GGNZ

ID: 1764911957541584896

linkhttps://ragas.io/ calendar_today05-03-2024 07:13:30

106 Tweet

921 Followers

0 Following

ikka

7 months ago

A fun weekend project turned out to be an example on how to evaluate simple LLM agents using simulation. I was surprised to see how brittle even the latest LLMs are to different edge-case scenarios. ⭐️ Application LLM is an agent collecting personal user information (name, SSID,

A fun weekend project turned out to be an example on how to evaluate simple LLM agents using simulation. I was surprised to see how brittle even the latest LLMs are to different edge-case scenarios. ⭐️

Application
LLM is an agent collecting personal user information (name, SSID,

thumb_up_off_alt10

chat_bubble_outline3

repeat4

shareShare

ragas

7 months ago

Creating synthetic test data that reflects your production use case is hard. However, there is one technique that can make a lot of difference if used correctly: conditioning model generation in persona. Instead of generic, one-size-fits-all questions, craft test cases using

Creating synthetic test data that reflects your production use case is hard. However, there is one technique that can make a lot of difference if used correctly: conditioning model generation in persona.

Instead of generic, one-size-fits-all questions, craft test cases using

thumb_up_off_alt6

chat_bubble_outline0

repeat2

shareShare

ikka

7 months ago

Ragas office hours are becoming a big hit among the community. In just the last week we did office hours with 5 of the Fortune 50 companies building LLM apps. What do we do differently from all others? We don’t recommend tools instead we recommend processes and opinions based on

thumb_up_off_alt9

chat_bubble_outline1

repeat2

shareShare

ragas

7 months ago

Introducing NVIDIA’s RAG metrics in Ragas, new metrics for end-to-end accuracy, relevance, and groundedness, engineered to deliver robust, fast, and cost-effective performance. 1️⃣ Answer Accuracy: End-to-end measurement ensures the RAG’s response perfectly aligns with the ground

Introducing NVIDIA’s RAG metrics in Ragas, new metrics for end-to-end accuracy, relevance, and groundedness, engineered to deliver robust, fast, and cost-effective performance.

1️⃣ Answer Accuracy: End-to-end measurement ensures the RAG’s response perfectly aligns with the ground

thumb_up_off_alt5

chat_bubble_outline0

repeat3

shareShare

ragas

7 months ago

Learn to use Vertex AI and Ragas to evaluate LLM workflows in the three-part tutorial series. ⭐️ The series covers 👉 1️⃣ Quick Start: learn how to use Vertex AI models with Ragas to evaluate your LLM workflows. 2️⃣ Align LLM Metrics: Train and align your LLM evaluators to

Learn to use Vertex AI and Ragas to evaluate LLM workflows in the three-part tutorial series. ⭐️

The series covers 👉

1️⃣ Quick Start: learn how to use Vertex AI models with Ragas to evaluate your LLM workflows.

2️⃣ Align LLM Metrics: Train and align your LLM evaluators to

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

ragas

6 months ago

Ragas 🤝 Google Vertex AI A tutorial showing how to use Vertex AI’s generative models with Ragas. Learn to configure Vertex AI’s evaluator LLM and embeddings, and conduct evaluations using a comprehensive suite of Ragas metrics—including model-based, computation-based, and

Ragas 🤝 Google Vertex AI

A tutorial showing how to use Vertex AI’s generative models with Ragas.

Learn to configure Vertex AI’s evaluator LLM and embeddings, and conduct evaluations using a comprehensive suite of Ragas metrics—including model-based, computation-based, and

thumb_up_off_alt1

chat_bubble_outline0

repeat0

shareShare

ragas

6 months ago

Misalignment between LLM-based and human evaluators often leads to unreliable results. Evaluations fall short without aligning LLM as judges with human evaluators. Here’s how you can fix it 👉 1️⃣ Evaluate your data using LLM-based metrics. 2️⃣ Identify and annotate discrepancies

thumb_up_off_alt4

chat_bubble_outline0

repeat1

shareShare

ragas

6 months ago

Misalignment between LLM-based and human evaluators often leads to unreliable results. Evaluations fall short without aligning LLM as judge with human evaluators. Here’s how you can fix it 👉 1️⃣ Evaluate your data using LLM-based metrics. 2️⃣ Identify and annotate discrepancies

Misalignment between LLM-based and human evaluators often leads to unreliable results.

Evaluations fall short without aligning LLM as judge with human evaluators. Here’s how you can fix it 👉

1️⃣ Evaluate your data using LLM-based metrics. 2️⃣ Identify and annotate discrepancies

thumb_up_off_alt5

chat_bubble_outline0

repeat2

shareShare

ragas

6 months ago

Lack of test data is one of the main bottlenecks in evaluation - solve this by generating high-quality synthetic test data ⭐ Generate a diverse synthetic test set of single-hop queries using Ragas with this comprehensive guide, demonstrating the Ragas test set generation

Lack of test data is one of the main bottlenecks in evaluation - solve this by generating high-quality synthetic test data ⭐

Generate a diverse synthetic test set of single-hop queries using Ragas with this comprehensive guide, demonstrating the Ragas test set generation

thumb_up_off_alt3

chat_bubble_outline0

repeat1

shareShare

ragas

6 months ago

Here are the two different ways to create criteria for evals. 1️⃣ general rubric: uses a global rubric/criteria to evaluate across the entire dataset. Easy to use, but can have limited accuracy in certain aspects. 2️⃣ instance-specific rubrics: uses custom handwritten

Here are the two different ways to create criteria for evals.

1️⃣ general rubric: uses a global rubric/criteria to evaluate across the entire dataset. Easy to use, but can have limited accuracy in certain aspects.

2️⃣ instance-specific rubrics: uses custom handwritten

thumb_up_off_alt4

chat_bubble_outline0

repeat3

shareShare

ragas

6 months ago

📊 Benchmarking Google Gemini Models on Academic QA using Ragas Metrics Choose the models that fit your needs best. Benchmark them with the metrics that matter to you. This tutorial explains how we benchmark Gemini 1.5 Flash and Gemini 2.0 Flash. We use AllenAI’s QASPER

📊 Benchmarking <a href="/Google/">Google</a> Gemini Models on Academic QA using Ragas Metrics

Choose the models that fit your needs best. Benchmark them with the metrics that matter to you.

This tutorial explains how we benchmark Gemini 1.5 Flash and Gemini 2.0 Flash. We use AllenAI’s QASPER

thumb_up_off_alt0

chat_bubble_outline0

repeat0

shareShare

Qdrant

6 months ago

🧠 Chunking changes everything in RAG. This benchmark post evaluated Fixed, Semantic, Agentic, and Recursive chunking in Agentic RAG. Built with Agno, Qdrant, ragas, and LlamaIndex 🦙. And measured with relevant metrics: Context Recall, Faithfulness, Factual

🧠 Chunking changes everything in RAG.

This benchmark post evaluated Fixed, Semantic, Agentic, and Recursive chunking in Agentic RAG.

Built with <a href="/AgnoAgi/">Agno</a>, <a href="/qdrant_engine/">Qdrant</a>, <a href="/ragas_io/">ragas</a>, and <a href="/llama_index/">LlamaIndex 🦙</a>.

And measured with relevant metrics: Context Recall, Faithfulness, Factual

thumb_up_off_alt170

chat_bubble_outline4

repeat35

shareShare

ragas

6 months ago

We're excited to see our co-founder, Jithin James, featured in a Microsoft for Startups and B Capital white paper! 🔥 It’s all about "RAG and the Future of Intelligent Enterprise Applications." This white paper provides valuable insights on RAG technology for businesses. It

We're excited to see our co-founder, Jithin James, featured in a Microsoft for Startups and B Capital white paper! 🔥 It’s all about "RAG and the Future of Intelligent Enterprise Applications."

This white paper provides valuable insights on RAG technology for businesses. It

thumb_up_off_alt7

chat_bubble_outline0

repeat2

shareShare

ragas

5 months ago

Use Datadog for LLM observability and Ragas for evaluation. Datadog now allows you to trace and log LLM calls and is integrated with Ragas metrics to evaluate and monitor your AI applications. Take a look at this thorough guide in Datadog's LLM observability section. Improve

Use Datadog for LLM observability and Ragas for evaluation. Datadog now allows you to trace and log LLM calls and is integrated with Ragas metrics to evaluate and monitor your AI applications.

Take a look at this thorough guide in Datadog's LLM observability section. Improve

thumb_up_off_alt2

chat_bubble_outline0

repeat1

shareShare

ragas

5 months ago

We are hosting our first-ever hackathon ⭐ Join us to learn, build, and hack on evaluating, experimenting with, and improving any LLM Agent using the Ragas App. Link to the event -> lu.ma/github-hacknig… 📅 When: Thursday, April 17th at 4 PM PST 📍 Where: GitHub HQ, San

We are hosting our first-ever hackathon ⭐

Join us to learn, build, and hack on evaluating, experimenting with, and improving any LLM Agent using the Ragas App.

Link to the event -> lu.ma/github-hacknig…

📅 When: Thursday, April 17th at 4 PM PST
📍 Where: GitHub HQ, San

thumb_up_off_alt8

chat_bubble_outline0

repeat2

shareShare

ragas

5 months ago

LlamaStack + Ragas + Llama 4 = 🚀 LlamaStack is an open-source framework maintained by Meta that streamlines the development and deployment of large language model-powered applications. Use Ragas to evaluate your LlamaStack apps. This tutorial walks you through: • Creating a

LlamaStack + Ragas + Llama 4 = 🚀

LlamaStack is an open-source framework maintained by Meta that streamlines the development and deployment of large language model-powered applications. Use Ragas to evaluate your LlamaStack apps.

This tutorial walks you through:

• Creating a

thumb_up_off_alt9

chat_bubble_outline0

repeat0

shareShare

ragas

5 months ago

🔍 Evaluating RAG Pipelines with Ragas + Milvus Milvus is a powerful open-source vector database that excels at similarity search and scales beautifully for production use. When combined with Ragas, you can: ✅ Measure the effectiveness of retrieval ✅ Ensure the generation's

🔍 Evaluating RAG Pipelines with Ragas + Milvus

Milvus is a powerful open-source vector database that excels at similarity search and scales beautifully for production use. When combined with Ragas, you can:

✅ Measure the effectiveness of retrieval

✅ Ensure the generation's

thumb_up_off_alt2

chat_bubble_outline0

repeat0

shareShare

ragas

5 months ago

📊 Measure What Matters: Griptape + Ragas Evaluate Griptape's RAG Engines with Ragas integration - get quantifiable metrics on retrieval accuracy and response quality with minimal setup. Griptape, a powerful framework for Gen AI applications development. Check out our new

📊 Measure What Matters: Griptape + Ragas

Evaluate Griptape's RAG Engines with Ragas integration - get quantifiable metrics on retrieval accuracy and response quality with minimal setup.

Griptape, a powerful framework for Gen AI applications development. Check out our new

thumb_up_off_alt6

chat_bubble_outline0

repeat1

shareShare

ragas

5 months ago

A pragmatic guide for approaching evals.

thumb_up_off_alt4

chat_bubble_outline0

repeat1

shareShare

ragas

3 months ago

🧠 Paper Club Alert! We're discussing "The Illusion of Thinking" - Apple's controversial paper on why LRMs ace easy puzzles but crash on hard ones. Join us July 3 @ 9:30 AM PT for: - Overview - Chain-of-thought limitations - Real implications for AI in prod Free on Zoom:

🧠 Paper Club Alert! We're discussing "The Illusion of Thinking" - Apple's controversial paper on why LRMs ace easy puzzles but crash on hard ones.

Join us July 3 @ 9:30 AM PT for:
- Overview
- Chain-of-thought limitations
- Real implications for AI in prod

Free on Zoom:

thumb_up_off_alt9

chat_bubble_outline0

repeat1

shareShare