Flow AI (@flowaicom) Twitter Tweets • TwiCopy

Flow AI

@flowaicom

+ Follow

Accelerate your AI agent development with continuously evolving, validated test data that your teams can trust.

ID: 1303413990526246914

linkhttp://flow-ai.com calendar_today08-09-2020 19:25:14

532 Tweet

1,1K Followers

1,1K Following

Atla

10 months ago

3.8B Flow-Judge-v0.1 from Flow AI is now live on Judge Arena 🧑‍⚖️ Try out this lightweight judge model and start voting: hf.co/spaces/AtlaAI/… — Also an update on standings after 2 months…

3.8B Flow-Judge-v0.1 from <a href="/flowaicom/">Flow AI</a> is now live on Judge Arena 🧑‍⚖️

Try out this lightweight judge model and start voting: hf.co/spaces/AtlaAI/…

—

Also an update on standings after 2 months…

thumb_up_off_alt3

chat_bubble_outline1

repeat1

shareShare

Aaro Isosaari

10 months ago

LLMs are great at generating text, but what if they need to call external tools? By enforcing a standardized JSON format and applying an automatic multi-stage verification, APIGen by Salesforce AI Research provides a powerful way to generate verifiable and diverse function-calling

LLMs are great at generating text, but what if they need to call external tools?

By enforcing a standardized JSON format and applying an automatic multi-stage verification, APIGen by <a href="/SFResearch/">Salesforce AI Research</a> provides a powerful way to generate verifiable and diverse function-calling

thumb_up_off_alt3

chat_bubble_outline0

repeat2

shareShare

Aaro Isosaari

9 months ago

AI agents need robust test sets to handle real-world complexity—multi-turn dialogues, tool interactions, and unpredictable inputs. Yet, this is where most evaluation methods fall short. In our latest blog, Tiina Vaahtio explores how automated test generation can improve AI agent

AI agents need robust test sets to handle real-world complexity—multi-turn dialogues, tool interactions, and unpredictable inputs. Yet, this is where most evaluation methods fall short.

In our latest blog, <a href="/TVaahtio/">Tiina Vaahtio</a> explores how automated test generation can improve AI agent

thumb_up_off_alt4

chat_bubble_outline0

repeat1

shareShare

Aaro Isosaari

9 months ago

Flow Judge is No. 1 on Judge Arena! 🎉 (And what that really means.) Several months after its release, our open-source evaluation model, Flow Judge, just hit the top spot on Judge Arena—the leaderboard ranking LLMs as evaluators. This means it’s demonstrating higher alignment

Flow Judge is No. 1 on Judge Arena! 🎉 (And what that really means.)

Several months after its release, our open-source evaluation model, Flow Judge, just hit the top spot on Judge Arena—the leaderboard ranking LLMs as evaluators.

This means it’s demonstrating higher alignment

thumb_up_off_alt8

chat_bubble_outline2

repeat3

shareShare

Flow AI

9 months ago

How do you measure if your model is processing long documents correctly—or as well as possible? Our co-founder & CTO Karolus Sariola compiled a comparison of recent advances, exploring how LLMs now process context more effectively by leveraging attention. flow-ai.com/blog/advancing…

How do you measure if your model is processing long documents correctly—or as well as possible?

Our co-founder & CTO <a href="/SariolaK/">Karolus Sariola</a> compiled a comparison of recent advances, exploring how LLMs now process context more effectively by leveraging attention.

flow-ai.com/blog/advancing…

thumb_up_off_alt5

chat_bubble_outline1

repeat4

shareShare

Aaro Isosaari

8 months ago

Ever wondered how many AI projects fail because of flawed testing data? 🤔 We learned this the hard way with our previous product, Flowrite. After weeks of crafting a “perfect” testing dataset, our agent still performed badly on a significant number of scenarios in production.

thumb_up_off_alt4

chat_bubble_outline0

repeat3

shareShare

Karolus Sariola

8 months ago

Qwen’s QwQ vs. Google Gemini 2.5 on the same prompt. 👇 Here’s a side-by-side of two LMs drawing a tree in Python using turtle graphics. There’s something oddly satisfying about this kind of vibe benchmark. It’s perhaps not a bad way to test a model’s Python drawing skills.

Qwen’s QwQ vs. Google Gemini 2.5 on the same prompt. 👇

Here’s a side-by-side of two LMs drawing a tree in Python using turtle graphics.

There’s something oddly satisfying about this kind of vibe benchmark. It’s perhaps not a bad way to test a model’s Python drawing skills.

thumb_up_off_alt6

chat_bubble_outline1

repeat4

shareShare

Bernardo García

8 months ago

It feels wild that in 2025, with all the rapid progress in the AI space, getting good test cases for AI agent development remains such a hurdle for so many AI teams!

It feels wild that in 2025, with all the rapid progress in the AI space, getting good test cases for AI agent development remains such a hurdle for so many AI teams!

thumb_up_off_alt6

chat_bubble_outline1

repeat2

shareShare

Aaro Isosaari

8 months ago

Test data is the backbone of reliable AI agents. But creating it? Surprisingly tough. Here are 4 test data challenges we see AI teams running into—and how they should be solved.

Test data is the backbone of reliable AI agents.

But creating it? Surprisingly tough.

Here are 4 test data challenges we see AI teams running into—and how they should be solved.

thumb_up_off_alt5

chat_bubble_outline1

repeat4

shareShare

Aaro Isosaari

7 months ago

🚨 GPT-4.1 just dropped, and we put it straight to the ultimate vibe test. Here's how it stacks up visually against GPT-4o, Qwen QwQ, and Gemini 2.5: This is all 4 models tackling the same task: draw a tree in Python using turtle graphics. The model generates the tree using

thumb_up_off_alt22

chat_bubble_outline3

repeat4

shareShare

Aaro Isosaari

7 months ago

🤔 Could a fine-tuned LLM actually be worse at finding the right answer than the raw base model? The shortcomings of fine-tuning aren't discussed much, but even the most modern methods come with tradeoffs. Here’s what recent research by Yang Yue et al. (2025) reveals: 🔹 Base

thumb_up_off_alt6

chat_bubble_outline1

repeat4

shareShare

Aaro Isosaari

7 months ago

LLM judges are gaining traction among AI teams. Let’s look at the common pitfalls and how to make your judges more reliable 👇 AI engineers are increasingly using LLMs to measure the output of other LLMs. These judges are fast, scalable, and surprisingly aligned with human

LLM judges are gaining traction among AI teams. Let’s look at the common pitfalls and how to make your judges more reliable 👇

AI engineers are increasingly using LLMs to measure the output of other LLMs. These judges are fast, scalable, and surprisingly aligned with human

thumb_up_off_alt12

chat_bubble_outline3

repeat3

shareShare

Aaro Isosaari

7 months ago

Building agentic AI? Here’s how to stay ahead of the cost curve 💰 Four years ago, while building our first product Flowrite on GPT-3, the most common investor question was along the lines of: "What if OpenAI 10x’s their API prices overnight?" That fear gradually faded as more

Building agentic AI? Here’s how to stay ahead of the cost curve 💰

Four years ago, while building our first product Flowrite on GPT-3, the most common investor question was along the lines of:
"What if OpenAI 10x’s their API prices overnight?"

That fear gradually faded as more

thumb_up_off_alt3

chat_bubble_outline1

repeat2

shareShare

Aaro Isosaari

6 months ago

250 top AI engineers, researchers, and builders – all in one room in Helsinki 🔥 Symposium AI's Summer Inference on June 4 is one of the few events truly designed for the technical AI community. That’s why it's also the first event we at Flow AI have ever sponsored. Having

250 top AI engineers, researchers, and builders – all in one room in Helsinki 🔥

Symposium AI's Summer Inference on June 4 is one of the few events truly designed for the technical AI community.

That’s why it's also the first event we at <a href="/flowaicom/">Flow AI</a> have ever sponsored. Having

thumb_up_off_alt11

chat_bubble_outline0

repeat4

shareShare

Aaro Isosaari

6 months ago

LLM system evals ≠ AI agent evals. Here’s why: Evaluating a single-step LLM system is mostly about 𝗼𝘂𝘁𝗽𝘂𝘁𝘀. You prompt the model, it replies, and you check (visually or with another model) if the reply is correct. Evaluating an AI agent is about

LLM system evals ≠ AI agent evals. Here’s why:

Evaluating a single-step LLM system is mostly about 𝗼𝘂𝘁𝗽𝘂𝘁𝘀. You prompt the model, it replies, and you check (visually or with another model) if the reply is correct.

Evaluating an AI agent is about

thumb_up_off_alt3

chat_bubble_outline0

repeat2

shareShare

Aaro Isosaari

6 months ago

LLMs are changing not only how products are built, but how teams decide what gets built. 🧠 Our team has pioneered with LLMs since GPT-3 in 2020. But for the first time, we’re using them at the earliest stage: initial scoping. Instead of spending weeks analyzing documentation

LLMs are changing not only how products are built, but how teams decide what gets built. 🧠

Our team has pioneered with LLMs since GPT-3 in 2020. But for the first time, we’re using them at the earliest stage: initial scoping.

Instead of spending weeks analyzing documentation

thumb_up_off_alt10

chat_bubble_outline2

repeat4

shareShare

Aaro Isosaari

5 months ago

Who should own testing in AI-first product teams: engineers, PMs, domain experts, or someone else? From our experience, testing AI builds often starts with engineers – naturally, since they know the system best. But that’s rarely enough. 💻 Engineers are ideal for rigorous,

Who should own testing in AI-first product teams: engineers, PMs, domain experts, or someone else?

From our experience, testing AI builds often starts with engineers – naturally, since they know the system best.

But that’s rarely enough.

💻 Engineers are ideal for rigorous,

thumb_up_off_alt5

chat_bubble_outline1

repeat3

shareShare

Aaro Isosaari

5 months ago

Tool use has become one of the defining capabilities of modern AI agents. Instead of relying solely on pre-trained knowledge, AI systems can now access other applications, databases, or even write and execute code. However, simply giving an agent access to tools doesn’t make it

Tool use has become one of the defining capabilities of modern AI agents.

Instead of relying solely on pre-trained knowledge, AI systems can now access other applications, databases, or even write and execute code.

However, simply giving an agent access to tools doesn’t make it

thumb_up_off_alt5

chat_bubble_outline1

repeat3

shareShare

Aaro Isosaari

5 months ago

For a long time, the prevailing narrative in AI agents has been "bigger is better" – that a single, widely capable LLM is the ultimate workhorse for any agent. But this week's paper from NVIDIA challenges that view by presenting a compelling case for the overlooked power of

For a long time, the prevailing narrative in AI agents has been "bigger is better" – that a single, widely capable LLM is the ultimate workhorse for any agent.

But this week's paper from <a href="/nvidia/">NVIDIA</a> challenges that view by presenting a compelling case for the overlooked power of

thumb_up_off_alt7

chat_bubble_outline1

repeat3

shareShare

Aaro Isosaari

4 months ago

One of the most common questions we get from other AI teams: “Should we build agents with a framework like LangChain, Autogen, CrewAI – or from scratch?” We’ve spent the past year building agent development and evaluation tooling for AI teams. Our conclusion is pretty aligned

One of the most common questions we get from other AI teams:
“Should we build agents with a framework like LangChain, Autogen, CrewAI – or from scratch?”

We’ve spent the past year building agent development and evaluation tooling for AI teams. Our conclusion is pretty aligned

thumb_up_off_alt5

chat_bubble_outline1

repeat3

shareShare