Flow AI (@flowaicom) 's Twitter Profile
Flow AI

@flowaicom

Accelerate your AI agent development with continuously evolving, validated test data that your teams can trust.

ID: 1303413990526246914

linkhttp://flow-ai.com calendar_today08-09-2020 19:25:14

532 Tweet

1,1K Followers

1,1K Following

Atla (@atla_ai) 's Twitter Profile Photo

3.8B Flow-Judge-v0.1 from Flow AI is now live on Judge Arena 🧑‍⚖️ Try out this lightweight judge model and start voting: hf.co/spaces/AtlaAI/… — Also an update on standings after 2 months…

3.8B Flow-Judge-v0.1 from <a href="/flowaicom/">Flow AI</a> is now live on Judge Arena 🧑‍⚖️

Try out this lightweight judge model and start voting: hf.co/spaces/AtlaAI/…

—

Also an update on standings after 2 months…
Aaro Isosaari (@aaroisosaari) 's Twitter Profile Photo

LLMs are great at generating text, but what if they need to call external tools? By enforcing a standardized JSON format and applying an automatic multi-stage verification, APIGen by Salesforce AI Research provides a powerful way to generate verifiable and diverse function-calling

LLMs are great at generating text, but what if they need to call external tools?

By enforcing a standardized JSON format and applying an automatic multi-stage verification, APIGen by <a href="/SFResearch/">Salesforce AI Research</a> provides a powerful way to generate verifiable and diverse function-calling
Aaro Isosaari (@aaroisosaari) 's Twitter Profile Photo

AI agents need robust test sets to handle real-world complexity—multi-turn dialogues, tool interactions, and unpredictable inputs. Yet, this is where most evaluation methods fall short. In our latest blog, Tiina Vaahtio explores how automated test generation can improve AI agent

AI agents need robust test sets to handle real-world complexity—multi-turn dialogues, tool interactions, and unpredictable inputs. Yet, this is where most evaluation methods fall short.

In our latest blog, <a href="/TVaahtio/">Tiina Vaahtio</a> explores how automated test generation can improve AI agent
Aaro Isosaari (@aaroisosaari) 's Twitter Profile Photo

Flow Judge is No. 1 on Judge Arena! 🎉 (And what that really means.) Several months after its release, our open-source evaluation model, Flow Judge, just hit the top spot on Judge Arena—the leaderboard ranking LLMs as evaluators. This means it’s demonstrating higher alignment

Flow Judge is No. 1 on Judge Arena! 🎉 (And what that really means.)

Several months after its release, our open-source evaluation model, Flow Judge, just hit the top spot on Judge Arena—the leaderboard ranking LLMs as evaluators.

This means it’s demonstrating higher alignment
Flow AI (@flowaicom) 's Twitter Profile Photo

How do you measure if your model is processing long documents correctly—or as well as possible? Our co-founder & CTO Karolus Sariola compiled a comparison of recent advances, exploring how LLMs now process context more effectively by leveraging attention. flow-ai.com/blog/advancing…

How do you measure if your model is processing long documents correctly—or as well as possible?

Our co-founder &amp; CTO <a href="/SariolaK/">Karolus Sariola</a> compiled a comparison of recent advances, exploring how LLMs now process context more effectively by leveraging attention.

flow-ai.com/blog/advancing…
Aaro Isosaari (@aaroisosaari) 's Twitter Profile Photo

Ever wondered how many AI projects fail because of flawed testing data? 🤔 We learned this the hard way with our previous product, Flowrite. After weeks of crafting a “perfect” testing dataset, our agent still performed badly on a significant number of scenarios in production.

Karolus Sariola (@sariolak) 's Twitter Profile Photo

Qwen’s QwQ vs. Google Gemini 2.5 on the same prompt. 👇 Here’s a side-by-side of two LMs drawing a tree in Python using turtle graphics. There’s something oddly satisfying about this kind of vibe benchmark. It’s perhaps not a bad way to test a model’s Python drawing skills.

Qwen’s QwQ vs. Google Gemini 2.5 on the same prompt. 👇

Here’s a side-by-side of two LMs drawing a tree in Python using turtle graphics.

There’s something oddly satisfying about this kind of vibe benchmark. It’s perhaps not a bad way to test a model’s Python drawing skills.
Bernardo García (@bergr7) 's Twitter Profile Photo

It feels wild that in 2025, with all the rapid progress in the AI space, getting good test cases for AI agent development remains such a hurdle for so many AI teams!

It feels wild that in 2025, with all the rapid progress in the AI space, getting good test cases for AI agent development remains such a hurdle for so many AI teams!
Aaro Isosaari (@aaroisosaari) 's Twitter Profile Photo

Test data is the backbone of reliable AI agents. But creating it? Surprisingly tough. Here are 4 test data challenges we see AI teams running into—and how they should be solved.

Test data is the backbone of reliable AI agents.

But creating it? Surprisingly tough.

Here are 4 test data challenges we see AI teams running into—and how they should be solved.
Aaro Isosaari (@aaroisosaari) 's Twitter Profile Photo

🚨 GPT-4.1 just dropped, and we put it straight to the ultimate vibe test. Here's how it stacks up visually against GPT-4o, Qwen QwQ, and Gemini 2.5: This is all 4 models tackling the same task: draw a tree in Python using turtle graphics. The model generates the tree using

Aaro Isosaari (@aaroisosaari) 's Twitter Profile Photo

🤔 Could a fine-tuned LLM actually be worse at finding the right answer than the raw base model? The shortcomings of fine-tuning aren't discussed much, but even the most modern methods come with tradeoffs. Here’s what recent research by Yang Yue et al. (2025) reveals: 🔹 Base

Aaro Isosaari (@aaroisosaari) 's Twitter Profile Photo

LLM judges are gaining traction among AI teams. Let’s look at the common pitfalls and how to make your judges more reliable 👇 AI engineers are increasingly using LLMs to measure the output of other LLMs. These judges are fast, scalable, and surprisingly aligned with human

LLM judges are gaining traction among AI teams. Let’s look at the common pitfalls and how to make your judges more reliable 👇

AI engineers are increasingly using LLMs to measure the output of other LLMs. These judges are fast, scalable, and surprisingly aligned with human
Aaro Isosaari (@aaroisosaari) 's Twitter Profile Photo

Building agentic AI? Here’s how to stay ahead of the cost curve 💰 Four years ago, while building our first product Flowrite on GPT-3, the most common investor question was along the lines of: "What if OpenAI 10x’s their API prices overnight?" That fear gradually faded as more

Building agentic AI? Here’s how to stay ahead of the cost curve 💰

Four years ago, while building our first product Flowrite on GPT-3, the most common investor question was along the lines of:
"What if OpenAI 10x’s their API prices overnight?"

That fear gradually faded as more
Aaro Isosaari (@aaroisosaari) 's Twitter Profile Photo

250 top AI engineers, researchers, and builders – all in one room in Helsinki 🔥 Symposium AI's Summer Inference on June 4 is one of the few events truly designed for the technical AI community. That’s why it's also the first event we at Flow AI have ever sponsored. Having

250 top AI engineers, researchers, and builders – all in one room in Helsinki 🔥

Symposium AI's Summer Inference on June 4 is one of the few events truly designed for the technical AI community.

That’s why it's also the first event we at <a href="/flowaicom/">Flow AI</a> have ever sponsored. Having
Aaro Isosaari (@aaroisosaari) 's Twitter Profile Photo

LLM system evals ≠ AI agent evals. Here’s why: Evaluating a single-step LLM system is mostly about 𝗼𝘂𝘁𝗽𝘂𝘁𝘀. You prompt the model, it replies, and you check (visually or with another model) if the reply is correct. Evaluating an AI agent is about

LLM system evals ≠ AI agent evals. Here’s why:

Evaluating a single-step LLM system is mostly about 𝗼𝘂𝘁𝗽𝘂𝘁𝘀. You prompt the model, it replies, and you check (visually or with another model) if the reply is correct.

Evaluating an AI agent is about
Aaro Isosaari (@aaroisosaari) 's Twitter Profile Photo

LLMs are changing not only how products are built, but how teams decide what gets built. 🧠 Our team has pioneered with LLMs since GPT-3 in 2020. But for the first time, we’re using them at the earliest stage: initial scoping. Instead of spending weeks analyzing documentation

LLMs are changing not only how products are built, but how teams decide what gets built. 🧠

Our team has pioneered with LLMs since GPT-3 in 2020. But for the first time, we’re using them at the earliest stage: initial scoping.

Instead of spending weeks analyzing documentation
Aaro Isosaari (@aaroisosaari) 's Twitter Profile Photo

Who should own testing in AI-first product teams: engineers, PMs, domain experts, or someone else? From our experience, testing AI builds often starts with engineers – naturally, since they know the system best. But that’s rarely enough. 💻 Engineers are ideal for rigorous,

Who should own testing in AI-first product teams: engineers, PMs, domain experts, or someone else?

From our experience, testing AI builds often starts with engineers – naturally, since they know the system best. 

But that’s rarely enough.

💻 Engineers are ideal for rigorous,
Aaro Isosaari (@aaroisosaari) 's Twitter Profile Photo

Tool use has become one of the defining capabilities of modern AI agents. Instead of relying solely on pre-trained knowledge, AI systems can now access other applications, databases, or even write and execute code. However, simply giving an agent access to tools doesn’t make it

Tool use has become one of the defining capabilities of modern AI agents.

Instead of relying solely on pre-trained knowledge, AI systems can now access other applications, databases, or even write and execute code.

However, simply giving an agent access to tools doesn’t make it
Aaro Isosaari (@aaroisosaari) 's Twitter Profile Photo

For a long time, the prevailing narrative in AI agents has been "bigger is better" – that a single, widely capable LLM is the ultimate workhorse for any agent. But this week's paper from NVIDIA challenges that view by presenting a compelling case for the overlooked power of

For a long time, the prevailing narrative in AI agents has been "bigger is better" – that a single, widely capable LLM is the ultimate workhorse for any agent.

But this week's paper from <a href="/nvidia/">NVIDIA</a> challenges that view by presenting a compelling case for the overlooked power of
Aaro Isosaari (@aaroisosaari) 's Twitter Profile Photo

One of the most common questions we get from other AI teams: “Should we build agents with a framework like LangChain, Autogen, CrewAI – or from scratch?” We’ve spent the past year building agent development and evaluation tooling for AI teams. Our conclusion is pretty aligned

One of the most common questions we get from other AI teams:
“Should we build agents with a framework like LangChain, Autogen, CrewAI – or from scratch?”

We’ve spent the past year building agent development and evaluation tooling for AI teams. Our conclusion is pretty aligned