Evaluating and Optimizing Agents, Chains or multi-step workflows

How to use Galileo Evaluate with Chains and Agents

Galileo Evaluate helps you evaluate and optimize Agents, Chains, and any other multi-step workflows with out-of-the-box Tracing and Analytics. Galileo allows you to run and log experiments, trace all the steps taken by your Agent or Chain, and use Guardrail or Custom Metrics to assess the quality of the end-to-end system.

Getting Started

The first step in evaluating your application is creating an evaluation run. To do this, run your evaluation set (e.g. a set of inputs that mimic the inputs you expect to get from users) through your Agent or Chain system and create a run.

Follow our instructions on how to integrate Evaluate into your existing application.

Keeping track of what changed in your experiment

As you start experimenting, you're going to want to keep track of what you're attempting with each experiment. To do so, use Prompt Tags. Prompt Tags are tags you can add to the run (e.g. "embedding_model" = "voyage-2", "embedding_model" = "text-embedding-ada-002").

Prompt Tags will help you remember what you tried with each experiment. Read more about how to add Prompt Tags here.

Tracing and Visualizing your Agent and Chain executions

Once you log your evaluation runs, you can go to the Galileo Console to analyze your Chain and Agent executions. For each execution, you'll be able to see what the input into the workflow was and what the final response was, as well as any steps of decisions taken to get to the final result.

Clicking on any row will open the Expanded View for that node. The Retriever Node will show you all the chunks that your retriever returned. Once you start debugging your executions, this will allow you to trace poor-quality responses back to the step that went wrong.

Evaluating and Optimizing the performance of your RAG application

Galileo has out-of-the-box Guardrail Metrics to help you assess and evaluate the quality of your application. In addition, Galileo supports user-defined custom metrics. When logging your evaluation run, make sure to include the metrics you want computed for your run.

More information on how to evaluate and debug them on the console.

Iterative Experimentation

Now that you've identified something wrong with your Chain or Agent, try to change your chain or agent configuration, prompt template, or model settings and re-run your evaluation under the same project. Your project view will allow you to quickly compare evaluation runs and see which configuration of your system worked best.

Last updated