Integrating Evaluate into my existing application
If you already have a prototype or an application you're looking to run experiments and evaluations over, Galileo Evaluate allows you to hook into it and log the inputs, outputs, and any intermediate steps to Galileo for further analysis.
Before creating a run, you'll want to make sure you have an evaluation set (a set of questions / sample inputs you want to run through your prototype for evaluation). Your evaluation set should be consistent across runs.
Haven't written any code yet? Are you looking for a no-code way of testing out models and templates for your use case? Check out Creating Prompt Runs.
There are a few ways you can integrate your existing application depending on how you built it:
Langchain
Galileo supports the logging of chains from langchain
. To log these chains, we require using the callback from our Python client promptquality
.
For logging your data, first login:
After that, you can set up the GalileoPromptCallback
:
project_name: each "run" will appear under this project. Choose a name that'll help you identify what you're evaluating
scorers: This is the list of metrics you want to evaluate your run over. Check out Galileo Guardrail Metrics and Custom Metrics for more information.
Executing and Logging
Next, run your chain over your Evaluation set and log the results to Galileo.
When you execute your chain (with run
, invoke
or batch
), just include the callback instance created earlier in the callbacks as:
If using .run()
:
If using .invoke()
:
If using .batch()
:
Important: Once you complete executing for your dataset, tell Galileo the run is complete by:
The finish
step uploads the run to Galileo and starts the execution of the scorers server-side. This step will also display the link you can use to interact with the run on the Galileo console.
A full example can be found here.
Note 1: Please make sure to set the callback at execution time, not at definition time so that the callback is invoked for all nodes of the chain.
Note 2: We recommend using .invoke
instead of .batch
because langchain
reports latencies for the entire batch instead of each individual chain execution.
Custom Logging
If you're not using an orchestration library, or using one other than Langchain, we also provide a similar interface for uploading your executions that do not use a callback mechanism. To log your runs with Galileo, you'd start with the same typical flow of logging into Galileo:
Then, for each step of your sequence (or node in the chain), construct a chain row:
For example, you can log your retriever and llm node with the snippet below.
We recommend you randomly generate node_id and chain_root_id (e.g. uuid()). Add the id of a 'parent' node as the chain_root_id of its children.
When your execution completes, log that data to Galileo:
Once that's complete, this step will display the link to access the run from your Galileo Console.
Logging metadata
If you are logging chains from langchain
, metadata values (such as chunk-level metadata for the retriever) will be automatocally included.
For custom chains, metadata values can be logged by dumping metadata along with page_content as demonstrated below.
Running multiple experiments in one go
If you want to run multiple experiments in one go (e.g. use different templates, experiment with different retriever params, etc.), check out Chain Sweeps.
Last updated