Custom Metrics: Registration & Usage

Galileo GenAI Studio supports Custom Metrics (programmatic or GPT-based) for all your Evaluate and Observe projects. Depending on where, when, and how you want these metrics to be executed, you have the option to choose between Custom Scorers and Registered Scorers.

Registered Scorers

We support registering a scorer such that it can be reused across various runs, projects, modules, and users within your organization. Registered Scorers are run in the backend in an isolated environment that has access to a predefined set of libraries and packages.

Creating Your Registered Scorer

To define a registered scorer, create a Python file that has at least 2 functions and follow the function signatures as described below:

scorer_fn: The scorer function is provided the row-wise inputs and is expected to generate outputs for each response. The expected signature for this function is:
```
def scorer_fn(*, index: Union[int, str], response: str, **kwargs: Any) -> Union[float, int, bool, str, None]:
    ...
```
We support outputting floating points, integers, boolean values, and strings. We also recommend ensuring your scorer_fn accepts any **kwargs so that your registered scorers are forward-compatible.
aggregator_fn: The aggregator function takes in an array of the row-wise outputs from your scorer and allows you to generate aggregates from those. The expected signature for the aggregator function is:
```
def aggregator_fn(*, scores: List[Union[float, int, bool, str, None]]) -> Dict[str, Union[float, int, bool, str, None]]:
 ...
```
For aggregated values that you want to output from your scorer, return them as key-value pairs with the key corresponding to the label and the value.

Registering Your Scorer

Once you've created your scorer file, you can register it with the name and the scorer file:

registered_scorer = pq.register_scorer(scorer_name="my-scorer", scorer_file="/path/to/scorer/file.py")

The name you choose here will be the name with which the values for this scorer appear in the UI later.

Using Your Registered Scorer

To use your scorer during a prompt run (or sweep), simply pass it in alongside any of the other scorers:

pq.run(..., scorers=[registered_scorer])

If you created your registered scorer in a previous session, you can also just pass in the name to the scorer instead of the object as:

pq.run(..., scorers=["my-scorer"])

Example

For example, let's say we wanted to create a custom metric that measured the length of the response. In our Python environment, we would define an executor function, an aggregator function, and create a CustomScorer object.

Create a scorer.py file:

from typing import Dict, List


def scorer_fn(*, response: str, **kwargs) -> int:
    return len(response)


def aggregator_fn(*, scores: List[str]) -> Dict[str, int]:
    return {
        "Total Response Length": sum(scores),
        "Average Response Length": sum(scores) / len(scores),
    }

pq.register_scorer("response_length", "scorer.py")

Use the scorer in your prompt run:

pq.run(..., scorers=["response_length"])

Note that registered scorer can only take response as the input - if you want to pass in other fields in the custom metric, please use the custom scorer above

Execution Environment

Your scorer will be executed in a Python 3.9 environment. The Python libraries available for your use are:

numpy~=1.24.3
onnxruntime~=1.16.0
pandas~=2.1.1
pydantic~=2.3.0
scikit-learn~=1.3.1
sentence_transformers~=2.2.2
tensorflow~=2.14.0
transformers~=4.33.3

If you are using an ML model to make predictions, please ensure it is <= 500MB in size and uses either scikit-learn or tensorflow. We recommend optimizing it by using the ONNX Runtime if it is a larger model.

Please note that we regularly update the minor and patch versions of these packages. Major version updates are infrequent but if a library is critical to your scorer, please let us know and we'll provide 1+ week of warning before updating the major versions for those.

What if I need to use other libraries or packages?

If you need to use other libraries or packages, you may use 'Custom Scorers'. Custom Scorers are run on your notebook environment. Because of they run locally, they won't be available for runs created from the UI or for Observe projects.

	Registered Scorers	Custom Scorers
Creating the custom metric	Created from the Python client, can be activated through the UI.	Created via the Python client
Sharing across the organization	Accessible within the Galileo console across different projects and modules	Outside Galileo, accessible only to the current project
Accessible modules	Evaluate and Observe	Evaluate
Scorer Definition	As an independent Python file	Within the notebook
Execution Environment	Server-side	Within your Python environment
Python Libraries available	Limited to a Galileo provided execution environment	Any library within your virtual environment
Execution Resources	Restricted by Galileo	Any resources available to your local instance

Registered Scorers

Custom Scorers

Creating the custom metric

Created from the Python client, can be activated through the UI.

Created via the Python client

Sharing across the organization

Accessible within the Galileo console across different projects and modules

Outside Galileo, accessible only to the current project

Accessible modules

Evaluate and Observe

Evaluate

Scorer Definition

As an independent Python file

Within the notebook

Execution Environment

Server-side

Within your Python environment

Python Libraries available

Limited to a Galileo provided execution environment

Any library within your virtual environment

Execution Resources

Restricted by Galileo

Any resources available to your local instance

How do I create a local "Custom Scorer"?

Custom scorers can be created from two Python functions (executor and aggrator function as defined below). Common types include:

Heuristics/custom rules: checking for regex matches or presence/absence of certain keywords or phrases.
model-guided: utilizing a pre-trained model to check for specific entities (e.g. PERSON, ORG), or asking an LLM to grade the quality of the output.

For example, for that registered scorer we created to calculate response length, here is the custom scorer equivalent:

Note that the naming of the functions are different: they are executor and aggregator instead of scorer_fn and aggregator_fn.

def executor(row) -> float:
  return len(row.response)

def aggregator(scores, indices) -> dict:
  return {'Total Response Length': sum(scores),
          # You can have multiple aggregate summaries for your metric.
          'Average Response Length': sum(scores)/len(scores)}

my_scorer = pq.CustomScorer(name='Response Length', executor=executor, aggregator=aggregator)

To register your scorer, you would just pass it through your scorers parameter inside pq.run or pq.run_sweep:

pq.run(my_template, 'my_dataset.csv', scorers=[my_scorer])

For more docs on custom metrics, visit our promptquality docs.

Once you complete a run, your custom metric can be used to evaluate responses for that specific project.

Note that custom scorer can only be used in the Evaluate module - if you want to use a custom metric to evaluate live traffic (Observe module), you'll need to use the registered scorers below.

PreviousChoosing your Guardrail Metrics NextSetting up your LLMs

Last updated 13 hours ago