Registering and Using Custom Metrics

Registered Metrics enable the ability for your team to define the custom metrics (programmatic or GPT-based) for your Observe projects.

Creating Your Registered Scorer

To define a registered scorer, create a Python file that has at least 2 functions and follow the function signatures as described below:

  1. scorer_fn: The scorer function is provided the row-wise inputs and is expected to generate outputs for each response. The expected signature for this function is:

    def scorer_fn(*, index: Union[int, str], response: str, **kwargs: Any) -> Union[float, int, bool, str, None]:
        ...

    We support output of a floating points, integers, boolean values, and strings. We also recommend ensuring your scorer_fn accepts any **kwargs so that your registered scorers are forward-compatible.

  2. aggregator_fn: The aggregator function takes in an array of the row-wise outputs from your scorer and allows you to generate aggregates from those. The expected signature for the aggregator function is:

    def aggregator_fn(*, scores: List[Union[float, int, bool, str, None]]) -> Dict[str, Union[float, int, bool, str, None]]:
     ...

    For aggregated values that you want to output from your scorer, return them as key-value pairs with the key corresponding to the label and the value.

Registering Your Scorer

Once you've created your scorer file, you can register it with the name and the scorer file with our Python package promptquality:

import promptquality as pq

pq.login({YOUR_GALILEO_URL})
registered_scorer = pq.register_scorer(scorer_name="My Scorer", scorer_file="/path/to/scorer/file.py")

Execution Environment

Your scorer will be executed in a Python 3.9 environment. The Python libraries available for your use are:

numpy~=1.24.3
onnxruntime~=1.16.0
pandas~=2.1.1
pydantic~=2.3.0
scikit-learn~=1.3.1
sentence_transformers~=2.2.2
tensorflow~=2.14.0
transformers~=4.33.3

If you are using an ML model to make predictions, please ensure it is <= 500MB in size and uses either scikit-learn or tensorflow. We recommend optimizing it by using the ONNX Runtime if it is a larger model.

Please note that we regularly update the minor and patch versions of these packages. Major version updates are infrequent but if a library is critical to your scorer, please let us know and we'll provide 1+ week of warning before updating the major versions for those.

The name you choose here will be the name with which the values for this scorer appear in the UI later.

Using Your Registered Scorer

All your Registered Scorers will be shown under the Custom Metrics section of your Project Settings. The On/Off switch turns them on and off.

When your metrics are on, your registered scorer will be executed on new samples that get logged to Galileo Observe (Note: scorers don't run retroactively, so past samples will not be scored). For each added Scorer, you'll see a new column in your Data view.

Example

For the same example scorer that we created using Custom Scorer for response lengths, here's its Registered Scorer equivalent.

  1. Create a scorer.py file:

from typing import List


def scorer_fn(*, response: str, **kwargs) -> int:
    return len(response)


def aggregator_fn(*, scores: List[str]) -> Dict[str, int]:
    return {
        "Total Response Length": sum(scores),
        "Average Response Length": sum(scores) / len(scores),
    }
  1. Register the scorer:

    pq.register_scorer("response_length", "scorer.py")
  2. Use the scorer in your prompt run:

    pq.run(..., scorers=["response_length"])

Last updated