Build your own Conditions

A class to build custom conditions for DataFrame assertions and alerting.

A Condition is a class for building custom data quality checks. Simply create a condition, and after the run is processed your conditions will be evaluated. Integrate with email or slack to have condition results alerting via a Run Report. Use Conditions to answer questions such as "Is the average confidence for my training data below 0.25" or "Has over 20% of my inference data drifted".

What do I do with Conditions?

You can build a Run Report that will evaluate all conditions after a run is processed.

import dataquality as dq

dq.init("text_classification")

cond1 = dq.Condition(...)
cond2 = dq.Condition(...)
dq.register_run_report(conditions=[cond1, cond2])

# By default we email the logged in user
# Optionally pass in additional emails to receive Run Reports
dq.register_run_report(conditions=[cond1], emails=["foo@bar.com"]

You can also build and evaluate conditions by accessing the processed DataFrame.

from dataquality import Condition

df = dq.metrics.get_dataframe("proj_name", "run_name", "training")
cond = Condition(...)
passes, ground_truth = cond.evaluate(df)

How do I build a Condition?

A Condition is defined as follows:

```

class Condition:
    agg: AggregateFunction # An aggregate function to apply to the metric
    threshold: float # Threshold value for evaluating the condition
    operator: Operator # The operator to use for comparing the agg to the threshold
    metric: Optional[str] = None # The DF column for evaluating the condition
    filters: Optional[List[ConditionFilter]] = [] # Optional filter to apply to the DataFrame before evaluating the Condition

To gain an intuition for what can be accomplished, consider the following examples:

1. Is the average confidence less than 0.3?
    >>> c = Condition(
    ...     agg=AggregateFunction.avg,
    ...     metric="confidence",
    ...     operator=Operator.lt,
    ...     threshold=0.3,
    ... )

2. Is the max DEP greater or equal to 0.45?
    >>> c = Condition(
    ...     agg=AggregateFunction.max,
    ...     metric="data_error_potential",
    ...     operator=Operator.gte,
    ...     threshold=0.45,
    ... )

By adding filters, you can further narrow down the scope of the condition. If the aggregate function is "pct", you don't need to specify a metric, as the filters will determine the percentage of data.

3. Alert if over 80% of the dataset has confidence under 0.1
    >>> c = Condition(
    ...     operator=Operator.gt,
    ...     threshold=0.8,
    ...     agg=AggregateFunction.pct,
    ...     filters=[
    ...         ConditionFilter(
    ...             metric="confidence", operator=Operator.lt, value=0.1
    ...         ),
    ...     ],
    ... )

4. Alert if at least 20% of the dataset has drifted (Inference DataFrames only)
    >>> c = Condition(
    ...     operator=Operator.gte,
    ...     threshold=0.2,
    ...     agg=AggregateFunction.pct,
    ...     filters=[
    ...         ConditionFilter(
    ...             metric="is_drifted", operator=Operator.eq, value=True
    ...         ),
    ...     ],
    ... )

5. Alert 5% or more of the dataset contains PII
    >>> c = Condition(
    ...     operator=Operator.gte,
    ...     threshold=0.05,
    ...     agg=AggregateFunction.pct,
    ...     filters=[
    ...         ConditionFilter(
    ...             metric="galileo_pii", operator=Operator.neq, value="None"
    ...         ),
    ...     ],
    ... )

Complex conditions can be built when the filter has a different metric than the metric used in the condition.

6. Alert if the min confidence of drifted data is less than 0.15
    >>> c = Condition(
    ...     agg=AggregateFunction.min,
    ...     metric="confidence",
    ...     operator=Operator.lt,
    ...     threshold=0.15,
    ...     filters=[
    ...         ConditionFilter(
    ...             metric="is_drifted", operator=Operator.eq, value=True
    ...         )
    ...     ],
    ... )

7. Alert if over 50% of high DEP (>=0.7) data contains PII
    >>> c = Condition(
    ...     operator=Operator.gt,
    ...     threshold=0.5,
    ...     agg=AggregateFunction.pct,
    ...     filters=[
    ...         ConditionFilter(
    ...             metric="data_error_potential", operator=Operator.gte, value=0.7
    ...         ),
    ...         ConditionFilter(
    ...             metric="galileo_pii", operator=Operator.neq, value="None"
    ...         ),
    ...     ],
    ... )

You can also call conditions directly, which will assert its truth against a DataFrame.

1. Assert that average confidence less than 0.3
>>> c = Condition(
...     agg=AggregateFunction.avg,
...     metric="confidence",
...     operator=Operator.lt,
...     threshold=0.3,
... )
>>> c(df)  # Will raise an AssertionError if False

Aggregate Function

from dataquality import AggregateFunction

The available aggregate functions are:

class AggregateFunction(str, Enum):
    avg = "avg"
    min = "min"
    max = "max"
    sum = "sum"
    pct = "pct"

Operator

from dataquality import Operator

The available operators are:

class Operator(str, Enum):
    eq = "eq"
    neq = "neq"
    gt = "gt"
    lt = "lt"
    gte = "gte"
    lte = "lte"

Metric & Treshold

The metric must be the name of a column in the DataFrame. Threshold is a numeric value for comparison in the Condition.

Alerting

Alerting via email, slack in development. Please reach out to Galileo at team@rungalileo.io for more information.

Last updated