Toxicity
Understand Galileo's Toxicity Metric
Definition: Flags whether a response contains hateful or toxic information. Output is a binary classification of whether a response is toxic or not.
Calculation: We utilize a pre-trained BERT model, fine-tuned on three Jigsaw challenge datasets: Toxic Comment Classification Challenge, Jigsaw Unintended Bias in Toxicity Classification, Jigsaw Multilingual Toxic Comment Classification averaging 96% accuracy on the validation set.
Usefulness: Identify responses that contain toxic comments and take preventative measure such as fine-tuning or implementing guardrails that flag responses to prevent future occurrences.
Last updated