Getting Started

How to use Galileo for Text Classification?

Discover the Console

Upon completing a run, you'll be taken to the Galileo Console. The first thing you'll notice is your dataset on the right. On each row, we show you the sample's text, its Ground Truth and Prediction labels, and the Data Error Potential of the sample. By default, your samples are sorted by Data Error Potential.

You can also view your samples in the embeddings space of the model. This can help you get a semantic understanding of your dataset. Using features like Color-By DEP, you might discover pockets of problematic data (e.g. decision boundaries that might benefit from more samples or a cluster of garbage samples).

Your left pane is called the Insights Menu. On the top you can see your dataset size and choose the metric you want to guide your exploration by (F1 by default). Size and metric update as you add filters to your dataset.

Your main source of insights will be Alerts, Metrics and Clusters. Alerts are a distilled list of different issues we've identified in your dataset. Insights such as Mislabeled Samples, Class Imbalance, Overlapping Classes, etc will be surfaced as Alerts.

Clicking on an Alert will filter the dataset to the subset of data that corresponds to the Alert.

Under metrics, you'll find different charts, such as:

  • F1 by Class

  • Sample Count by Class

  • Overlapping Classes

  • Top Misclassified Pairs

  • DEP Distribution

These charts are dynamic and update as you add different filters. They're also interactive - clicking on a class or group of classes will filter the dataset accordingly, allowing you to inspect and fix the samples.

The third tab are your Clusters. We automatically cluster your dataset taking into account frequent words and semantic distance. For each Cluster, we show you its average DEP score, F1, and the size of the cluster - factors you can use to determine which clusters are worth looking into. We also show you the common words in the cluster, and, if you enable your OpenAI integration, we leverage GPT to generate summaries of your clusters (more details here).

Taking Action

Once you've identified a problematic subset of data, Galileo allows you to fix your samples with the goal of improving your F1 or performance metric of choice. In Text Classification runs, we allow you to:

  • Change Label - Re-assign the label of your image right in-tool

  • Remove - Remove problematic images you want to discard from your dataset

  • Edit Data - Fix typos or extraneous characters in your samples

  • Send to Labelers - Send your samples to your labelers through our Labeling Integrations

  • Export - Download your samples so you can fix them elsewhere

Your changes are tracked in your Edits Cart. There you can view a summary of the changes you've made, you can undo them, or download a clean and fixed dataset to retrain your model.

Changing Splits

Your dataset splits are maintained on Galileo. Your data is logged as Training, Test and/or Validation split. Galileo allows you to explore each split independently. Some alerts, such as Underfitting Classes or Overfitting Classes look at cross-split performance. However, for the most part, each split is treated independently.

To switch splits, find the Splits dropdown next to your project and run name near the top of the screen. By default, the Training split is shown first.

Get started with a notebook 📘

Start integrating Galileo with our supported frameworks 💻

  • HuggingFace 🤗

  • PyTorch

  • TensorFlow

  • Keras

Last updated