You have questions, we have (some) answers!

pip install dataquality

The first thing to try in this case it to restart your kernel. Dataquality uses certain python packages that require your kernel to be restarted after installation. In Jupyter you can click "Kernel -> Restart"
In Colab you can click "Runtime -> Disconnect and delete runtime"
If you already had vaex installed on your machine prior to installing dataquality, there is a known bug when upgrading. Solution: pip uninstall -y vaex-core vaex-hdf5 && pip install --upgrade --force-reinstall dataquality And then restart your jupyter/colab kernel

It's possible that:
  • your run hasn't finished processing
  • you've logged some data incorrectly
  • you may have found a bug (congrats!
First, to see what happened to your data, you can run dq.wait_for_run() (you can optionally pass in the project and run name, or the most recent will be used)
This function will wait for your run to finish processing. If it's completed, check the console again by refreshing.
If that shows an exception, your run failed to be processed. You can see the logs from your model training by running dq.get_dq_log_file() which will download and return the path to your logfile. That may indicate the issue. Feel free to reach out to us for more help!

Yes (glad you asked)! You can attach any metadata fields you'd like to your original dataset, as long as they are primitive datatypes (numbers and strings).
In all available logging functions for input data, you can attach custom metadata:
df = pd.DataFrame(
"id": [0,1,2,3],
"text": ["sen 1","sen 2","sen 3","sen 4"],
"label": [0, 1, 1, 0],
"customer_score": [0.66, 0.98, 0.12, 0.05],
"sentiment": ["happy", "sad", "happy", "angry"]
dq.log_dataset(df, meta=["customer_score", "sentiment"])
texts = [
"Text sample 1",
"Text sample 2",
"Text sample 3",
"Text sample 4"
labels = ["B", "C", "A", "A"]
meta = {
"sample_importance": ["high", "low", "low", "medium"]
"quality_ranking": [9.7, 2.4, 5.5, 1.2]
ids = [0, 1, 2, 3]
split = "training"
dq.log_data_samples(texts=texts, labels=labels, ids=ids, meta=meta split=split)
This data will show up in the console under the column dropdown
And you can see any performance metric grouped by your categorical metadata
Lastly, once active, you can further filter your data by your metadata fields, helping find high-value cohorts

from datasets import Dataset, dataset_dict
file_name_train = "exported_galileo_sample_file_train.parquet"
file_name_val = "exported_galileo_sample_file_val.parquet"
file_name_test = "exported_galileo_sample_file_test.parquet"
ds_train = Dataset.from_parquet(file_name_train)
ds_val = Dataset.from_parquet(file_name_val)
ds_test = Dataset.from_parquet(file_name_test)
ds_exported = dataset_dict.DatasetDict({"train": ds_train, "validation": ds_val, "test": ds_test})
labels = ds_new["train"]["ner_labels"][0]
tokenized_datasets = hf.tokenize_and_log_dataset(ds_exported, tokenizer, labels)
train_dataloader = hf.get_dataloader(tokenized_datasets["train"], collate_fn=data_collator, batch_size=MINIBATCH_SIZE, shuffle=True)
val_dataloader = hf.get_dataloader(tokenized_datasets["validation"], collate_fn=data_collator, batch_size=MINIBATCH_SIZE, shuffle=False)
test_dataloader = hf.get_dataloader(tokenized_datasets["test"], collate_fn=data_collator, batch_size=MINIBATCH_SIZE, shuffle=False)

import dataquality as dq
from datasets import Dataset
# A vaex dataframe
df = dq.metrics.get_dataframe(
project_name, run_name, split, hf_format=True, tagging_schema="BIO"
ds = Dataset.from_parquet("data.parquet")

If you're seeing an error similar to: JSONDecodeError: Expecting ',' delimiter: line 1 column 84 (char 83) It's likely the case that you have some data in your text field that is not valid json (extra quotes " or '). Unfortunately, we cannot modify the content of your span text, but we can strip out the text field with some regex. Given a pandas dataframe df with column spans (from a Galileo export) you can replace df["spans"] = df.apply(json.loads) with (make sure to import re) df["spans"] = df.apply(lambda row: json.loads(re.sub(r","text".}", "}", row)))

Great observation! Let's take a real example below, from the WikiNER IT dataset. As you can see, the Anemone apennina clearly looks like a wrong tag error (correct span boundaries, incorrect class prediction), but is marked as a span shift.
We can further validate this with dq.metrics.get_dataframe. We can see that there are 2 spans with identical character boundaries, one with a label and one without (which is the prediction span).
So what is going on here? When Galileo computes error types for each span, they are computed at the byte-pair (BPE) level using the span token indices, not the character indices. When looking at the console, however, you are seeing the character level indices, because that's much more intuitive view of your data. That conversion from token (fine-grained) to character (coarse-grained) level indices can cause index differences to overlap as a result of less-granular information.
We can again validate this with dq.metrics by looking at the raw data logged to Galileo. As we can see, at the token level, the span start and end indices do not align, and in fact overlap (ids 21948 and 21950), which is the reason for the span_shift error ๐Ÿค—

We manage deployments and updates to the versions of services running in your cluster via Github Actions. Each deployment/update produces logs that go into a bucket on Galileo's cloud (GCP). During our private deployment process (for Enterprise users), we allow customers to provide us with their emails, so they can have access to these deployment logs.

The client logs are stored in the home (~) folder of the machine where the training occurs.

For Enterprise Users, data does not leave the customer VPC/Data Center. For users of the Free version of our product, we store data and model outputs in secured servers in the cloud. We pride ourselves in taking data security very seriously.

Yes, we do! Contact us to learn more.

You can write us at [email protected]

Copy link
On this page