DQ Auto

The fastest way to improve your data
To get started with auto instantly, see Getting Started
Welcome to auto, your newest superpower in the world of Machine Learning!
We know now that more data isn’t the answer, better data is. But how do you find that data? We already know the answer to that: ✨Galileo✨
But how do you get started now, and iterate quickly with data-centric techniques?
Enter: the secret sauce to instant data insights. We handle the training, you focus on the data.

What is auto? is a helper function to train the most cutting-edge transformer (or any of your choosing from HuggingFace) on your dataset so it can be processed by Galileo. You provide the data, let Galileo train the model, and you’re off to the races.
The goal of this tool, and Galileo at large, is to build a data-centric view of machine learning. Keep your model static and iterate on the dataset until it’s well-formed and well-representative of your problem space. This is the path to robust and stable ML models.

What is auto not?

auto is not an AutoML tool. It will not perform hyperparameter tuning, and will not search through a gallery of models to optimize every percentage of f1.
In fact, auto is quite the opposite. It intentionally keeps the model static, forcing you to understand and fix your data to improve performance.


It turns out that in many (most) cases, you don’t need to train your own model to find data insights. In fact, you often don’t need to build your own custom model at all! HuggingFace, and in particular transformers, has brought the most cutting-edge deep learning algorithms straight to your fingertips, allowing you to leverage the best research has to offer in 1 line of code.
Transformer models have consistently outperformed their predecessors, and HuggingFace is constantly updating their fleet of free models for anyone to download.
So if you don’t need to build a custom model anymore, why not let Galileo do it for you?

Get Started

Simply install: pip install --upgrade dataquality
and use!
import dataquality as dq
# Get insights on the official 'emotion' dataset"emotion")
You can also provide data as files or pandas dataframes
import pandas as pd
from sklearn.datasets import fetch_20newsgroups
import dataquality as dq
# Load the newsgroups dataset from sklearn
newsgroups_train = fetch_20newsgroups(subset='train')
newsgroups_test = fetch_20newsgroups(subset='test')
# Convert to pandas dataframes
df_train = pd.DataFrame({"text":, "label":})
df_test = pd.DataFrame({"text":, "label":})
) works for:
auto will automatically figure out your task and start the process for you.
For more docs and examples, see help( in your notebook! Happy data fixing 🚀