Soda – Data Quality

By | December 6, 2021

I stumbled upon Soda when on the lookout for open source DQ tools. Easy to get started on it and incorporate it in existing Data pipelines. This repo has a notebook which will help others in exploring Soda more and see if it suits their needs. The notebook is self explanatory, but I wanted to jot down detailed steps and share for folks who are looking for the same.

Soda Spark

The documentation is quite clean and easy to read and can be found here. Below are generalised steps to be done for onboarding to Soda.

  1. Install soda spark
  2. Read about metrics available by default
  3. Identify and define Soda scan yml file for your dataset.
  4. Execute the scan on your dataframe.
    1. It currently returns the object scan_results
    2. It used to return a Dataframe
  5. Take Action based on results.

Installation

It was tricky, currently when I tried it in Databricks Community Edition. You can refer to the Step 1 in the notebook for the workaround. Otherwise its pretty straightforward.

Scan Results

Scan Results is a python Object and comprises of two child Objects:

  1. Measurements
    It holds the result of all metrics defined in the yml file.
  2. Test_result
    It holds the results of test defined at a table level and column level.
    We can programmatically check for a specific measurement or test_result and take action based on it.

My Approach

I have currently converted the scan results to a measurement Dataframe and test_result Dataframe for easier analysis. My planned next steps were publishing these results to InfluxDB, visualize and define alerts there. Before I could do that, I stumbled upon Soda Cloud. (Paid Service with Free trial)

See also  Json Schema and Json Validation

Soda Cloud

Soda Cloud does exactly what I wanted to do within Influxdata but without any additional work. Here are the steps I needed to do:

  1. Setup Soda Cloud account
  2. Create an api key
  3. Setup the Soda Server Client
  4. Add another argument when I execute the scan
    scan.execute(scan_definition, df, soda_server_client=soda_server_client)

The below image shows a monitor created automatically based on the Scan yml file.


image

The below image shows an alert for a test that has failed


image

You can look at all the Datasets you are monitoring in one place
Soda Data Quality Tool

You can look at Schema, monitors and Sample_data (if published) for each dataset:
image

Advantages (from my perspective)

  1. Easy setup
  2. Quick to onboard
  3. Can leverage just open source as we could define actions based on Scan results.
  4. If you leverage soda scan for all pipelines, Soda Cloud has the potential to act as Data Catalog with data health monitor.
  5. Excellent community support on Slack..
  6. Sample data or Failed records could also be sent to cloud platform instead of Soda Cloud.

Some things which could take time due to the learning curve. I intend to add more examples here.

  1. Defining the Scan yml file.
  2. Understanding the metrics that are provided by default.
  3. Group metrics and Historical metrics

Next Steps and Limitations

Soda Cloud is a paid service and I think it should be so for what it provides.

  1. Explore publishing the scan result to Influxdata (community edition)
  2. Visualize it in influx. This could be a good stack for personal projects.
  3. It currently supports many databases and soda-spark.
  4. Streaming might be up soon.
  5. Explore integration with Airflow.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.