I stumbled upon Soda when on the lookout for open source DQ tools. Easy to get started on it and incorporate it in existing Data pipelines. This repo has a notebook which will help others in exploring Soda more and see if it suits their needs. The notebook is self explanatory, but I wanted to jot down detailed steps and share for folks who are looking for the same.
The documentation is quite clean and easy to read and can be found here. Below are generalised steps to be done for onboarding to Soda.
- Install soda spark
- Read about metrics available by default
- Identify and define Soda scan yml file for your dataset.
- Execute the scan on your dataframe.
- Take Action based on results.
It was tricky, currently when I tried it in Databricks Community Edition. You can refer to the Step 1 in the notebook for the workaround. Otherwise its pretty straightforward.
Scan Results is a python Object and comprises of two child Objects:
It holds the result of all metrics defined in the yml file.
It holds the results of test defined at a table level and column level.
We can programmatically check for a specific
test_resultand take action based on it.
I have currently converted the scan results to a measurement Dataframe and test_result Dataframe for easier analysis. My planned next steps were publishing these results to InfluxDB, visualize and define alerts there. Before I could do that, I stumbled upon Soda Cloud. (Paid Service with Free trial)
Soda Cloud does exactly what I wanted to do within Influxdata but without any additional work. Here are the steps I needed to do:
- Setup Soda Cloud account
- Create an api key
- Setup the Soda Server Client
- Add another argument when I execute the scan
scan.execute(scan_definition, df, soda_server_client=soda_server_client)
The below image shows a monitor created automatically based on the Scan yml file.
The below image shows an alert for a test that has failed
You can look at all the Datasets you are monitoring in one place
You can look at Schema, monitors and Sample_data (if published) for each dataset:
Advantages (from my perspective)
- Easy setup
- Quick to onboard
- Can leverage just open source as we could define actions based on Scan results.
- If you leverage soda scan for all pipelines, Soda Cloud has the potential to act as Data Catalog with data health monitor.
- Excellent community support on Slack..
- Sample data or Failed records could also be sent to cloud platform instead of Soda Cloud.
Some things which could take time due to the learning curve. I intend to add more examples here.
- Defining the Scan yml file.
- Understanding the metrics that are provided by default.
- Group metrics and Historical metrics
Next Steps and Limitations
Soda Cloud is a paid service and I think it should be so for what it provides.
- Explore publishing the scan result to Influxdata (community edition)
- Visualize it in influx. This could be a good stack for personal projects.
- It currently supports many databases and soda-spark.
- Streaming might be up soon.
- Explore integration with Airflow.