Working Notes: a commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog.

Data

Perhaps paranoid, but created after constantly burning myself while creating and evaluating datasets.
Run through the checklist and document all the answers in the evaluation.
Save a dated version of the run through on quip or the internal wiki.
Consider automating any tests that can be.
Consider automatically generating these queries given the table structure.
Table Characteristics
Does the table /cover/ the expected number of ?
Is the total number of rows expected?
Is this direct instrumentation or a derived table?
- Derived Table
Manually transform a small sample of rows and check that the results match.
- Direct Instrumentation
Can any numbers be determined from any alternative sources? If yes, run and document sanity checks.
- Column Relationships
Assertions possible on individual columns? Apply them.
Column Characteristics
- General
Are there instrumentation related artifacts in the dataset? Explain them.
How many values are NULL? Are they expected?
Is this column derived from other columns in this table? Sanity check the calculations
- If calculated, run a manual query comparing the values and the distribution of the result.
- If lookup related, check that the numbers on both sides of the mapping make sense.
- User Ids
Do the userids satisfy userid requirements?
Is the number of distinct users within reason?
- Normal Columns
Does the column have the expected number of distinct values?
For enum-like columns, are there corrupted values?
- Numerical columns
What are the min/max values? Do they make sense?
What does the distribution of the data look like? Normal / Bimodal / etc.? Does it match the expected distribution?
Note the min/p1/p25/p50/mean/mode/p75/p90/p99/max values.
Are there physical constraints on the columns (e.g. only positive values)? Are they satisfied?
Should NULL values be coerced to zero or excluded? Are potential queries updated accordingly?
If representing a quantity, like time or energy - are the units documented in the column? Are values consistent with the units?
Outliers
Explicitly collect samples with outlier values across different columns, preferably those with the most outliers.
Time Series Data
Is the volume consistent across multiple dates?
Is the volume consistent within a day?
Do troughs and peak correspond to user behaviour? Compare the troughs and peaks.
References Possible Additional Checks
Wikipedia
Methodology for data validation

— Kunal