Working Notes: a commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog.
Perhaps paranoid, but created after constantly burning myself while creating and evaluating datasets.
Run through the checklist and document all the answers in the evaluation.
Save a dated version of the run through on quip or the internal wiki.
Consider automating any tests that can be.
Consider automatically generating these queries given the table structure.
Table Characteristics
Does the table /cover/ the expected number of
Is the total number of rows expected?
Is this direct instrumentation or a derived table?
Manually transform a small sample of rows and check that the results match.
Can any numbers be determined from any alternative sources? If yes, run and document sanity checks.
Assertions possible on individual columns? Apply them.
Column Characteristics
Are there instrumentation related artifacts in the dataset? Explain them.
How many values are NULL? Are they expected?
Is this column derived from other columns in this table? Sanity check the calculations
Do the userids satisfy userid requirements?
Is the number of distinct users within reason?
Does the column have the expected number of distinct values?
For enum-like columns, are there corrupted values?
What are the min/max values? Do they make sense?
What does the distribution of the data look like? Normal / Bimodal / etc.? Does it match the expected distribution?
Note the min/p1/p25/p50/mean/mode/p75/p90/p99/max values.
Are there physical constraints on the columns (e.g. only positive values)? Are they satisfied?
Should NULL values be coerced to zero or excluded? Are potential queries updated accordingly?
If representing a quantity, like time or energy - are the units documented in the column? Are values consistent with the units?
Outliers
Explicitly collect samples with outlier values across different columns, preferably those with the most outliers.
Time Series Data
Is the volume consistent across multiple dates?
Is the volume consistent within a day?
Do troughs and peak correspond to user behaviour? Compare the troughs and peaks.
References Possible Additional Checks
— Kunal