Audit your warehouse. Pick one critical table. Enforce NOT NULL on every single column. If you truly need a missing value, use a sentinel row (e.g., id = 0 , name = "UNKNOWN" ). You will be shocked how many bugs disappear.
Replace NULL with explicit semantics. Use -999 for "offline," -9999 for "out of range," or better—split the column into value and value_metadata_flag . 3. The Referential Integrity Illusion Modern data lakes love "schema on read." This is the enemy of ab initio . You are essentially saying, “Let’s store the garbage, and we’ll figure out what kind of garbage it is later.” ab initio data quality
Stop polishing bad data. Start building it right from the first principle. Audit your warehouse
Change is allowed. Silent change is not. Your first principle is: Schema version is part of the data identifier. events_v2.parquet is a different entity than events_v1.parquet . Never mutate; deprecate. If you truly need a missing value, use a sentinel row (e
Most data teams focus on reactive data quality (DQ). They let data in, then scramble to fix it. But what if we borrowed a concept from theoretical chemistry and quantum physics? What if we focused on ?