Backfilling a data table with rows of missing dates

technotes
3 min readApr 12, 2021

Now imagine you have a table with state with testing data, some states have most recent data and some states don’t how do you back fill missing rows for states with missing dates?

Let’s say your table name is

myrandomwatch-b4b41.official.states-testing

Your data is as such

State,   date,       totalTestResults
CA 2021-03-22 20000
CA 2021-03-10 10000
AK 2021-03-24 10000
AK 2020-12-31 900

As you can see from above, for CA, data for 2021–03–23 and and 2021–03–24 are missing, and AK is missing a hell lot more data. Some of the past data is missing, probably due to collection errors. Some new data probably hasn’t arrived yet.

First step find all legit days

select 
allstates.state as state , d.date
from
(select distinct date
from `myrandomwatch-b4b41.official.states-testing`
) as d,
(select distinct state
from `myrandomwatch-b4b41.official.states-testing`
) as allstates

second step find missing dates by performing a left outer join with all data and filter in the end by a null row

select 
allstates.state as state , d.date as missingdate
from
(select distinct date
from `myrandomwatch-b4b41.official.states-testing`
) as d,
(select distinct state
from `myrandomwatch-b4b41.official.states-testing`
) as allstates
left outer join `myrandomwatch-b4b41.official.states-testing` as data
on d.date = data.date and allstates.state = data.state

where totalTestResults is null

Now that we have missing dates we need to find the appropriate backfill date, which shall be the last date with data.

SELECT e.state as state, missingdate, max(date) as backfilldate
FROM

(
... above
) as e
join `myrandomwatch-b4b41.official.states-testing` as f
on e.state = f.state and e.missingdate > f.date
group by e.state, e.missingdate

finally join the backfill dates with actual data

select * from (
... above
) as backfill
join `myrandomwatch-b4b41.official.states-testing` as dd
on backfill.state = dd.state and backfill.backfilldate = dd.date

In the end, you probably want to union this output with the original table to get the full data set/

SELECT *
FROM (SELECT e.state AS state,
missingdate,
Max(date) AS backfilldate
FROM (SELECT allstates.state AS state,
d.date AS missingdate
FROM (SELECT DISTINCT date
FROM `myrandomwatch-b4b41.official.states-testing`) AS
d,
(SELECT DISTINCT state
FROM `myrandomwatch-b4b41.official.states-testing`) AS
allstates
LEFT OUTER JOIN
`myrandomwatch-b4b41.official.states-testing` AS
data
ON d.date = data.date
AND allstates.state = data.state
WHERE totaltestresults IS NULL) AS e
JOIN `myrandomwatch-b4b41.official.states-testing` AS f
ON e.state = f.state
AND e.missingdate > f.date
GROUP BY e.state,
e.missingdate) AS backfill
JOIN `myrandomwatch-b4b41.official.states-testing` AS dd
ON backfill.state = dd.state
AND backfill.backfilldate = dd.date
ORDER BY dd.state

I’m still a SQL novice. This is notes for what worked for me. Please let me know if you know better ways to do this.

Since BigQuery doesn’t have great support for parametric queries especially with table names, a lightly better way is to generate this query with Mustache

var view = {
table_name: "`myrandomwatch-b4b41.my_dataset.California-Vaccination-Overtime-clean`",
start_date: "2021-02-01",
group_by_field: "fips",
null_test_field: "doses_administered",
date_field: "date",
};

here is the template

select * from
(
SELECT
backfill.{{group_by_field}} as {{group_by_field}},
missingdate as {{date_field}},
dd.* except({{group_by_field}}, {{date_field}}),
FROM
(
SELECT e.{{group_by_field}} AS {{group_by_field}},
missingdate,
Max({{date_field}}) AS backfilldate
FROM (SELECT allstates.{{group_by_field}} AS {{group_by_field}},
d.{{date_field}} AS missingdate
FROM (SELECT {{date_field}} FROM UNNEST(
GENERATE_DATE_ARRAY(DATE('{{start_date}}'), CURRENT_DATE(), INTERVAL 1 DAY) ) AS {{date_field}}
) AS
d,
(SELECT DISTINCT {{group_by_field}}
FROM {{&table_name}}) AS
allstates
LEFT OUTER JOIN
{{&table_name}} AS
data
ON d.{{date_field}} = data.{{date_field}}
AND allstates.{{group_by_field}} = data.{{group_by_field}}
WHERE {{null_test_field}} IS NULL) AS e
JOIN {{&table_name}} AS f ON e.{{group_by_field}} = f.{{group_by_field}} AND e.missingdate > f.{{date_field}}
GROUP BY e.{{group_by_field}},
e.missingdate
) AS backfill
JOIN {{&table_name}} AS dd ON backfill.{{group_by_field}} = dd.{{group_by_field}} AND backfill.backfilldate = dd.{{date_field}}
UNION DISTINCT
(select * from {{&table_name}})
)
ORDER BY {{group_by_field}}, {{date_field}} desc

--

--