Now imagine you have a table with state with testing data, some states have most recent data and some states don’t how do you back fill missing rows for states with missing dates?
Let’s say your table name is
myrandomwatch-b4b41.official.states-testing
Your data is as such
State, date, totalTestResults
CA 2021-03-22 20000
CA 2021-03-10 10000
AK 2021-03-24 10000
AK 2020-12-31 900
As you can see from above, for CA, data for 2021–03–23 and and 2021–03–24 are missing, and AK is missing a hell lot more data. Some of the past data is missing, probably due to collection errors. Some new data probably hasn’t arrived yet.
First step find all legit days
select
allstates.state as state , d.datefrom
(select distinct date
from `myrandomwatch-b4b41.official.states-testing`
) as d,
(select distinct state
from `myrandomwatch-b4b41.official.states-testing`
) as allstates
second step find missing dates by performing a left outer join with all data and filter in the end by a null row
select
allstates.state as state , d.date as missingdate
from
(select distinct date
from `myrandomwatch-b4b41.official.states-testing`
) as d,
(select distinct state
from `myrandomwatch-b4b41.official.states-testing`
) as allstatesleft outer join `myrandomwatch-b4b41.official.states-testing` as data
on d.date = data.date and allstates.state = data.state
where totalTestResults is null
Now that we have missing dates we need to find the appropriate backfill date, which shall be the last date with data.
SELECT e.state as state, missingdate, max(date) as backfilldate
FROM
(
... above
) as e
join `myrandomwatch-b4b41.official.states-testing` as f
on e.state = f.state and e.missingdate > f.date
group by e.state, e.missingdate
finally join the backfill dates with actual data
select * from (
... above
) as backfill
join `myrandomwatch-b4b41.official.states-testing` as dd
on backfill.state = dd.state and backfill.backfilldate = dd.date
In the end, you probably want to union this output with the original table to get the full data set/
SELECT *
FROM (SELECT e.state AS state,
missingdate,
Max(date) AS backfilldate
FROM (SELECT allstates.state AS state,
d.date AS missingdate
FROM (SELECT DISTINCT date
FROM `myrandomwatch-b4b41.official.states-testing`) AS
d,
(SELECT DISTINCT state
FROM `myrandomwatch-b4b41.official.states-testing`) AS
allstates
LEFT OUTER JOIN
`myrandomwatch-b4b41.official.states-testing` AS
data
ON d.date = data.date
AND allstates.state = data.state
WHERE totaltestresults IS NULL) AS e
JOIN `myrandomwatch-b4b41.official.states-testing` AS f
ON e.state = f.state
AND e.missingdate > f.date
GROUP BY e.state,
e.missingdate) AS backfill
JOIN `myrandomwatch-b4b41.official.states-testing` AS dd
ON backfill.state = dd.state
AND backfill.backfilldate = dd.date
ORDER BY dd.state
I’m still a SQL novice. This is notes for what worked for me. Please let me know if you know better ways to do this.
Since BigQuery doesn’t have great support for parametric queries especially with table names, a lightly better way is to generate this query with Mustache
var view = {
table_name: "`myrandomwatch-b4b41.my_dataset.California-Vaccination-Overtime-clean`",
start_date: "2021-02-01",
group_by_field: "fips",
null_test_field: "doses_administered",
date_field: "date",
};
here is the template
select * from
(
SELECT
backfill.{{group_by_field}} as {{group_by_field}},
missingdate as {{date_field}},
dd.* except({{group_by_field}}, {{date_field}}),
FROM
(
SELECT e.{{group_by_field}} AS {{group_by_field}},
missingdate,
Max({{date_field}}) AS backfilldate
FROM (SELECT allstates.{{group_by_field}} AS {{group_by_field}},
d.{{date_field}} AS missingdate
FROM (SELECT {{date_field}} FROM UNNEST(
GENERATE_DATE_ARRAY(DATE('{{start_date}}'), CURRENT_DATE(), INTERVAL 1 DAY) ) AS {{date_field}}
) AS
d,
(SELECT DISTINCT {{group_by_field}}
FROM {{&table_name}}) AS
allstates
LEFT OUTER JOIN
{{&table_name}} AS
data
ON d.{{date_field}} = data.{{date_field}}
AND allstates.{{group_by_field}} = data.{{group_by_field}}
WHERE {{null_test_field}} IS NULL) AS e
JOIN {{&table_name}} AS f ON e.{{group_by_field}} = f.{{group_by_field}} AND e.missingdate > f.{{date_field}}
GROUP BY e.{{group_by_field}},
e.missingdate
) AS backfill
JOIN {{&table_name}} AS dd ON backfill.{{group_by_field}} = dd.{{group_by_field}} AND backfill.backfilldate = dd.{{date_field}}UNION DISTINCT
(select * from {{&table_name}})
)
ORDER BY {{group_by_field}}, {{date_field}} desc