Add a scrap command to backfill Spec11 threats (#897)

This parses through all pre-existing Spec11 files in GCS (starting at
2019-01-01 which is basically when the new format started) and maps them
to the new Spec11ThreatMatch objects.

Because the old format stored domain names only and the new format stores
names + repo IDs, we need to retrieve the DomainBase objects from the
point in time of the scan (failing if they don't exist). Because the
same domains appear multiple times (we estimate a total of 100k+ entries
but only 1-2k unique domains) we cache the DomainBase objects that we
retrieve from Datastore.
This commit is contained in:
gbrodman 2020-12-15 16:18:27 -05:00 committed by GitHub
parent 090eeaa618
commit 1db0c5b018
11 changed files with 4652 additions and 4402 deletions

View file

@ -2251,30 +2251,6 @@ ALTER TABLE ONLY public."RegistrarPoc"
ADD CONSTRAINT fk_registrar_poc_registrar_id FOREIGN KEY (registrar_id) REFERENCES public."Registrar"(registrar_id);
--
-- Name: Spec11ThreatMatch fk_spec11_threat_match_domain_repo_id; Type: FK CONSTRAINT; Schema: public; Owner: -
--
ALTER TABLE ONLY public."Spec11ThreatMatch"
ADD CONSTRAINT fk_spec11_threat_match_domain_repo_id FOREIGN KEY (domain_repo_id) REFERENCES public."Domain"(repo_id);
--
-- Name: Spec11ThreatMatch fk_spec11_threat_match_registrar_id; Type: FK CONSTRAINT; Schema: public; Owner: -
--
ALTER TABLE ONLY public."Spec11ThreatMatch"
ADD CONSTRAINT fk_spec11_threat_match_registrar_id FOREIGN KEY (registrar_id) REFERENCES public."Registrar"(registrar_id);
--
-- Name: Spec11ThreatMatch fk_spec11_threat_match_tld; Type: FK CONSTRAINT; Schema: public; Owner: -
--
ALTER TABLE ONLY public."Spec11ThreatMatch"
ADD CONSTRAINT fk_spec11_threat_match_tld FOREIGN KEY (tld) REFERENCES public."Tld"(tld_name);
--
-- Name: DomainHistoryHost fka9woh3hu8gx5x0vly6bai327n; Type: FK CONSTRAINT; Schema: public; Owner: -
--