Merge pull request #534 from cisagov/nmb/migration

This commit is contained in:
Neil MartinsenBurrell 2023-04-18 15:44:58 -05:00 committed by GitHub
commit 8a22c6f9a5
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
3 changed files with 224 additions and 0 deletions

View file

@ -0,0 +1,79 @@
# Registrar Data Migration
There is an existing registrar/registry at Verisign. They will provide us with an
export of the data from that system. The goal of our data migration is to take
the provided data and use it to create as much as possible a _matching_ state
in our registrar.
There is no way to make our registrar _identical_ to the Verisign system
because we have a different data model and workflow model. Instead, we should
focus our migration efforts on creating a state in our new registrar that will
primarily allow users of the system to perform the tasks that they want to do.
## Users
One of the major differences with the existing registrar/registry is that our
system uses Login.gov for authentication. Any person with an identity-verified
Login.gov account can make an account on the new registrar, and the first time
that person logs in through Login.gov, we make a corresponding account in our
user table. Because we cannot know the Universal Unique ID (UUID) for a
person's Login.gov account, we cannot pre-create user accounts for individuals
in our new registrar based on the data from Verisign.
## Domains
Our registrar keeps track of domains. The authoritative source for domain
information is the registry, but the registrar needs a copy of that
information to make connections between registry users and the domains that
they manage. The registrar stores very few fields about a domain except for
its name, so it could be straightforward to import the exported list of domains
from Verisign's `escrow_domains.daily.dotgov.GOV.txt`. It doesn't appear that
that table stores a flag for active or inactive, so every domain in the file
can be imported into our system as `is_active=True`.
An example Django management command that can load the delimited text file
from the daily escrow is in
`src/registrar/management/commands/load_domains_data.py`. It uses Django's
object-relational modeler (ORM) to create Django objects for the domains and
then write them to the database in a single bulk operation. To run the command
locally for testing, using Docker Compose:
```shell
docker compose run -T app ./manage.py load_domains_data < /tmp/escrow_domains.daily.dotgov.GOV.txt
```
## User access to domains
The Verisign data contains a `escrow_domain_contacts.daily.dotgov.txt` file
that links each domain to three different types of contacts: `billing`,
`tech`, and `admin`. The ID of the contact in this linking table corresponds
to the ID of a contact in the `escrow_contacts.daily.dotgov.txt` file. In the
contacts file is an email address for each contact.
The new registrar associates user accounts (authenticated with Login.gov) with
domains using a `UserDomainRole` linking table. New users can be granted roles
on domains by creating a `DomainInvitation` that links an email address with a
domain. When a new user finishes authenticating with Login.gov and their email
address matches an invitation, then they are given the appropriate role on the
invitation's domain.
For the purposes of migration, we can prime the invitation system by creating
an invitation in the system for each email address listed in the
`domain_contacts` file. This means that if a person is currently a user in the
Verisign system, and they use the same email address with Login.gov, then they
will end up with access to the same domains in the new registrar that they
were associated with in the Verisign system.
A management command that does this needs to process two data files, one for
the contact information and one for the domain/contact association, so we
can't use stdin the way that we did before. Instead, we can use the fact that
Docker Compose mounts the `src/` directory inside of the container at `/app`.
Then, data files that are inside of the `src/` directory can be accessed
inside the Docker container.
An example script using this technique is in
`src/registrar/management/commands/load_domain_invitations.py`.
```shell
docker compose run app ./manage.py load_domain_invitations /app/escrow_domain_contacts.daily.dotgov.GOV.txt /app/escrow_contacts.daily.dotgov.GOV.txt
```

View file

@ -0,0 +1,76 @@
"""Load domain invitations for existing domains and their contacts."""
import csv
import logging
from collections import defaultdict
from django.core.management import BaseCommand
from registrar.models import Domain, DomainInvitation
logger = logging.getLogger(__name__)
class Command(BaseCommand):
help = "Load invitations for existing domains and their users."
def add_arguments(self, parser):
"""Add our two filename arguments."""
parser.add_argument(
"domain_contacts_filename",
help="Data file with domain contact information",
)
parser.add_argument(
"contacts_filename", help="Data file with contact information"
)
parser.add_argument("--sep", default="|", help="Delimiter character")
def handle(self, domain_contacts_filename, contacts_filename, **options):
"""Load the data files and create the DomainInvitations."""
sep = options.get("sep")
# We open the domain file first and hold it in memory.
# There are three contacts per domain, so there should be at
# most 3*N different contacts here.
contact_domains = defaultdict(list) # each contact has a list of domains
logger.info("Reading domain-contacts data file %s", domain_contacts_filename)
with open(domain_contacts_filename, "r") as domain_file:
for row in csv.reader(domain_file, delimiter=sep):
# fields are just domain, userid, role
# lowercase the domain names now
contact_domains[row[1]].append(row[0].lower())
logger.info("Loaded domains for %d contacts", len(contact_domains))
# now we have a mapping of user IDs to lists of domains for that user
# iterate over the contacts list and for contacts in our mapping,
# create the domain invitations for their email address
logger.info("Reading contacts data file %s", contacts_filename)
to_create = []
skipped = 0
with open(contacts_filename, "r") as contacts_file:
for row in csv.reader(contacts_file, delimiter=sep):
# userid is in the first field, email is the seventh
userid = row[0]
if userid not in contact_domains:
# this user has no domains, skip them
skipped += 1
continue
for domain_name in contact_domains[userid]:
email_address = row[6]
domain = Domain.objects.get(name=domain_name)
to_create.append(
DomainInvitation(
email=email_address.lower(),
domain=domain,
status=DomainInvitation.INVITED,
)
)
logger.info("Creating %d invitations", len(to_create))
DomainInvitation.objects.bulk_create(to_create)
logger.info(
"Created %d domain invitations, ignored %d contacts",
len(to_create),
skipped,
)

View file

@ -0,0 +1,69 @@
"""Load domains from registry export."""
import csv
import logging
import sys
from django.core.management.base import BaseCommand
from registrar.models import Domain
logger = logging.getLogger(__name__)
def _domain_dict_reader(file_object, **kwargs):
"""A csv DictReader with the correct field names for escrow_domains data.
All keyword arguments are sent on to the DictReader function call.
"""
# field names are from escrow_manifests without "f"
return csv.DictReader(
file_object,
fieldnames=[
"Name",
"Roid",
"IdnTableId",
"Registrant",
"ClID",
"CrRr",
"CrID",
"CrDate",
"UpRr",
"UpID",
"UpDate",
"ExDate",
"TrDate",
],
**kwargs,
)
class Command(BaseCommand):
help = "Load domain data from a delimited text file on stdin."
def add_arguments(self, parser):
parser.add_argument(
"--sep", default="|", help="Separator character for data file"
)
def handle(self, *args, **options):
separator_character = options.get("sep")
reader = _domain_dict_reader(sys.stdin, delimiter=separator_character)
# accumulate model objects so we can `bulk_create` them all at once.
domains = []
for row in reader:
name = row["Name"].lower() # we typically use lowercase domains
# Ensure that there is a `Domain` object for each domain name in
# this file and that it is active. There is a uniqueness
# constraint for active Domain objects, so we are going to account
# for that here with this check so that our later bulk_create
# should succeed
if Domain.objects.filter(name=name, is_active=True).exists():
# don't do anything, this domain is here and active
continue
else:
domains.append(Domain(name=name, is_active=True))
logger.info("Creating %d new domains", len(domains))
Domain.objects.bulk_create(domains)