Merge branch 'main' into nl/926-transistion-domain-object-creation

This commit is contained in:
CuriousX 2023-10-15 20:26:19 -06:00 committed by GitHub
commit 657af74c73
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
3 changed files with 587 additions and 11 deletions

View file

@ -1,11 +1,12 @@
# Registrar Data Migration
There is an existing registrar/registry at Verisign. They will provide us with an
export of the data from that system. The goal of our data migration is to take
the provided data and use it to create as much as possible a _matching_ state
The original system has an existing registrar/registry that we will import.
The company of that system will provide us with an export of the data.
The goal of our data migration is to take the provided data and use
it to create as much as possible a _matching_ state
in our registrar.
There is no way to make our registrar _identical_ to the Verisign system
There is no way to make our registrar _identical_ to the original system
because we have a different data model and workflow model. Instead, we should
focus our migration efforts on creating a state in our new registrar that will
primarily allow users of the system to perform the tasks that they want to do.
@ -18,7 +19,7 @@ Login.gov account can make an account on the new registrar, and the first time
that person logs in through Login.gov, we make a corresponding account in our
user table. Because we cannot know the Universal Unique ID (UUID) for a
person's Login.gov account, we cannot pre-create user accounts for individuals
in our new registrar based on the data from Verisign.
in our new registrar based on the original data.
## Domains
@ -27,7 +28,7 @@ information is the registry, but the registrar needs a copy of that
information to make connections between registry users and the domains that
they manage. The registrar stores very few fields about a domain except for
its name, so it could be straightforward to import the exported list of domains
from Verisign's `escrow_domains.daily.dotgov.GOV.txt`. It doesn't appear that
from `escrow_domains.daily.dotgov.GOV.txt`. It doesn't appear that
that table stores a flag for active or inactive.
An example Django management command that can load the delimited text file
@ -43,7 +44,7 @@ docker compose run -T app ./manage.py load_domains_data < /tmp/escrow_domains.da
## User access to domains
The Verisign data contains a `escrow_domain_contacts.daily.dotgov.txt` file
The data export contains a `escrow_domain_contacts.daily.dotgov.txt` file
that links each domain to three different types of contacts: `billing`,
`tech`, and `admin`. The ID of the contact in this linking table corresponds
to the ID of a contact in the `escrow_contacts.daily.dotgov.txt` file. In the
@ -59,9 +60,9 @@ invitation's domain.
For the purposes of migration, we can prime the invitation system by creating
an invitation in the system for each email address listed in the
`domain_contacts` file. This means that if a person is currently a user in the
Verisign system, and they use the same email address with Login.gov, then they
original system, and they use the same email address with Login.gov, then they
will end up with access to the same domains in the new registrar that they
were associated with in the Verisign system.
were associated with in the original system.
A management command that does this needs to process two data files, one for
the contact information and one for the domain/contact association, so we
@ -76,3 +77,56 @@ An example script using this technique is in
```shell
docker compose run app ./manage.py load_domain_invitations /app/escrow_domain_contacts.daily.dotgov.GOV.txt /app/escrow_contacts.daily.dotgov.GOV.txt
```
## Transition Domains
We are provided with information about Transition Domains in 3 files:
FILE 1: **escrow_domain_contacts.daily.gov.GOV.txt** -> has the map of domain names to contact ID. Domains in this file will usually have 3 contacts each
FILE 2: **escrow_contacts.daily.gov.GOV.txt** -> has the mapping of contact id to contact email address (which is what we care about for sending domain invitations)
FILE 3: **escrow_domain_statuses.daily.gov.GOV.txt** -> has the map of domains and their statuses
Transferring this data from these files into our domain tables happens in two steps;
***IMPORTANT: only run the following locally, to avoid publicizing PII in our public repo.***
### STEP 1: Load Transition Domain data into TransitionDomain table
**SETUP**
In order to use the management command, we need to add the files to a folder under `src/`.
This will allow Docker to mount the files to a container (under `/app`) for our use.
- Create a folder called `tmp` underneath `src/`
- Add the above files to this folder
- Open a terminal and navigate to `src/`
Then run the following command (This will parse the three files in your `tmp` folder and load the information into the TransitionDomain table);
```shell
docker compose run -T app ./manage.py load_transition_domain /app/tmp/escrow_domain_contacts.daily.gov.GOV.txt /app/tmp/escrow_contacts.daily.gov.GOV.txt /app/tmp/escrow_domain_statuses.daily.gov.GOV.txt
```
**OPTIONAL COMMAND LINE ARGUMENTS**:
`--debug`
This will print out additional, detailed logs.
`--limitParse 100`
Directs the script to load only the first 100 entries into the table. You can adjust this number as needed for testing purposes.
`--resetTable`
This will delete all the data loaded into transtion_domain. It is helpful if you want to see the entries reload from scratch or for clearing test data.
### STEP 2: Transfer Transition Domain data into main Domain tables
Now that we've loaded all the data into TransitionDomain, we need to update the main Domain and DomainInvitation tables with this information.
In the same terminal as used in STEP 1, run the command below;
(This will parse the data in TransitionDomain and either create a corresponding Domain object, OR, if a corresponding Domain already exists, it will update that Domain with the incoming status. It will also create DomainInvitation objects for each user associated with the domain):
```shell
docker compose run -T app ./manage.py transfer_transition_domains_to_domains
```
**OPTIONAL COMMAND LINE ARGUMENTS**:
`--debug`
This will print out additional, detailed logs.
`--limitParse 100`
Directs the script to load only the first 100 entries into the table. You can adjust this number as needed for testing purposes.

View file

@ -0,0 +1,524 @@
import sys
import csv
import logging
import argparse
from collections import defaultdict
from django.core.management import BaseCommand
from registrar.models import TransitionDomain
logger = logging.getLogger(__name__)
class termColors:
"""Colors for terminal outputs
(makes reading the logs WAY easier)"""
HEADER = "\033[95m"
OKBLUE = "\033[94m"
OKCYAN = "\033[96m"
OKGREEN = "\033[92m"
YELLOW = "\033[93m"
FAIL = "\033[91m"
ENDC = "\033[0m"
BOLD = "\033[1m"
UNDERLINE = "\033[4m"
BackgroundLightYellow = "\033[103m"
def query_yes_no(question: str, default="yes") -> bool:
"""Ask a yes/no question via raw_input() and return their answer.
"question" is a string that is presented to the user.
"default" is the presumed answer if the user just hits <Enter>.
It must be "yes" (the default), "no" or None (meaning
an answer is required of the user).
The "answer" return value is True for "yes" or False for "no".
"""
valid = {"yes": True, "y": True, "ye": True, "no": False, "n": False}
if default is None:
prompt = " [y/n] "
elif default == "yes":
prompt = " [Y/n] "
elif default == "no":
prompt = " [y/N] "
else:
raise ValueError("invalid default answer: '%s'" % default)
while True:
logger.info(question + prompt)
choice = input().lower()
if default is not None and choice == "":
return valid[default]
elif choice in valid:
return valid[choice]
else:
logger.info("Please respond with 'yes' or 'no' " "(or 'y' or 'n').\n")
class Command(BaseCommand):
help = """Loads data for domains that are in transition
(populates transition_domain model objects)."""
def add_arguments(self, parser):
"""Add our three filename arguments (in order: domain contacts,
contacts, and domain statuses)
OPTIONAL ARGUMENTS:
--sep
The default delimiter is set to "|", but may be changed using --sep
--debug
A boolean (default to true), which activates additional print statements
--limitParse
Used to set a limit for the number of data entries to insert. Set to 0
(or just don't use this argument) to parse every entry.
--resetTable
Use this to trigger a prompt for deleting all table entries. Useful
for testing purposes, but USE WITH CAUTION
"""
parser.add_argument(
"domain_contacts_filename", help="Data file with domain contact information"
)
parser.add_argument(
"contacts_filename",
help="Data file with contact information",
)
parser.add_argument(
"domain_statuses_filename", help="Data file with domain status information"
)
parser.add_argument("--sep", default="|", help="Delimiter character")
parser.add_argument("--debug", action=argparse.BooleanOptionalAction)
parser.add_argument(
"--limitParse", default=0, help="Sets max number of entries to load"
)
parser.add_argument(
"--resetTable",
help="Deletes all data in the TransitionDomain table",
action=argparse.BooleanOptionalAction,
)
def print_debug_mode_statements(
self, debug_on: bool, debug_max_entries_to_parse: int
):
"""Prints additional terminal statements to indicate if --debug
or --limitParse are in use"""
if debug_on:
logger.info(
f"""{termColors.OKCYAN}
----------DEBUG MODE ON----------
Detailed print statements activated.
{termColors.ENDC}
"""
)
if debug_max_entries_to_parse > 0:
logger.info(
f"""{termColors.OKCYAN}
----------LIMITER ON----------
Parsing of entries will be limited to
{debug_max_entries_to_parse} lines per file.")
Detailed print statements activated.
{termColors.ENDC}
"""
)
def get_domain_user_dict(
self, domain_statuses_filename: str, sep: str
) -> defaultdict[str, str]:
"""Creates a mapping of domain name -> status"""
domain_status_dictionary = defaultdict(str)
logger.info("Reading domain statuses data file %s", domain_statuses_filename)
with open(domain_statuses_filename, "r") as domain_statuses_file: # noqa
for row in csv.reader(domain_statuses_file, delimiter=sep):
domainName = row[0].lower()
domainStatus = row[1].lower()
domain_status_dictionary[domainName] = domainStatus
logger.info("Loaded statuses for %d domains", len(domain_status_dictionary))
return domain_status_dictionary
def get_user_emails_dict(
self, contacts_filename: str, sep
) -> defaultdict[str, str]:
"""Creates mapping of userId -> emails"""
user_emails_dictionary = defaultdict(str)
logger.info("Reading domain-contacts data file %s", contacts_filename)
with open(contacts_filename, "r") as contacts_file:
for row in csv.reader(contacts_file, delimiter=sep):
user_id = row[0]
user_email = row[6]
user_emails_dictionary[user_id] = user_email
logger.info("Loaded emails for %d users", len(user_emails_dictionary))
return user_emails_dictionary
def get_mapped_status(self, status_to_map: str):
"""
Given a verisign domain status, return a corresponding
status defined for our domains.
We map statuses as follows;
"serverHold” fields will map to hold, clientHold to hold
and any ok state should map to Ready.
"""
status_maps = {
"hold": TransitionDomain.StatusChoices.ON_HOLD,
"serverhold": TransitionDomain.StatusChoices.ON_HOLD,
"clienthold": TransitionDomain.StatusChoices.ON_HOLD,
"created": TransitionDomain.StatusChoices.READY,
"ok": TransitionDomain.StatusChoices.READY,
}
mapped_status = status_maps.get(status_to_map)
return mapped_status
def print_summary_duplications(
self,
duplicate_domain_user_combos: list[TransitionDomain],
duplicate_domains: list[TransitionDomain],
users_without_email: list[str],
):
"""Called at the end of the script execution to print out a summary of
data anomalies in the imported Verisign data. Currently, we check for:
- duplicate domains
- duplicate domain - user pairs
- any users without e-mails (this would likely only happen if the contacts
file is missing a user found in the domain_contacts file)
"""
total_duplicate_pairs = len(duplicate_domain_user_combos)
total_duplicate_domains = len(duplicate_domains)
total_users_without_email = len(users_without_email)
if total_users_without_email > 0:
users_without_email_as_string = "{}".format(
", ".join(map(str, duplicate_domain_user_combos))
)
logger.warning(
f"{termColors.YELLOW} No e-mails found for users: {users_without_email_as_string}" # noqa
)
if total_duplicate_pairs > 0 or total_duplicate_domains > 0:
duplicate_pairs_as_string = "{}".format(
", ".join(map(str, duplicate_domain_user_combos))
)
duplicate_domains_as_string = "{}".format(
", ".join(map(str, duplicate_domains))
)
logger.warning(
f"""{termColors.YELLOW}
----DUPLICATES FOUND-----
{total_duplicate_pairs} DOMAIN - USER pairs
were NOT unique in the supplied data files;
{duplicate_pairs_as_string}
{total_duplicate_domains} DOMAINS were NOT unique in
the supplied data files;
{duplicate_domains_as_string}
{termColors.ENDC}"""
)
def print_summary_status_findings(
self, domains_without_status: list[str], outlier_statuses: list[str]
):
"""Called at the end of the script execution to print out a summary of
status anomolies in the imported Verisign data. Currently, we check for:
- domains without a status
- any statuses not accounted for in our status mappings (see
get_mapped_status() function)
"""
total_domains_without_status = len(domains_without_status)
total_outlier_statuses = len(outlier_statuses)
if total_domains_without_status > 0:
domains_without_status_as_string = "{}".format(
", ".join(map(str, domains_without_status))
)
logger.warning(
f"""{termColors.YELLOW}
--------------------------------------------
Found {total_domains_without_status} domains
without a status (defaulted to READY)
---------------------------------------------
{domains_without_status_as_string}
{termColors.ENDC}"""
)
if total_outlier_statuses > 0:
domains_without_status_as_string = "{}".format(
", ".join(map(str, outlier_statuses))
) # noqa
logger.warning(
f"""{termColors.YELLOW}
--------------------------------------------
Found {total_outlier_statuses} unaccounted
for statuses-
--------------------------------------------
No mappings found for the following statuses
(defaulted to Ready):
{domains_without_status_as_string}
{termColors.ENDC}"""
)
def print_debug(self, print_condition: bool, print_statement: str):
"""This function reduces complexity of debug statements
in other functions.
It uses the logger to write the given print_statement to the
terminal if print_condition is TRUE"""
# DEBUG:
if print_condition:
logger.info(print_statement)
def prompt_table_reset(self):
"""Brings up a prompt in the terminal asking
if the user wishes to delete data in the
TransitionDomain table. If the user confirms,
deletes all the data in the TransitionDomain table"""
confirm_reset = query_yes_no(
f"""
{termColors.FAIL}
WARNING: Resetting the table will permanently delete all
the data!
Are you sure you want to continue?{termColors.ENDC}"""
)
if confirm_reset:
logger.info(
f"""{termColors.YELLOW}
----------Clearing Table Data----------
(please wait)
{termColors.ENDC}"""
)
TransitionDomain.objects.all().delete()
def handle( # noqa: C901
self,
domain_contacts_filename,
contacts_filename,
domain_statuses_filename,
**options,
):
"""Parse the data files and create TransitionDomains."""
sep = options.get("sep")
# If --resetTable was used, prompt user to confirm
# deletion of table data
if options.get("resetTable"):
self.prompt_table_reset()
# Get --debug argument
debug_on = options.get("debug")
# Get --LimitParse argument
debug_max_entries_to_parse = int(
options.get("limitParse")
) # set to 0 to parse all entries
# print message to terminal about which args are in use
self.print_debug_mode_statements(debug_on, debug_max_entries_to_parse)
# STEP 1:
# Create mapping of domain name -> status
domain_status_dictionary = self.get_domain_user_dict(
domain_statuses_filename, sep
)
# STEP 2:
# Create mapping of userId -> email
user_emails_dictionary = self.get_user_emails_dict(contacts_filename, sep)
# STEP 3:
# Parse the domain_contacts file and create TransitionDomain objects,
# using the dictionaries from steps 1 & 2 to lookup needed information.
to_create = []
# keep track of statuses that don't match our available
# status values
outlier_statuses = []
# keep track of domains that have no known status
domains_without_status = []
# keep track of users that have no e-mails
users_without_email = []
# keep track of duplications..
duplicate_domains = []
duplicate_domain_user_combos = []
# keep track of domains we ADD or UPDATE
total_updated_domain_entries = 0
total_new_entries = 0
# if we are limiting our parse (for testing purposes, keep
# track of total rows parsed)
total_rows_parsed = 0
# Start parsing the main file and create TransitionDomain objects
logger.info("Reading domain-contacts data file %s", domain_contacts_filename)
with open(domain_contacts_filename, "r") as domain_contacts_file:
for row in csv.reader(domain_contacts_file, delimiter=sep):
total_rows_parsed += 1
# fields are just domain, userid, role
# lowercase the domain names
new_entry_domain_name = row[0].lower()
user_id = row[1]
new_entry_status = TransitionDomain.StatusChoices.READY
new_entry_email = ""
new_entry_emailSent = False # set to False by default
# PART 1: Get the status
if new_entry_domain_name not in domain_status_dictionary:
# This domain has no status...default to "Create"
# (For data analysis purposes, add domain name
# to list of all domains without status
# (avoid duplicate entries))
if new_entry_domain_name not in domains_without_status:
domains_without_status.append(new_entry_domain_name)
else:
# Map the status
original_status = domain_status_dictionary[new_entry_domain_name]
mapped_status = self.get_mapped_status(original_status)
if mapped_status is None:
# (For data analysis purposes, check for any statuses
# that don't have a mapping and add to list
# of "outlier statuses")
logger.info("Unknown status: " + original_status)
outlier_statuses.append(original_status)
else:
new_entry_status = mapped_status
# PART 2: Get the e-mail
if user_id not in user_emails_dictionary:
# this user has no e-mail...this should never happen
if user_id not in users_without_email:
users_without_email.append(user_id)
else:
new_entry_email = user_emails_dictionary[user_id]
# PART 3: Create the transition domain object
# Check for duplicate data in the file we are
# parsing so we do not add duplicates
# NOTE: Currently, we allow duplicate domains,
# but not duplicate domain-user pairs.
# However, track duplicate domains for now,
# since we are still deciding on whether
# to make this field unique or not. ~10/25/2023
existing_domain = next(
(x for x in to_create if x.domain_name == new_entry_domain_name),
None,
)
existing_domain_user_pair = next(
(
x
for x in to_create
if x.username == new_entry_email
and x.domain_name == new_entry_domain_name
),
None,
)
if existing_domain is not None:
# DEBUG:
self.print_debug(
debug_on,
f"{termColors.YELLOW} DUPLICATE file entries found for domain: {new_entry_domain_name} {termColors.ENDC}", # noqa
)
if new_entry_domain_name not in duplicate_domains:
duplicate_domains.append(new_entry_domain_name)
if existing_domain_user_pair is not None:
# DEBUG:
self.print_debug(
debug_on,
f"""{termColors.YELLOW} DUPLICATE file entries found for domain - user {termColors.BackgroundLightYellow} PAIR {termColors.ENDC}{termColors.YELLOW}:
{new_entry_domain_name} - {new_entry_email} {termColors.ENDC}""", # noqa
)
if existing_domain_user_pair not in duplicate_domain_user_combos:
duplicate_domain_user_combos.append(existing_domain_user_pair)
else:
entry_exists = TransitionDomain.objects.filter(
username=new_entry_email, domain_name=new_entry_domain_name
).exists()
if entry_exists:
try:
existing_entry = TransitionDomain.objects.get(
username=new_entry_email,
domain_name=new_entry_domain_name,
)
if existing_entry.status != new_entry_status:
# DEBUG:
self.print_debug(
debug_on,
f"{termColors.OKCYAN}"
f"Updating entry: {existing_entry}"
f"Status: {existing_entry.status} > {new_entry_status}" # noqa
f"Email Sent: {existing_entry.email_sent} > {new_entry_emailSent}" # noqa
f"{termColors.ENDC}",
)
existing_entry.status = new_entry_status
existing_entry.email_sent = new_entry_emailSent
existing_entry.save()
except TransitionDomain.MultipleObjectsReturned:
logger.info(
f"{termColors.FAIL}"
f"!!! ERROR: duplicate entries exist in the"
f"transtion_domain table for domain:"
f"{new_entry_domain_name}"
f"----------TERMINATING----------"
)
sys.exit()
else:
# no matching entry, make one
new_entry = TransitionDomain(
username=new_entry_email,
domain_name=new_entry_domain_name,
status=new_entry_status,
email_sent=new_entry_emailSent,
)
to_create.append(new_entry)
total_new_entries += 1
# DEBUG:
self.print_debug(
debug_on,
f"{termColors.OKCYAN} Adding entry {total_new_entries}: {new_entry} {termColors.ENDC}", # noqa
)
# Check Parse limit and exit loop if needed
if (
total_rows_parsed >= debug_max_entries_to_parse
and debug_max_entries_to_parse != 0
):
logger.info(
f"{termColors.YELLOW}"
f"----PARSE LIMIT REACHED. HALTING PARSER.----"
f"{termColors.ENDC}"
)
break
TransitionDomain.objects.bulk_create(to_create)
logger.info(
f"""{termColors.OKGREEN}
============= FINISHED ===============
Created {total_new_entries} transition domain entries,
updated {total_updated_domain_entries} transition domain entries
{termColors.ENDC}
"""
)
# Print a summary of findings (duplicate entries,
# missing data..etc.)
self.print_summary_duplications(
duplicate_domain_user_combos, duplicate_domains, users_without_email
)
self.print_summary_status_findings(domains_without_status, outlier_statuses)

View file

@ -1,5 +1,3 @@
# Generated by Django 4.2.1 on 2023-10-12 21:34
from django.db import migrations, models