google-nomulus/docs/rde-import-architecture.md
Wolfgang Meyers d4a428fc24 Document TLD import architecture
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=149575471
2017-03-13 11:22:38 -04:00

3.9 KiB

Registry Data Import Architecture

See also the RDE usage guide.

The Registry Data Import feature was designed to handle escrow files from other registries with millions of domains. In the spirit of divide and conquer, the mapreduce library is used to break up the work of the import into smaller chunks that can be processed in a reasonable period of time. This process is broken down into four separate mapreduce jobs that must be run in sequence due to how datastore transactions work and due to dependencies between registry objects. The steps are broken up as follows:

Initial Setup - This is a set of manual steps that must be completed before the process can be run, and is out of the scope of this document. See Usage for more details.

Contacts Import - Reads contact entries from an escrow file and saves them as ContactResource entities. HistoryEntry entities are also created for the contact. This process depends on initial setup, but does not depend on any previous step being run.

Hosts Import - Reads host entries from an escrow file and saves them as HostResource entities. HistoryEntry entities are also created for the hosts. This process depends on initial setup, but does not depend on any previous step being run.

Domains Import - Reads domain entries from an escrow file and saves them as DomainResource entities. For each domain imported, a history entry, autorenew billing event and autorenew poll message will also be created. For domains that are in pending transfer state, the import process will also create future entities for automatic server approval in the same fashion as domain transfer request EPP messages. Domains cannot be imported until the contacts and hosts required by the domain are imported in previous steps.

Hosts Link - Reads host entries from an escrow file and links in-zone hosts to their superordinate domains. This is the last step because both hosts and domains have to be imported before the link can be made in both directions.

Components

Each mapreduce job (with the exception of Hosts Link) is made up of a similar set of components. Note that much of the work that is done by each job strongly resembles the inversion of the Registry Data Export feature, and reuses the Jaxb representations of the xml elements that compose escrow files.

Parser - The parser is the lowest level of the import process. This component parses an escrow file (provided as an open stream from Google Cloud Storage) into discrete JAXB objects. The parser maintains an internal cursor in the xml file that represents the next element to be read, and can advance to and skip any number of elements.

Reader - The reader is configured by each mapreduce job to load an escrow file from Cloud Storage and use a parser to read a selected subset of the file, forwarding the results to a mapper.

Input - An input is responsible for determining how many reader instances to create and which section of the escrow file should be consumed by each reader.

Converter - The converter accepts a Jaxb object and returns an equivalent resource that can be saved to the datastore.

Import Utility Logic - Common import logic is consolidated into a single place, such as creation of index entities and escrow file validation.

Mapper - The mapper accepts a stream of Jaxb objects from the reader and uses the converter to map them to resource objects. Then for each resource object produced, the mapper will attempt to save the resource and any related objects to the datastore in a transaction. This is an idempotent operation; if any resource has been previously imported by the process, it will be ignored.

Action Endpoint - The action endpoint is responsible for accepting requests to launch each step of the import process, bootstrapping mapreduce jobs, and redirecting the client to the status page of the import job. This is the entry point of the import process.