google-nomulus/docs/app-engine-architecture.md
mcilwain 715118ddda Add documentation on our App Engine services and task queues
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=128098514
2016-07-21 17:57:28 -04:00

10 KiB

App Engine architecture

This document contains information on the overall architecture of the Domain Registry project as it is implemented in App Engine.

Services

The Domain Registry contains three services, which were previously called modules in earlier versions of App Engine. The services are: default (also called front-end), backend, and tools. Each service runs independently in a lot of ways, including that they can be upgraded individually, their log outputs are separate, and their servers and configured scaling are separate as well.

Default service

The default service is responsible for all registrar-facing EPP command traffic, all user-facing WHOIS and RDAP traffic, and the admin and registrar web consoles, and is thus the most important service. If the service has any problems and goes down or stops servicing requests in a timely manner, it will begin to impact users immediately. Requests to the default service are handled by the FrontendServlet, which provides all of the endpoints exposed in FrontendRequestComponent.

Backend service

The backend service is responsible for executing all regularly scheduled background tasks (using cron) as well as all asynchronous tasks. Requests to the backend service are handled by the BackendServlet, which provides all of the endpoints exposed in BackendRequestComponent. These include tasks for generating/exporting RDE, syncing the trademark list from TMDB, exporting backups, writing out DNS updates, handling asynchronous contact and host deletions, writing out commit logs, exporting metrics to BigQuery, and many more. Issues in the backend service will not immediately be apparent to end users, but the longer it is down, the more obvious it will become that user-visible tasks such as DNS and deletion are not being handled in a timely manner.

The backend service is also where all MapReduces run, which includes some of the aforementioned tasks such as RDE and asynchronous resource deletion, as well as any one-off data migration MapReduces. Consequently, the backend service should be sized to support not just the normal ongoing DNS load but also the load incurred by MapReduces, both scheduled (such as RDE) and on-demand (asynchronous contact/host deletion).

Tools service

The tools service is responsible for servicing requests from the registry_tool command line tool, which provides administrative-level functionality for developers and tech support employees of the registry. It is thus the least critical of the three services. Requests to the tools service are handled by the ToolsServlet, which provides all of the endpoints exposed in ToolsRequestComponent. Some example functionality that this service provides includes the server-side code to update premium lists, run EPP commands from the tool, and manually modify contacts/hosts/domains/and other resources. Problems with the tools service are not visible to users.

Task queues

Task queues in App Engine provide an asynchronous way to enqueue tasks and then execute them on some kind of schedule. There are two types of queues, push queues and pull queues. Tasks in push queues are always executing up to some throttlable limit. Tasks in pull queues remain there indefinitely until the queue is polled by code that is running for some other reason. Essentially, push queues run their own tasks while pull queues just enqueue data that is used by something else. Many other parts of App Engine are implemented using task queues. For example, App Engine cron adds tasks to push queues at regularly scheduled intervals, and the MapReduce framework adds tasks for each phase of the MapReduce algorithm.

The Domain Registry project uses a particular pattern of paired push/pull queues that is worth explaining in detail. Push queues are essential because App Engine's architecture does not support long-running background processes, and so push queues are thus the fundamental building block that allows asynchronous and background execution of code that is not in response to incoming web requests. However, they also have limitations in that they do not allow batch processing or grouping. That's where the pull queue comes in. Regularly scheduled tasks in the push queue will, upon execution, poll the corresponding pull queue for a specified number of tasks and execute them in a batch. This allows the code to execute in the background while taking advantage of batch processing.

Particulars on the task queues in use by the Domain Registry project are specified in the queue.xml file. Note that many push queues have a direct one-to-one correspondence with entries in cron.xml because they need to be fanned-out on a per-TLD or other basis (see the Cron section below for more explanation). The exact queue that a given cron task will use is passed as the query string parameter "queue" in the url specification for the cron task.

Here are the task queues in use by the system. All are push queues unless explicitly marked as otherwise.

  • bigquery-streaming-metrics -- Queue for metrics that are asynchronously streamed to BigQuery in the Metrics class. Tasks are enqueued during EPP flows in EppController. This means that there is a lag of a few seconds to a few minutes between when metrics are generated and when they are queryable in BigQuery, but this is preferable to slowing all EPP flows down and blocking them on BigQuery streaming.
  • brda -- Queue for tasks to upload weekly Bulk Registration Data Access (BRDA) files to a location where they are available to ICANN. The RdeStagingReducer (part of the RDE MapReduce) creates these tasks at the end of generating an RDE dump.
  • delete-commits -- Cron queue for tasks to regularly delete commit logs that are more than thirty days stale. These tasks execute the DeleteOldCommitLogsAction.
  • dns-cron (cron queue) and dns-pull (pull queue) -- A push/pull pair of queues. Cron regularly enqueues tasks in dns-cron each minute, which are then executed by ReadDnsQueueAction, which leases a batch of tasks from the pull queue, groups them by TLD, and writes them as a single task to dns-publish to be published to the configured DNS writer for the TLD.
  • dns-publish -- Queue for batches of DNS updates to be pushed to DNS writers.
  • export-bigquery-poll -- Queue for tasks to query the success/failure of a given BigQuery export job. Tasks are enqueued by BigqueryPollJobAction.
  • export-commits -- Queue for tasks to export commit log checkpoints. Tasks are enqueued by CommitLogCheckpointAction (which is run every minute by cron) and executed by ExportCommitLogDiffAction.
  • export-reserved-terms -- Cron queue for tasks to export the list of reserved terms for each TLD. The tasks are executed by ExportReservedTermsAction.
  • export-snapshot -- Cron and push queue for tasks to load a Datastore snapshot that was stored in Google Cloud Storage and export it to BigQuery. Tasks are enqueued by both cron and CheckSnapshotServlet and are executed by both ExportSnapshotServlet and LoadSnapshotAction.
  • export-snapshot-poll -- Queue for tasks to check that a Datastore snapshot has been successfully uploaded to Google Cloud Storage (this is an asynchronous background operation that can take an indeterminate amount of time). Once the snapshot is successfully uploaded, it is imported into BigQuery. Tasks are enqueued by ExportSnapshotServlet and executed by CheckSnapshotServlet.
  • export-snapshot-update-view -- Queue for tasks to update the BigQuery views to point to the most recently uploaded snapshot. Tasks are enqueued by LoadSnapshotAction and executed by UpdateSnapshotViewAction.
  • flows-async -- Queue for asynchronous tasks that are enqueued during EPP command flows. Currently all of these tasks correspond to invocations of any of the following three MapReduces: DnsRefreshForHostRenameAction, DeleteHostResourceAction, or DeleteContactResourceAction.
  • group-members-sync -- Cron queue for tasks to sync registrar contacts (not domain contacts!) to Google Groups. Tasks are executed by SyncGroupMembersAction.
  • load[0-9] -- Queues used to load-test the system by LoadTestAction. These queues don't need to exist except when actively running load tests (which is not recommended on production environments). There are ten of these queues to provide simple sharding, because the Domain Registry system is capable of handling significantly more Queries Per Second than the highest throttle limit available on task queues (which is 500 qps).
  • lordn-claims and lordn-sunrise -- Pull queues for handling LORDN exports. Tasks are enqueued synchronously during EPP commands depending on whether the domain name in question has a claims notice ID.
  • marksdb -- Queue for tasks to verify that an upload to NORDN was successfully received and verified. These tasks are enqueued by NordnUploadAction following an upload and are executed by NordnVerifyAction.
  • nordn -- Cron queue used for NORDN exporting. Tasks are executed by NordnUploadAction, which pulls LORDN data from the lordn-claims and lordn-sunrise pull queues (above).
  • rde-report -- Queue for tasks to upload RDE reports to ICANN following successful upload of full RDE files to the escrow provider. Tasks are enqueued by RdeUploadAction and executed by RdeReportAction.
  • rde-upload -- Cron queue for tasks to upload already-generated RDE files from Cloud Storage to the escrow provider. Tasks are executed by RdeUploadAction.
  • sheet -- Queue for tasks to sync registrar updates to a Google Sheets spreadsheet. Tasks are enqueued by RegistrarServlet when changes are made to registrar fields and are executed by SyncRegistrarsSheetAction.

Cron tasks

Datastore entities

Cloud Storage buckets

Web.xml

Cursors