Allow over 1000 dns-updates to be handled at once

The task-queue API only allows reading 1000 tasks at a time, hence the original reason for this limit. We get over that limit by reading (and processing) items from the queue in a loop - 1000 at a time.

This is important because the 1000 dns-updates are shared among all TLDs,
meaning that a TLD with >1000 waiting updates can affect the update latency of
other TLDs.

In addition, partially fixes the bug where if there are more than 1000 updates to paused
/ non-existing TLDs, we completely block all updated to all TLDs.

By partially fixed, I mean "if we have around 1000 updates to paused TLDs, we will read them every time ReadDnsUpdates is called, ignore then, and only then get to the actual updates we want to process".

This works for a number of 1000 updates waiting - but if paused TLDs have tens or hundreds of thousands of updates waiting - this might still choke up other TLDs (not to mention we keep reading / updating 10s or 100s of thousands of tasks in the queue, that's... bad.)

A more thorough fix will come in a future CL, as it requires a more thorough change in the code.

Note that the queue lease command supports a maximum of 10 QPS. Any more than
that - and we get errors / empty results. Hence we limit our QPS to 9 to be on
the safe side.

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=185218684
This commit is contained in:
guyben 2018-02-09 17:56:29 -08:00 committed by jianglai
parent ce5baafc4a
commit bba975a991
7 changed files with 389 additions and 78 deletions

View file

@ -265,29 +265,48 @@ public final class RegistryConfig {
}
/**
* The maximum interval (seconds) to lease tasks from the dns-pull queue.
* The maximum time we allow publishDnsUpdates to run.
*
* <p>This is the maximum lock duration for publishing the DNS updates, meaning it should allow
* the various DnsWriters to publish and commit an entire batch (with a maximum number of items
* set by provideDnsTldUpdateBatchSize).
*
* <p>Any update that takes longer than this timeout will be killed and retried from scratch.
* Hence, a timeout that's too short can result in batches that retry over and over again,
* failing forever.
*
* <p>If there are lock contention issues, they should be solved by changing the batch sizes or
* the cron job rate, NOT by making this value smaller.
*
* @see google.registry.dns.ReadDnsQueueAction
* @see google.registry.dns.PublishDnsUpdatesAction
*/
@Provides
@Config("dnsWriteLockTimeout")
public static Duration provideDnsWriteLockTimeout() {
/*
* This is the maximum lock duration for publishing the DNS updates, meaning it should allow
* the various DnsWriters to publish and commit an entire batch (with a maximum number of
* items set by provideDnsTldUpdateBatchSize).
*
* Any update that takes longer than this timeout will be killed and retried from scratch.
* Hence, a timeout that's too short can result in batches that retry over and over again,
* failing forever.
*
* If there are lock contention issues, they should be solved by changing the batch sizes
* or the cron job rate, NOT by making this value smaller.
*/
@Config("publishDnsUpdatesLockDuration")
public static Duration providePublishDnsUpdatesLockDuration() {
return Duration.standardMinutes(3);
}
/**
* The requested maximum duration for ReadDnsQueueAction.
*
* <p>ReadDnsQueueAction reads update tasks from the dns-pull queue. It will continue reading
* tasks until either the queue is empty, or this duration has passed.
*
* <p>This time is the maximum duration between the first and last attempt to lease tasks from
* the dns-pull queue. The actual running time might be slightly longer, as we process the
* tasks.
*
* <p>This value should be less than the cron-job repeat rate for ReadDnsQueueAction, to make
* sure we don't have multiple ReadDnsActions leasing tasks simultaneously.
*
* @see google.registry.dns.ReadDnsQueueAction
*/
@Provides
@Config("readDnsQueueActionRuntime")
public static Duration provideReadDnsQueueRuntime() {
return Duration.standardSeconds(45);
}
/**
* Returns the default time to live for DNS A and AAAA records.
*