Associate the custom metrics with the correct monitored resource type. The labels of the monitored resource are either obtained from environment variables for the container, configured in the GKE deployment file, or queried from GCE metadate server. Using the correct monitored resource can help performance and reduced out-of-order metric writes.
Also changed the metrics display name to be more descriptive.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=189184411
Soy has historically tolerated map accesses on weakly-typed variables. That is, if a template declared a param $p and then did $p['some_key'] in the template body, Soy would treat $p as a map even if it wasn't statically declared as one.
This situation is changing with [] There are now two map types, `map` and `legacy_object_map`. We are trying to migrate every template in [] from `legacy_object_map` to `map`, leaving Soy with one (improved) map type. Because the two map types generate different JS code, Soy can no longer allow map accesses on weakly-typed variables. (If $p['some_key'] occurs in a template and the type of $p is unknown, Soy would not know what code to generate.)
Every parameter whose static type is unknown (`?`) but which is inferred to be a `legacy_object_map` needs to be migrated to a `map`. We are developing tools for this in [] However, as a first step, we need to migrate the subset of these params that use the legacy SoyDoc syntax to use header param syntax with a static type of `?`. (All params declared in SoyDoc are typed as unknown, and it is a compilation error to mix SoyDoc and header param syntax in the same template, so any template that declares a SoyDoc param that is inferred to be a map needs to migrate to header param syntax.)
This CL was prepared by using the tools in [] to create a list of templates declaring SoyDoc params inferred to be legacy_object_maps. This list was then fed to the existing //third_party/java_src/soy/java/com/google/template/soy/tools:ParamMigrator tool. Since this tool migrates whole files instead of individual templates, the resulting CL is a superset of the migration that is actually required. However, since the SoyDoc param syntax has been deprecated for years, and since there is little risk in migrating from one param style to another, I decided to land the superset.
This migration falls under the LSC described at https://docs.google.com/document/d/1dAl-rDMp3oL0Zh_iSTaiHICwtcbLbVIy1FQ0wXSAaHs.
Tested:
TAP --sample for global presubmit queue
[] passed FOSS tests
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=188879980
Add the "shell" command which lets you run multiple other command in a single
session, sparing you the initialization costs for all but the first of them.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=188712815
Allow specifying certificate hash other than certificate file. This makes things easier when only setting up EAP registrars. The certificate hash can be easily pulled from existing registrars (SUNRISE, GA, etc) with automation.
Also fixes a bug where we always expect the registrar name + phase string to be at least 7-character long.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=188511561
This fixes a few problems encountered when building and running according to the Install Guide using the DummyKeyring. It's still failing when trying to parse the JSON credential, which I haven't solved, but before proceeding I wanted to get agreement that it needs to be fixed at all since the best we could do is provide a valid format (as with the PGP keyrings), but the metrics logging will still fail for a different reason (i.e. the credential will not work for the GC project).
Things fixed in this PR:
Fix format string causing MissingFormatArgumentException in FrontendServlet
when keyring fails to load.
Include exception cause in VerifyException in PgpHelper.
Replace dummy PGP keyrings with ones without a password, as code expects.
Document how the PGP keyrings are created.
P.S. I see a tab character snuck into PgpHelper. I'll fix that if you're interested in this PR.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=188342973
If the proxy protocol header contains a malformatted string, such as "PROXY UNKNOWN", instead of throwing and killing the connection, use the TCP source IP as the remote IP.
Also changed how the header is read from the buffer, to avoid a potential Netty resource leak. Originally the header is read into another ByteBuf, which needs be be explicit released in order for Netty to reclaim its memory (http://netty.io/wiki/reference-counted-objects.html). Now we just read it into a byte array and let JVM GC it.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=188047084
This simplifies calculating the overall invoice by giving RESTORE fees a
period equal to the period of the associated RENEW (1 year). Older
BillingEvents will not be backfilled, and will have periodYears = null.
Invoicing and business both agree this is a valid representation, since RESTORE fees are intrinsically tied to the 1-year RENEW it's associated with.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=188041777
When not running locally, the logging formatter is set to convert the log record to a single-line JSON string that Stackdriver logging agent running in GKE will pick up and parse correctly.
Also removed redundant logging handler in the proxy frontend connection. They have two problems: 1) it is possible to leak PII when all frontend traffic is logged, such as client IPs. Even though this is less of a concern because the GCP TCP proxy load balancer masquerade source IPs. 2) We are only logging the HTTP request/response that the frontend connection is sending to/receiving from the backend connection, but the backend already has its own logging handler to log the same message that it gets from/sends to the GAE app, so the logging in the frontend connection does not really give extra information.
Logging of some potential PII information such as the source IP of a proxied connection are also removed.
Thirdly, added a k8s autoscaling object that scales the containers based on CPU load. The default target load is 80%. This, in connection with GKE cluster VM autoscaling, means that when traffic is low, we'll only have one VM running one container of the proxy.
Fixes a bug where the MetricsComponent generates a separate ProxyConfig that does not call parse method on the command line args passed, resulting default Environment always being used in constructing the metric reporter.
Lastly a little bit of cleaning of the MOE config script, no newlines are necessary as the BUILD are formatted after string substitution.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=188029019
Following b/74072938, our quota for our main projects (prod, sandbox, alpha) is now 5000 queries per 100s, which allows us to increase our client-side rate limit accordingly.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=188026911
Changed SUNRISE to START_SUNRISE and added a registry/registrar pair for testing EAP. The EAP period is set to 2018-03-01 to 2022-03-01 with a price of $100.
A temporary flag is added to only create EAP registry/registrar pair so that we can update existing registrars.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=187897405
It was nullable all along, but wasn't tagged as such, and thus it was
possible to misuse the method from its call sites.
Also adds an assertion about no NORDN tasks being enqueued in a failing
domain create test for a required signed mark.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=187649865
This enables sharded DNS publishing on a per-TLD basis. Instead of a TLD-wide lock, the sharded scheme locks each update on the shard number, allowing parallel writes to DNS.
We allow N (the number of shards) to be 0 or 1 for no sharding, and N > 1 for an N-way sharding scheme. Unless explicitly set, all TLDs default to a numShards of 0, so we don't have to reload all registry objects explicitly.
WARNING: This will change the lock name upon deployment for the PublishDnsAction from "<TLD> Dns Updates" to "<TLD> Dns Updates shard 0". This may cause concurrency issues if the underlying DNSWriter is not parallel-write tolerant (currently all production usages are ZonemanWriter, which is parallel-tolerant, so no issues are expected).
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=187525655
Also changed the name of "verifyRegistryStateAllowsLaunchFlows" to "verifyRegistryStateAllowsApplicationFlows", because there are now launch flows that don't use applications (start-date sunrise).
Finally, added a test to showcase the "super-user" power that EPPs with Anchor Tenants have. There's no change in behavior in that regard in this CL - we just add a test to make it explicit.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=187517199
Using bazel to build and push image result in reproducible builds.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=187252645
Since this has interactions with the recently added EPP resource caching,
they both need to be configurable, otherwise the EPP resource caching time
could not be set longer than the hard-coded async delete delay.
This also adds comments to better clarify the interaction between the two.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=187179539
After investigating common domain create/update command usage
patterns by registrars, we noticed that it is frequent for a
given registrar to reuse both hosts (using a standardized set of
nameservers) as well as contacts (e.g. for privacy/proxy
services). With these usage patterns, potential per-registrar
throughput during high volume scenarios (i.e. first moments of
General Availability) suffers from hitting hot keys in Datastore.
The solution, implemented in this CL, is to add short-term
in-memory caching for contacts and hosts, analogous to how we are
already caching Registry and Registrar entities. These new
cached paths are only used inside domain flows to determine
existence and deleted/pending delete status of contacts and
hosts. This is a potential loss of transactional consistency, but
in practice it's hard to imagine this having negative effects, as
contacts or hosts that are in use cannot be deleted, and caching
would primarily affect widely used contacts and hosts.
Note that this caching can be turned on or off through a
configuration option, and by default would be off. We'd only want
it on when we really needed it, i.e. during a big launch.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=187093378
Currently, DeleteProberDataAction goes over all the TLDs of type "TEST" that
end with .test, and deletes all their DomainResources and their subordinate
history entries, poll messages, billing events, ForeignKeyDomainIndex and
EppResourceIndex entities.
After this change, we can optionally supply TLDs to work on for the request using one or more "tld=" parameter. The default (if none are supplied) will still be "all TEST TLDs that end in .test".
All given TLDs must exist, and must all be of type TEST.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=187064053
Even when the request is not permissioned to see contact information, we should
show information about the owning registrar.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=187049833
This CL setups up kubernetes configuration files necessary to deploy the proxy service to k8s (GKE to be specific). Because kubernetes service can only expose node ports higher than 30000, the default ports that the containers expose are also changed to >30000 so that they are consistent. This is *not* necessary, but makes it easier to remember which ports are for what purpose.
Note that we are not setting up a load balancing service. The way it is set up now, the services are only visible within the clusters, on each node at the specified node ports. The load balancer k8s sets up uses GCP L4 load balancer that does not support IPv6 (because it does not do TCP termination at the LB, rather just routes packages to cluster nodes, and GCE VMs does not support IPv6 yet). The L4 load balancer also only provides regional IPs on the frontend, which means proxies running in different regions (Americas, EMEA, APAC) would all have different IPs, which in turn offload regional routing determination to the DNS system, adding complexity.
A user of the proxy instead should set up TCP proxy load balancing in GCP separately and point traffic to the VM group(s) backing the k8s cluster. This allows for a single global anycast IP (IPv4 and IPv6) to be allocated at the load balancer frontend.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=187046521
It seems that even though the token is supposed to be valid for 60min, in
practice it expires before that. Reducing caching time to 30min solves the
problem (at least as far as I can tell). This should not increase too much load
as we are only calling the API twice an hour instead of once.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=186830395
The RDAP Pilot Program operational profile document indicates that domain
responses should list, in addition to their normal contacts, a special entity
for the registrar.
1.5.12. The domain object in the RDAP response MUST contain an entity with the registrar role (called registrar entity in this section). The handle of the entity MUST be equal to the IANA Registrar ID. A valid fn member MUST be present in the registrar entity. Other members MAY be present in the entity (as specified in RFC6350, the vCard Format Specification and its corresponding JSON mapping RFC7095). Contracted parties MUST include an entity with the abuse role (called Abuse Entity in this section) within the registrar entity. The Abuse Entity MUST include tel and email members, and MAY include other members.
1.5.13. The entity with the registrar role in the RDAP response MUST contain a publicIDs member [RFC7483] to identify the IANA Registrar ID from the IANA’s Registrar ID registry (https://www.iana.org/assignments/registrar-ids/registrar-ids.xhtml). The type value of the publicID object MUST be equal to IANA Registrar ID.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=186797360
I'm actually surprised that we had this in our code, as it seems like a huge
oversight, but we were individually loading each and every referenced contact
and host during domain/application create/update/allocate flows. Loading them
all as a single batch should reduce round trips to Datastore by a good deal,
thus improving performance.
We aren't giving up any transactional consistency in doing so and the only
potential downside I can think of is that we're always loading all contacts/
hosts instead of only some of them, in the rare case that one of the earlier
contacts/hosts is actually in pending delete (which allowed short-circuiting).
However, the gains from only making one round-trip should swamp the potential
losses in occasionally loading more data than is strictly necessary.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=186656118
Currently we validate the fee extension by summing up all fees present in the extension and comparing it against the total fee to be charged. While this works in most cases, we'd like the ability to individually validate each fee. This is especially useful during EAP when two fees are charged, a regular "create" fee that would also be amount we charge during renewal, and a one time "EAP" fee.
Because we can only distinguish fees by their descriptions, we try to match the description to the format string of the fee type enums. We also only require individual fee matches when we are charging more than one type of fees, which makes the change compatible with most existing use cases where only one fees is charged and the description field is ignored in the extension.
We expect the workflow to be that a registrar sends a domain check, and we reply with exactly what fees we are expecting, and then it will use the descriptions in the response to send us a domain create with the correct fees.
Note that we aggregate fees within the same FeeType together. Normally there will only be one fee per type, but in case of custom logic there could be more than one fee for the same type. There is no way to distinguish them as they both use the same description. So it is simpler to just aggregate them.
This CL also includes some reformatting that conforms to google-java-format output.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=186530316
The unlimited exponential backoff makes cascading failure a serious problem,
when encountering burst DNS load. Originally, it was exponential backoff, with min 1 sec max 1 hour.
This changes it to be linearly scaling from
30 seconds to 10 minutes. Min 30 seconds is used to avoid over-retrying due to lock contention. Max 10 minutes allows for more retries within our 1 hour SLA. Finally, we're
switching to linear scaling to increase the number of 'quick' retries for low
backoff time, before ultimately settling on the upper bound of 10 minutes (if a
task ever gets to that point, it's probably misconfigured.)
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=186041553
Added:
- dns/update_latency, which measures the time since a DNS update was added to the pull queue until that updates is committed to the DnsWriter
- - It doesn't check that after being committed, it was actually published in the DNS.
- dns/publish_queue_delay, which measures how long since the initial insertion to the push queue until a publishDnsUpdate action was handled. It measures both for successes (which is what we care about) and various failures (which are important because the success for that publishDnsUpdate will be > than any of the previous failures)
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=185995678
The START_DATE_SUNRISE phase allows registration of domains only with a signed mark. In all other respects - it is identical to the GENERAL_AVAILABILITY phase.
Note that Anchor Tenants bypass all checks, and are hence able to register domains without a signed mark.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=185534793
The task-queue API only allows reading 1000 tasks at a time, hence the original reason for this limit. We get over that limit by reading (and processing) items from the queue in a loop - 1000 at a time.
This is important because the 1000 dns-updates are shared among all TLDs,
meaning that a TLD with >1000 waiting updates can affect the update latency of
other TLDs.
In addition, partially fixes the bug where if there are more than 1000 updates to paused
/ non-existing TLDs, we completely block all updated to all TLDs.
By partially fixed, I mean "if we have around 1000 updates to paused TLDs, we will read them every time ReadDnsUpdates is called, ignore then, and only then get to the actual updates we want to process".
This works for a number of 1000 updates waiting - but if paused TLDs have tens or hundreds of thousands of updates waiting - this might still choke up other TLDs (not to mention we keep reading / updating 10s or 100s of thousands of tasks in the queue, that's... bad.)
A more thorough fix will come in a future CL, as it requires a more thorough change in the code.
Note that the queue lease command supports a maximum of 10 QPS. Any more than
that - and we get errors / empty results. Hence we limit our QPS to 9 to be on
the safe side.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=185218684
When a quota request is rejected, increment the metric counter by one.
Also makes both frontend and backend metrics singleton because all the fields they have a static.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=185146804
SoySyntaxException is an abstract exception type and is never even declared to be thrown (all declarations about this changed about 2 years ago). So places catching it should either change to catch SoyCompilationException, or do nothing and let it propagate.
Tested:
TAP sample presubmit queue
[]
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=185050724
The quota handler terminates connections when quota is exceeded.
The next CL will add instrumentation for quota related metrics.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=185042675
Changes the code to be in compliance with the RDAP Pilot Profile document,
which specifies:
1.4.11. If permitted or required by an ICANN agreement provision, waiver, or Consensus Policy, an RDAP response may contain redacted registrant, administrative, technical and/or other contact information. If any information is redacted, the response MUST include a remarks member with title "Data Policy", type "object truncated due to authorization", a description containing the string "Some of the data in this object has been removed" and a links member with the elements rel:alternate and href indicating where the data policy can be found. An entity with redacted information MUST include the "removed" value in the status element.
We were using the "removed" status to indicate deleted contacts and inactive
registrars. Instead, we will now use "inactive", so that we can use "removed"
to indicated redaction.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=185039201
Every time you run nomulus tool you currently get a bunch of useless output
to the console that looks like this:
---
Feb 08, 2018 3:11:18 PM google.registry.config.YamlUtils mergeYaml
INFO: Successfully loaded environment configuration YAML file.
Feb 08, 2018 3:11:20 PM com.google.wrappers.base.GoogleInit logArgs
INFO: First call to GoogleInit.initialize - removeFlags: false, args: [ProcessUtils, --noinstall_signal_handlers]
Feb 08, 2018 3:11:20 PM com.google.wrappers.base.GoogleInit logArgs
INFO: Subsequent call to GoogleInit.initialize, ignoring - removeFlags: false, args: [SecureWrapperBindings (via google.registry.tools.RegistryTool), --noinstall_signal_handlers]
Feb 08, 2018 3:11:25 PM com.google.monitoring.metrics.MetricRegistryImpl newIncrementableMetric
INFO: Registered new counter: /lock/acquire_lock_requests
Feb 08, 2018 3:11:25 PM com.google.monitoring.metrics.MetricRegistryImpl newEventMetric
INFO: Registered new event metric: /lock/lock_duration
---
This CL fixes that by increasing the console logging threshold from INFO to
WARNING for the relevant paths, for nomulus tool only.
I also had to decrease the logging level of one statement inside YamlUtils
from INFO to FINE, because it was being called by AppEngineConnectionFlags'
constructor in building the HostAndPort server field, which is executed
from the first line of RegistryCli.runCommand(), whereas
loggingParams.configureLogging(), which actually reads in and takes action
on the logging.properties file, isn't called until much later. This is fine
though, because there's little value from logging the statement
"Successfully loaded environment configuration YAML file." every time every
command or flow is executed. We certainly do log errors if that ever fails,
which is the important part.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=185036329
The higher the number the better for serious launches. These used to be 100
but had been detuned because instances weren't dying correctly when no longer
needed, thus contributing to higher costs than necessary. That problem was
fixed when we migrated to the Java 8 runtime, however, so there's no reason
not to use the higher number.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=184742738
In publishDomain, we load the subordinate hosts of the domain from datastore and compare its nameservers to them. For any nameserver that is in-baliwick, we call publishSubordinateHost on it and stage the A/AAAA records of the host for publication.
This is superior to the old approach where we use hostName.endsWith(domainName) to check if a nameserver is in-baliwick because it will mistake ns.another-example.tld as a subordinate host of example.tld. It is also better than checking hostName.endsWith("." + domainName), which will catch false positives as above, but falls short in a corner case where the nameserver has been deleted before its superordinate domain's record is updated. In that case, subordinateHosts.cotains(hostName) will be false but hostName.endsWith("." + domainName) will still be true.
Note that we still use the suffix check in filterGlueRecords because it is filtering on existing records from Cloud DNS. It is even advantageous to do so because if there were (and there shouldn't be if everything is consistent) any orphaned glue records (suffix matches to the domain, but not actually in its subordinate host list), they would be retained by the filter and therefore be deleted when the staged changes are committed.
Also fixed a few tests that should have failed had we checked subrodinate hosts....
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=184732005
See []for more information.
Created with the tools in []
Tested:
TAP --sample for global presubmit queue
[]
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=184727400
It's been long enough since the format change adding in years that all
registrars should no longer have any IDs in the old format lying around
that they're still attempting to ACK. All poll messages have already been
coming back to registrars with the new format for months now.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=184714735
Previously, CloudDnsWriter used InetAddress.toString() to produce the ipv4/6
address string (i.e. 127.0.0.1 or 0:0:0:0:0:0:0:1) used as an argument to the
Cloud DNS API. However, this fails because InetAddress uses the format
"HostName/IpAddress" for toString(), which uses the empty string as a HostName
if unspecified. This resulted in the erroneous use of a prefix slash (i.e.
"/127.0.01") as an InetAddress argument, causing all glue record updates to
fail.
This change replaces InetAddress.toString() with InetAddress.getHostAddress(),
which properly generates the IP address for the InetAddress. This also replaces
a lot of logic in the corresponding test with concrete equivalents, preventing
obvious errors like this from creeping up on us in the future.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=184708896
Now that we've verified the new Beam billing pipeline works, we can delete the
old manual commands we used to use.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=184707182