Commit graph

78 commits

Author SHA1 Message Date
weiminyu
98cce20899 Set up domain-registry proxy in Crash environment
- Created configs for Proxy server, GKE, and terraform
- Created sans_list file for use with tarsier client
- Updated allowedClients in registry server

TODO: Update dr-bashrc to support crash environment

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=236659249
2019-03-05 14:25:01 -05:00
jianglai
468808723a Move domain registry terraform configs
We are moving toward using GitHub as the source of truth for the domain registry project (Nomulus). As such, the piper location will soon be deleted, along with it the terraform configs. These files are copied to the canonical location []

Note that the files under modules will still be present in the open source code base as it allows open source users to set up the project quickly. The files under envs are specific to each actual project and is removed entirely from the open source code (it was excluded by MOE before).

Some files are renamed to conform to the newly established terraform code style.

There was a remaining regarding using latchkey to set up IAM policies that I intend to punt for now. I imagine if we decide to use latchkey, it means that the IAM related terraform configs will be removed for the Annealing set up. However we would still like to leave that in the open source configs such that it still is a one-stop shop to set up your project.

The automation mode is set to DRYRUN so that there are no accidental changes to our projects during .dev launch. It will be changed back later.

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=234838043
2019-02-25 11:13:08 -05:00
jianglai
a85544b3f6 Use gson to make JSON string in proxy log formatter
This is simpler than using fasterxml.jackson.

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=224583713
2018-12-12 13:22:34 -05:00
jianglai
57a53db84e Make FOSS proxy treat connections with unknown sources more gracefully
When a connection to the proxy using the PROXY protocol (https://www.haproxy.org/download/1.8/doc/proxy-protocol.txt) comes from an IP address that the external load balancer does not recognize, make the source IP 0.0.0.0. This way an appropriate WHOIS quota can be configured for this kind of connections.

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=224583547
2018-12-12 13:22:34 -05:00
jianglai
3ef8cd692d Add MOE equivalency for 2018-11-05 sync
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=224209323
2018-12-12 13:22:34 -05:00
jianglai
86007622f7 Remove proxy's dependency on config
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=221666668
2018-11-16 16:57:30 -05:00
jianglai
c0239b0a07 Move YamlUtils to be under google.registry.util package
This makes it simpler to package google.registry.util as a separate project in
Gradle that can be depended upon by the proxy package. Currently the proxy
package depends on both google.registry.util and google.registry.config.

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=221450085
2018-11-14 12:00:45 -05:00
jianglai
3cfde5d4a1 Fix EPP quota handling bug
We limit the maximum number of concurrent connections that a client can make the proxy. The quota is implemented as a (thread-safe) map of client certificate hash to available number of connections. When a new connection is made, we decrement the availability counter by one. When the counter hits zero, no more connections can be made and any new connection from the same client is terminated by the proxy.

Currently, the counter is incremented when a connection is terminated, including connections that are terminated *because* the quota is reached (i. e. the connections for which the counter is not decremented because the counter is already zero). This means that the first time the quota is reached, the next connection is dropped, the counter is incremented to 1 and new connections can be made again, bypassing the quota. This process can be repeated to achieve, theoretically, infinite quota.

This CL fixes this bug by only incrementing the counter, upon connection termination, for connections that have decremented the counter in the first place.

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=217231593
2018-10-17 11:56:04 -04:00
jianglai
ecdbdbca63 Update terraform version constraint
There is no "google_project" resource managed by terraform, so we are not worried about the new terraform binary destroying/re-creating GAE resources.

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=217206226
2018-10-17 11:54:34 -04:00
jianglai
7b9d562043 Explicitly set terraform version in preparation for the incoming 1.13.0 update
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=215806094
2018-10-08 16:59:29 -04:00
jianglai
3fc7271145 Move GCP proxy code to the old [] proxy's location
1. Moved code for the GCP proxy to where the [] proxy code used to live.
3. Corrected reference to the GCP proxy location.
4. Misc changes to make ErrorProne and various tools happy.

+diekmann to LGTM terraform whitelist change.

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=213630560
2018-09-20 11:19:36 -04:00
mcilwain
a483beef28 Add MOE equivalence for 2018-09-14 sync
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=212996616
2018-09-20 11:19:36 -04:00
jianglai
5e2831b562 Change how access tokens are refreshed
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=212880971
2018-09-14 11:59:39 -04:00
jianglai
ee97d7c2cd Update max pod number to 10
This should not cause any waste as the pods are only scaled up when necessary.

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=209881536
2018-09-07 23:59:39 -04:00
jianglai
0065e52d84 Log remote IP when EPP SSL handshake fails
This makes it easy to debug issues when registrars cannot finish SSL
handshake. There's no privacy concerns because we keep a record of the
registrars' IP address in our whitelist anyway.

The remote address attribute it set by the ProxyProtocolHandler, which runs before anything is done. The GCLP added the protocol header at the beginning of a stream, so we know that by the time handshake is finished (successful or not), this key must be set.

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=209169683
2018-08-20 14:23:40 -04:00
jianglai
782643ce33 Log all exceptions thrown at the end of the pipeline
The RelayHandler is installed at the end of a channel pipeline (both frontend and backend). If it does not log the exception, it will be regarded and unhandled exception, which shows up in logs, but does not log the corresponding channel.

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=208984756
2018-08-20 14:20:32 -04:00
jianglai
8a1c99e22b Only log EPP and WHIOS connections
Only connections that have backend are of interest to us. Move the logging
statement accordingly.

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=208898433
2018-08-20 14:15:56 -04:00
jianglai
fc75e08061 Change access token validity and remove logging
The access token renewal on GCE is not what we expected. The metadata server always returns the same token as long as it is valid for 1699 to 3599 seconds and rolls over to the next token on its own schedule. Calling refresh on the GoogleCredential has no effect. We were caching the token for 30 min (1800 seconds), so in a rare case where we "refreshed" the token while its expiry is between 1699 and 1800 seconds, we will cache the token for longer than its validity. [] shorted the caching period to 10 min and added logging, which proved to be working. We no longer need the log any more now that the root cause has been identified. Also changed the cache period to 15 min (900 seconds) which should still be good.

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=208888170
2018-08-20 14:14:21 -04:00
jianglai
301301cafe Remove some unnecessary loggings from the proxy
We confirmed that the retry is working. Instead of logging the messages them
selves, we only need to log the message hash to ensure that the same message is
retried.

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=208883712
2018-08-20 14:12:44 -04:00
jianglai
d878f4ba2d Tweak access token refresh time
There's a very rare error where our access token is denied by GAE which happens a couple of seconds a day (if it happens at all). There doesn't seem to be anything wrong on our side, it could be just that the OAuth server is flaky. But to be safe, the refresh period is shortened. Also added logging to confirm what is refreshed. Note that the logging is at FINE leve, which only actually write to the logs in non-production environment.

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=208823699
2018-08-20 14:07:59 -04:00
jianglai
c5c0051f5e Ensure that no reference counted objects leak memory
The objects stored in the relay buffer may leak memory when they are no longer used. Alway remember to release their reference count in all cases.

Also save the relay channel and its name in BackendMetricsHandler when the handler is registered. This is because when retrying a relay, the write is sent as soon as the channel is connected, and the channelActive function is not called yet.

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=208757730
2018-08-20 14:06:24 -04:00
jianglai
4965478cce Correctly retry relay of reference counted objects
It turns out in the edge case where a write occurs at the same moment that the
relay connection is terminated, the current retry mechanism is not sufficient
because it stores reference coutned objects whose internal buffers are already
freed.

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=208738065
2018-08-20 14:03:18 -04:00
jianglai
2e2898e17c Fix WHOIS issues
[1] Web whois should redirect to www.registry.google. whois.registry.google also points to the proxy IP, so redirecting to whois.registry.google just makes it loop. Also allow HEAD in web whois request in case that is used in monitoring.

[2] Separately, there's a bug introduced in [] where exception handling of inbound messages is moved to HttpsRelayServiceHandler. However the quota handlers are installed behind the HttpServiceServiceHandler in the channel pipeline, therefore the exception thrown in quota handlers never got processed. This results in hung connection when quota exceeded.

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=208651011
2018-08-20 14:00:08 -04:00
jianglai
0e64015cdf Improve logs in the GCP proxy
Tweaked a few logging levels to not spam error level logs. Also make it easy to debug issues in case relay retry fails.

[1] Put non-fatal exceptions that should be logged at warning in their explicit sets. Also always use the root cause to determine if an exception is non-fatal, because sometimes the actual causes are wrapped inside other exceptions.

[2] Record the cause of a relay failure, and record if a relay retry is successful. This way we can look at the log and figure out if a relay is eventually successful.

[3] Add a log when the frontend connection from the client is terminated.

[4] Alway close the relay channel when a relay has failed, which, depend on if the channel is frontend or backend, will reconnect and trigger a retry.

[5] Lastly changed failure test to use assertThrows instead of fail.

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=208649916
2018-08-20 13:58:30 -04:00
jianglai
f554ace51b Log non-200 response at warning
The previous CL had a bug as non-200 response are outbound errors and are not caught in exceptionCaught() method.

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=208063877
2018-08-10 13:46:48 -04:00
jianglai
58e68db386 Update Netty version
This seems to fix the FOSS test timeout.

Also use the static-linked netty-tcnative library in tests to ensure that
OpenSSL provider is always available in tests. In production, we should use
the dynamic-linked version to reduce binary footprint and relay on system
OpenSSL library.

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=208057173
2018-08-10 13:46:48 -04:00
jianglai
9eec70729f Refine tests in GCP proxy
Previously the ssl initializer tests always uses JDK, which is not really testing what happens in production when we take advantage of the OpenSSL provider. Now the tests will run with all providers that are available (through JUnit parameterization). Some bugs that may cause flakiness are fixed in the process.

Change how SNI is verified in tests. It turns out that the old method (only verifying the SSL parameters in the SSL engine) does not actually ensure that the SNI address is sent to the peer, but only that the SSL engine is configured to send it (this value exists even before a handshake is performed). Also there's likely a bug in Netty's SSL engine that does not set this parameter when created with a peer host.

Lastly HTTP test utils are changed so that they do not use pre-defined constants for header names and values. We want the test to confirm that these constants are what we expect they are. Using string literals makes these tests also more explicit.

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=207930282
2018-08-10 13:46:48 -04:00
jianglai
6810e959f9 Refine logs in the proxy
[1] All logs should contain a reference to the channel so that it is easy to search for logs about a specific channel.

[2] EPP ssl handshake failure should be logged at warning. It is mostly the client that failed to complete the handshake, for example by sending bad cert, or not sending cert, or not using the correct SSL version. We should not lot it at error and spam the log.

[3] When the EPP response is not 200, we should not log at error because it means that the GAE app responded successfully. For example when datastore contention occurs, app engine responds with a non-200 status and logs at warning. The proxy should not at a higher level than app engine itself.

[4] Timeout is a non-fatal error that should be logged at warning.

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=207562299
2018-08-10 13:46:48 -04:00
jianglai
4ff77fb370 Automatic reconnect to GAE when the connection is dropped
The connection to GAE is not persistent and can drop. Reconnect when that happens, as long as the connection from the client is still active.

We need to consider the fact that while a reconnection is happening, the client may be sending requests that was relayed to the old connection, which is not going through. In that case these requests are queued and will be retried when the new connection is available.

Since we are no longer tying the lifecycles of the two connections, we cannot automatically terminate one when another is terminated. Also we need to explicitly control how WHOIS connection is terminated, not depending on the HTTP connection header.

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=207335498
2018-08-10 13:46:48 -04:00
jianglai
3f55216b21 Reduce web WHOIS error log level to warning
There's not much we can do when the user sends incorrect HTTP requests or cannot finish SSL handshake (the problematic requests are likely from bots anyway). Reducing the log level to warning in order to reduce spamming.

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=207159118
2018-08-10 13:46:48 -04:00
jianglai
8664101687 Make web WHOIS more resilient to malformed requests
We are seeing some web WHOIS HTTP(S) requests made to our endpoints without the Host header specified. This is an error according to the HTTP/1.1 spec. However we do not want to spam our logs with errors that are outside of our control. Do not throw and return a 400 response instead.

Also re-worked the logic a bit to only return HSTS headers if we send a redirect response, not any other error responses. The tests are re-arrange to correspond with the logical flow in the code.

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=207143230
2018-08-10 13:46:48 -04:00
jianglai
628aacd754 Cache server certificates for up to 30 min
The server certificates and corresponding keys are encrypted by KMS and stored on GCS. This allows us to easily replace expiring certs without having to roll out a new proxy release. However currently the certificate is obtained as a singleton and used in all connections served by a proxy instance. This means that if we were to upload a new cert, all existing instances will not use it.

This CL makes it so that we only cache the certificate for 30 min, after which a new cert is fetched and decrypted. Local certificates used for testing are still singletons.

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=206976318
2018-08-10 13:46:48 -04:00
jianglai
4a5b317016 Add web WHOIS redirect support
Opened two ports (30010 and 30011 by default) that handles HTTP(S) GET requests. the HTTP request is redirected to the corresponding HTTPS site, whereas the HTTPS request is redirected to a site that supports web WHOIS.

The GCLB currently exposes port 80, but not port 443 on its TCP proxy load balancer (see https://cloud.google.com/load-balancing/docs/choosing-load-balancer). As a result, the HTTP traffic has to be routed by the HTTP load balancer, which requires a separate HTTP health check (as opposed to the TCP health check that the TCP proxy LB uses). This CL also added support for HTTP health check.

There is not a strong case for adding an end-to-end test for WebWhoisProtocolsModule (like those for EppProtocolModule, etc) as it just assembles standard HTTP codecs used for an HTTP server, plus the WebWhoisRedirectHandler, which is tested. The end-to-end test would just be testing if the Netty provided HTTP handlers correctly parse raw HTTP messages.

Sever other small improvement is also included:

[1] Use setInt other than set when setting content length in HTTP headers. I don't think it is necessary, but it is nevertheless a better practice to use a more specialized setter.
[2] Do not write metrics when running locally.
[3] Rename the qualifier @EppCertificates to @ServerSertificate as it now provides the certificate used in HTTPS traffic as well.

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=206944843
2018-08-10 13:46:48 -04:00
jianglai
0e62270f54 Set up GCLB to router web WHOIS traffic
We need to support web WHOIS on the same IP addresses that we use for port 43 whois. [] added support for HTTP(S) traffic on the proxy, which simply redirects to another website that actually hosts the web WHOIS service. This cl sets up the GCLB to route port 80 and port 443 traffic to the proxy.

We were using the TCP proxy load balancer for other protocols that we support (EPP and WHOIS), but the TCP proxy LB only exposes port 443, not port 80. For port 443, we simply follow the same pattern and add another TCP proxy LB. For port 80, we had to use the HTTP LB which exposes port 80 (on the same external IP addresses). This requires a different HTTP health check and a URL map. The added URL map is a dummy one that routes all paths to the same backend service that supports HTTP redirect.

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=206409007
2018-08-10 13:44:25 -04:00
jianglai
030e2f4dd3 Do not explicitly depend on latest GKE version
When versions are explicitly set to the latest available version, Annealing almost always fails to apply the patch due to yet-unknown reasons. The rationale for setting the versions explicitly was to ensure that the clusters are always updated in time. But it seems like it is not worth the trouble.

Without the explicit latest versions, the master should still be automatic upgrade (may not be immediate after version availability):

https://cloud.google.com/kubernetes-engine/versioning-and-upgrades#automatic_master_upgrades

We also set "Auto Upgrade" on the nodes, which should upgrades the nodes to master versions (may not be immediate after master version upgrade).

So it seems without these lines, we can still expect the gke versions of the cluster to upgrade (eventually).

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=206408347
2018-08-10 13:44:25 -04:00
jianglai
8f5be6e7a8 Make some minor changes to logging messages and test names.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=205464581
2018-08-10 13:44:25 -04:00
jianglai
35110927d6 Add configs for production GCP proxy
This also introduces a production canary environment, similar to sandbox canary. The docker tags are changed to "live" and "sandbox" respectively, to reflect the fact that different images may be used for prod and sandbox.

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=204343530
2018-07-14 01:37:03 -04:00
jianglai
6ca28386cd Store encrypted file in Base64 encoding
It is better to store it ASCII armored so that it can be easily diffed to see
if a file has changed

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=200045488
2018-06-18 17:53:11 -04:00
jianglai
db60f0fd12 Create canary records in proxy zones
This allows for the creation of records like epp-canary.registr.google.

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=199850436
2018-06-18 17:50:15 -04:00
jianglai
61f6e666b1 Enforce no logging in production environment
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=199156367
2018-06-06 15:10:15 -04:00
jianglai
3960207502 Log source IP when logging is enabled
We will only enable logging for non-production environment, so there shouldn't be any privacy concerns by enabling this.

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=198744739
2018-06-06 15:02:31 -04:00
jianglai
af8b050446 Tweak log message a bit
SERVER and CLIENT is a bit hard to understand.

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=198721870
2018-06-06 15:01:00 -04:00
jianglai
65ac28fae5 Increate GKE cluster upgrade timeout time to 30m
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=198322158
2018-05-30 12:18:54 -04:00
jianglai
a5abb05761 Migrating to fluent logging (red)
This is a 'red' Flogger migration CL. Red CLs contain changes which are
likely not to work without manual intervention.

Note that it may not even be possible to directly migrate the logger
usage in this CL to the Flogger API and some additional refactoring may
be required. If this is the case, please note that it should be safe to
submit any outstanding 'green' and 'yellow' CLs prior to tackling this.

If you feel that your use case is not covered by the existing Flogger API
please raise a feature request at []and
revert this CL.

For more information, see []
Base CL: 197331037

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=197503952
2018-05-30 12:18:54 -04:00
jianglai
05f166918f Migrating to fluent logging (green)
This is a 'green' Flogger migration CL. Green CLs are intended to be as
safe as possible and should be easy to review and submit.

No changes should be necessary to the code itself prior to submission,
but small changes to BUILD files may be required.

Changes within files are completely independent of each other, so this CL
can be safely split up for review using tools such as Rosie.

For more information, see []
Base CL: 197331037

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=197466715
2018-05-30 12:18:54 -04:00
jianglai
0cb303ed7f Fix proxy metrics instrumentation bug
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=197209531
2018-05-30 12:18:54 -04:00
jianglai
68b24f0a54 Migrate to internal FormattingLogger in GCP proxy in preparation of migration to Flogger
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=197199265
2018-05-30 12:18:54 -04:00
jianglai
053c52e0bd Add Flogger to GCP proxy
This adds a dummy flogger logging statement in the GCP proxy to ensure that it
works.

TESTED=Deployed to alpha and verified that flogger works. Also passed FOSS
tests.

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=196899036
2018-05-30 12:18:54 -04:00
jianglai
1248a7722b Enable logging in sandbox GCP proxies
This makes it easier to debug issues. There are no privacy concerns in sandbox.

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=197045576
2018-05-17 21:52:35 -04:00
jianglai
0fb845e81a Remove no quota leased warning from quota handler inactive callback
When EPP SSL handshake is unsuccessful, #channelInactive is called but there are no quotas to return, because quotas are only leased upon the first #channelRead. There is no need to log a warning and throw an exception in this case because the handshake exception would have been thrown already. Throwing a second exception just crowds the log.

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=197016756
2018-05-17 21:52:35 -04:00