We limit the maximum number of concurrent connections that a client can make the proxy. The quota is implemented as a (thread-safe) map of client certificate hash to available number of connections. When a new connection is made, we decrement the availability counter by one. When the counter hits zero, no more connections can be made and any new connection from the same client is terminated by the proxy.
Currently, the counter is incremented when a connection is terminated, including connections that are terminated *because* the quota is reached (i. e. the connections for which the counter is not decremented because the counter is already zero). This means that the first time the quota is reached, the next connection is dropped, the counter is incremented to 1 and new connections can be made again, bypassing the quota. This process can be repeated to achieve, theoretically, infinite quota.
This CL fixes this bug by only incrementing the counter, upon connection termination, for connections that have decremented the counter in the first place.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=217231593
1. Moved code for the GCP proxy to where the [] proxy code used to live.
3. Corrected reference to the GCP proxy location.
4. Misc changes to make ErrorProne and various tools happy.
+diekmann to LGTM terraform whitelist change.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=213630560
This makes it easy to debug issues when registrars cannot finish SSL
handshake. There's no privacy concerns because we keep a record of the
registrars' IP address in our whitelist anyway.
The remote address attribute it set by the ProxyProtocolHandler, which runs before anything is done. The GCLP added the protocol header at the beginning of a stream, so we know that by the time handshake is finished (successful or not), this key must be set.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=209169683
The RelayHandler is installed at the end of a channel pipeline (both frontend and backend). If it does not log the exception, it will be regarded and unhandled exception, which shows up in logs, but does not log the corresponding channel.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=208984756
The access token renewal on GCE is not what we expected. The metadata server always returns the same token as long as it is valid for 1699 to 3599 seconds and rolls over to the next token on its own schedule. Calling refresh on the GoogleCredential has no effect. We were caching the token for 30 min (1800 seconds), so in a rare case where we "refreshed" the token while its expiry is between 1699 and 1800 seconds, we will cache the token for longer than its validity. [] shorted the caching period to 10 min and added logging, which proved to be working. We no longer need the log any more now that the root cause has been identified. Also changed the cache period to 15 min (900 seconds) which should still be good.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=208888170
We confirmed that the retry is working. Instead of logging the messages them
selves, we only need to log the message hash to ensure that the same message is
retried.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=208883712
The objects stored in the relay buffer may leak memory when they are no longer used. Alway remember to release their reference count in all cases.
Also save the relay channel and its name in BackendMetricsHandler when the handler is registered. This is because when retrying a relay, the write is sent as soon as the channel is connected, and the channelActive function is not called yet.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=208757730
It turns out in the edge case where a write occurs at the same moment that the
relay connection is terminated, the current retry mechanism is not sufficient
because it stores reference coutned objects whose internal buffers are already
freed.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=208738065
[1] Web whois should redirect to www.registry.google. whois.registry.google also points to the proxy IP, so redirecting to whois.registry.google just makes it loop. Also allow HEAD in web whois request in case that is used in monitoring.
[2] Separately, there's a bug introduced in [] where exception handling of inbound messages is moved to HttpsRelayServiceHandler. However the quota handlers are installed behind the HttpServiceServiceHandler in the channel pipeline, therefore the exception thrown in quota handlers never got processed. This results in hung connection when quota exceeded.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=208651011
Tweaked a few logging levels to not spam error level logs. Also make it easy to debug issues in case relay retry fails.
[1] Put non-fatal exceptions that should be logged at warning in their explicit sets. Also always use the root cause to determine if an exception is non-fatal, because sometimes the actual causes are wrapped inside other exceptions.
[2] Record the cause of a relay failure, and record if a relay retry is successful. This way we can look at the log and figure out if a relay is eventually successful.
[3] Add a log when the frontend connection from the client is terminated.
[4] Alway close the relay channel when a relay has failed, which, depend on if the channel is frontend or backend, will reconnect and trigger a retry.
[5] Lastly changed failure test to use assertThrows instead of fail.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=208649916
The previous CL had a bug as non-200 response are outbound errors and are not caught in exceptionCaught() method.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=208063877
Previously the ssl initializer tests always uses JDK, which is not really testing what happens in production when we take advantage of the OpenSSL provider. Now the tests will run with all providers that are available (through JUnit parameterization). Some bugs that may cause flakiness are fixed in the process.
Change how SNI is verified in tests. It turns out that the old method (only verifying the SSL parameters in the SSL engine) does not actually ensure that the SNI address is sent to the peer, but only that the SSL engine is configured to send it (this value exists even before a handshake is performed). Also there's likely a bug in Netty's SSL engine that does not set this parameter when created with a peer host.
Lastly HTTP test utils are changed so that they do not use pre-defined constants for header names and values. We want the test to confirm that these constants are what we expect they are. Using string literals makes these tests also more explicit.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=207930282
[1] All logs should contain a reference to the channel so that it is easy to search for logs about a specific channel.
[2] EPP ssl handshake failure should be logged at warning. It is mostly the client that failed to complete the handshake, for example by sending bad cert, or not sending cert, or not using the correct SSL version. We should not lot it at error and spam the log.
[3] When the EPP response is not 200, we should not log at error because it means that the GAE app responded successfully. For example when datastore contention occurs, app engine responds with a non-200 status and logs at warning. The proxy should not at a higher level than app engine itself.
[4] Timeout is a non-fatal error that should be logged at warning.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=207562299
The connection to GAE is not persistent and can drop. Reconnect when that happens, as long as the connection from the client is still active.
We need to consider the fact that while a reconnection is happening, the client may be sending requests that was relayed to the old connection, which is not going through. In that case these requests are queued and will be retried when the new connection is available.
Since we are no longer tying the lifecycles of the two connections, we cannot automatically terminate one when another is terminated. Also we need to explicitly control how WHOIS connection is terminated, not depending on the HTTP connection header.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=207335498
There's not much we can do when the user sends incorrect HTTP requests or cannot finish SSL handshake (the problematic requests are likely from bots anyway). Reducing the log level to warning in order to reduce spamming.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=207159118
We are seeing some web WHOIS HTTP(S) requests made to our endpoints without the Host header specified. This is an error according to the HTTP/1.1 spec. However we do not want to spam our logs with errors that are outside of our control. Do not throw and return a 400 response instead.
Also re-worked the logic a bit to only return HSTS headers if we send a redirect response, not any other error responses. The tests are re-arrange to correspond with the logical flow in the code.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=207143230
The server certificates and corresponding keys are encrypted by KMS and stored on GCS. This allows us to easily replace expiring certs without having to roll out a new proxy release. However currently the certificate is obtained as a singleton and used in all connections served by a proxy instance. This means that if we were to upload a new cert, all existing instances will not use it.
This CL makes it so that we only cache the certificate for 30 min, after which a new cert is fetched and decrypted. Local certificates used for testing are still singletons.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=206976318
Opened two ports (30010 and 30011 by default) that handles HTTP(S) GET requests. the HTTP request is redirected to the corresponding HTTPS site, whereas the HTTPS request is redirected to a site that supports web WHOIS.
The GCLB currently exposes port 80, but not port 443 on its TCP proxy load balancer (see https://cloud.google.com/load-balancing/docs/choosing-load-balancer). As a result, the HTTP traffic has to be routed by the HTTP load balancer, which requires a separate HTTP health check (as opposed to the TCP health check that the TCP proxy LB uses). This CL also added support for HTTP health check.
There is not a strong case for adding an end-to-end test for WebWhoisProtocolsModule (like those for EppProtocolModule, etc) as it just assembles standard HTTP codecs used for an HTTP server, plus the WebWhoisRedirectHandler, which is tested. The end-to-end test would just be testing if the Netty provided HTTP handlers correctly parse raw HTTP messages.
Sever other small improvement is also included:
[1] Use setInt other than set when setting content length in HTTP headers. I don't think it is necessary, but it is nevertheless a better practice to use a more specialized setter.
[2] Do not write metrics when running locally.
[3] Rename the qualifier @EppCertificates to @ServerSertificate as it now provides the certificate used in HTTPS traffic as well.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=206944843
This is a 'green' Flogger migration CL. Green CLs are intended to be as
safe as possible and should be easy to review and submit.
No changes should be necessary to the code itself prior to submission,
but small changes to BUILD files may be required.
Changes within files are completely independent of each other, so this CL
can be safely split up for review using tools such as Rosie.
For more information, see []
Base CL: 197331037
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=197466715
When EPP SSL handshake is unsuccessful, #channelInactive is called but there are no quotas to return, because quotas are only leased upon the first #channelRead. There is no need to log a warning and throw an exception in this case because the handshake exception would have been thrown already. Throwing a second exception just crowds the log.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=197016756
This allows us to not obtain a certificate and encrypt it with KMS when running the proxy locally during development.
Also updated FOSS build dagger version.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=191746309
If the proxy protocol header contains a malformatted string, such as "PROXY UNKNOWN", instead of throwing and killing the connection, use the TCP source IP as the remote IP.
Also changed how the header is read from the buffer, to avoid a potential Netty resource leak. Originally the header is read into another ByteBuf, which needs be be explicit released in order for Netty to reclaim its memory (http://netty.io/wiki/reference-counted-objects.html). Now we just read it into a byte array and let JVM GC it.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=188047084
When not running locally, the logging formatter is set to convert the log record to a single-line JSON string that Stackdriver logging agent running in GKE will pick up and parse correctly.
Also removed redundant logging handler in the proxy frontend connection. They have two problems: 1) it is possible to leak PII when all frontend traffic is logged, such as client IPs. Even though this is less of a concern because the GCP TCP proxy load balancer masquerade source IPs. 2) We are only logging the HTTP request/response that the frontend connection is sending to/receiving from the backend connection, but the backend already has its own logging handler to log the same message that it gets from/sends to the GAE app, so the logging in the frontend connection does not really give extra information.
Logging of some potential PII information such as the source IP of a proxied connection are also removed.
Thirdly, added a k8s autoscaling object that scales the containers based on CPU load. The default target load is 80%. This, in connection with GKE cluster VM autoscaling, means that when traffic is low, we'll only have one VM running one container of the proxy.
Fixes a bug where the MetricsComponent generates a separate ProxyConfig that does not call parse method on the command line args passed, resulting default Environment always being used in constructing the metric reporter.
Lastly a little bit of cleaning of the MOE config script, no newlines are necessary as the BUILD are formatted after string substitution.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=188029019
When a quota request is rejected, increment the metric counter by one.
Also makes both frontend and backend metrics singleton because all the fields they have a static.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=185146804
The quota handler terminates connections when quota is exceeded.
The next CL will add instrumentation for quota related metrics.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=185042675
Dagger updated to 2.13, along with all its dependencies.
Also allows us to have multiple config files for different environment (prod, sandbox, alpha, local, etc) and specify which one to use on the command line with a --env flag. Therefore the same binary can be used in all environments.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=176551289