'afterFinalFailure' is called just before rethrowing a non-retrying error from
the retrier. This can happen either because the exception shouldn't be retried,
or because we exceeded the maximum number of retries.
The same thing can be done by catching that thrown error outside of the
retrier:
retrier.callWithRetry(
callable,
new FailureReporter() {
@Override
void afterFinalFailure(Throwable thrown, int failures) {
// do something with thrown
}
},
RetriableException.class);
is (almost) the same as:
try {
retrier.callWithRetry(callable, RetriableException.class);
} catch (Throwable thrown) {
// do something with thrown
throw thrown;
}
("almost" because the retrier might wrap the Throwable in a RuntimeException,
so you might need to getCause or getRootCause. Also - there is the
"beforeRetry" I ignored for the example)
Removing "afterFinalFailure" also makes the FailureReporter in line with Java 8
functional interface - meaning we can more easily create it when we do need to
override "beforeRetry".
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=189972101
This was a surprisingly involved change. Some of the difficulties included
java.util.Optional purposely not being Serializable (so I had to move a
few Optionals in mapreduce classes to @Nullable) and having to add the Truth
Java8 extension library for assertion support.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=171863777
When trying to run the MapReduce for DeleteOldCommitLogsAction, we run into a
lot of DatastoreTimeoutException during CommitLogManifestReader.next.
This causes the entire shard to fail. Since we have a lot of keys (tens of
millions), this is almost guaranteed to happen, dooming the entire MapReduce.
Here is an attempt to recover from the Timeout Exception by saving the state
before the read, then on failure restoring that state and trying again.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=165172222
Attempting to run DeleteOldCommitLogs in prod resulted in a lot of DatastoreTimeoutException errors. The assumption is that attempting to load so many CommitLogManifests (over 200 million of them), when each one has a slight possibility of failure, has a very high probability of error.
The shard aborts after 20 of these errors, and by eliminating as many loads as possible and retrying the remaining loads inside a transaction we are effectively eliminating any exceptions "leaking" out to the mapreduce framework, which will hopefully keep us bellow 20. At least, that's our best guess currently as to why the mapreduce fails.
EppResources are loaded in the map stage to get the revisions, and CommitLogManifests are only loaded in the reduce stage for sanity check so we don't accidentally delete resources we need in prod. Both of these are wrapped in transactNew to make sure they retry individually.
The only "load" not done inside a transaction is the EppResourceIndex, but there's no getting around that without rewriting the EppResourceInputs.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=164176764