google-nomulus/javatests/google
mcilwain 3f1a8eebdc Decrease flakiness of multipart TLD lifecycle tests
As part of our commit log layer that we have built on top of Objectify, we
enforce the constraint of a monotonically increasing transaction time with a
millisecond granularity. Thus, if two transactions occur at exactly the same
millisecond, as had been the case with these tests, one will get a
TimestampInversionException and retry. However, since we're mocking time in
these tests as well, they will retry at exactly the same millisecond, and thus
continue failing for the same reason until the max retry threshold is hit. The
EPP flow then ultimately fails with a generic "Command failed" response. All
of this is actual findings from looking at test logs from a flake.

It's a mystery to me why these tests were merely flaky; it seems like they
should have always been failing for this reason, but they were still only
sometimes failing. Who knows.

The fix is simple -- Adjust the tests so that no two commands are run at exactly
the same millisecond. Note that this is a test-only problem; in the real world,
a command that temporarily fails will simply then succeed the next time it is
retried, since time is actually elapsing. This implies that our commit log system
imposes a max mutation rate of 1,000 QPS across our entire system. This is
unlikely to be a problem in practice for any existing registry of any size.

Also note that, as far the EPP XML itself is concerned, times only have second
granularity, so up to a thousand commands can execute in the same second and
still "appear" to have taken place at the same time as far as EPP is concerned.
That's why this CL only adds millisecond precision to the actual run time, not
to the expected values in the commands.

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=184777558
2018-02-20 15:20:22 -05:00
..
registry Decrease flakiness of multipart TLD lifecycle tests 2018-02-20 15:20:22 -05:00