This finally fixes b/37629674 by cutting over the ICANN activity reporting query for EPP/SRS metrics to use the new JSON-based structured log line in FlowReporter, which is much easier to parse and interpret correctly than the old XML logging which was not designed to be ingested in BigQuery.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=159633467
This makes a bunch of accuracy improvements to the ICANN activity EPP/SRS query that parses XML logs. We're going to be getting rid of this code imminently in favor of the new FlowReporter JSON log line which is much easier to interpret correctly, and in fact all of these issues were detected by comparison with that query. Fixing these issues brings this query almost (but not quite) up to par with the new version. The point of checking this in is so that we have evidence that the new version only diverges from the existing version in ways that are desirable and documented herein.
Here are the accuracy issues fixed in this CL:
1) false negative - the old query failed to parse the XML entirely for logs with no client ID (e.g. a registrar who isn't logged in attempting to do a domain check), due to an overly restrictive regex
2) false negative - the old query failed to extract the TLD from EPP requests for just ".tld" (no SLD label at all), which would be an invalid request anyway but should probably still be counted in the reporting for that TLD (which the new query does), due to an overly restrictive regex
3) false negative - the old query failed to normalize uppercase TLDs from the raw XML to lowercase, meaning they wouldn't have matched later on in the pipeline
4) false positive - the old query counted dry-runs from the tool as valid EPP requests, the new logging ignores them
5) false positive - if the old query can't extract the TLD (for example, the domain name provided is just "how" instead of a real SLD), it reports a NULL tld, but then the way the overall activity report query works, it considers that metric to apply to *all possible TLDs* (this is necessary behavior for cases where e.g. for contact/host metrics, we do want a null TLD to signify "applies to all TLDs"). In the case of e.g. a domain check though, this results in that domain check being counted for all TLDs' activity reports, even though it should not be counted for any of them. The fix is to manually filter out results with for 'srs-dom-*' metrics with a NULL tld.
6) false negative - old query wasn't counting /check (CheckApiAction) queries, new query does (and for some reason, we have a strange number of these coming in for TLDs other than .how/.soy, pretty much all from AWS IP addresses and with a UserAgent corresponding to the Go HTTP client library)
Finally, this also adds more progress printing lines to icann_reporting.py.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=159608369
This will help [] be submitted without breaking the linter.
License headers are now added automatically where they were previously
added by hand. We're also now adding the license header to Soy and SQL
files.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=129017424
This feature would have been useful earlier when I was changing the TLD
state on a sandbox TLD on-the-fly for testing purposes.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=128088578
This migrates the IcannReportQueryBuilder part of the internal ICANN reporting
script into the opensource repository, as a new module under the package
google.registry.reporting.
It correspondingly moves the golden activity SQL query test to the opensource
repo, since that test only applies to this part of the script anyway (note that
the actual golden SQL contents is unchanged by the move).
Tested: confirmed that the newly moved test passes (and that it also fails when
expected as well), and ran the internal icann reporting script locally to verify
that both activity and transaction reporting results are unaffected by the move.
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=127580326
This adds a home in our opensource repo for python libraries and binaries,
under a top-level "python" directory. Future CLs will relocate ICANN
reporting bits and pieces to new homes under this directory, and will use
the MOE configuration and python_directory_import rule defined here.
This approach is roughly modeled on the protobuf Bazel opensource project,
which also uses a top-level directory for various languages, and also uses
the "imports" parameter to exclude that directory in python module names:
https://github.com/google/protobuf/blob/v3.0.0-beta-3/BUILD#L568
-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=127459882