Improve accuracy of ICANN activity EPP/SRS query

This makes a bunch of accuracy improvements to the ICANN activity EPP/SRS query that parses XML logs.  We're going to be getting rid of this code imminently in favor of the new FlowReporter JSON log line which is much easier to interpret correctly, and in fact all of these issues were detected by comparison with that query. Fixing these issues brings this query almost (but not quite) up to par with the new version.  The point of checking this in is so that we have evidence that the new version only diverges from the existing version in ways that are desirable and documented herein.

Here are the accuracy issues fixed in this CL:

1) false negative - the old query failed to parse the XML entirely for logs with no client ID (e.g. a registrar who isn't logged in attempting to do a domain check), due to an overly restrictive regex

2) false negative - the old query failed to extract the TLD from EPP requests for just ".tld" (no SLD label at all), which would be an invalid request anyway but should probably still be counted in the reporting for that TLD (which the new query does), due to an overly restrictive regex

3) false negative - the old query failed to normalize uppercase TLDs from the raw XML to lowercase, meaning they wouldn't have matched later on in the pipeline

4) false positive - the old query counted dry-runs from the tool as valid EPP requests, the new logging ignores them

5) false positive - if the old query can't extract the TLD (for example, the domain name provided is just "how" instead of a real SLD), it reports a NULL tld, but then the way the overall activity report query works, it considers that metric to apply to *all possible TLDs* (this is necessary behavior for cases where e.g. for contact/host metrics, we do want a null TLD to signify "applies to all TLDs").  In the case of e.g. a domain check though, this results in that domain check being counted for all TLDs' activity reports, even though it should not be counted for any of them.  The fix is to manually filter out results with for 'srs-dom-*' metrics with a NULL tld.

6) false negative - old query wasn't counting /check (CheckApiAction) queries, new query does (and for some reason, we have a strange number of these coming in for TLDs other than .how/.soy, pretty much all from AWS IP addresses and with a UserAgent corresponding to the Go HTTP client library)

Finally, this also adds more progress printing lines to icann_reporting.py.

-------------
Created by MOE: https://github.com/google/moe
MOE_MIGRATED_REVID=159608369
This commit is contained in:
nickfelt 2017-06-20 13:56:28 -07:00 committed by Ben McIlwain
parent 9d96072e01
commit 978fc665ea
2 changed files with 32 additions and 17 deletions

View file

@ -16,8 +16,8 @@
The IcannReportQueryBuilder class contains logic for constructing the
multi-part BigQuery queries used to produce ICANN monthly reports. These
queries are fairly complicated; see the design doc published to the
domain-registry-users@googlegroups.com for an overview.
queries are fairly complicated; see the design doc published to
nomulus-discuss@googlegroups.com for an overview.
Currently, this class only supports building the query for activity
reports (not transaction reports).
@ -73,7 +73,8 @@ class IcannReportQueryBuilder(object):
self._MakeActivityOperationalRegistrarsQuery(next_yearmonth),
self._MakeActivityAllRampedUpRegistrarsQuery(next_yearmonth),
self._MakeActivityAllRegistrarsQuery(registrar_count),
self._MakeActivityWhoisQuery(logs_query), self._MakeActivityDnsQuery(),
self._MakeActivityWhoisQuery(logs_query),
self._MakeActivityDnsQuery(),
self._MakeActivityEppSrsMetricsQuery(epp_xml_logs_query)
]
return _StripTrailingWhitespaceFromLines(self._MakeActivityReportQuery(
@ -107,8 +108,8 @@ class IcannReportQueryBuilder(object):
query = r"""
-- Query EPP request logs and extract the clientId and raw EPP XML.
SELECT
REGEXP_EXTRACT(logMessage, r'^%(log_signature)s\n\t.+\n\t(.+)\n') AS clientId,
REGEXP_EXTRACT(logMessage, r'^%(log_signature)s\n\t.+\n\t.+\n\t.+\n\t((?s).+)$') AS xml,
REGEXP_EXTRACT(logMessage, r'^%(log_signature)s\n\t.*\n\t(.*)\n') AS clientId,
REGEXP_EXTRACT(logMessage, r'^%(log_signature)s\n\t.*\n\t.*\n\t.*\n\t((?s).*)$') AS xml,
FROM (
-- BEGIN LOGS QUERY --
%(logs_query)s
@ -116,8 +117,10 @@ class IcannReportQueryBuilder(object):
)
WHERE
-- EPP endpoints from the proxy, regtool, and console respectively.
requestPath IN ('/_dr/epp', '/_dr/epptool', '/registrar-xhr')
(requestPath IN ('/_dr/epp', '/_dr/epptool', '/registrar-xhr')
OR LEFT(requestPath, 7) = '/check?')
AND REGEXP_MATCH(logMessage, r'^%(log_signature)s')
AND NOT logMessage CONTAINS 'DRY_RUN'
"""
return query % {'logs_query': logs_query,
'log_signature': FLOWRUNNER_LOG_SIGNATURE_PATTERN}
@ -318,8 +321,9 @@ class IcannReportQueryBuilder(object):
# pylint: disable=missing-docstring
query = r"""
-- Query EPP XML messages and calculate SRS metrics.
SELECT * FROM (
SELECT
domainTld AS tld,
LOWER(domainTld) AS tld,
-- SRS metric names follow a set pattern corresponding to the EPP
-- protocol elements. First we extract the 'inner' command element in
-- EPP, e.g. <domain:create>, which is the resource type followed by
@ -376,7 +380,7 @@ class IcannReportQueryBuilder(object):
REGEXP_EXTRACT(xml, '(?s)<command>.*?<([a-z]+)') AS commandType,
REGEXP_EXTRACT(xml, '(?s)<command>.*?<[a-z]+ op="(.+?)"') AS commandOpArg,
REGEXP_EXTRACT(xml, '(?s)<command>.*?<.+?>.*?<([a-z]+:[a-z]+)') AS innerCommand,
REGEXP_EXTRACT(xml, '<domain:name.*?>[^.]+[.](.+)</domain:name>') AS domainTld,
REGEXP_EXTRACT(xml, '<domain:name.*?>[^.]*[.](.+)</domain:name>') AS domainTld,
REGEXP_EXTRACT(xml, '<rgp:restore op="(.+?)"/>') AS restoreOp,
xml,
FROM (
@ -391,6 +395,10 @@ class IcannReportQueryBuilder(object):
-- excludes login, logout, and poll.
WHERE commandType IN ('check', 'create', 'delete', 'info', 'renew', 'transfer', 'update')
GROUP BY tld, metricName
)
-- Exclude domain-related EPP requests with no parsed TLD, otherwise
-- a NULL tld will make them apply to all TLDs like contact/host requests.
WHERE NOT (metricName CONTAINS 'srs-dom' AND tld IS NULL)
"""
return query % {'epp_xml_logs_query': epp_xml_logs_query}

View file

@ -176,8 +176,9 @@
(
-- Query EPP XML messages and calculate SRS metrics.
SELECT * FROM (
SELECT
domainTld AS tld,
LOWER(domainTld) AS tld,
-- SRS metric names follow a set pattern corresponding to the EPP
-- protocol elements. First we extract the 'inner' command element in
-- EPP, e.g. <domain:create>, which is the resource type followed by
@ -234,7 +235,7 @@
REGEXP_EXTRACT(xml, '(?s)<command>.*?<([a-z]+)') AS commandType,
REGEXP_EXTRACT(xml, '(?s)<command>.*?<[a-z]+ op="(.+?)"') AS commandOpArg,
REGEXP_EXTRACT(xml, '(?s)<command>.*?<.+?>.*?<([a-z]+:[a-z]+)') AS innerCommand,
REGEXP_EXTRACT(xml, '<domain:name.*?>[^.]+[.](.+)</domain:name>') AS domainTld,
REGEXP_EXTRACT(xml, '<domain:name.*?>[^.]*[.](.+)</domain:name>') AS domainTld,
REGEXP_EXTRACT(xml, '<rgp:restore op="(.+?)"/>') AS restoreOp,
xml,
FROM (
@ -242,8 +243,8 @@
-- Query EPP request logs and extract the clientId and raw EPP XML.
SELECT
REGEXP_EXTRACT(logMessage, r'^(?:google.registry.flows.FlowRunner run|com.google.domain.registry.flows.FlowRunner run|com.google.domain.registry.util.FormattingLogger log|google.registry.util.FormattingLogger log): EPP Command\n\t.+\n\t(.+)\n') AS clientId,
REGEXP_EXTRACT(logMessage, r'^(?:google.registry.flows.FlowRunner run|com.google.domain.registry.flows.FlowRunner run|com.google.domain.registry.util.FormattingLogger log|google.registry.util.FormattingLogger log): EPP Command\n\t.+\n\t.+\n\t.+\n\t((?s).+)$') AS xml,
REGEXP_EXTRACT(logMessage, r'^(?:google.registry.flows.FlowRunner run|com.google.domain.registry.flows.FlowRunner run|com.google.domain.registry.util.FormattingLogger log|google.registry.util.FormattingLogger log): EPP Command\n\t.*\n\t(.*)\n') AS clientId,
REGEXP_EXTRACT(logMessage, r'^(?:google.registry.flows.FlowRunner run|com.google.domain.registry.flows.FlowRunner run|com.google.domain.registry.util.FormattingLogger log|google.registry.util.FormattingLogger log): EPP Command\n\t.*\n\t.*\n\t.*\n\t((?s).*)$') AS xml,
FROM (
-- BEGIN LOGS QUERY --
@ -263,8 +264,10 @@
)
WHERE
-- EPP endpoints from the proxy, regtool, and console respectively.
requestPath IN ('/_dr/epp', '/_dr/epptool', '/registrar-xhr')
(requestPath IN ('/_dr/epp', '/_dr/epptool', '/registrar-xhr')
OR LEFT(requestPath, 7) = '/check?')
AND REGEXP_MATCH(logMessage, r'^(?:google.registry.flows.FlowRunner run|com.google.domain.registry.flows.FlowRunner run|com.google.domain.registry.util.FormattingLogger log|google.registry.util.FormattingLogger log): EPP Command')
AND NOT logMessage CONTAINS 'DRY_RUN'
-- END EPP XML LOGS QUERY --
)
@ -275,6 +278,10 @@
-- excludes login, logout, and poll.
WHERE commandType IN ('check', 'create', 'delete', 'info', 'renew', 'transfer', 'update')
GROUP BY tld, metricName
)
-- Exclude domain-related EPP requests with no parsed TLD, otherwise
-- a NULL tld will make them apply to all TLDs like contact/host requests.
WHERE NOT (metricName CONTAINS 'srs-dom' AND tld IS NULL)
)
-- END JOINED DATA SOURCES --