mirror of
https://github.com/google/nomulus.git
synced 2025-08-03 00:12:11 +02:00
This parameter is misleading and does not do what it purports to do. Namely, it does not impact the level of parallelism. Given the input n for this parameter, and m for the batch size, the elements are divided (keyed) into n groups, each of which are then spread evenly across all threads, which are eventually in turn batched into batches with size m: https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/GroupIntoBatches.java#L227 This is also evident in the implementation itself, where the ShardedKey is determined by the unique number for a worker/thread combo and the original key: https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/GroupIntoBatches.java#L268 Using a more concrete example, suppose we have 100 elements and 10 worker threads, with a target batch size of 5. If the "shard" number is set to 1, we first spread the 100 elements across 10 threads, resulting in 10 elements per thread, each thread then batches the elements into 2 batches of size 5. If the "shard" number is set to 2, the 100 elements are first divided into 2 "shards" of 50 each. Each "shard" is then distributed within the 10 threads, resulting in 5 elements per "shard" per thread. They then get turned into 1 batch per "shard" per thread. In the end, each thread still processes 2 batches, even though they are from 2 different "shards". Therefore this "shard" number does not perform horizontal partitioning that one normally associates with sharding, and provides no performance benefits but rather confuses the user. It is also suggested that using withShardedKey() alone is already sufficient to achieve auto-sharding within the keyed group. There is no need to manually divide the input by keying them differently based on the "shard" number specified: https://youtu.be/jses0W4Zalc?t=967 |
||
---|---|---|
.. | ||
src | ||
build.gradle | ||
buildscript-gradle.lockfile | ||
Dockerfile | ||
gradle.lockfile |