google-nomulus/core
Lai Jiang cfee4713ed
Remove sharding parameter from RegistryJpaIO (#1856)
This parameter is misleading and does not do what it purports to do.
Namely, it does not impact the level of parallelism. Given the input n for this
parameter, and m for the batch size, the elements are divided (keyed) into n
groups, each of which are then spread evenly across all threads, which
are eventually in turn batched into batches with size m:

https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/GroupIntoBatches.java#L227

This is also evident in the implementation itself, where the ShardedKey
is determined by the unique number for a worker/thread combo and the
original key:

https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/GroupIntoBatches.java#L268

Using a more concrete example, suppose we have 100 elements and 10
worker threads, with a target batch size of 5. If the "shard" number is set to
1, we first spread the 100 elements across 10 threads, resulting in 10
elements per thread, each thread then batches the elements into 2
batches of size 5.

If the "shard" number is set to 2, the 100 elements are first divided into 2
"shards" of 50 each. Each "shard" is then distributed within the 10
threads, resulting in 5 elements per "shard" per thread. They then get
turned into 1 batch per "shard" per thread. In the end, each thread
still processes 2 batches, even though they are from 2 different "shards".

Therefore this "shard" number does not perform horizontal partitioning
that one normally associates with sharding, and provides no
performance benefits but rather confuses the user.

It is also suggested that using withShardedKey() alone is already
sufficient to achieve auto-sharding within the keyed group. There is no
need to manually divide the input by keying them differently based on
the "shard" number specified:

https://youtu.be/jses0W4Zalc?t=967
2022-12-12 11:55:24 -05:00
..
src Remove sharding parameter from RegistryJpaIO (#1856) 2022-12-12 11:55:24 -05:00
build.gradle Remove some appengine dependencies (#1874) 2022-12-08 11:46:47 -05:00
buildscript-gradle.lockfile Upgrade to Gradle 7.0 (#1712) 2022-07-26 11:41:27 -04:00
Dockerfile Build docker image of nomulus tool (#142) 2019-07-16 20:18:44 -04:00
gradle.lockfile Remove some appengine dependencies (#1874) 2022-12-08 11:46:47 -05:00