Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix: Disabling Dataflow memory monitor for Bigtable Dataflow pipeline…
…s. (#2856) Bigtable pipelines are very GC intensive, For each cell in Bigtable we create following objects: 1. Row key 1. Column qualifier 1. Timestamp 1. Value 1. A cell object that contains the above 4 objects. So each cell has at least 5 objects. On top of that, each cell may represented by different kinds of objects. For example, import job creates HBase Result object and Mutation objects for all the cells. Same is the case with Snapshot related pipelines. Given this abundance of objects, for cells with smaller values, the pipeline may lead to a high GC overhead, but it does make progress. The MemoryMonitor on dataflow worker kills the pipeline and results in wasted work. The above is true for most dataflow pipeline, but this specific use case is different as the pipeline does nothing else. CPU is only used for object transformation and GC. So, we disable the memory monitor on Bigtable pipelines. If pipeline stalls, it will OOM and then human intervention will be required. As a mitigation, users should choose a worker machine with higher memory or reduce the parallelism on the workers (by setting --numberOfWorkerHarnessThreads).
- Loading branch information