Speedup thicket composition by performing profile unions in binary tree order #170

michaelmckinsey1 · 2024-06-11T00:37:27Z

This PR extends off #169. Limiting the total size of the ensemble during unification is more impactful than the performance gains from parallelism, and this solution is less complicated than using multiprocessing.

	Parallel Sorting	LBANN
Profiles	15734	16
Input Nodes (min, max)	5, 18	6266, 6889
Input Dataframe Size (min, max)	''	''
Output Nodes	154	7106
Output DF	2423036	113696

Dataset	Thicket Version	Time	Unions	old_to_new size (min, max)
Parallel Sorting	Develop	>120m	15733	157459
	#170	12m	''	16, 185
	#185	33m	''	157459
LBANN	Develop	17m	15	106897
	#170	7m	''	12824, 13821
	#185	3m	''	106897

Number of unions is not affected by the different unifying strategy in this PR. This is because:

# PR170 - 3 unions
tk1 U tk2 = tk12
tk3 U tk4 = tk34
tk12 U tk34 = tk1234

is the same as

# develop - 3 unions
tk1 U tk2 = tk12
tk12 U tk3 = tk123
tk123 U tk4 = tk1234

This PR increases performance by keeping the old_to_new mapping dictionary small as we call concat_thickets n-1 times instead of one time (as in the develop branch). The old_to_new dictionary requires an operation to update the dictionary mappings per each union, which can become expensive if the dictionary is too large. The old_to_new dictionary is necessary to replace the node objects in the performance data with the new node objects in the graph that results from union.

thicket/thicket.py

ilumsden

The code is good, but something has to be done to make the tree-based concatenation clearer. At bare minimum, this needs to be well commented, otherwise no one is going to have a clue what this is doing.

ilumsden · 2024-06-20T17:36:25Z

thicket/thicket.py

-
-        return thicket_object
+        # n - 1 edges in a binary tree
+        pbar = tqdm.tqdm(range(len(ens_list) - 1), disable=disable_tqdm)


Using tqdm here makes this process very hard to understand. I'd suggest not using tqdm and/or extensively documenting this with comments.

tqdm is really useful to see what is happening for longer reads.

I changed it to use a while loop, which should be clearer with the popping and appending that is happening in the loop. And I added very detailed comments.

ilumsden

LGTM. The stuff you're doing with tqdm also makes much more sense now that you've added the comments and switched to a while loop

ilumsden · 2024-06-26T23:32:21Z

@pearce8 this PR is also ready for your review

…ee order (LLNL#170) * Speedup reader by splitting into chunks * Switch to binary tree order composition * Make code clearer with while loop and comments * Clarity * Update comment * Update comment --------- Co-authored-by: Michael Richard Mckinsey <mckinsey@quartz1154.llnl.gov>

michaelmckinsey1 commented Jun 11, 2024

View reviewed changes

thicket/thicket.py Outdated Show resolved Hide resolved

michaelmckinsey1 mentioned this pull request Jun 12, 2024

Add Option to Speedup Thicket Reader with Multiprocess #169

Closed

michaelmckinsey1 force-pushed the feature-speedup_reader branch from 00e02ba to e4d2ebc Compare June 12, 2024 23:08

michaelmckinsey1 changed the title ~~Speedup Reader by Splitting Data into Chunks~~ Speedup Reader by Splitting Ensemble into Sub-lists Jun 13, 2024

michaelmckinsey1 force-pushed the feature-speedup_reader branch from e4d2ebc to f57450e Compare June 13, 2024 17:31

michaelmckinsey1 marked this pull request as ready for review June 13, 2024 22:28

michaelmckinsey1 requested review from pearce8 and ilumsden June 13, 2024 22:28

michaelmckinsey1 self-assigned this Jun 13, 2024

michaelmckinsey1 force-pushed the feature-speedup_reader branch from f57450e to fe69690 Compare June 14, 2024 22:54

pearce8 changed the title ~~Speedup Reader by Splitting Ensemble into Sub-lists~~ Speedup thicket composition by performing profile unions in binary tree order Jun 14, 2024

ilumsden requested changes Jun 20, 2024

View reviewed changes

michaelmckinsey1 added status-revisions-needed Revisions have been requested from a reviewer for this PR and removed status-ready-for-review This PR is ready to be reviewed by assigned reviewers labels Jun 20, 2024

michaelmckinsey1 force-pushed the feature-speedup_reader branch from fe69690 to bce8513 Compare June 20, 2024 23:03

michaelmckinsey1 and others added 2 commits June 24, 2024 15:57

Speedup reader by splitting into chunks

8035e0c

Switch to binary tree style ensembling

bb1b19d

michaelmckinsey1 force-pushed the feature-speedup_reader branch from bce8513 to bb1b19d Compare June 24, 2024 20:57

Make code clearer with while loop and comments

a0afa2d

michaelmckinsey1 force-pushed the feature-speedup_reader branch from 6fd50d7 to a0afa2d Compare June 24, 2024 22:11

michaelmckinsey1 added 3 commits June 24, 2024 17:13

Clarity

2f9cf7f

Update comment

28b359a

Update comment

86b9b06

michaelmckinsey1 mentioned this pull request Jun 26, 2024

Optimize Loop in Unify #185

Merged

michaelmckinsey1 added status-ready-for-review This PR is ready to be reviewed by assigned reviewers and removed status-revisions-needed Revisions have been requested from a reviewer for this PR labels Jun 26, 2024

ilumsden approved these changes Jun 26, 2024

View reviewed changes

ilumsden added status-approved No more revisions are required on this PR and it is ready for merge and removed status-ready-for-review This PR is ready to be reviewed by assigned reviewers labels Jun 26, 2024

pearce8 approved these changes Jun 26, 2024

View reviewed changes

pearce8 merged commit ace73f2 into LLNL:develop Jun 26, 2024
4 checks passed

slabasan added this to the 2024.2.0 milestone Sep 6, 2024

michaelmckinsey1 mentioned this pull request Nov 12, 2024

Extend hash length #227

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speedup thicket composition by performing profile unions in binary tree order #170

Speedup thicket composition by performing profile unions in binary tree order #170

michaelmckinsey1 commented Jun 11, 2024 •

edited

Loading

ilumsden left a comment

ilumsden Jun 20, 2024

michaelmckinsey1 Jun 24, 2024

ilumsden left a comment

ilumsden commented Jun 26, 2024

Speedup thicket composition by performing profile unions in binary tree order #170

Speedup thicket composition by performing profile unions in binary tree order #170

Conversation

michaelmckinsey1 commented Jun 11, 2024 • edited Loading

ilumsden left a comment

Choose a reason for hiding this comment

ilumsden Jun 20, 2024

Choose a reason for hiding this comment

michaelmckinsey1 Jun 24, 2024

Choose a reason for hiding this comment

ilumsden left a comment

Choose a reason for hiding this comment

ilumsden commented Jun 26, 2024

michaelmckinsey1 commented Jun 11, 2024 •

edited

Loading