Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve startup time of workers #85

Closed
lars-reimann opened this issue Apr 13, 2024 · 5 comments · Fixed by #87
Closed

Improve startup time of workers #85

lars-reimann opened this issue Apr 13, 2024 · 5 comments · Fixed by #87
Labels
performance 🏃 Speed things up released Included in a release

Comments

@lars-reimann
Copy link
Member

lars-reimann commented Apr 13, 2024

During evaluation, we observed delays ranging from 3s to 20s+ between a program message and the first values coming back, regardless of the size of the data. For better usability, this time should be reduced as much as possible, since this

  • delays the table view showing up
  • and later causes another delay for the profiling.
@lars-reimann lars-reimann added the performance 🏃 Speed things up label Apr 13, 2024
@lars-reimann lars-reimann added this to DSL Apr 13, 2024
@github-project-automation github-project-automation bot moved this to Backlog in DSL Apr 13, 2024
@lars-reimann
Copy link
Member Author

lars-reimann commented Apr 17, 2024

A few imports in the library constitute most of the delay:

  • (Measure-Command { python -c "import seaborn" }).TotalSeconds: 1.21s
  • (Measure-Command { python -c "import torch" }).TotalSeconds: 1.24s
  • (Measure-Command { python -c "import torchvision" }).TotalSeconds: 1.90s (also loads torch)
  • (Measure-Command { python -c "import openpyxl" }).TotalSeconds: 0.26s

We should get major improvements regarding startup by:

  1. Lazily loading our own classes in __init__ files (apipkg). Why is this not the default behavior in Python?!
  2. Lazily importing external libraries. Why is this not the default behavior in Python?!

Unfortunately, many libraries currently import all their stuff into a global package, so if we use anything from, say seaborn, there will still be the 1.21s delay. This can only be fixed by the libraries themselves.

Regarding the runner, we will need to have a primed process ready that can handle the execution of the next pipeline. Ideally, we would also keep those processes alive and reset them after pipeline execution is done, so we don't have the startup delay again and again.

@WinPlay02
Copy link
Contributor

WinPlay02 commented Apr 17, 2024

Ideally, we would also keep those processes alive and reset them after pipeline execution is done, so we don't have the startup delay again and again.

@lars-reimann This is somewhat supported, as metapath (for loading our pipelines) is already reset after a pipeline ends. The process could theoretically be reused, but this behavior would not guarantee correctness in all cases. States from external libraries (like seaborn) however, are not reset.

Regarding the runner, we will need to have a primed process ready that can handle the execution of the next pipeline.

I'll try something out for tomorrow, maybe something like that can be achieved.

@lars-reimann
Copy link
Member Author

lars-reimann commented Apr 17, 2024

The library improvements are already in Safe-DS/Library#624. But that will only get us so far. It would be great if you could take a look.

@lars-reimann
Copy link
Member Author

There's also a new release of the library now, containing your hash functions and your other improvements, as well as the lazy imports.

lars-reimann added a commit that referenced this issue Apr 17, 2024
Closes partially #85

### Summary of Changes

Require the latest version of `safe-ds`.
lars-reimann pushed a commit that referenced this issue Apr 17, 2024
## [0.11.0](v0.10.0...v0.11.0) (2024-04-17)

### Features

* bump `safe-ds` to `v0.21.0` ([#86](#86)) ([d780822](d780822)), closes [#85](#85)
* memoization improvements ([#81](#81)) ([6bc2288](6bc2288)), closes [#44](#44)
lars-reimann added a commit that referenced this issue Apr 21, 2024
Closes #85

### Summary of Changes

- Use a process pool to keep started processes waiting
- The max. amount of pipeline processes is now set to `4`.
- Reuse started processes. This should be correct, as the same pipeline
process cannot be used by multiple pipelines at the same time. As the
`metapath` is reset to remove the custom generated Safe-DS pipeline
code, only global library imports (and settings) should remain. If this
is a concern, `maxtasksperchild` can be set to `1`, in which case
pipeline processes are not reused.
- Reuse shared memory location for saving placeholders, if the
memoization infrastructure has added such a location to the object being
saved

---------

Co-authored-by: megalinter-bot <129584137+megalinter-bot@users.noreply.github.com>
Co-authored-by: Lars Reimann <mail@larsreimann.com>
@github-project-automation github-project-automation bot moved this from Backlog to ✔️ Done in DSL Apr 21, 2024
lars-reimann pushed a commit that referenced this issue Apr 22, 2024
## [0.12.0](v0.11.0...v0.12.0) (2024-04-22)

### Features

* handle list of filenames in `absolute_path` and `file_mtime` ([#89](#89)) ([50d831f](50d831f)), closes [#88](#88)
* prepare and pool processes ([#87](#87)) ([e5e7011](e5e7011)), closes [#85](#85)
@lars-reimann
Copy link
Member Author

🎉 This issue has been resolved in version 0.12.0 🎉

The release is available on:

Your semantic-release bot 📦🚀

@lars-reimann lars-reimann added the released Included in a release label Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance 🏃 Speed things up released Included in a release
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants