Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No download? No build dir? No problem! #305

Merged
merged 3 commits into from
Aug 25, 2023

Conversation

tsibley
Copy link
Member

@tsibley tsibley commented Aug 24, 2023

Make the nextstrain build pathogen <directory> optional when --attach + --no-download are used

This allows usages which just want to check job status/logs to stop passing in a meaningless/unused directory.

Testing

  • Behaviour manually exercised
  • Checks pass

Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No download? No build dir? No problem!

Yay! 🥳

@@ -78,7 +79,7 @@ def register_commands(parser, commands):

for cmd in commands:
subparser = cmd.register_parser(subparsers)
subparser.set_defaults( __command__ = cmd )
subparser.set_defaults( __command__ = cmd, __parser__ = weakref.proxy(subparser) )
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm reading through the weakref docs, but would like to understand choice to use here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weak references are used with automatic reference counting systems for memory management. They're a limited form of garbage collection (GC), though many folks only use GC to mean something different like mark-and-sweep systems. Reference counting systems track how many times (the refcount) a thing (e.g. a value/object) is referred to (e.g. by variables) and when that drops to 0 then the system knows it can destroy it and return its memory usage to the available pool for use by something else.

A problem arises with cyclical references because refcounts never drop to 0 for affected objects. This results in memory leaks because affected values aren't ever destroyed. Consider an object A that stores a reference to itself in its properties/attributes; the refcount of A will never be less than 1 even when nothing else outside of A refers to it. In practice, I think some reference counting with additional bookkeeping can identify simple cycles (e.g. A → A → A → …) but not more complex/indirect ones (e.g. A → B → A → B → …).

A weak reference is one which doesn't increase the reference count, which means it doesn't ensure the value stays in existence, and thus can be used to break cyclical references and avoid memory leaks.

Here, I'm assuming that subparser.set_defaults() ends up storing the passed values in itself somewhere so it can use it later when parse_args() is called. Since I'm passing in subparser to itself and expecting it to store what I pass, I'm assuming a cyclical reference will be created and that I should break that with weakref. That said, the Nextstrain CLI processes are not very long-lived and the argument parsing code is called once, maybe twice, but not repeatedly many times, so the consequence of a cyclical reference is probably very low. I probably could omit weakening and not worry about it! Maybe I should not worry.

This got me thinking about it some more, and I remembered that Python (well, CPython) also complements reference-counting with a traversal-based garbage collector. This is specifically to handle the case of cyclical references. So I guess I really shouldn't worry about it.

Copy link
Member Author

@tsibley tsibley Aug 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to play around with this to get a sense of it, try this patch:

diff --git a/nextstrain/cli/__init__.py b/nextstrain/cli/__init__.py
index 5c1b110..4f388df 100644
--- a/nextstrain/cli/__init__.py
+++ b/nextstrain/cli/__init__.py
@@ -32,6 +32,15 @@ def run(args):
     parser = make_parser()
     opts = parser.parse_args(args)
 
+    # Removing "parser" which holds references to the "subparser" referred to
+    # weakly by "opts.__parser__".
+    del parser
+
+    # Trigger GC now, before we use opts.__parser__ below.  This will render
+    # the weakref invalid.
+    import gc
+    gc.collect()
+
     try:
         return opts.__command__.run(opts)
 

alone and in combination with this one:

diff --git a/nextstrain/cli/argparse.py b/nextstrain/cli/argparse.py
index fa7b035..c654c0d 100644
--- a/nextstrain/cli/argparse.py
+++ b/nextstrain/cli/argparse.py
@@ -79,7 +79,7 @@ def register_commands(parser, commands):
 
     for cmd in commands:
         subparser = cmd.register_parser(subparsers)
-        subparser.set_defaults( __command__ = cmd, __parser__ = weakref.proxy(subparser) )
+        subparser.set_defaults( __command__ = cmd, __parser__ = subparser )
 
         # Ensure all subparsers format like the top-level parser
         subparser.formatter_class = parser.formatter_class

Also try commenting out/deleting the del parser line in the first patch. Automatic reference counting isn't guaranteed to destroy objects immediately upon refcount going to 0, hence why the first patch triggers GC immediately.

In any case, I'm going to remove the weakref usage as the traversal-based GC should make it unnecessary (and consequences are very low anyway due to short process lifetimes). Thanks for getting me to think more about this beyond my initial "oh, let's not create a reference cycle here"!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Bookmarking this to read and play with more later. Thank you for the in depth explanation 🌟)

…or message

Normally argparse does this when it detects a usage error, but this lets
us handle other cases it doesn't in a similar way.  Similar to how we
track which command class the args parse to, we track the command parser
the args parse to so we can emit appropriate usage information.

For consistency with argparse, the exit status of a usage error is 2
instead of 1, but the distinction is unlikely to be actually useful in
practice.
This allows usages which just want to check job status/logs to stop
passing in a meaningless/unused directory.
@tsibley tsibley force-pushed the trs/build/no-download-no-build-dir-no-problem branch from 1fc0a0e to 7979f6f Compare August 24, 2023 23:21
@tsibley tsibley merged commit ec82fed into master Aug 25, 2023
@tsibley tsibley deleted the trs/build/no-download-no-build-dir-no-problem branch August 25, 2023 00:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants