-
Notifications
You must be signed in to change notification settings - Fork 253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
From Linux bash script to Python control flow #386
Conversation
What did you think about create python script which will be copied into docker container and then executed in one command? |
I think it could work, but there's loads of details to figure out. What state/configuration and code to pass over? How is passed state serialised/deserialized? How to interpret the output of the script? Which sections of the codebase can be called from that remote script? Worst case, it'd be the same as the current bash implementation, with huge amounts of string interpolation, writing Python code inside Python strings. I also was keen to avoid the mental overhead of having to think what sections of the codebase could be called by different files. |
This lets us use the 'linux32 bash' that's specified by the i686 docker images, without specifying it explicitly.
Alright, that should be the performance problems fixed. Instead of |
…tead" This reverts commit a052c71.
But, the dockcross images in the test aren't so happy about using ENTRYPOINT after all - their default ENTRYPOINT prints this-
It does sorta make sense that we ensure it's bash that we're talking to, to be honest. Anyway, I've reverted that change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, quite a, impressive PR, @joerick!
I do very much like how it looks from the user-side of linux.py
. It's ... well, one piece of code again, rather than the bash-nested-inside-python. The implementation is quite impressive as well, but also quite heavy, somehow, and then especially the output-retrieving part of call
. I'm somehow not entirely convinced there isn't a better solution, and it does make me wonder if it is worth it (I can't immediately see how or where, but if there are problems or errors in there, they will potentially be very obscure?).
But at the same time, all these docker-specific things are now separate from the rest of the code, more general and reusable, and not really worse than what we had before, so why not go for it, actually, and see how it feels once we start using it? :-)
env_assignments = ' '.join(f'{shlex.quote(k)}={shlex.quote(v)}' | ||
for k, v in env.items()) | ||
command = ' '.join(shlex.quote(str(a)) for a in args) | ||
end_of_message = str(uuid.uuid4()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we somehow use \0
to mark the end of a message? To be fair, I don't know how bash handles this.
At any rate, do we need to generate a UUID each time? I'd think we could just take a fixed string marker?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's nothing stopping a child process from outputting \0
and causing havoc.
Normally, when doing this kind of protocol, there'd be a header that specifies the length of the following message -that makes reads really easy. But I wanted to avoid buffering the output on the container side, for memory reasons but mostly so that output still streamed back to the console - during a long build this is very reassuring, or helpful if a command stalls. This solution was inspired by SMTP multipart boundaries - basically a random string long enough is unique enough to delimit between parts of data. UUID is just an easy way to make a unique string.
I could add some of this context as a comment I guess?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I wanted to avoid buffering the output on the container side, for memory reasons but mostly so that output still streamed back to the console - during a long build this is very reassuring, or helpful if a command stalls.
Alright, yes. Good point! I hadn't considered that, yet, but this is an important feature, then! :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could still just define our own constant magic end-of-message UUID, no? Is there a reason to introduce randomness and regenerate a new one every time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's something comforting about using a new UUID each time - it's a string that's never existed in the world before. but practically, err, the most concrete I can get with it is that if it was static and written in code, and then somehow this code file was sent down the pipe, then the terminator would appear before the end of the message. Far-fetched, but that's the kind of thing I'm thinking about. But a fresh UUID has never existed before, so that kind of thing couldn't happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a bit far-fetched, but it had completely not crossed my mind! Better safe than sorry, then :-)
Agree that the docker part is kinda... low-level protocolish. I wonder if a few unit tests on that piece in isolation might make us feel better about using it? |
Ah yes, that would probably already make quite a difference :-) |
Okie doke, this is ready for another review, if you have time @YannickJadoul ! I've fixed some the issues you raised above, and added unit tests for It got a little... complicated... when I realised that one of the preexisting tests Anyway, let me know what you think! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright! I can't say I've looked at every little detail, because there are a lot of changes and I'm a bit overwhelmed, but I see quite extensive tests. So I think we should just merge it, get used to the new way of doing things, and make smaller sets of changes later, if we notice something that can be improved?
env_assignments = ' '.join(f'{shlex.quote(k)}={shlex.quote(v)}' | ||
for k, v in env.items()) | ||
command = ' '.join(shlex.quote(str(a)) for a in args) | ||
end_of_message = str(uuid.uuid4()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could still just define our own constant magic end-of-message UUID, no? Is there a reason to introduce randomness and regenerate a new one every time?
if from_path.is_dir(): | ||
self.call(['mkdir', '-p', to_path]) | ||
subprocess.run( | ||
f'tar cf - . | docker exec -i {self.name} tar -xC {shell_quote(to_path)} -f -', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checking, since you were disappointed by the speed of docker exec
. Using byte-streams now, it's still not possible to do this with self.call
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooops, I just saw this after the merge. It would be possible, but pretty fiddly to implement. The easiest implementation would involve reading an entire tarball into memory within the host process before streaming it into the container. With a large codebase that might introduce memory issues, and could well be slower. A better implementation would stream between active tar
processes on either side, but I'm guessing that would involve some pretty gnarly non-blocking I/O code to get it to work right.
I was curious - here's a profile of the basic test project build -
_ ._ __/__ _ _ _ _ _/_ Recorded: 19:54:25 Samples: 373
/_//_/// /_\ / //_// / //_'/ // Duration: 35.644 CPU time: 0.144
/ _/ v3.1.3
Program: /Users/joerick/Projects/cibuildwheel/cibuildwheel/__main__.py --platform linux
35.642 <module> __main__.py:1
└─ 35.599 main __main__.py:56
└─ 35.597 build linux.py:78
├─ 25.233 call docker_container.py:117
├─ 3.552 copy_into docker_container.py:76
│ └─ 3.541 run subprocess.py:447
│ [11 frames hidden] subprocess
├─ 3.184 __enter__ docker_container.py:34
│ ├─ 2.655 call docker_container.py:117
│ └─ 0.520 run subprocess.py:447
│ [11 frames hidden] subprocess
├─ 1.854 __exit__ docker_container.py:68
│ └─ 1.575 wait subprocess.py:1014
│ [4 frames hidden] subprocess
├─ 0.682 glob docker_container.py:106
│ └─ 0.682 call docker_container.py:117
├─ 0.536 copy_out docker_container.py:95
│ └─ 0.536 run subprocess.py:447
│ [9 frames hidden] subprocess
└─ 0.518 get_environment docker_container.py:176
└─ 0.518 call docker_container.py:117
copy_into
is 10% of the time here. Maybe not worth the hassle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might not be indeed. It just looks out of place; that's why I'm asked.
Thanks for looking it over, @YannickJadoul! |
We've been talking for a while about moving more of the Linux control flow into Python, instead of bash. This PR explores that.
This is a first implementation of that idea. It has a build script that resembles that of macOS or Windows, but runs every command through
docker exec
.Code-wise, I think this is a nice improvement. It avoids lots of bash quirks and fiddling with strings. Hopefully, it should unlock lots of improvements down the line.
The downside, currently, is performance. Those
docker exec
calls are pretty slow, compared to local subprocess calls. Eachdocker exec
takes around 150ms on my machine. We'll see how it shakes out in our CI providers, but in my testing it looks like roughly +33% run time on the CI suite.To be clear, the effects will be much worse on our CI than in any user scenario, because we run hundreds of tiny builds, and users probably have a handful of more substantial ones. But the CI is already pretty tedious, so I'd like to explore options to speed it up before committing to this approach.
My current thought to improve perf is to keep a long-running shell process running in the Docker container, and relay commands to that via open file descriptors. Couple details to figure out but should be fairly transparent I hope.
Comments welcome!