Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deterministic builds #239

Closed
indygreg opened this issue Dec 26, 2019 · 20 comments · Fixed by #256
Closed

Deterministic builds #239

indygreg opened this issue Dec 26, 2019 · 20 comments · Fixed by #256

Comments

@indygreg
Copy link

I think cibuildwheel should strive to be deterministic in its behavior. i.e. if you run cibuildwheel tomorrow, it will have identical behavior to today. Put another way, the user expectation of cibuildwheel is that it only changes in significant ways when its version is updated. Having deterministic behavior makes CI and release pipelines more predictable and reproducible. This reduces frustration and is better from a security perspective.

Fully deterministic output is hard to achieve. Especially when you don't control the base VM image being executed on. But this doesn't mean cibuildwheel shouldn't strive to be deterministic wherever possible.

One of the areas where cibuildwheel isn't deterministic today is downloading 3rd party dependencies.

For example, installing pip on macOS always retrieves the latest stable version of pip (https://github.com/joerick/cibuildwheel/blob/651f6a9172020aa9a2b0c9eb50dfca06d865ace4/cibuildwheel/macos.py#L35). A better solution here is to fetch an explicit version of get-pip.py from e.g. https://github.com/pypa/get-pip/raw/309a56c5fd94bd1134053a541cb4657a4e47e09d/get-pip.py (corresponds to pip 19.2.3).

Another example of non-deterministic behavior is with pip install. I think cibuildwheel should be pinning versions universally (ideally with hashes for additional security protections). Otherwise, the exact installed package version could vary over time. An example where versions aren't being pinned is https://github.com/joerick/cibuildwheel/blob/651f6a9172020aa9a2b0c9eb50dfca06d865ace4/cibuildwheel/linux.py#L89 and https://github.com/joerick/cibuildwheel/blob/651f6a9172020aa9a2b0c9eb50dfca06d865ace4/cibuildwheel/windows.py#L144.

Is the cibuildwheel project receptive to making behavior more deterministic (and secure) by making downloads (and possibly other behavior) more deterministic?

@joerick
Copy link
Contributor

joerick commented Dec 27, 2019

Hi @indygreg ! Thanks for your proposal. I definitely agree with the spirit of the suggestion - deterministic software is a good thing™. This was something that I always thought "Oh, we'll get to v1.0 and then think about it", but perhaps because we haven't had many reports of dependencies breaking things it's fallen by the wayside.

This is where we are now, as I see it

                      ┌─────────────────┐
 Now                  │                 │
 cibuildwheel v1.1    │      PyPI       │
                      │                 │
                      └─────────────────┘
                               ▲
                               │
                               ▼
                      ┌─────────────────┐
                      │                 │
                      │       pip       │
                      │                 │
                      └─────────────────┘
┌────────────────────────────────────────┐
│┌──────────────────────────────────────┐│
││                                      ││ Versions locked
││             cibuildwheel             ││ to this version
││                                      ││ of cibuildwheel
│└──────────────────────────────────────┘│
│            ┌───────────────────────────┘
│┌──────────┐│ ┌──────────┐  ┌──────────┐
││          ││ │setuptools│  │ External │
││  Python  ││ │  wheel   │  │  build   │
││          ││ │virtualenv│  │toolchain │
│└──────────┘│ └──────────┘  └──────────┘
└────────────┘
 ┌──────────────────────────────────────┐
 │                                      │
 │               CI image               │
 │                                      │
 └──────────────────────────────────────┘

This is where we could be by specifying versions of as much as possible...

                      ┌─────────────────┐
 “Version as much     │                 │
 as we can”           │      PyPI       │
                      │                 │
                      └─────────────────┘
                               ▲
                     ┌─────────┼─────────┐
                     │         ▼         │
                     │┌─────────────────┐│
                     ││                 ││
                     ││       pip       ││
                     ││                 ││
                     │└─────────────────┘│
┌────────────────────┘                   │
│┌──────────────────────────────────────┐│
││                                      ││
││             cibuildwheel             ││
││                                      ││
│└──────────────────────────────────────┘│
│                          ┌─────────────┘
│┌──────────┐  ┌──────────┐│ ┌──────────┐
││          │  │setuptools││ │ External │
││  Python  │  │  wheel   ││ │  build   │
││          │  │virtualenv││ │toolchain │
│└──────────┘  └──────────┘│ └──────────┘
└──────────────────────────┘
 ┌──────────────────────────────────────┐
 │                                      │
 │               CI image               │
 │                                      │
 └──────────────────────────────────────┘

One thing I'm less sure of now is the merits of pinning the versions of things like pip. pip seems to present itself as more of an 'evergreen' piece of software, and it's very eager for users to upgrade with messages like:

You are using pip version 19.0.3, however version 19.3.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.

It relies on APIs from PyPI, who since the rewrite, should be keeping things backward-compatible(?), but the big breaking changes to that ecosystem seem to happen when security fixes require updates - I think SSL updates were a big one. So I'd be curious if anyone has any experience/insight into pinning pip and how that worked in a build system.


On an implementation note, this could present a maintenance burden, unless we can figure out a clever way to bump the versions to their latest before release? Could pip 'constraint' files help us here?

@YannickJadoul
Copy link
Member

Just my 2 cents: currently, the nice thing is that we don't have to release a new version each time some dependencies get updated, and cibuildwheel is "just" the thing that pulls everything together. How easy/possible would it be to somehow control this from the outside (i.e., pin versions/constraints on the versions as options to cibuildwheel) ? We can still discuss about sensible defaults, then, but it would on the one hand hand over control to the user, and on the other hand not put an extra burden of maintenance on the cibuildwheel maintainers (hoping that at one point or the other, cibuildwheel will +- be stable in functionality?).

@indygreg
Copy link
Author

There are compelling arguments that can be made that cibuildwheel should support both a pinned and latest mode. Different people have different perspectives here. (Those who care about reproducibility and and security tend to prefer the pinned mode and those who want things to just work with minimal effort tend to prefer latest.) Of course, supporting both modes of execution makes the source code more complex and is harder to maintain.

One potential strategy here is to check in 2 versions of a pip requirements file: one listing just the packages we require and another expanded to contain versions and hashes. For all of my Python projects these days, I use pip-compile from pip-tools (https://pypi.org/project/pip-tools/) and maintain both an input requirements file and an output one. See e.g. https://github.com/indygreg/python-zstandard/blob/master/ci/requirements.in and https://github.com/indygreg/python-zstandard/blob/master/ci/requirements.txt. At run-time, cibuildwheel could consume either the minimal or expanded requirements file, depending on whether deterministic mode is enabled.

That would leave pip and associated packaging tools to be versioned manually. Unfortunately, I don't see a way to solve this problem that doesn't involve adding version-specific URLs (and maybe hashes) to cibuildwheel's source code. The good news is that get-pip.py/pip embeds a copy of setuptools and wheel, so the only thing you need to pin on Python 3 is get-pip.py. (On Python 2, you also need to pin a virtualenv package. But since you presumably won't care about Python 2 that much longer...)

I'll throw out Mercurial's Linux (https://www.mercurial-scm.org/repo/hg/file/5685ce2ea3bf/contrib/automation/hgautomation/linux.py#l40) and Windows (https://www.mercurial-scm.org/repo/hg/file/5685ce2ea3bf/contrib/install-windows-dependencies.ps1) CI bootstrap scripts for examples of how we (hopefully deterministically) bootstrap the CI environment.

tl;dr I think pinning get-pip.py and using something like pip-compile to manage pip requirements files would get cibuildwheel most of the way towards deterministic behavior and would placate the concerns that caused me to file this issue. As for whether that should be the default or only behavior is a decision left up to the maintainer(s).

@YannickJadoul
Copy link
Member

Those who care about reproducibility and and security tend to prefer the pinned mode

@indygreg, could you quickly explain how this is related to security? I'm not a big expert on version and dependency management, but I thought getting the latest version results in plugged security risks?

I quite like @joerick's idea of constraint files: https://pip.pypa.io/en/stable/user_guide/#constraints-files. In this way, we could have a new option (CIBW_PIP_CONSTRAINTS or so?) and pass that to the pip install commands; that would 3 scenarios: latest (i.e., no constraint file), maybe some included working constraints file we can update with new releases, or a custom one specified by the user? Any thoughts on that?

That doesn't solve get-pip.py, though, but if anything, get-pip.py should be pretty stable, I'd expect? So either way, keeping the latest version or pinning it shouldn't matter too much.

I'll ping @mayeut as well, who seems to be always be the first one to notice Python versions need to be updated, and might have something to add here? :-)

@indygreg
Copy link
Author

Security concerns:

  • Content tampering attacks. If a remote resource is compromised (say a bad actor uploads a malicious version of a Python package to PyPI), you may download and use the malicious version automatically if you are not pinned to a specific, known good version.
  • MITM attacks. If you don't verify content integrity (often by checking hashes), a man-in-the-middle attack can replace content and you could run a malicious version of software.

The fact that software like cibuildwheel downloads files like get-pip.py without content verification makes software like get-pip.py an extremely lucrative attack target. If I were a bad actor, knowing that lots of the Python ecosystem downloads get-pip.py without content verification, if I could make a malicious version of that file/URL available, I would be able to poison a large portion of the Python ecosystem and do some real damage.

You are also correct that pulling the latest version of say pip can result in being patched from known security issues. But this assumes you are pulling a good version of that dependency and requires various assumptions about trust to hold. My ideal best practice is to fetch deterministic content (either by vendoring in the local repository or verifying SHA-256 hashes on download) and update dependencies frequently to pull in important updates. But again, different people have different requirements.

@joerick
Copy link
Contributor

joerick commented Jan 1, 2020

Thanks both.

I quite like @joerick's idea of constraint files: https://pip.pypa.io/en/stable/user_guide/#constraints-files. In this way, we could have a new option (CIBW_PIP_CONSTRAINTS or so?) and pass that to the pip install commands; that would 3 scenarios: latest (i.e., no constraint file), maybe some included working constraints file we can update with new releases, or a custom one specified by the user? Any thoughts on that?

I like this design too. I was also dinking around with get-pip, and it seems that supports constraint files too!

(env) joerick@joerick2 /tmp> cat constraints.txt
pip==19.2.0
(env) joerick@joerick2 /tmp> curl https://bootstrap.pypa.io/get-pip.py | python3 - -c constraints.txt
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1734k  100 1734k    0     0  1227k      0  0:00:01  0:00:01 --:--:-- 1227k
Collecting pip==19.2.0
  Using cached https://files.pythonhosted.org/packages/3a/6f/35de4f49ae5c7fdb2b64097ab195020fb48faa8ad3a85386ece6953c11b1/pip-19.2-py2.py3-none-any.whl
Collecting wheel
  Using cached https://files.pythonhosted.org/packages/00/83/b4a77d044e78ad1a45610eb88f745be2fd2c6d658f9798a15e384b7d57c9/wheel-0.33.6-py2.py3-none-any.whl
Installing collected packages: pip, wheel
  Found existing installation: pip 19.0.3
    Uninstalling pip-19.0.3:
      Successfully uninstalled pip-19.0.3
Successfully installed pip-19.2 wheel-0.33.6

The relevant bit is curl https://bootstrap.pypa.io/get-pip.py | python3 - -c constraints.txt. In fact you can pass any pip install option to that script. Quite magical, really! I suppose it uses pip to install pip.

At that point, the only thing that isn't pinnable is get-pip itself - I don't know if that's possible now, but there's an open issue for verification via hash pypa/get-pip#41
.

But I think that gets us to a point where all the software running in cibuildwheel's process is pinned. The question I'm wondering is - is it materially different security-wise if get-pip is installed from bootstrap.pypa.io or from a github raw URL? Personally, I'd rather stick with pypa's URL since it's designed to do exactly this, but there might be another consideration I'm missing?

@indygreg
Copy link
Author

indygreg commented Jan 1, 2020

If you do go with a constraints file, I highly recommend populating it with SHA-256 hashes so content is verified. Again, I recommend pip-compile from pip-tools to generate these files, as it will find all potential files on PyPI and add the hashes for each of them.

As for pinning get-pip itself, you could link to a URL like https://github.com/pypa/get-pip/raw/309a56c5fd94bd1134053a541cb4657a4e47e09d/get-pip.py, which is deterministic. But this uses GitHub as a static content server. And I know GitHub can frown upon this. (They've temporarily banned projects sending too much traffic their way, although I'm not sure if they still do this.) I think our best bet is to ask the get-pip project to expose an immutable version of get-pip.py on their CDN (or whatever) every time they make a release, as this would be more official and would presumably be designed to scale.

@joerick
Copy link
Contributor

joerick commented Jan 1, 2020

Another idea, given that we'll probably be dropping Python 2.7 soon - to use ensurepip instead of get-pip - I think you've looked at this previously @YannickJadoul? According to the docs there's a bundled version of pip in every distribution of CPython, which can be installed with this command.

@joerick
Copy link
Contributor

joerick commented Jan 1, 2020

And yes, pip-compile looks good! I think that it will generate constraint files, because the format is the same as requirement files.

Example:

(env) joerick@joerick2 /tmp> cat constraints.in                       
pip
setuptools
(env) joerick@joerick2 /tmp> pip-compile --no-header --allow-unsafe --generate-hashes --upgrade constraints.in

# The following packages are considered to be unsafe in a requirements file:
pip==19.3.1 \
    --hash=sha256:21207d76c1031e517668898a6b46a9fb1501c7a4710ef5dfd6a40ad9e6757ea7 \
    --hash=sha256:6917c65fc3769ecdc61405d3dfd97afdedd75808d200b2838d7d961cebc0c2c7
setuptools==44.0.0 \
    --hash=sha256:180081a244d0888b0065e18206950d603f6550721bd6f8c0a10221ed467dd78e \
    --hash=sha256:e5baf7723e5bb8382fc146e33032b241efc63314211a3a120aaa55d62d2bb008

@YannickJadoul
Copy link
Member

Another idea, given that we'll probably be dropping Python 2.7 soon - to use ensurepip instead of get-pip - I think you've looked at this previously @YannickJadoul?

Yeah, I did try to use ensurepip at some point, but that failed for some reason. I must say I forgot why, though ;-) Something tells me it wasn't just 2.7, but maybe also the distribution of nuget? Anyway, worth a try again.

On the security issue: I do appreciate the concerns here; security is way too often overlooked. But to some degree, downloading over HTTPS should already give us sóme authentication, no? Let's definitely add hashes if it's that easy, but I'm not sure I'm very worried about bootstrap.pypa.io/get-pip.py being compromised. (If that happens, I'm sure we have bigger issues than cibuildwheel :-( )

@joerick
Copy link
Contributor

joerick commented Jan 2, 2020

I think I agree - and even without https://bootstrap.pypa.io/get-pip.py being pinned we can still claim to have deterministic builds I believe. What do you think @indygreg ?

@indygreg
Copy link
Author

indygreg commented Jan 4, 2020

In order to claim determinism, you must verify hashes of every asset downloaded from the Internet. That includes get-pip.py.

As for the security of HTTPS, the encryption protects against man-in-the-middle tampering. But the verification of the endpoint (CA validation) is ensure the certificate presented by the remote server was signed by a trusted root certificate authority. Nearly everybody in open source from Linux distributions to Python uses a set of trusted root certificates maintained by Mozilla (https://www.mozilla.org/en-US/about/governance/policies/security-group/certs/). This set of certificates is optimized for user convenience of people browsing the web. i.e. people want their browser to recognize root CAs being used to sign popular sites. In this root certificate list are some CAs maintained by or under the influence of some governments with questionable track records. If these governments/CAs wanted to, they could issue a new certificate for pretty much any hostname and clients would validate the certificate without issue since it chains up to a trusted root CA. There are some controls/mitigations in place to prevent this. But the trusted root CA system is intrinsically based on (delegated) trust of CAs and if a CA does bad things, security is compromised. A chain is only as strong as its weakest link, etc.

While root CA verification might be good enough for most, if you really want to be secure, you need to do something more, like verify content integrity by checking against hashes (and hey - determinism is also a useful property to have) or pin the certificate hash of the server you are connecting to. This is what robust, must-be-secure software does. e.g. Firefox's update mechanism pins the certificate hash of the Mozilla operated update server and verifies the SHA-256 of content downloaded from a CDN because neither the trusted root CA store or a CDN can be ultimately trusted.

As for other options, using ensurepip might be viable for Python 3 - just as long as it doesn't connect to the Internet and download the latest versions of things (again: you want things to be deterministic). You could also distribute a copy of get-pip.py with cibuildwheel and execute a local file instead of downloading one from the Internet.

@Czaki
Copy link
Contributor

Czaki commented Jan 9, 2020

@indygreg Problem with pinning pip version is that it force to wait on new release of cibuildwheel even if some bug in pip is fixed. You can still pin version of pip using CIBW_BEFORE_BUILD.

If we thing about pining version there much more things to be pinned:

  • manylinux images
  • wheel package (still waiting on new release which can change a lot)
  • setuptools (windows)
  • virtualenv
  • delocate

But pining all of this versions force to often release new version and split development on two branches.

EDIT. on quay.io there is no option to get previous manylinux docker images.They use only tag lastest.

@indygreg
Copy link
Author

Yes, pinning requires new cibuildwheel releases to pull in newer versions. I see a few mitigations for this:

  1. Make the pinning optional. Give people the option of using a deterministic mode that uses the pinned version of the requirements file versus a non-pinned version.
  2. Allow users to supply their own requirements file to use. That way they can update versions at their leisure. This would get cibuildwheel out of the critical path of adopting new versions. Although it opens the door of version incompatibility. e.g. a user could install a more modern version or too old version of a dependency than what cibuildwheel is compatible with.

@Czaki
Copy link
Contributor

Czaki commented Jan 12, 2020

There is option to pin requirements. There is CIBW_BEFORE_BUILD step where user can add command pip install wheel==0.33.6 delocate==0.8.0 etc.

For example there is no option for pin manylinux docker images: https://quay.io/repository/pypa/manylinux2010_x86_64?tab=tags. They do not provide tags for previous version. And they modify it often: https://quay.io/repository/pypa/manylinux2010_x86_64?tab=history.

So it is not easy to pin everything for all systems. Or maybe you have some suggestion what could be don better other than pip install package==version and why CIBW_BEFORE_BUILD is not enough?

@joerick
Copy link
Contributor

joerick commented Jan 14, 2020

@Czaki BEFORE_BUILD isn't enough because it is already using versions of pip/setuptools that were installed from latest. Plus it doesn't affect the test environment.

Vendoring a get-pip.py would be a neat solution, I think.

The manylinux images are a concern, though. If we can't pin to a specific version of those, Linux will never have determinism. I guess we'll need to ping the pypa folks and ask if they can start tagging their images on each release.

@Czaki
Copy link
Contributor

Czaki commented Jan 14, 2020

BEFORE_BUILD: pip install pip=19.3.0 setuptools==42.0.0; pip install something is not enough?
Tagging manylinux images will be nice.

So if there is decision to pin version there also will be decision to use master/develop model?
To can easy bump dependency version without new features?

EDIT:
Other option for manylinux is manually create copy and tag it. (create dummy dockerfile and create own account on some dockerhub)

The problem which I can see with pinned manylinux are centos repositories. If you need to install anything from it you should not use pinned version of manylinux. Or you can create own image with Dockerfile and then have everything pinned.

@YannickJadoul
Copy link
Member

@Czaki BEFORE_BUILD isn't enough because it is already using versions of pip/setuptools that were installed from latest. Plus it doesn't affect the test environment.

Vendoring a get-pip.py would be a neat solution, I think.

Yeah, I'm assuming get-pip.py won't change that often, and after that we still update pip, right (which is solved by the constraint files solution)? So I don't mind vendoring get-pip.py, I think. It ís 1.7MB, though, so I'm not sure our git history will appreciate this :-/

The manylinux images are a concern, though. If we can't pin to a specific version of those, Linux will never have determinism. I guess we'll need to ping the pypa folks and ask if they can start tagging their images on each release.

Good thing is: we already have support for this anyway :-) CIBW_MANYLINUX_..._IMAGE would allow to specify a docker tag :-)

BEFORE_BUILD: pip install pip=19.3.0 setuptools==42.0.0; pip install something is not enough?
Tagging manylinux images will be nice.

If I understood correctly, @joerick means that by that time, you already have a new version of pip installed through get-pip.py. So this could downgrade/pin the version of pip, but to do so, you would still be installing a newer version.

So if there is decision to pin version there also will be decision to use master/develop model?
To can easy bump dependency version without new features?

That sounds a bit tricky to me. It would be good to finally get #156 merged, but eh, I don't know.
Would this be possible with pip constraint files, though?

EDIT:
Other option for manylinux is manually create copy and tag it. (create dummy dockerfile and create own account on some dockerhub)

Good point; I like this idea! And again, this would already be supported using the options for custom manylinux images. So we'd just need to document this! :-)

The problem which I can see with pinned manylinux are centos repositories. If you need to install anything from it you should not use pinned version of manylinux. Or you can create own image with Dockerfile and then have everything pinned.

@Czaki
Copy link
Contributor

Czaki commented Jan 16, 2020

@joerick

Plus it doesn't affect the test environment.

PR #242

That sounds a bit tricky to me. It would be good to finally get #156 merged, but eh, I don't know.
Would this be possible with pip constraint files, though?

Without this it may be hard to fast bump pinned version of packages.

@joerick
Copy link
Contributor

joerick commented May 2, 2020

Deterministic builds was released in v1.4.0!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants