Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ModuleNotFoundError: No module named 'pkg_resources' #4756

Open
3 tasks
jkotas opened this issue Jan 8, 2025 · 26 comments
Open
3 tasks

ModuleNotFoundError: No module named 'pkg_resources' #4756

jkotas opened this issue Jan 8, 2025 · 26 comments

Comments

@jkotas
Copy link
Member

jkotas commented Jan 8, 2025

Build

https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=910866

Build leg reported

ComInterfaceGenerator.Tests.WorkItemExecution

Pull Request

dotnet/runtime#110558

Known issue core information

Fill out the known issue JSON section by following the step by step documentation on how to create a known issue

 {
    "ErrorMessage" : "ModuleNotFoundError: No module named 'pkg_resources'",
    "BuildRetry": false,
    "ErrorPattern": "",
    "ExcludeConsoleLog": false
 }

@dotnet/dnceng

Release Note Category

  • Feature changes/additions
  • Bug fixes
  • Internal Infrastructure Improvements

Release Note Description

Additional information about the issue reported

No response

Known issue validation

Build: 🔎 https://dev.azure.com/dnceng-public/public/_build/results?buildId=910866
Error message validated: [ModuleNotFoundError: No module named 'pkg_resources']
Result validation: ✅ Known issue matched with the provided build.
Validation performed at: 1/8/2025 10:20:43 PM UTC

Report

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
0 0 0
@dougbu
Copy link
Member

dougbu commented Jan 8, 2025

we're tracking work on this in dotnet/runtime#4751. we found the problem another way and noticed it seems to be specific to ArmArch Linux Docker containers just yesterday. hoping to move away from the deprecated and, now, apparently sometimes unavailable package in time to include the fix in our next rollout

am I correct the containers failing in your build are quite infrequently used❓

@jkotas
Copy link
Member Author

jkotas commented Jan 8, 2025

This is failing on many PRs in dotnet/runtime. The affected containers are used by default dotnet/runtime CI configuration.

@jkotas
Copy link
Member Author

jkotas commented Jan 8, 2025

This is failing on many PRs in dotnet/runtime

You can see it in the stats in the top post.

@jkotas
Copy link
Member Author

jkotas commented Jan 9, 2025

it seems to be specific to ArmArch Linux

This affects number of Linux and macOS variants.

For example, here is a log from macOS x64: https://helixr1107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-111218-merge-42cc8e62a78b4da997/Microsoft.Extensions.Configuration.Tests/1/console.47b8555d.log?helixlogtype=result

@akoeplinger
Copy link
Member

akoeplinger commented Jan 9, 2025

I think the common factor is having Python 3.12, e.g. if you look at https://helixr1107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-111218-merge-42cc8e62a78b4da997/ComInterfaceGenerator.Tests/1/console.aa96da09.log?helixlogtype=result from the same job as Jan posted above it worked, because the dci-mac-build-133 macOS machine is not using Python 3.12 (I guess because it wasn't updated yet)

@dougbu
Copy link
Member

dougbu commented Jan 9, 2025

I think the common factor is having Python 3.12, e.g. if you look at https://helixr1107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-111218-merge-42cc8e62a78b4da997/ComInterfaceGenerator.Tests/1/console.aa96da09.log?helixlogtype=result from the same job as Jan posted above it worked, because the dci-mac-build-133 macOS machine is not using Python 3.12 (I guess because it wasn't updated yet)

I agree. it also seems like the problem is only partially under our direct control. we use pkg_resources in a couple of places but so does the azure package version we rely on. other dependencies seem to have the right try / catch setup to fallback when pkg_resources isn't available.


unless we missed some older builds that failed, problems started just after our rollout began updating queues yesterday. the earliest error I saw had timestamp 2025-01-08T21:20:05.2091700Z and the queue rollout started real work at 2025-01-08T20:45:17.1221544Z (slightly late due to retries of earlier jobs in our pipeline). the rollout picked up new OS packages for Python3 on Linux machines as well as slightly changed Python requirements for our helix-scripts/ code. the second part might have impacted OSX machines

@dougbu
Copy link
Member

dougbu commented Jan 9, 2025

Python 3.12.8 released on 2024-12-03. it contains a bunch of fixes and dependency updates though nothing obviously linked to pkg_resources. it did upgrade its bundled pip to 24.3.1 but the pip changelog also doesn't include an obvious smoking gun. it also upgraded its libexpat dependency to 2.6.3 but that's written in C

I checked the setuptools changelog as well b/c we let that float as much as the Python and pip versions allow. nothing obvious there either

@dougbu
Copy link
Member

dougbu commented Jan 10, 2025

we're reverting yesterday's rollout due to the problems discussed in this issue. queues are getting updated as I type this note

@dougbu
Copy link
Member

dougbu commented Jan 10, 2025

revert is now complete but I don't see a clear signal that this particular problem has been resolved. please let us know

@akoeplinger
Copy link
Member

Seems to be working again. Though I noticed that e.g. a build on dci-mac-build-108 which failed before is still printing No module named 'pkg_resources', but it seems to not be an error anymore.

https://helixr1107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-111257-merge-ab67e12681af4b7b92/System.Collections.NonGeneric.Tests/1/console.05a00ae9.log?helixlogtype=result

@dougbu
Copy link
Member

dougbu commented Jan 10, 2025

Seems to be working again. Though I noticed that e.g. a build on dci-mac-build-108 which failed before is still printing No module named 'pkg_resources', but it seems to not be an error anymore.

that's very interesting. I can only guess why an exception turned into a simple message

@garath garath self-assigned this Jan 13, 2025
@ilyas1974
Copy link
Contributor

@jkotas with the rollout revert we performed, it appears that the issue has been mitigated (per the telemetry above, no new build failures have been detected in over a week). Is there any reason to continue to keep this issue open?

@jkotas
Copy link
Member Author

jkotas commented Jan 21, 2025

I agree - this can be closed.

@jkotas jkotas closed this as completed Jan 21, 2025
@garath
Copy link
Member

garath commented Jan 21, 2025

We're using this issue to track a proper fix to the deployments. (Though of course I may have missed a conversation, so let me know if this was resolved elsewhere.)

@garath garath reopened this Jan 21, 2025
@dougbu
Copy link
Member

dougbu commented Jan 22, 2025

We're using this issue to track a proper fix to the deployments. (Though of course I may have missed a conversation, so let me know if this was resolved elsewhere.)

totally agree. the revert seemed to avoid the problem but this issue tracks making sure our next rollout doesn't break Python scenarios again. we're close but not done

@dougbu
Copy link
Member

dougbu commented Jan 22, 2025

if you care about the details, our Helix wrapping code for Linux machines hit some issues using sudo python:

  1. sudo isn't available in all Docker images
  2. our invocation lost information about the venv most Docker images set up to isolate installations from the system environment. so, we weren't installing the components we needed or expected. setuptools (which contains pkg_resources) was part of this
  3. !46915 avoids the two issues above
  4. however, I need to test on a recent macOS machine to confirm there isn't another problem lurking in our code.
  5. in addition, it's difficult to tell what changed in the affected Docker images and machines leading to the problems. haven't found anything in the Python Changelog that should be related. put another way, the notes about 3.12.8 look innocuous but obviously something somewhere changed — likely a set of overlapping somethings since our revert helped

@dougbu
Copy link
Member

dougbu commented Jan 22, 2025

Machine dci-mac-build-294inosx.1200.amd64.open has been disabled for this testing. will re-enable when I'm done

@dougbu
Copy link
Member

dougbu commented Jan 23, 2025

that machine was unreachable. switched to using dci-macpro-20 in the staging osx.1200.amd64 queue. it's temporarily disabled…

@dougbu
Copy link
Member

dougbu commented Jan 25, 2025

not quite done w/ testing but I'm pretty sure the remaining No module named 'pkg_resources' message is completely unrelated to the original problem. that run.py file is part of the dotnet/runtime test infrastructure and seems to depend on something conditionally using pkg_resources, emitting a message when it's not found. I suspect the message started showing about when we began using a venv for our use — an effort to isolate our python usage from both the system Python environment and any test actions

overlapping this is the fact our venv use is incomplete b/c we don't re-image our on-premises machines that often

to avoid such messages about pkg_resources (if they indicate an actual problem), you probably need to bump your Python package versions or perhaps make sure you're not relying on our pip installations to provide your dependencies. a venv is probably a great idea for your use case too

@dougbu
Copy link
Member

dougbu commented Jan 25, 2025

ugh, I was wrong. reporter/run.py sometimes shows up in other Helix console logs

@dougbu
Copy link
Member

dougbu commented Jan 27, 2025

builds are expiring and getting deleted. here's an example of the original osx.1200.amd64 failure:

+ /usr/local/bin/python3.12 -u /tmp/helix/working/AF510960/w/AAD7090B/u/xharness-event-processor.py
Traceback (most recent call last):
  File "/tmp/helix/working/AF510960/w/AAD7090B/u/xharness-event-processor.py", line 8, in <module>
    from helix.public import request_reboot, request_infra_retry, send_metric, send_metrics
  File "/etc/helix/scripts/helix/public/__init__.py", line 5, in <module>
    import helix.event
  File "/etc/helix/scripts/helix/event.py", line 7, in <module>
    import helix.logs
  File "/etc/helix/scripts/helix/logs.py", line 11, in <module>
    from helix.azure_utils import get_auth_credential
  File "/etc/helix/scripts/helix/azure_utils.py", line 1, in <module>
    from azure.identity import ManagedIdentityCredential, CredentialUnavailableError
  File "/etc/helix/scripts/azure/__init__.py", line 5, in <module>
    import pkg_resources
ModuleNotFoundError: No module named 'pkg_resources'
+ /usr/local/bin/python3.12 /tmp/helix/working/AF510960/p/reporter/run.py https://dev.azure.com/dnceng-public/ public 24025496 eyJ0eXAiOiJKV1QiLCJhbGciOiJSUzI1NiIsIng1dCI6IjdSd2F5dmlYRHFoZnN6MTZSNmxPbXNXWWxTQSJ9.eyJuYW1laWQiOiJjNzczZjJjMi01MTIwLTQyMDctYWZlMi1hZmFmMzVhOGJjMGEiLCJzY3AiOiJhcHBfdG9rZW4iLCJhdWkiOiI2OTY3ODM3OC0yYjMxLTQwZjAtYTZiYi0zMmViOGRkNzMyZTUiLCJzaWQiOiI2YzZlNjYyOC05OTg1LTQ2OTYtYjA1Ny1kMjAxYjhjYjQ1MjkiLCJCdWlsZElkIjoiY2JiMTgyNjEtYzQ4Zi00YWJiLTg2NTEtOGNkY2I1NDc0NjQ5OzkxMTAzMyIsIkRlZklkIjoiMTU0Iiwiam9icmVmIjoiYWY1MjgyMTEtNmMzZi00MDdjLTg1ZTctNDg1ZTA3MjU1YzJkOmEyYTM5ZDZiLTcwYzUtNTA0MC05MzNhLTRiOGI2ZTUyYmYxOSIsInBwaWQiOiJ2c3RmczovLy9CdWlsZC9CdWlsZC85MTEwMzMiLCJvcmNoaWQiOiJhZjUyODIxMS02YzNmLTQwN2MtODVlNy00ODVlMDcyNTVjMmQuYnVpbGQuYnVpbGRfbWFjY2F0YWx5c3RfeDY0X3JlbGVhc2VfYWxsc3Vic2V0c19tb25vLl9fZGVmYXVsdCIsInJlcG9JZHMiOiIiLCJpc3MiOiJhcHAudnN0b2tlbi52aXN1YWxzdHVkaW8uY29tIiwiYXVkIjoiYXBwLnZzdG9rZW4udmlzdWFsc3R1ZGlvLmNvbXx2c286NmZjYzkyZTUtNzNhNy00Zjg4LThkMTMtZDkwNDViNDVmYjI3IiwibmJmIjoxNzM2MzcxNjU2LCJleHAiOjE3MzYzODM2NTZ9.WM0rVtEnKSyuVNaconMyilHoFd8xtqB3880x3PMgRyiWPqZ13fRQHH_R18dT4wos1Wt2625WIcrr1nda_05HUljQ8HV4iHt6Xrd4oEO28KH4RJXOyTobdr7mVov_T0y_D3hMRfB1X4hEcMJZxW7BOpitv49L0PDVgOu2AXAAWlwGoIbibHGYFR5OCJ5RQnanGpSpyrCqe0Ky6fmjwvgCTNuwTdTDfZkRN3JQeLM1573AKBTQ7GDauNV79wHk_DHNUCKT9OfPGsVQM6VA5Z_RG_g-zTIR7a2-B7IhwEtezBYRhDgHsG7KWAevo1CQHqBZZ3byB3PwuxWHmfc4mLHz4g
Traceback (most recent call last):
  File "/tmp/helix/working/AF510960/p/reporter/run.py", line 13, in <module>
    from test_results_reader import read_results
  File "/private/tmp/helix/working/AF510960/p/reporter/test_results_reader/__init__.py", line 3, in <module>
    from helix.public import TestResult, TestResultAttachment
  File "/etc/helix/scripts/helix/public/__init__.py", line 5, in <module>
    import helix.event
  File "/etc/helix/scripts/helix/event.py", line 7, in <module>
    import helix.logs
  File "/etc/helix/scripts/helix/logs.py", line 11, in <module>
    from helix.azure_utils import get_auth_credential
  File "/etc/helix/scripts/helix/azure_utils.py", line 1, in <module>
    from azure.identity import ManagedIdentityCredential, CredentialUnavailableError
  File "/etc/helix/scripts/azure/__init__.py", line 5, in <module>
    import pkg_resources
ModuleNotFoundError: No module named 'pkg_resources'

reporter/run.py and xharness-event-processor.py come from dotnet/arcade. run.py executes unconditionally as the last command in $(HelixPostCommands). xharness-event.processor.py is embedded in the Helix SDK and added to $(HelixPostCommands) when '$(EnableXHarnessTelemetry)' == 'true'. both are run using $HELIX_PYTHON_PATH, which will be the most recent python3.* available on the machine. these commands execute in a slightly different Python environment than the background python3.* processes. those background processes are from dotnet-helix-machines and are not associated with a particular work item

I suspect problems causing the user Python environment to be empty (lacking any of our usual dependencies) e.g., Python packages are installed per-version and a version bump can break things. but I searched through all of the logs above to find the macOS failures among the builds that remain. could only find three such logs, all running on just two machines — dci-mac-build-108 and dci-mac-build-147. dci-mac-build-147 seems to be in a sorry state at the moment and I need to file an ICM about it

I'm trying to work w/ dci-mac-build-108 today…

dougbu added a commit to dougbu/arcade that referenced this issue Jan 28, 2025
- see dotnet/dnceng#4756
- with this, executing `python3` should not result in `pkg_resources not found` error
  - note direct Python dependencies in dotnet/arcade include only default modules
  - change in Microsoft.DotNet.Build.Tasks.Installers likely less important
@dougbu
Copy link
Member

dougbu commented Jan 31, 2025

sorry for not reporting back here. we had a total of three problems leading to the No module named 'pkg_resources' failures — the two sudo problems listed near the top of #4756 (comment) plus the macOS issues due to having Python 3.12 or 3.13 on them.

the reason for the problems w/ Python 3.12+ was PEP 668 enforcement, which is particularly stringent in macOS. messing with the system or user Python environments was strongly discouraged and failed unless you removed some files that really shouldn't be touched (EXTERNALLY-MANAGED marker files). we stopped trying to fight Python and switched to brew install python-setuptools when configuring macOS machines

you'll notice these issues are all long-standing problems that went unnoticed. it's as if something introduced set -e somewhere. I haven't found that but continue to look…

@jkotas
Copy link
Member Author

jkotas commented Jan 31, 2025

@dougbu These failures started hitting dotnet/runtime CI heavily again earlier today. #4892 has the details. Could you please take a look?

@dougbu
Copy link
Member

dougbu commented Jan 31, 2025

new issue likely has a similar root cause to this problem i.e., something I can't find switching set -e on. I see errors about failing chmod commands. is it possible to see if warnings in similar runs yesterday occurred in that test run❓

regardless, I believe we (DNCEng) need to revert our rollout again 😦

@dougbu
Copy link
Member

dougbu commented Feb 3, 2025

using this issue to track the problems addressed to date. will track the latest problems in #4892

note we need another small PR to include the macOS configuration changes (adding the python-setuptools HomeBrew package) in future imaging procedures

@dougbu
Copy link
Member

dougbu commented Feb 21, 2025

!46915 changes were reverted in !47331 and reintroduced in !47334

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants