Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gitea Actions FetchTask not reliable assigning queued jobs to idle runners as long no new jobs are queued #33492

Open
ChristopherHX opened this issue Feb 4, 2025 · 0 comments · May be fixed by #33497
Labels
topic/gitea-actions related to the actions of Gitea type/bug

Comments

@ChristopherHX
Copy link
Contributor

ChristopherHX commented Feb 4, 2025

Description

  • Start 60 independent parallel act_runners ( opposed to a single runner that has parallel jobs enabled )
# docker-compose.yml
services:
  runner:
    build: . # this Dockerfile adds self-signed certs to act_runner be able to connect
    environment:
      - GITEA_INSTANCE_URL= # Your Gitea Instance to register to
      - GITEA_RUNNER_REGISTRATION_TOKEN= # The Gitea registration token
      - GITEA_RUNNER_LABELS=label # The labels of your runner (comma separated)
    user: root
    deploy:
      mode: replicated
      replicas: 60 # <--- This is required on a different host as the gitea server, single runner setups does not seem to be affected
  • Now create a large 10x10 matrix with 100 jobs at once
on: push
jobs:
  stress-test:
    strategy:
      matrix:
        a: [0,1,2,3,4,5,6,7,8,9]
        b: [0,1,2,3,4,5,6,7,8,9]
      max-parallel: 60
    runs-on: label
    steps:
    - run: ${{ tojson(github) }}
      shell: cat {0}
    - run: uname -a
    - run: ${{ tojson(github) }}
      shell: cat {0}
    - run: sleep 60
    - run: ${{ tojson(github) }}
      shell: cat {0}
    - run: ${{ tojson(runner) }}
      shell: cat {0}
    - run: ${{ tojson(env) }}
      shell: cat {0}
    - run: ${{ tojson(strategy) }}
      shell: cat {0}
    - run: ${{ tojson(job) }}
      shell: cat {0}
    - run: ${{ tojson(needs) }}
      shell: cat {0}
    - run: ${{ tojson(steps) }}
      shell: cat {0}
  • Notice only randomly 1-10 jobs get assigned to runners
    • 90-99 jobs keep waiting
    • once the runners that make progress finish those set taskversion to 0 and get a new one
    • this job has a sleep 60 to keep working runners busy for some amount of time
  • Notice the old Workflow run might continue to queue new jobs to runners even if it should have been stopped by concurrency = 1
    • This could be a database / request timeout side effect

Observed internal behavior

  • fetchtask might return no job under load even if jobs are available instead of returning an error
    • if this happend once taskversion gets updated and no picktask calls happen until taskversion is incremented due to new jobs
    • ca. 50 of the 60 runners directly update their taskversion to latest even if jobs are still queued
  • in this scenario rerun all jobs returns http 500 probably due to database/request timeout
  • all other features of gitea keep functional

Workaround

  • patch act_runner to always send taskversion 0 to force query the database + set fetchtimeout to 50 then all runners got a queued job assigned
  • Works as well more cpu power and more ram for the database and gitea tested on m4 pro mac + sqlite
  • Possible alternative Untested use a single act_runner to delegate resources to other machines

I'm not aware of other reports here, still debugging this trying to understand why gitea sometimes claims that no new job is available even if there are clearly tens of them

I'm not planning to run the tests against the demo site to not stress test its resources

EDIT
Update macbook pro m4 as Gitea server I got 20 parallel jobs using sqlite during debugging..
Now need to make breakpoints why FetchTask returns no error and no job

EDIT
Need to collect more information...

EDIT
First edit is obsolete, more powerful device solves this problem as well.
So the good path works perfectly fine, but there must be a bad path with degraded performance as well

EDIT
Possible root cause is here, concurrent job assignments are forwarded as no more jobs instead of an error in line 320

if n, err := UpdateRunJob(ctx, job, builder.Eq{"task_id": 0}); err != nil {
return nil, false, err
} else if n != 1 {
return nil, false, nil

Gitea Version

1.23.1

Can you reproduce the bug on the Gitea demo site?

No

Log Gist

No response

Screenshots

No response

Git Version

No response

Operating System

ubuntu 22.04 arm64

How are you running Gitea?

docker image on a raspberry pi4 8GB, depending on the database performance more parallel runners might be needed to see something similar.

Database

MySQL/MariaDB

@kemzeb kemzeb added the topic/gitea-actions related to the actions of Gitea label Feb 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic/gitea-actions related to the actions of Gitea type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants