Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mirrorbits for linux distributions (rapidly changing metadata files) #85

Open
stormi opened this issue Nov 5, 2018 · 12 comments · May be fixed by #188
Open

mirrorbits for linux distributions (rapidly changing metadata files) #85

stormi opened this issue Nov 5, 2018 · 12 comments · May be fixed by #188

Comments

@stormi
Copy link

stormi commented Nov 5, 2018

As discussed on IRC, in the context of a linux distribution, repository metadata files can change quite often, and when they change it can cause a delay during which no mirror can serve those files (unless you provide a least one mirror that syncs instantly).

It can also happen that a user with a slightly older repository metadata cache tries to install a file from the mirrors and get an error because the file does not exist anymore in mirrorbits local reference, and there's no grace delay to let the cache expire (usually a few hours). It might be preferrable to let the request reach one of the mirrors that have not synced yet and that still have the file.

A few leads (may contain very bad ideas!):

  • an option to "remember" deleted files for a while and keep serving them if mirrors still have them + accepting to serve an old version of some files (repomd.xml and repomd.xml.asc for yum repositories for example) if no mirror (or no mirror close enough) has the new version?
  • select mirror first (while accepting to select mirrors that have synced recently enough, even if they don't have the 'latest'), serve files from that mirror next for the subsequent requests?
@PalinuroSec
Copy link
Contributor

ping
any updates?

@ott
Copy link
Contributor

ott commented Sep 17, 2022

Perhaps it would be best if the origin server would also keep old files for as long as it expects that clients have not updated their metadata. This would remove this burden from the mirrors and the download redirector.

elboulangero added a commit to elboulangero/mirrorbits that referenced this issue Oct 11, 2023
With this setting, files are allowed to be outdated on the mirrors, for
at most MaxOutdated minutes. The filesize check is also disabled for
those files.

Use-case: for a Debian-like distribution, the metadata (ie.  the
directory /dists) are updated in-place, so we must give time for mirrors
to sync, and then for mirrorbits to be aware of the changes.

Otherwise, as soon as the source is updated and scanned, Mirrorbits will
go in fallback mode for all the files under /dists, since at this point,
either mirrors didn't sync yet, either they did but Mirrorbits is not
aware of it yet (as the interval to scan mirrors is higher than the
interval to scan the source).

Cf. etix#85 for more details.
@elboulangero
Copy link
Contributor

@stormi I'd be curious to know if/how you solved it for XCP-ng. I've been looking at the same issue for Kali Linux. Quoting what you said at the time:

an option to "remember" deleted files for a while and keep serving them if mirrors still have them + accepting to serve an old version of some files (repomd.xml and repomd.xml.asc for yum repositories for example) if no mirror (or no mirror close enough) has the new version?

For Kali, we don't need to « "remember" deleted files for a while », in the sense that we solve this issue with reprepro, the tool that generates the repository. When a package is replaced by a new one, we keep the old package around in the archive for a few days. Hence this problem doesn't need to be solved at the Mirrorbits level.

However, I definitely observed the issue that you mentioned with metadata (the files that are updated in place). When we push an update of the repo, Mirrorbits will be quickly aware of the new version of the metadata files, and since it didn't rescan the mirrors yet (and maybe the mirrors didn't even sync anyway), it can't redirect, and it goes in fallback mode for those files.

« accepting to serve an old version of some files » seems to a good solution for Kali, so I implemented this feature in #147.

@stormi
Copy link
Author

stormi commented Oct 11, 2023

@elboulangero no, we haven't solved it. Thankfully, the metadata on non-testing repositories doesn't change often, so issues are rare.

@elboulangero
Copy link
Contributor

@stormi I'd like to improve the MR #147 so that it would work for RPM repos as well.

As I said quickly above, the idea with this MR is to tell mirrorbits to accept serving old versions of some files, and within a certain time limit (when files are really too old, mirrorbits will stop serving it).

So far, the setting I proposed is pretty crude, as the only matching option is a prefix. It works for Kali, as all I want to do is to match requests paths that start with /dists/, and allow files under this prefix to be outdated.

Now, how would that go for the XCP-ng repo, what outdated files do you need to match? I had a quick look, it seems like we could match /repodata/ anywhere in the request path. Or, be stricter, and match the files repomd.xml and repomd.xml.asc. Or maybe repomd.xml.*$, trying to future-proof a bit. What do you prefer? Are those the only metadata files to match, or are there others?

As you rightfully pointed out, from the moment we allow mirrorbits to serve old metadata, the next issue is that clients that get these old metadata will also request old files, that might not be on the repo anymore. (NB: Mirrorbits won't serve files that are not on the local repo, it doesn't matter if those files are still on the mirrors).

You suggested that Mirrorbits could try to keep track of deleted files for a while. Now that I'm familiar with the code of mirrorbits, I'd prefer to avoid this route. Ok, I'm a bit biased as for my own use-case (Kali), we already solve this issue outside of Mirrorbits. But still, I wonder if you could look at the options you have with the tool you use to create and manage your RPM repository. Is there any option to snapshot a repository?

I suggest the idea of "snapshot" because that's how we do it for the Kali repo, with reprepro. Every time we update the repo (4 times a day, as Kali is a rolling distro), we take a snapshot of the distro. We keep something like the last 10 snapshots. It means that after packages are removed from Kali rolling, they still linger around for 2.5 days, as the snapshots still hold a reference to it.

Can you take the same approach for your RPM distro?

@lazka Please allow me to pull you in the discussion, as you're maintaining a Arch-based distro it seems, and I'd like to also have your feedback. Do you have the same kind of issue, to start with?

@lazka
Copy link
Contributor

lazka commented Oct 31, 2023

@lazka Please allow me to pull you in the discussion, as you're maintaining a Arch-based distro it seems, and I'd like to also have your feedback. Do you have the same kind of issue, to start with?

We only upload ~1 a day, and the only metadata change there is that the database files change, which amounts to ~12MB. While that means all clients will pull from the main server, it hasn't been a problem so far (at least no one complained). We don't have that many users, and most traffic comes from downloading packages which this doesn't affect, also we have enough mirrors that things get in sync quite fast. So it's definitely not a problem traffic wise, but might result in sluggish database syncs for some far away users for a bit. There is also an upcoming change in pacman where package signatures will be moved out of the database files, which will reduce the metadata size by ~50%.

As for trying to fetch no longer existing files: We keep all packages for >1.5 years before we prune them, so this isn't really an issue for us.

tl;dr: we don't have that many packages or users for this to be a big problem.

The only potential problem I see with serving existing files from outdated mirrors is that two DB syncs in a short period might lead to pacman doing package downgrades, if it happens to hit an old mirror after a fresh one, which we don't support really.

@elboulangero
Copy link
Contributor

Ack, thanks very much for your detailed reply @lazka!

The only potential problem I see with serving existing files from outdated mirrors is that two DB syncs in a short period might lead to pacman doing package downgrades, if it happens to hit an old mirror after a fresh one, which we don't support really.

Ah Ok. This is not a problem on Debian's side, as apt will silently discard a Release file that is older than the local one. So if we hit an old mirror after a fresh one, from apt point of view it just means that the system is up-to-date.

@elboulangero
Copy link
Contributor

elboulangero commented Oct 31, 2023

Something else I wanted to share in this discussion: the methodology (and scripts) I used to monitor the availability of some files.

In short:

  • Configure mirrorbits for OutputMode: auto.
  • Then start a script to request the file(s) of interest, every minute. Set the header Accept: application/json in the request, so that mirrorbits doesn't serve the files, but instead returns some JSON data with the results of the selection.
  • After a day or two, run another script to process this data. We want to see how many mirrors are excluded, and why.
  • Plot the results.

And here's the result, requesting the InRelease file (ie. the first metadata file that is requested by apt update), every minute during a day.

InRelease

What we clearly see above is that, after the sync of 18:00 and the sync of 06:00, for a while the InRelease file was served in fallback mode (the pinkish vertical bars). We see that suddenly, all mirrors are excluded (due to mod time mismatch the first time, and file size mismatch the second time). Then slowly, this number decreases, as the mirrors are synced, and mirrorbits scan it.

I don't know why the number of returned mirrors goes way above 4, and then drop suddenly to 4 at some point. I'm sure this can be explained by a careful reading of the selection algorithm...

Anyway. So if someone wants to do the same check and produce a similar graph, I pushed the scripts at: https://gitlab.com/kalilinux/tools/mirrorbits-scripts/-/tree/main/check-availability. It's very straightforward to use it, there's even a README!

@stormi
Copy link
Author

stormi commented Mar 14, 2024

@stormi I'd like to improve the MR #147 so that it would work for RPM repos as well.

As I said quickly above, the idea with this MR is to tell mirrorbits to accept serving old versions of some files, and within a certain time limit (when files are really too old, mirrorbits will stop serving it).

So far, the setting I proposed is pretty crude, as the only matching option is a prefix. It works for Kali, as all I want to do is to match requests paths that start with /dists/, and allow files under this prefix to be outdated.

Now, how would that go for the XCP-ng repo, what outdated files do you need to match? I had a quick look, it seems like we could match /repodata/ anywhere in the request path. Or, be stricter, and match the files repomd.xml and repomd.xml.asc. Or maybe repomd.xml.*$, trying to future-proof a bit. What do you prefer? Are those the only metadata files to match, or are there others?

Hi! Sorry for the late reply. So, as I understand it, the problem is that most filenames contain unique identifiers in repodata, so serving an old version of repomd.xml wouldn't solve anything: it would refer to the old filenames, which mirrorbits doesn't know about anymore. That's why I suggested remembering old files for a while. Not RPMs: I agree with other distro maintainers, keeping old RPMs is the distro's responsibility, and we do keep all updates we released in the repositories.

See the current contents of one of the repodata directories:

0405c825b877bd049254a99576e927ad7fcaa3200ff425181caca31369720c0c-primary.sqlite.bz2
09ce9e87374c09e1a42a226cd93a56892e9485f9d0dd90af33fc0203a23eac37-other.sqlite.bz2
ada1055a6861676ef0ebdd75bfd1d0481057e6b3c89d99286b829ffe80248443-filelists.sqlite.bz2
e9b0b77f0410d8a3e7016f0de434ad3afc3f9155c2ea7fb234b8d61931dbb8c4-primary.xml.gz
f3794ff0b31ed187c30b443d891d02d593e73c08783fea45eb07b7a6471aae64-filelists.xml.gz
fc85f0bc20acb3da728b13b394d0f4337a16702171495daf41f251f64656d1d8-other.xml.gz
repomd.xml
repomd.xml.asc

And the the contents of repomd.xml which references them:

<?xml version="1.0" encoding="UTF-8"?>
<repomd xmlns="http://linux.duke.edu/metadata/repo" xmlns:rpm="http://linux.duke.edu/metadata/rpm">
  <revision>1709307305</revision>
  <data type="primary">
    <checksum type="sha256">e9b0b77f0410d8a3e7016f0de434ad3afc3f9155c2ea7fb234b8d61931dbb8c4</checksum>
    <open-checksum type="sha256">8e7d4d3311b470f54c43ec1958decfe9760d413abf2e76d1093775b2a117b7f5</open-checksum>
    <location href="repodata/e9b0b77f0410d8a3e7016f0de434ad3afc3f9155c2ea7fb234b8d61931dbb8c4-primary.xml.gz"/>
    <timestamp>1709307299</timestamp>
    <size>2114338</size>
    <open-size>14535196</open-size>
  </data>
  <data type="filelists">
    <checksum type="sha256">f3794ff0b31ed187c30b443d891d02d593e73c08783fea45eb07b7a6471aae64</checksum>
    <open-checksum type="sha256">95b07d6edbf1e3e2261095112f20f75057257e447f360a523f0e7ec6180821ce</open-checksum>
    <location href="repodata/f3794ff0b31ed187c30b443d891d02d593e73c08783fea45eb07b7a6471aae64-filelists.xml.gz"/>
    <timestamp>1709307299</timestamp>
    <size>7190267</size>
    <open-size>100075554</open-size>
  </data>
  <data type="other">
    <checksum type="sha256">fc85f0bc20acb3da728b13b394d0f4337a16702171495daf41f251f64656d1d8</checksum>
    <open-checksum type="sha256">d5b576c83da54d59e9d001cd5f2d1fa4c68afcf9081bbe445cece9b3ff45828d</open-checksum>
    <location href="repodata/fc85f0bc20acb3da728b13b394d0f4337a16702171495daf41f251f64656d1d8-other.xml.gz"/>
    <timestamp>1709307299</timestamp>
    <size>1071736</size>
    <open-size>9827266</open-size>
  </data>
  <data type="primary_db">
    <checksum type="sha256">0405c825b877bd049254a99576e927ad7fcaa3200ff425181caca31369720c0c</checksum>
    <open-checksum type="sha256">f821fe8f705479bfcc4dfaa61e2b5ac97da56835a0bcbbbc573e825060596992</open-checksum>
    <location href="repodata/0405c825b877bd049254a99576e927ad7fcaa3200ff425181caca31369720c0c-primary.sqlite.bz2"/>
    <timestamp>1709307302</timestamp>
    <size>3228605</size>
    <open-size>16738304</open-size>
    <database_version>10</database_version>
  </data>
  <data type="filelists_db">
    <checksum type="sha256">ada1055a6861676ef0ebdd75bfd1d0481057e6b3c89d99286b829ffe80248443</checksum>
    <open-checksum type="sha256">cc7eef41a3dc1cafad950c57345784f8ffe7fae6f2118e5a28638885cae1e818</open-checksum>
    <location href="repodata/ada1055a6861676ef0ebdd75bfd1d0481057e6b3c89d99286b829ffe80248443-filelists.sqlite.bz2"/>
    <timestamp>1709307305</timestamp>
    <size>7294996</size>
    <open-size>43271168</open-size>
    <database_version>10</database_version>
  </data>
  <data type="other_db">
    <checksum type="sha256">09ce9e87374c09e1a42a226cd93a56892e9485f9d0dd90af33fc0203a23eac37</checksum>
    <open-checksum type="sha256">8ed8afa59eef48afd6e761d9d8f67d46a024ff788c85b23726cb2d447188684b</open-checksum>
    <location href="repodata/09ce9e87374c09e1a42a226cd93a56892e9485f9d0dd90af33fc0203a23eac37-other.sqlite.bz2"/>
    <timestamp>1709307302</timestamp>
    <size>1255313</size>
    <open-size>9761792</open-size>
    <database_version>10</database_version>
  </data>
</repomd>

The next time we regenerate the medata, filenames will change.

@elboulangero
Copy link
Contributor

If I understand correctly: yum (or is it dnf?) downloads the repomd.xml first, and then it might download the other files listed in repomd.xml? Question is: does it hit the redirector again to download those files?

I ask for comparison with apt. Here's how it works for apt update: it first downloads the Release file, and then after it downloads some other files listed in the release file. The key thing is: apt doesn't hit the redirector again for those files, it requests it from the same mirror that served the release file. In other words: during a apt update transaction, all metadata files are downloaded from the same mirror.

@stormi
Copy link
Author

stormi commented Mar 18, 2024

If I understand correctly: yum (or is it dnf?) downloads the repomd.xml first, and then it might download the other files listed in repomd.xml? Question is: does it hit the redirector again to download those files?

I think it does hit the redirector again, because it is not aware there is any redirector at all, with mirrorbits. This is the big difference with other mirror management software that distros may use, be it with yum/dnf, apt or other, and is the very reason why I opened this issue: mirrorbits doesn't give you a mirror URL that you can then use for subsequent requests. It redirects every single request directly, via HTTP headers, in an attempt to 1. balance load better, file by file, and 2. always redirect to a mirror which has the right version of the requested file (as I understand the motives). A given mirror might be eligible for some files but not for others, because it only partially synced, or has some outdated files. Mirrorbits may then redirect you to the partial mirror, closer to your location, for some files, and to other mirrors for the rest.

Now maybe I'm wrong and there's some logic in dnf that detects there was a HTTP redirection and then bypasses the very URL that we asked it to download from (mirrorbits), but I doubt it. Are you sure apt wouldn't do the same in a similar situation?

@elboulangero
Copy link
Contributor

elboulangero commented Apr 16, 2024

Sorry for being late, I missed your reply.

Are you sure apt wouldn't do the same in a similar situation?

100% sure, let me detail.

First, we can easily log the requests that are sent by apt. So here's a apt update transaction that is sent to mirrorbits. I filtered a bit the output for clarity:

┌──(root㉿carbon)-[/work/tmp]
└─# apt -y -q -o Debug::Acquire::http=true update 2>&1 | grep -E '^(GET|Host:|Answer|HTTP)'
GET /kali/dists/kali-rolling/InRelease HTTP/1.1
Host: http.kali.org
Answer for: http://http.kali.org/kali/dists/kali-rolling/InRelease
HTTP/1.1 302 Found

GET /kali/dists/kali-rolling/InRelease HTTP/1.1
Host: kali.cs.nycu.edu.tw
Answer for: http://kali.cs.nycu.edu.tw/kali/dists/kali-rolling/InRelease
HTTP/1.1 200 OK

GET /kali/dists/kali-rolling/main/binary-amd64/Packages.gz HTTP/1.1
Host: kali.cs.nycu.edu.tw
Answer for: http://kali.cs.nycu.edu.tw/kali/dists/kali-rolling/main/binary-amd64/Packages.gz
HTTP/1.1 200 OK

GET /kali/dists/kali-rolling/non-free/binary-amd64/Packages.gz HTTP/1.1
Host: kali.cs.nycu.edu.tw
Answer for: http://kali.cs.nycu.edu.tw/kali/dists/kali-rolling/non-free/binary-amd64/Packages.gz
HTTP/1.1 200 OK

GET /kali/dists/kali-rolling/non-free-firmware/binary-amd64/Packages.gz HTTP/1.1
Host: kali.cs.nycu.edu.tw
Answer for: http://kali.cs.nycu.edu.tw/kali/dists/kali-rolling/non-free-firmware/binary-amd64/Packages.gz
HTTP/1.1 200 OK

GET /kali/dists/kali-rolling/contrib/binary-amd64/Packages.gz HTTP/1.1
Host: kali.cs.nycu.edu.tw
Answer for: http://kali.cs.nycu.edu.tw/kali/dists/kali-rolling/contrib/binary-amd64/Packages.gz
HTTP/1.1 200 OK

To translate that to words:

  1. Get InRelease from http.kali.org (aka. mirrorbits)
  2. Mirrorbits returns a 302 to mirror kali.cs.nycu.edu.tw
  3. Get InRelease from kali.cs.nycu.edu.tw
  4. Then get 4 Packages.gz files that are referenced in the InRelease file, straight from kali.cs.nycu.edu.tw. Not hitting mirrorbits.

It was implemented in apt in this commit: https://salsa.debian.org/apt-team/apt/-/commit/9b8034a9fd40b4d05075fda719e61f6eb4c45678 (back in 2016)

elboulangero added a commit to elboulangero/mirrorbits that referenced this issue Feb 14, 2025
The new setting AllowOutdatedFiles allows user to define which files are
allowed to be outdated on the mirrors, and for how long.

The user defines a list of rules, each rule is of the form:
- Prefix: matched against the beginning of the path of the requested file
- Minute: if Prefix matches, how long the file is allowed to be oudated

AllowOutdatedFiles is a list of rules, they are checked in order, and
the first rule that matches is selected.

Note that, when a rule matches, the filesize check is also disabled for
this file. As it wouldn't make much sense if we allowed a file to be
outdated, but didn't allow it to be of a different size.

Now, here's the use-case for this setting.

For a Debian-like distribution, the directory `/dists` (aka. the
metadata of the repository) contains a lot of files that are updated
in-place.  Each time the repository is updated, and immediately after
mirrorbits rescans the local repo, mirrorbits redirects all the traffic
for those files to the fallback mirror, since they have a new modtime, a
new size, and mirrorbits doesn't know yet any mirror with those new
files. It's only after 1) mirrors sync with the origin repository and 2)
mirrorbits scans the updated mirrors, that it can redirect traffic to
mirrors again.

For more details in a real-life setup: Kali Linux is a rolling distro,
the repository is updated every 6 hours, and mirrors are scanned every
hour.  In effect, it means that every 6 hours, mirrorbits redirects most
of the metadata traffic to the fallback mirrors, then it takes around 1
to 2 hours before all the mirrors are scanned and traffic flows back to
normal. Then again, 4 times a day.

To prevent that, Kali uses the following setting:

```
AllowOutdatedFiles:
    - Prefix: /dists/
      Minutes: 540
```

Cf. etix#85 for more details.
@elboulangero elboulangero linked a pull request Feb 14, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants