-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
http: avoid counting very close range downloads from a same source. #128
Conversation
Wouldn't it be easier (or good enough) to just count requests that have a range starting with 0? Though I don't know how easy it is to parse that header or if golang provides some helpers for that. |
I don't really understand what you mean, or what you are proposing to solve with this specific change? :-) In any case, no, adding a specific casing for range starting with 0 is not "easier" since it would only add a test (so it would just be more code, though I would not call this more "difficult" per se, but easier, obviously not). Yet once again, I'm not sure what you are trying to achieve with this additional test. Note: there are improvements to be done by the way by checking ranges. Such as improving the transfer data stats. But none of the changes I have in mind cares about whether or not we start at 0, which is just a particular case (which is not so special). |
Ok so trying to guess what you might have meant, it is that you assume that we'd have exactly 1 GET call with a range starting at 0 per download. So you think that we can just count each GET with a 0-range == 1 download (and ignore the rest). This is a false assumption for a number of reasons:
My proposed logic instead is just about kinda counting the unique downloads with "unique" being defined as a single (file, downloader IP, downloader user-agent) triplet within a close time range. I don't think we can get much more accurate than this (I mean: we probably can, but the added complexity is likely not worth it), because there are too many possible cases and we can't know what is really happening (i.e.: why is a specific client requesting a range?). |
I see, never mind then. I wasn't aware of how web seeds work. (another thing I could think of is counting as float with count+=range/total) |
The changes seem logically correct to me: Range requests are only counted once within a time frame. However, the efficiency of the Redis operations could be doubled and some minor stylistic changes would be advisable. Otherwise, this looks good to me. |
LGTM |
@Jehan Another idea would be to count the requested amount of data, to sum it up and finally to divide the sum by the file size. I can explain it in more detail but as far as I know this method is used to estimate the number of podcast downloads. |
1e2b276
to
2fe6869
Compare
I assume you mean that we can do the Using the deprecated
You didn't tell which ones exactly even though I took care to reuse the same style as I was seeing in nearby code. I only found one place where I had a space between function name and opening parenthese. I removed it and amended directly in the first commit. If there is anything else, I'd welcome the details because I seem to follow existing code style.
This won't do, for the same reasons I gave earlier to @lazka. In a situation where downloads are full only, it makes sense. But with partial downloads, the total amount of data divided by the file size doesn't mean the number of downloads at all. A torrent downloader could perfectly download only a part of the file on our server while downloading the rest on other seeding machines. With your idea, if 2 completely different users (different IP, different machines, even days apart!) were to download half of the file, using your proposed logic, we'd count it as a single download. Which obviously makes absolutely no sense here. Now I agree that a further improvement would be to sum up the exact requested amount of data (this is what I meant in an earlier comment when saying some improvements can be made) and I will propose to look at this in a later MR (we kind of diverged already a bit now in this MR, where I also added some logic for more efficient redis version checks). In any case, this data must absolutely not be used to count the number of downloads which is completely different. 2 partial downloads by 2 people are still 2 downloads. We only try to gather the small partial downloads by a single person into a single bigger partial download, not gather partial downloads from completely different sources (which would be the consequence on basing download counts on size of transferred data). |
For the record, the reason why I propose this patch is that for instance, in GIMP case, mirrorbits currently shows more 26 millions of downloads just for our current stable version of the Windows installer, in less than a month of usage (we started using mirrorbits on August 29). Though I'd love to pretend that's true, and we definitely know for a fact that GIMP is massively downloaded anyway, we also know that these numbers are over-evaluated because of the torrents partial GET requests. So unfortunately currently, mirrorbits stats are not too usable for the binary files (Windows installer and macOS dmg) which are also provided officially with a torrent file. I'd love to get a new release of mirrorbits where we'd have more usable stats. :-D |
Any chances to see this merged? I'd love for us to have reliable download stats! :-D |
2fe6869
to
6e29af8
Compare
@jbkempf Rebased and force-pushed. :-) |
Hello @Jehan, I am Simon, part of the VLC mirrors management team. This looks like a very nicely pragmatic way of detecting partial downloads and avoid wrong stats gathering! I like the idea. I have a few questions though:
|
To be fair, I don't even think it's our custom needs. I wondered:
I came up with 10 minutes as I could have come with 2 minutes or 30 minutes. In the end, all of this is just some approximate heuristic because we can't know the real answer to what is in fact the real question: "are these 2 range requests for a single download or 2 separate downloads?" In reality we can theoretically also have 2 clients with a same IP and a same user-agent requesting the same file 2 times in a row with a few seconds delay, same as a single client can make a new request for a range after an hour (someone behind a very slow download connection, or maybe a torrent client "fixing" a broken file afterwards). So it was really just an OK duration which felt reasonable to me to get the most common use-case. Anyway sure, that could definitely be a configuration, I agree. We could imagine that some people could even want to set the value to 0 meaning that every range download is still to be considered by a full download on its own (as of now). I believe it's a bit cheating the numbers up, but whatever… I'll have a look at this, but probably not this week. I'm preparing a stable GIMP release so this will wait a bit.
First, it's not every request, but every range request. I think most requests are still non-range requests anyway. But once in a while, you can have range requests. Not just the torrent case, but nowadays I believe that most browsers have the ability to continue a download which was stopped in the middle. This being said, you may be right that it's still maybe more CPU requirement that we may want (even though the data to hash is quite small). To this question:
No I have not done any benchmark. I may try later too. Our most downloaded file is likely whatever is our latest Windows installer at point t in time, which seems to get about 500k requests a day (according to Note that we could also use MD5 for instance which would be must faster and less CPU intensive. I went with the assumption that we are hashing private data (IP and user-agent) so the less broken algorithm the better. On the other hand, this is very short-lived data, not the most sensitive data of all (this same data is actually even usually stored in web server logs in plain text) and the attack vector is probably very limited here. So an older and faster algorithm like MD5 might just be enough when weighing the benefits and costs. Anyway I'll look into it too.
I don't really know what went on the the golang community. This MR might have been the first and only time I ever looked at Go code (or maybe not, but I don't really remember). 😅 But on a wider perspective, disregarding the programming language used, it's indeed a good comment. Cost of any cryptographic algorithm can sometimes be not negligible at all when done a lot. You are right, it's better to check. I should have tested this. |
The delay is clearly related to how the user is supposed to download the package, indeed. Thinking again about the problematic, I really hope this will not "cheat down" the numbers if a lot of users try to download files behind a NAT - like pretty much any university, company, etc. I still have some trouble to determine if the agent is distinctive enough to discriminate range requests with the same IP.
My bad for the confusion, yes, I was talking about ranged requests. And correct, pausing/resuming download is common now.
I ended up the same conclusion: MD5 or SHA1 collision is probably not a big issue on this particular topic, and well optimized nowadays. Of course, if the CPU usage difference is negligible, there should be no reason to use something else that a more "modern" hash function like SHA256 - which is why I asked the question.
Well, I also asked because in other projects I experienced some similar issues in term of scaling... I can probably try to check on my side as well, this should be fast (to check). |
Ok, I think you can forget about my concerns about the hash functions, I just done a quick check on my laptop (Intel i7-7700HQ): with something very similar to your code (hashing a concat of an IPv6 and a user-agent), I ended up something like:
Let's keep the sha256 then :D |
If you are thinking about things like a company/univ massively installing a software on all their computers/servers, I sure hope that they don't re-download the program every time. I.e. they download it once and make it available on an internal server which will make the download part a lot faster/more reliable. As far as I know, that's often what univs/companies really do, with an internal server containing all the "allowed software". And even if they did massively download the software from the source server, they would likely just use a simple wget/curl through a script, which would not do a range request anyway (these tools are likely able to do range requests, but that's not how you typically set them up by default for a script). I believe that range requests are mostly happening either in specific peer-to-peer protocols such as torrent, or within interactive download interactions. So even with this use case in mind, the numbers would not get caught by this part of the code. I'm not saying that the counting error cannot happen in the "cheating down" direction, but the numbers are so wrong right now for us in the "cheating up" side (to the point of being useless), and while proposing this patch, I have thought a lot about how the heuristic could break the other way around (I did imagine the "massively installing a software use case in big organizations" too and came to the conclusion that it won't be a problem). In any case, I'm pretty sure we are all aware that these stats help having ballpark statistics and that's it. Most FLOSS are in fact already downloaded through dozens of other channels (Linux distributions, various software-listing websites, "stores"…) and we know that 1 download doesn't really equate 1 user (some people re-download the software for random reasons; and the use case you note with big organizations is typical of a use-once-install-many use case, already, with or without this code change), and so on. So yeah, bottom line, I believe that this added heuristic will fix a lot the "cheating up" part but won't really make MirrorBits "cheat down" in any meaningful way.
Ok. So I add the delay as a settings in MirrorBits configuration and we are good then? I'll try to do this within the next 2 weeks. Really looking forward the day when we'll have finally meaningful stats! :-) |
Very true for companies.
Agreed.
Yes! Absolutely. |
So, we’re waiting for a configuration patch. :) |
@Jehan bouncing back on what you said:
I would add download accelerators. People use it to speed up big downloads, especially in countries where internet connection is slow or unreliable.
From what we see at Kali Linux images (http://cdimage.kali.org/), looking at the last 3 months of logs, around 66% of the requests are range requests, so 2/3. The majority of the traffic is made of range requests. I don't know if you can measure that easily on your side. My guess is that this ratio depends on how big the files. A Kali image is around 4 GB, so users are more likely to use download accelerators. I also wonder a bit how this change might affect performance, but it's easy to patch it out if needed. We'll see. |
Unfortunately right now, we can't because we don't have access to logs. We have access to a stats tool called "Splunk" but it doesn't contain the "range" information. GNOME admins said they could have a look eventually to configure them in, but it's very low in their priority list.
With the settings I'm going to add (when I'll have time), you will be able to disable this by setting 0 as check delay. No need to patch the change out. Though it would still be interesting to know if anyone has performance issue. Moving to using MD5 is still a possibility. @garfvl made tests earlier where they established that SHA256 was fast enough, but MD5 is still more than 25% faster (which can make a difference when we are talking hundreds of thousands of calls). |
Sorry for the delay. I'll try to work on adding the missing settings soon(ish). :-) |
The idea is that if a same machine downloads twice ranges of a given file within a very short time span, it's probably 1 download (but done in pieces). For GIMP, we have particularly such use cases as we also serve torrent files and use our mirrors as web seeds. So a same torrent client may end up requesting partial bits of files. I'm using Redis feature of temporary keys and went with 10 minutes for expiration. Each time a new piece is requested, the timer is reset. If 2 requests are done with a longer silence frame, let's assume these are different downloads. The logic is worth what it is, but better than just re-counting partial downloads within seconds of intervals. Finally I identify source clients though their (IP, user-agent), hashing this ID for privacy and not storing non-necessary private data.
Rather than calling and parsing "INFO server" each time we might want to check the redis server version, only ask once when we (re)connect.
This option only appeared in redis 6.2.0. Note that there is the deprecated GETSET command, and we could have used it to cater to all redis versions, except that the EX option doesn't exist on GETSET. So it won't do. Instead let's just run the single SET call on redis 6.2.0 and over, and continue to run 2 requests on older redis databases. This uses the fact we now store the redis version (cf. previous commit), otherwise if we were to query the version each time (+ 1 request), it would make no sense.
This lets admins customize the time interval (in seconds) between which they want to consider 2 range downloads to be the same download. Setting SameDownloadInterval to 0 disables the same download detection logic (i.e. that every range download counts as a full download, just as it used to happen until now).
6e29af8
to
e8775c4
Compare
Ok, sorry again for the delay, I had quite some other stuff to handle! Anyway I implemented the new settings (cf. the last commit). I hope it's fine. I would appreciate a review, and hopefully a merge. Our admins from the GNOME Foundation infrastructure said they'd add this change as soon as it's validated upstream. :-) Thanks! |
And merged. If needed extra work can be done, before release. This needs documentation updates, btw. |
Thanks! As for docs, I can have a look but where should I look into? If I |
Running in production at http://cdimage.kali.org ! Sharing two screenshots, one from 9th of April (without this change), and one on the 10th, with the change. Same time of the day (I believe that mirrorbits show stats for the current UTC day? I'm not even 100% sure). Anyway, we clearly see the decrease in downloads stats. The patch is working! I also didn't notice any increase in CPU usage after the change. Thanks @Jehan ! Before change: After change: |
Thanks @elboulangero. Unlike you, I was a bit disappointed now that GNOME admins updated MirrorBits for GIMP downloads. Yesterday we had a full day with the new code, but the latest installer of GIMP still had 2.6 million downloads according to MirrorBits. And today we already passed the 2 million downloads though we still have a quarter of the day left (server time is UTC). Compared to previous days, it didn't get down much. I looked further in the code and realized that MirrorBits considers any request as a download, even HEAD requests (which are probably common since it's a good way to check the size of a file without starting a download?). So I opened #171. Hopefully this might be another good reason for the over-evaluated download stats we have? |
The idea is that if a same machine downloads twice ranges of a given
file within a very short time span, it's probably 1 download (but done
in pieces).
For GIMP, we have particularly such use cases as we also serve torrent
files and use our mirrors as web seeds. So a same torrent client may end
up requesting partial bits of files.
I'm using Redis feature of temporary keys and went with 10 minutes for
expiration. Each time a new piece is requested, the timer is reset. If 2
requests are done with a longer silence frame, let's assume these are
different downloads. The logic is worth what it is, but better than just
re-counting partial downloads within seconds of intervals.
Finally I identify source clients though their (IP, user-agent), hashing
this ID for privacy and not storing non-necessary private data.
Note: we started to use Mirrorbits for GIMP a few days ago: https://download.gimp.org/pub/gimp/v2.10/gimp-2.10.32.tar.bz2?mirrorstats
I was wondering if stats were taking range downloads into consideration. Looking at the code, apparently it's not.
In GIMP, we also serve torrents file and we use our mirrors as web seeds, which means we may have torrent clients sending "GET" requests asking for a "Range". A same client may typically request several ranges to the http server, within minutes, if not seconds. Right now, each request counts as 1 download, which therefore over-estimate the download count.
Such a logic as proposed is not perfect (perfect is likely not possible here anyway), but it would get much closer to real download count.