Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random crashes when rasterizing tiles #1411

Closed
georgbachmann opened this issue Nov 6, 2024 · 9 comments
Closed

Random crashes when rasterizing tiles #1411

georgbachmann opened this issue Nov 6, 2024 · 9 comments

Comments

@georgbachmann
Copy link

georgbachmann commented Nov 6, 2024

We are using tileserver-gl to serve our maps. We have apps that only use vector tiles and that is NO PROBLEM at all. We separated all incoming requests already and noticed that serving pure vector data does not cause any problems. But for our website we still use raster tiles and that seems to cause problems after a couple of hours. We get consistent crashes and we think (not know) that it has something todo with memory consumption?!? At least memory seems to spike before crashes happen. The machines would have plenty though (50GB) but it's already crashing way before that.
Currently we use version 4.11.1 and also tried the latest 5.0.0 to test and still the same problem. It's running on linux machines in a docker-container. (maptiler/tileserver-gl:v4.11.1)

We can also not give more information about what causes the issue, cause we don't know it and the logs of our docker-container don't say much.

So my first question would be: how to provide information to narrow down the problem?

@acalcutt
Copy link
Collaborator

acalcutt commented Nov 9, 2024

Does this seem similar to #1236 . Possibly try 4.5.1 and see if the issue happens in that version just so we can get an idea of when this started.

In docker are you able to see how much memory is allocated and in use? My understanding is y default all memory is available, but maybe that isn't the case?

I run 5.0.0 without the docker here on Ubuntu 22.04 and haven't noticed any issues, but some people have reported similar issues when using the docker.

@georgbachmann
Copy link
Author

@acalcutt thank you very much for helping!!!
We don't think this issue is related - as far as we can tell, none of our systems ever had zombie processes and those seemingly random crashes have been an ongoing issue for us for years now: Having tried dozens of versions all over 4.x (I think even 3.x) up to 5.0.0 now, doesn't seem to change much if anything at all.
I believe by now we've picked off all the low hanging fruits: Hosts with Ubuntu 20.04 up to 24.04, various versions of docker, different hardware and virtual machines (Intel, AMD, 8 cores, 60 cores, from 8 to 300 GB Memory), even different filesystems for out data, tried setting a memory limit for each container as well as have it running without, have it run on a host with way more memory than we need even though there are no OOM kills logged to beginn with and so on. Crashes seem to happen randomly, sometimes it takes up to a day, sometime we've seen it crash within a couple minutes and every now and then the process seems to completely freeze instead of a hard crash and simply wont answer any request at all until manually killed and restarted. Our workaround is to have multiple instances running in parallel, on multiple servers and have haproxy do the loadbalancing so there's always some server that's able to answer requests even when one or two of the instances are in the process of restarting or a whole server is bogged down. But that's more of an ugly hack - especially as there's a chance you run into a race condition when all the instances are hanging/restarting at the same time (and we've had that happen more than once during the last couple months). Also doesn't seem to be related to the amount of traffic, we've seen those issues even with very little traffic at night as well as with peak traffic.

@acalcutt
Copy link
Collaborator

Do you think you could try this minimal test I had made for the maplibre-narive v6 rendering issue on macOS
https://github.com/acalcutt/maplibre-node-test/tree/macos_test

It is basically a crude test application that uses maplibre-native to render the same image 50 times. If it works, maybe try increasing the amount of renders by increasing https://github.com/acalcutt/maplibre-node-test/blob/macos_test/app.js#L53 to something bigger (say 1000) and see if it finishes creating all the images

@georgbachmann
Copy link
Author

georgbachmann commented Nov 28, 2024

Ok... after some digging we found the cause for our problem. Maybe it helps others as well.

Our map contains vector data, but also raster data. We produced this raster data using the good super old Tilemill and it wrote into the metadata table of the resulting mbtiles format: jpg70.

Now when a user was requesting a tile, but that could not be found (it was the one directly where our data starts but it seems it needs the surrounding tiles as well - for font rendering i suppose... is it tileMargin in the config?!?), inside serve_rendered.js around line 974 it wanted to createEmptyResponse and that function seems to use this format that is specified in the metadata table of the mbtiles which was jpg70. And not that is what crashed! Cause that throws an exception of Unknown format which is not caught!

So my proposal would be (I have no idea about node-js) but to make sure that throwing exceptions in there are properly caught?!? And maybe e.g. an empty-content response or something is returned?!? Not sure.

Also: What was a problem for us was KNOWING THAT! We did start our docker-container (but also same applies without docker and running it from source) with the --verbose option, but the if (options.verbose) that was before the console.log('MBTiles error, serving empty', err); did not make it print anything. So I guess there's a bug as well?!? So for our own debugging we added the source, x, y, z to the log as well (to get more details), commented out the if verbose part to catch the problem.
I am not familiar with node-js (I am an app developer :-) ) but I'd be happy to assist with or review at a possible PR.

So yes, that was our problem. We solved it by updating our mbtiles to format: jpeg and now it gracefully handles those problematic tiles as well.

And then if I may ask... is there a way to speed up the starting time? Then debugging now that was a bisi a problem, cause starting the tileserver takes around 4-5 min for us with those large mbtiles. Is it going through the contents of those files I suppose while startup? Is there a way to prevent that?

@acalcutt
Copy link
Collaborator

acalcutt commented Nov 28, 2024

Thank you for doing the extra troubleshooting and figuring this out a bit. a PR to fix or improve any issues you found with logging would be a great addition and appreciated. my guess is some of that logging could have been broken when mbtiles got promisified recently.

For the unknow file type issue, I wonder if we could set a list of expected extensions, and if it isn't in the list we set a default value. maybe in the createEmptyResponse function at ( https://github.com/maptiler/tileserver-gl/blob/master/src/serve_rendered.js#L85 ). Looking at this function, my guess is it might fail at .toFormat(format) if the value wasn't something sharp supports.

@acalcutt
Copy link
Collaborator

For the start speed issue, do you have a lot of mbtiles? My instance only takes about 30 seconds to start. On startup it connects to each mbtiles file and pulls out metadata needed to serve tilejson. it pulls all that info into a big internal array.

@georgbachmann
Copy link
Author

georgbachmann commented Nov 28, 2024

Thank you for doing the extra troubleshooting and figuring this out a bit. a PR to fix or improve any issues you found with logging would be a great addition and appreciated. my guess is some of that logging may could have been broken when mbtiles got promisified recently.

Or check that on startup... i guess it opens and parses the metadata table of mbtiles anyway. Not just crash at execution then. So it would constantly fail on startup already. And yes... the toFormat fails because of sharp, yes!
About the PR... I have no real clue about node-js 😀 was quite a hassle for me to get everything up and running 😬
So I would wrap it in a try {} block and hope for the best... but I have no idea, so I am sorry if I that would not be beneficial. The whole createEmptyResponse can throw, so I guess it would be important to gracefully handle that. Currently it's not.

@georgbachmann
Copy link
Author

For the start speed issue, do you have a lot of mbtiles? My instance only takes about 30 seconds to start. On startup it connects to each mbtiles file and pulls out metadata needed to serve tilejson. it pulls all that info into a big internal array.

We have about 10 mbtiles files, some are super large (500-600GB) and some are smaller. But shouldn't the query just for the metadata be super fast? At least when I do it manually it's instant?!? So I guess it makes something else with those files as well?!?

If you want to debug, I can screenshare and we can have a look, but I can't give you the mbtiles, sorry :-)

@acalcutt
Copy link
Collaborator

acalcutt commented Nov 28, 2024

I have around 30 mbtiles at https://tiles.wifidb.net and don't see those long startimes. For the most part, TileServer gl just connects to the database using @mapbox/mbtiles and pulls out the metadata.

However if it is taking a long time and your file is large, I wonder if your mbtiles file is missing some metadata like bounds, minzoom, or maxzoom. The information at @mapbox/mbtiles for getInfo says these things are generated from queries if they do not exist, which i'd imagine could take some time to generate if your file is large

getInfo(callback)

Get info of an MBTiles file, which is stored in the metadata table. Includes information like zoom levels, bounds, vector_layers, that were created during generation. This performs fallback queries if certain keys like bounds, minzoom, or maxzoom have not been provided.

In my files I usually edit them and add in any missing bounds/minzoom/maxzoom/center , so maybe that is why I don't see this.
For Example

acalcutt added a commit to WifiDB/tileserver-gl that referenced this issue Dec 29, 2024
Co-Authored-By: Andrew Calcutt <acalcutt@techidiots.net>
acalcutt added a commit to WifiDB/tileserver-gl that referenced this issue Jan 2, 2025
Co-Authored-By: Andrew Calcutt <acalcutt@techidiots.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants