Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregation duplicates points at tile borders #889

Closed
jgoizueta opened this issue Mar 8, 2018 · 7 comments
Closed

Aggregation duplicates points at tile borders #889

jgoizueta opened this issue Mar 8, 2018 · 7 comments
Assignees

Comments

@jgoizueta
Copy link
Contributor

jgoizueta commented Mar 8, 2018

See CartoDB/carto-vl#91 (comment)

Our aggregation queries select the points in the tile with this filter:

   WHERE the_geom_webmercator && _cdb_params.bbox

So, points at tile borders can be selected in multiple tiles.

The aggregation expression, since it uses Floor functions should never aggregate a point to the wrong cluster, so those border points will appear clusters outside the tile in all except one tile.

We should fix the queries to avoid returning clusters that don't belong to the tile.

This can be done either by properly filtering points to be aggregated or by filtering after aggregation.

In the first case we could replace:

   WHERE the_geom_webmercator && _cdb_params.bbox

by:

        WHERE ST_X(the_geom_webmercator) >= ST_XMIN(bbox)
          AND ST_X(the_geom_webmercator) < ST_XMAX(bbox)
          AND ST_Y(the_geom_webmercator) >= ST_YMIN(bbox)
          AND ST_Y(the_geom_webmercator) < ST_YMAX(bbox)

But this could degrade the performance (or not, because the the_geom_webmercator && _cdb_params.bbox is added by Mapnik anyway in its query wrapping).
A solution could be add both conditions (we'd need to check).

The other option would be to filter the aggregation coordinates, for example:

        GROUP BY
            Floor(ST_X(_cdb_query.the_geom_webmercator)/_cdb_params.res),
            Floor(ST_Y(_cdb_query.the_geom_webmercator)/_cdb_params.res)
         HAVING 
             Floor(ST_X(_cdb_query.the_geom_webmercator)/_cdb_params.res) >= 0
             AND Floor(ST_X(_cdb_query.the_geom_webmercator)/_cdb_params.res) < 256/resolution
             AND etc...

Which would not use indices, but would operate on relatively few rows...

@Algunenano
Copy link
Contributor

Algunenano commented Mar 9, 2018

You could do

WHERE the_geom_webmercator && _cdb_params.bbox
    AND ST_X(the_geom_webmercator) < ST_XMAX(bbox) -- If you have the number it's better
    AND ST_Y(the_geom_webmercator) < ST_YMAX(bbox) -- If you have the number it's better

This approach a) only works for points; b) Some artefacts are to be expected if the right border of the most to the right tile falls exactly at the border of the screen; same for the bottom, since the tile covering the point will not be requested.

Keeping in mind I don't know anything at all about the renderer, could it be easier to return the cartodb_id with each geometry and avoid rendering it again if duplicated? Please disregard it if I'm saying something stupid.

jgoizueta added a commit that referenced this issue Apr 4, 2018
See #889
FOr centroid and point-grid the cartodb_id wasn't unique across tiles.
jgoizueta added a commit that referenced this issue Apr 5, 2018
See #889
FOr centroid and point-grid the cartodb_id wasn't unique across tiles.
@jgoizueta
Copy link
Contributor Author

I'm closing it by now because we can handle the duplicated border points in the clients easily.
If we find this a problem in the future in some use case let's reopen and apply @Algunenano proposed solution.

@jgoizueta
Copy link
Contributor Author

I'm reopening this because I think we should avoid the duplicate points: it would avoid potential confusions in the use of the tiles, and, commented above shouldn't have a performance impact (if we add the filtering condition in addition to the existing one which uses indices).

As stated above, border points are assigned to the correct cluster so that only in one tile the resulting cluster belongs to the tile. But the other cases are not only out of the tile, but can contain incomplete aggregations (only the border points are aggregated). This is not a problem if the client filters the out-of-the-tile clusters geometrically, but it could be a problem if duplicates are avoided by cartodb_id instead. In any case it seems inconsistent to include those partially aggregated clusters in the tile.

@jgoizueta
Copy link
Contributor Author

I've realised that the simple filtering described above is only valid with no tile extent/buffer, or when the MVT extent is a multiple of the aggregation resolution.

To exclude all aggregation cells that are only partially included in the bbox we must first compute
the limits of the cells fully included int the bbox:

CEIL(ST_XMIN(bbox)/res)*res AS xmin,
FLOOR(ST_XMAX(bbox)/res)*res AS xmax,
CEIL(ST_YMIN(bbox)/res)*res AS ymin,
FLOOR(ST_YMAX(bbox)/res)*res AS ymax

And then filter with:

(the_geom_webmercator && bbox) AND
ST_X(the_geom_webmercator) >= xmin AND
ST_X(the_geom_webmercator) < xmax AND
ST_Y(the_geom_webmercator) >= ymin AND
ST_Y(the_geom_webmercator) < ymax

I've been trying this and I've found a problem when zoom level is high and the tile is away from the origin: in such cases,the inaccuracies in Mapnik's provided bbox can make us drop cells on the border of the area.

This wouldn't happen if bbox was computed with full double accuracy, as CDB_XYZ_Extent does.

For example, for the tile z=20, x=1000000, y=1000000, here's the difference between the minimum Y of the tile and the correct value -18181044.018313024:

ST_YMIN(CDB_XYZ_Extent(1000000,1000000,20)) + 18181044.018313024 -- => 0 this is good
ST_YMIN(!bbox!) + 18181044.018313024 -- => 3.72529029846191e-09 oh oh

This discrepancy makes the filtering formulas above skip the lower row of cells in the tile.

Note: I have yet to test this with production Mapnik. Also, @Algunenano, I'll ask you to run some tests I have in your local configuration.

Also to be able to execute the tests I've prepared accurately we need to fix the problem described in #1001 so I will include combine those fixes in a PR

@jgoizueta
Copy link
Contributor Author

For a moment I thought that we better include all the cells intersected by the bbox, which would be more robust against bbox lack of precision (taking floor of min, ceil of max). But the problem is that the spatial filters that Mapnik wraps the queries with will prevent all the necessary data from being aggregated.

@jgoizueta
Copy link
Contributor Author

I've notice the error in the example ST_YMIN(!bbox!) + 18181044.018313024 -- => 3.72529029846191e-09 is exactly 1 ulp (unit in the last place). If this is so, we could manage the problem by adding/substracting a relative epsilon before the floor/ceil operations.

@jgoizueta
Copy link
Contributor Author

This applies the idea in my last comment and seems to work. Further testing will be needed, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants