You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: README.md
+90-50
Original file line number
Diff line number
Diff line change
@@ -14,7 +14,7 @@ The geocoder includes the following data categories:
14
14
15
15
To improve search precision, multiple aliases can be generated for each Overture Maps feature. These aliases anticipate user input that may combine multiple locations to refine search results. For example, in the Netherlands, many streets are named "Kerkstraat." If a user searches for "Kerkstraat Amsterdam," the geocoder should prioritize "Kerkstraat" in Amsterdam as the top result. To achieve this, aliases like "Kerkstraat" and "Kerkstraat {intersecting division.locality}" are added. These aliases vary based on the class and subclass of the feature.
16
16
17
-
Postgres Full Text Search (FTS) is used to index the aliases and can handle most of the queries efficiently. For instance if a user types "Kerkstr Amsterd," the geocoder can still locate "Kerkstraat" in Amsterdam. When FTS is not able to find a match, trigram matching takes over to find similar results, this approach is more tolerant of typos.
17
+
Postgres Full Text Search (FTS) is used to index the aliases and can handle most of the queries efficiently. For instance if a user types "Kerkstr Amsterd," the geocoder can still locate "Kerkstraat" in Amsterdam. When FTS is not able to find a match for example in case of a typo, trigram matching takes over to find similar results.
18
18
19
19
Additionally, related segments for road, water and infra are merged into a single entry, enabling retrieval of the full feature rather than fragmented segments in the Overture Maps data. This approach reduces the likelihood of excessive high-matching results for the same road or water.
20
20
@@ -27,6 +27,7 @@ For entries with names like "'s-Hertogenbosch," a common alias "den bosch" can b
27
27
28
28
This is a first experiment and seems to work pretty good but there are still some todo's.
29
29
30
+
- Make aliases configurable trough config
30
31
- API: Endpoint for reverse geocoding
31
32
- API: Filter results based on bbox
32
33
- API: Batch geocoding
@@ -44,12 +45,6 @@ To download data we can use the overturemaps CLI tool and to process the data we
44
45
pip install overturemaps
45
46
```
46
47
47
-
To install DuckDB we can use the following commands.
Now we can download all data from Overture Maps with a given bounding box using the `download` script. The script will download all data in the bounding box and store it in the `data/download` directory.
54
49
55
50
```sh
@@ -100,56 +95,70 @@ OpenAPI docs available at [http://localhost:8080/docs](http://localhost:8080/doc
100
95
#### Query API
101
96
102
97
```sh
103
-
curl -X GET "http://localhost:8080/geocode?q=Adr%20poorters%20Vught&class=road&limit=10"
98
+
curl -X GET "http://localhost:8080/geocode?q=Adr%20poorters%20Vught&class=road&geom=true&limit=10"
104
99
```
105
100
106
101
FTS has a 1 result so no fallback to trigram matching is needed.
107
102
108
103
```json
109
104
{
110
-
"ms": 3,
111
-
"results": [
112
-
{
113
-
"name": "Adriaan Poortersstraat",
114
-
"class": "road",
115
-
"subclass": "residential",
116
-
"divisions": "Vught",
117
-
"alias": "adriaan poortersstraat vught",
118
-
"searchType": "fts",
119
-
"similarity": 0.548,
120
-
"geom": {
121
-
"type": "LineString",
122
-
"coordinates": [
123
-
[5.2859974, 51.6466151],
124
-
[5.2860828, 51.646718],
125
-
[5.2891755, 51.6474486]
126
-
]
127
-
}
128
-
}
129
-
]
105
+
"queryTime": 6,
106
+
"results": [
107
+
{
108
+
"id": 339468,
109
+
"name": "Adriaan Poortersstraat",
110
+
"class": "road",
111
+
"subclass": "residential",
112
+
"divisions": "{Vught}",
113
+
"alias": "adriaan poortersstraat vught",
114
+
"searchType": "fts",
115
+
"similarity": 0.548,
116
+
"geom": {
117
+
"type": "LineString",
118
+
"coordinates": [
119
+
[
120
+
5.2859974,
121
+
51.6466151
122
+
],
123
+
[
124
+
5.2860828,
125
+
51.646718
126
+
],
127
+
[
128
+
5.2891755,
129
+
51.6474486
130
+
]
131
+
]
132
+
}
133
+
}
134
+
]
130
135
}
131
136
```
132
137
133
138
## Data
134
139
135
140
### Database
136
141
137
-
The database consists of 2 tables: `overture` and `overture_search`. The `overture` table contains the features from Overture Maps and the `overture_search` table contains aliases for the features which point to the `overture` table. The column `alias` in the `overture_search` table has a `gin_trgm_ops` index on it for fast searching using the PostgreSQL extension `pg_trgm`and another index on alias also using gin but with `to_tsvector` on `alias`for FTS.
142
+
The database consists of 2 tables: `overture` and `overture_search`. The `overture` table contains the features from Overture Maps and the `overture_search` table contains aliases for the features which point to the `overture` table. The column `alias` in the `overture_search` table has a `gin_trgm_ops` index on it for searching using the PostgreSQL extension `pg_trgm`. A column `vector_search` is added to the `overture_search` table which contains a tsvector of the aliases and is used for full text search. The rest of the colums: `class_rank`, `subclass_rank`, `word_count` and `char_count` are used for filtering and ranking the results.
138
143
139
144

140
145
141
146
### Division
142
147
143
-
- Add locality relations for neighbourhoods & microhood features
144
-
- Add county relations for locality features
145
-
- Add region relations for county features
148
+
#### Process
149
+
150
+
- Adds locality relations for neighbourhoods & microhood features
151
+
- Adds county relations for locality features
152
+
- Adds region relations for county features
146
153
147
154
### Road
148
155
149
-
- Only segments with a primary name, we cannot search for a segment without a name so we leave them out.
150
-
- Only segments with a subtype road. Tracks are not usefull for geocoding and water we will get from a different source since water features are segments and not water bodies.
151
-
- Roads can be split up in multiple segments in the overture data: Buffer roads and uninion where features have the same name and class and are close to each other. This way we can cluster roads and get the full road when searching for a road.
152
-
- Add relations for locality to roads but exlude relations for motorways since this does not make much sense.
156
+
#### Process
157
+
158
+
- Only picks segments with a primary name, we cannot search for a segment without a name so we leave them out.
159
+
- Only picks segments with a subtype road. Tracks are not usefull for geocoding and water we will get from a different source since water features are segments and not water bodies.
160
+
- Merges clusterable segments into 1 feature
161
+
- Adds relations for locality to roads but exlude relations for motorways since this does not make much sense.
153
162
154
163
#### ToDo
155
164
@@ -159,40 +168,49 @@ The database consists of 2 tables: `overture` and `overture_search`. The `overtu
159
168
160
169
### Water
161
170
162
-
- Only water with primary name
163
-
- Subtype is most of the time the same as class and not helpfull use subtype as subclass
164
-
- Features with lines are sometimes split up and also can represent the same feature, these need to be grouped and merged
165
-
- Polygons are not directly split up but need to be grouped aswell when close and representing the same feature
171
+
#### Process
172
+
173
+
- Only picks features with primary name
174
+
- Selects overture subtype as subclass
175
+
- Merges clusterable features into 1 feature (works for polygons and lines)
166
176
167
177
#### ToDo
168
178
169
-
We have features 'duplicated' as lines and polygons, remove a line if it's within a polygon with the same name and subclass
179
+
-We have features 'duplicated' as lines and polygons, remove a line if it's within a polygon with the same name and subclass
170
180
171
181
### POI
172
182
173
-
- Take all pois with confidence 0.4 or higher
174
-
- Add locality relation to pois
183
+
#### Process
184
+
185
+
- Takes all pois with confidence 0.4 or higher
186
+
- Adds locality relation to features
175
187
176
188
### Address
177
189
178
-
- Combine street and number for name
179
-
- Use address_levels for relations
190
+
#### Process
191
+
192
+
- Combines street and number for name/alias
193
+
- Picks address_levels for relations
180
194
181
195
### Zipcode
182
196
183
-
These are not official zipcode areas but generated from the address data.
197
+
These are not official zipcode areas but are generated based on zipcodes from the address data.
184
198
185
-
- Group addresses by zipcode and union geometries and create convex hull
199
+
#### Process
200
+
201
+
- Groups addresses by zipcode and union geometries and create convex hull as zipcode area
186
202
187
203
#### ToDo
188
204
189
205
- Fill the country with the zipcode areas, can we somehow create a voronoi with the polygons we have?
190
206
191
207
### Infra
192
208
193
-
- Take only infra with a name and filter out some classes that are not usefull for geocoding
194
-
- Merge close infra features with the same name and class
195
-
- Add locality relation to infra
209
+
#### Process
210
+
211
+
- Takes only infra features with a name and filters out some classes that are not usefull for geocoding
212
+
- Merges close infra features with same name and class
213
+
- Adds locality relation to features
196
214
197
215
## Building executable
198
216
@@ -210,3 +228,25 @@ Latest image is available on ghcr.io.
210
228
```sh
211
229
docker run --network host -v ./config/geocodeur.conf:/config/geocodeur.conf ghcr.io/tebben/geocodeur:latest
212
230
```
231
+
232
+
## Tests
233
+
234
+
### pg_trgm
235
+
236
+
pg_trgm can be very fast but the performance tanks when there are a lot of aliases in the database containing the same word. For instance searching for `Amsterdam` results in 620.000 features, when ordering by similarity the query takes multiple seconds while we are looking for sub 100ms response times. pg_trgm is however very helpfull when there are typing errors so in the current setup pg_trgm is only used as a fallback when FTS does not return any results.
237
+
238
+
### Meilisearch
239
+
240
+
Performance is great but adds another service to the stack. With some effort of trying to rank the results I got some ok responses but still got some unexpected results for some inputs. Overall a great tool but maybe not the best for our use case.
241
+
242
+
### Bluge/bleve
243
+
244
+
I tried to have a more integrated solution with Bluge/bleve, played around alot to get the best results which were ok in the end but the performance was not satisfying enough.
245
+
246
+
### Typesense
247
+
248
+
Not able to get expected results.
249
+
250
+
### ToDo
251
+
252
+
Would be fun to explore a custom solution in Go using BK-tree, trigrams and inverted indexes to see if we can get good results and fast response times directly in Geocodeur.
0 commit comments