Skip to content

Commit 48fee4d

Browse files
committed
updated readme.md
1 parent c18b8af commit 48fee4d

File tree

1 file changed

+90
-50
lines changed

1 file changed

+90
-50
lines changed

README.md

+90-50
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ The geocoder includes the following data categories:
1414

1515
To improve search precision, multiple aliases can be generated for each Overture Maps feature. These aliases anticipate user input that may combine multiple locations to refine search results. For example, in the Netherlands, many streets are named "Kerkstraat." If a user searches for "Kerkstraat Amsterdam," the geocoder should prioritize "Kerkstraat" in Amsterdam as the top result. To achieve this, aliases like "Kerkstraat" and "Kerkstraat {intersecting division.locality}" are added. These aliases vary based on the class and subclass of the feature.
1616

17-
Postgres Full Text Search (FTS) is used to index the aliases and can handle most of the queries efficiently. For instance if a user types "Kerkstr Amsterd," the geocoder can still locate "Kerkstraat" in Amsterdam. When FTS is not able to find a match, trigram matching takes over to find similar results, this approach is more tolerant of typos.
17+
Postgres Full Text Search (FTS) is used to index the aliases and can handle most of the queries efficiently. For instance if a user types "Kerkstr Amsterd," the geocoder can still locate "Kerkstraat" in Amsterdam. When FTS is not able to find a match for example in case of a typo, trigram matching takes over to find similar results.
1818

1919
Additionally, related segments for road, water and infra are merged into a single entry, enabling retrieval of the full feature rather than fragmented segments in the Overture Maps data. This approach reduces the likelihood of excessive high-matching results for the same road or water.
2020

@@ -27,6 +27,7 @@ For entries with names like "'s-Hertogenbosch," a common alias "den bosch" can b
2727

2828
This is a first experiment and seems to work pretty good but there are still some todo's.
2929

30+
- Make aliases configurable trough config
3031
- API: Endpoint for reverse geocoding
3132
- API: Filter results based on bbox
3233
- API: Batch geocoding
@@ -44,12 +45,6 @@ To download data we can use the overturemaps CLI tool and to process the data we
4445
pip install overturemaps
4546
```
4647

47-
To install DuckDB we can use the following commands.
48-
49-
```sh
50-
curl --fail --location --progress-bar --output duckdb_cli-linux-amd64.zip https://github.com/duckdb/duckdb/releases/download/v1.1.3/duckdb_cli-linux-amd64.zip && unzip duckdb_cli-linux-amd64.zip
51-
```
52-
5348
Now we can download all data from Overture Maps with a given bounding box using the `download` script. The script will download all data in the bounding box and store it in the `data/download` directory.
5449

5550
```sh
@@ -100,56 +95,70 @@ OpenAPI docs available at [http://localhost:8080/docs](http://localhost:8080/doc
10095
#### Query API
10196

10297
```sh
103-
curl -X GET "http://localhost:8080/geocode?q=Adr%20poorters%20Vught&class=road&limit=10"
98+
curl -X GET "http://localhost:8080/geocode?q=Adr%20poorters%20Vught&class=road&geom=true&limit=10"
10499
```
105100

106101
FTS has a 1 result so no fallback to trigram matching is needed.
107102

108103
```json
109104
{
110-
"ms": 3,
111-
"results": [
112-
{
113-
"name": "Adriaan Poortersstraat",
114-
"class": "road",
115-
"subclass": "residential",
116-
"divisions": "Vught",
117-
"alias": "adriaan poortersstraat vught",
118-
"searchType": "fts",
119-
"similarity": 0.548,
120-
"geom": {
121-
"type": "LineString",
122-
"coordinates": [
123-
[5.2859974, 51.6466151],
124-
[5.2860828, 51.646718],
125-
[5.2891755, 51.6474486]
126-
]
127-
}
128-
}
129-
]
105+
"queryTime": 6,
106+
"results": [
107+
{
108+
"id": 339468,
109+
"name": "Adriaan Poortersstraat",
110+
"class": "road",
111+
"subclass": "residential",
112+
"divisions": "{Vught}",
113+
"alias": "adriaan poortersstraat vught",
114+
"searchType": "fts",
115+
"similarity": 0.548,
116+
"geom": {
117+
"type": "LineString",
118+
"coordinates": [
119+
[
120+
5.2859974,
121+
51.6466151
122+
],
123+
[
124+
5.2860828,
125+
51.646718
126+
],
127+
[
128+
5.2891755,
129+
51.6474486
130+
]
131+
]
132+
}
133+
}
134+
]
130135
}
131136
```
132137

133138
## Data
134139

135140
### Database
136141

137-
The database consists of 2 tables: `overture` and `overture_search`. The `overture` table contains the features from Overture Maps and the `overture_search` table contains aliases for the features which point to the `overture` table. The column `alias` in the `overture_search` table has a `gin_trgm_ops` index on it for fast searching using the PostgreSQL extension `pg_trgm` and another index on alias also using gin but with `to_tsvector` on `alias` for FTS.
142+
The database consists of 2 tables: `overture` and `overture_search`. The `overture` table contains the features from Overture Maps and the `overture_search` table contains aliases for the features which point to the `overture` table. The column `alias` in the `overture_search` table has a `gin_trgm_ops` index on it for searching using the PostgreSQL extension `pg_trgm`. A column `vector_search` is added to the `overture_search` table which contains a tsvector of the aliases and is used for full text search. The rest of the colums: `class_rank`, `subclass_rank`, `word_count` and `char_count` are used for filtering and ranking the results.
138143

139144
![example](./static/example.jpg)
140145

141146
### Division
142147

143-
- Add locality relations for neighbourhoods & microhood features
144-
- Add county relations for locality features
145-
- Add region relations for county features
148+
#### Process
149+
150+
- Adds locality relations for neighbourhoods & microhood features
151+
- Adds county relations for locality features
152+
- Adds region relations for county features
146153

147154
### Road
148155

149-
- Only segments with a primary name, we cannot search for a segment without a name so we leave them out.
150-
- Only segments with a subtype road. Tracks are not usefull for geocoding and water we will get from a different source since water features are segments and not water bodies.
151-
- Roads can be split up in multiple segments in the overture data: Buffer roads and uninion where features have the same name and class and are close to each other. This way we can cluster roads and get the full road when searching for a road.
152-
- Add relations for locality to roads but exlude relations for motorways since this does not make much sense.
156+
#### Process
157+
158+
- Only picks segments with a primary name, we cannot search for a segment without a name so we leave them out.
159+
- Only picks segments with a subtype road. Tracks are not usefull for geocoding and water we will get from a different source since water features are segments and not water bodies.
160+
- Merges clusterable segments into 1 feature
161+
- Adds relations for locality to roads but exlude relations for motorways since this does not make much sense.
153162

154163
#### ToDo
155164

@@ -159,40 +168,49 @@ The database consists of 2 tables: `overture` and `overture_search`. The `overtu
159168

160169
### Water
161170

162-
- Only water with primary name
163-
- Subtype is most of the time the same as class and not helpfull use subtype as subclass
164-
- Features with lines are sometimes split up and also can represent the same feature, these need to be grouped and merged
165-
- Polygons are not directly split up but need to be grouped aswell when close and representing the same feature
171+
#### Process
172+
173+
- Only picks features with primary name
174+
- Selects overture subtype as subclass
175+
- Merges clusterable features into 1 feature (works for polygons and lines)
166176

167177
#### ToDo
168178

169-
We have features 'duplicated' as lines and polygons, remove a line if it's within a polygon with the same name and subclass
179+
- We have features 'duplicated' as lines and polygons, remove a line if it's within a polygon with the same name and subclass
170180

171181
### POI
172182

173-
- Take all pois with confidence 0.4 or higher
174-
- Add locality relation to pois
183+
#### Process
184+
185+
- Takes all pois with confidence 0.4 or higher
186+
- Adds locality relation to features
175187

176188
### Address
177189

178-
- Combine street and number for name
179-
- Use address_levels for relations
190+
#### Process
191+
192+
- Combines street and number for name/alias
193+
- Picks address_levels for relations
180194

181195
### Zipcode
182196

183-
These are not official zipcode areas but generated from the address data.
197+
These are not official zipcode areas but are generated based on zipcodes from the address data.
184198

185-
- Group addresses by zipcode and union geometries and create convex hull
199+
#### Process
200+
201+
- Groups addresses by zipcode and union geometries and create convex hull as zipcode area
186202

187203
#### ToDo
188204

189205
- Fill the country with the zipcode areas, can we somehow create a voronoi with the polygons we have?
190206

191207
### Infra
192208

193-
- Take only infra with a name and filter out some classes that are not usefull for geocoding
194-
- Merge close infra features with the same name and class
195-
- Add locality relation to infra
209+
#### Process
210+
211+
- Takes only infra features with a name and filters out some classes that are not usefull for geocoding
212+
- Merges close infra features with same name and class
213+
- Adds locality relation to features
196214

197215
## Building executable
198216

@@ -210,3 +228,25 @@ Latest image is available on ghcr.io.
210228
```sh
211229
docker run --network host -v ./config/geocodeur.conf:/config/geocodeur.conf ghcr.io/tebben/geocodeur:latest
212230
```
231+
232+
## Tests
233+
234+
### pg_trgm
235+
236+
pg_trgm can be very fast but the performance tanks when there are a lot of aliases in the database containing the same word. For instance searching for `Amsterdam` results in 620.000 features, when ordering by similarity the query takes multiple seconds while we are looking for sub 100ms response times. pg_trgm is however very helpfull when there are typing errors so in the current setup pg_trgm is only used as a fallback when FTS does not return any results.
237+
238+
### Meilisearch
239+
240+
Performance is great but adds another service to the stack. With some effort of trying to rank the results I got some ok responses but still got some unexpected results for some inputs. Overall a great tool but maybe not the best for our use case.
241+
242+
### Bluge/bleve
243+
244+
I tried to have a more integrated solution with Bluge/bleve, played around alot to get the best results which were ok in the end but the performance was not satisfying enough.
245+
246+
### Typesense
247+
248+
Not able to get expected results.
249+
250+
### ToDo
251+
252+
Would be fun to explore a custom solution in Go using BK-tree, trigrams and inverted indexes to see if we can get good results and fast response times directly in Geocodeur.

0 commit comments

Comments
 (0)