diff --git a/site-html/404.html b/site-html/404.html index 94b3469..ac25724 100644 --- a/site-html/404.html +++ b/site-html/404.html @@ -12,15 +12,15 @@ - + -
+ Bases: osmium._osmium.BaseHandler
osmium._osmium.BaseHandler
Basic writer for OSM data. The SimpleWriter can write out @@ -1750,7 +1752,7 @@
Handler for retriving and caching locations from ways @@ -1619,7 +1621,7 @@
pyosmium is a library to efficiently read and process OpenStreetMap data files. It is based on the osmium library for reading and writing data and adds convenience functions that allow you to set up fast processing pipelines in Pythons that can handle even planet-sized data.
This manual comes in three parts:
The recommended way to install pyosmium is via pip:
pip install osmium\n
Binary wheels are provided for all actively maintained Python versions on Linux, MacOS and Windows 64bit.
To compile pyosmium from source or when installing it from the source wheel, the following additional dependencies need to be available:
On Debian/Ubuntu-like systems, the following command installs all required packages:
sudo apt-get install python3-dev build-essential cmake libboost-dev \\\n libexpat1-dev zlib1g-dev libbz2-dev\n
libosmium, protozero and pybind11 are shipped with the source wheel. When building from source, you need to download the source code and put it in the subdirectory 'contrib'. Alternatively, if you want to put the sources somewhere else, point pyosmium to the source code location by setting the CMake variables LIBOSMIUM_PREFIX, PROTOZERO_PREFIX and PYBIND11_PREFIX respectively.
LIBOSMIUM_PREFIX
PROTOZERO_PREFIX
PYBIND11_PREFIX
To compile and install the bindings, run
pip install [--user] .\n
This section presents practical examples and use cases, where pyosmium can come in handy. Each cookbook comes with three sections:
All cookbooks are available as Jupiter notebooks in the pyosmium source tree
This section lists all functions and classes that pyosmium implements for reference.
osmium.BaseHandler
Base class for all native handler functions in pyosmium. Any class that derives from this class can be used for parameters that need a handler-like object.
osmium.BaseFilter
Bases: osmium._osmium.BaseHandler
Base class for all native filter functions in pyosmium. A filter is a handler that returns a boolean in the handler functions indicating if the object should pass the filter (False) or be dropped (True).
enable_for(entities: osm_entity_bits) -> None
Set the OSM types this filter should be applied to. If an object has a type for which the filter is not enabled, the filter will be skipped completely. Or to put it in different words: every object for which the filter is not enabled, passes the filter automatically.
HandlerLike
Many functions in pyosmium take handler-like objects as a parameter. Next to classes that derive from BaseHandler and BaseFilter you may also hand in any object that has one of the handler functions node(), way(), relation(), area(), or changeset() implemented.
BaseHandler
BaseFilter
node()
way()
relation()
area()
changeset()
This user manual gives you an introduction on how to process OpenStreetMap data
pyosmium builds on the fast and efficient libosmium library. It borrows many of its concepts from libosmium. For more in-depth information, you might also want to consult the libosmium manual.
import osmium\nfrom dataclasses import dataclass\nimport json\n
@dataclass\nclass PlaceInfo:\n id: int\n tags: dict[str, str]\n coords: tuple[float, float]\n\ngeojsonfab = osmium.geom.GeoJSONFactory()\n\nclass BoundaryHandler(osmium.SimpleHandler):\n def __init__(self, outfile):\n self.places = {}\n self.outfile = outfile\n # write the header of the geojson file\n self.outfile.write('{\"type\": \"FeatureCollection\", \"features\": [')\n # This is just to make sure, we place the commas on the right place.\n self.delim = ''\n\n def finish(self):\n self.outfile.write(']}')\n\n def node(self, n):\n self.places[n.tags['wikidata']] = PlaceInfo(n.id, dict(n.tags), (n.location.lon, n.location.lat))\n \n def area(self, a):\n # Find the corresponding place node\n place = self.places.get(a.tags.get('wikidata', 'not found'), None)\n # Geojsonfab creates a string with the geojson geometry.\n # Convert to a Python object to make it easier to add data.\n geom = json.loads(geojsonfab.create_multipolygon(a))\n if geom:\n # print the array delimiter, if necessary\n self.outfile.write(self.delim)\n self.delim = ','\n\n tags = dict(a.tags)\n # add the place information to the propoerties\n if place is not None:\n tags['place_node:id'] = str(place.id)\n tags['place_node:lat'] = str(place.coords[1])\n tags['place_node:lon'] = str(place.coords[0])\n for k, v in place.tags.items():\n tags['place_node:tags:' + k] = v\n # And wrap everything in proper GeoJSON.\n feature = {'type': 'Feature', 'geometry': geom, 'properties': dict(tags)}\n self.outfile.write(json.dumps(feature))\n\n# We are interested in boundary relations that make up areas and not in ways at all.\nfilters = [osmium.filter.KeyFilter('place').enable_for(osmium.osm.NODE),\n osmium.filter.KeyFilter('wikidata').enable_for(osmium.osm.NODE),\n osmium.filter.EntityFilter(~osmium.osm.WAY),\n osmium.filter.TagFilter(('boundary', 'administrative')).enable_for(osmium.osm.AREA | osmium.osm.RELATION)]\n\nwith open('../data/out/boundaries.geojson', 'w') as outf:\n handler = BoundaryHandler(outf)\n handler.apply_file('../data/liechtenstein.osm.pbf', filters=filters)\n handler.finish()\n
@dataclass\nclass PlaceInfo:\n id: int\n tags: dict[str, str]\n coords: tuple[float, float]\n
This class can now be filled from the OSM file:
class PlaceNodeReader:\n\n def __init__(self):\n self.places = {}\n\n def node(self, n):\n self.places[n.tags['wikidata']] = PlaceInfo(n.id, dict(n.tags), (n.location.lon, n.location.lat))\n\nreader = PlaceNodeReader()\n\nosmium.apply('../data/liechtenstein.osm.pbf',\n osmium.filter.KeyFilter('place').enable_for(osmium.osm.NODE),\n osmium.filter.KeyFilter('wikidata').enable_for(osmium.osm.NODE),\n reader)\n\nprint(f\"{len(reader.places)} places cached.\")\n
29 places cached.\n
We use the osmium.apply() function here with a handler instead of a FileProcessor. The two approaches are equivalent. Which one you choose, depends on your personal taste. FileProcessor loops are less verbose and quicker to write. Handlers tend to yield more readable code when you want to do very different things with the different kinds of objects.
osmium.apply()
As you can see in the code, it is entirely possible to use filter functions with the apply() functions. In our case, the filters make sure that only objects pass which have a place tag and a wikidata tag. This leaves exactly the objects we need already, so no further processing needed in the handler callback.
place
wikidata
Next the relations need to be read. Relations can be huge, so we don't want to cache them but write them directly out into a file. If we want to create a geojson file, then we need the geometry of the relation in geojson format. Getting geojson format itself is easy. Pyosmium has a converter built-in for this, the GeoJSONFactory:
geojsonfab = osmium.geom.GeoJSONFactory()\n
The factory only needs to be instantiated once and can then be used globally.
To get the polygon from a relation, the special area handler is needed. It is easiest to invoke by writing a SimpleHandler class with an area() callback. When apply_file() is called on the handler, it will take the necessary steps in the background to build the polygon geometries.
apply_file()
class BoundaryHandler(osmium.SimpleHandler):\n def __init__(self, places, outfile):\n self.places = places\n self.outfile = outfile\n # write the header of the geojson file\n self.outfile.write('{\"type\": \"FeatureCollection\", \"features\": [')\n # This is just to make sure, we place the commas on the right place.\n self.delim = ''\n\n def finish(self):\n self.outfile.write(']}')\n\n def area(self, a):\n # Find the corresponding place node\n place = self.places.get(a.tags.get('wikidata', 'not found'), None)\n # Geojsonfab creates a string with the geojson geometry.\n # Convert to a Python object to make it easier to add data.\n geom = json.loads(geojsonfab.create_multipolygon(a))\n if geom:\n # print the array delimiter, if necessary\n self.outfile.write(self.delim)\n self.delim = ','\n\n tags = dict(a.tags)\n # add the place information to the propoerties\n if place is not None:\n tags['place_node:id'] = str(place.id)\n tags['place_node:lat'] = str(place.coords[1])\n tags['place_node:lon'] = str(place.coords[0])\n for k, v in place.tags.items():\n tags['place_node:tags:' + k] = v\n # And wrap everything in proper GeoJSON.\n feature = {'type': 'Feature', 'geometry': geom, 'properties': dict(tags)}\n self.outfile.write(json.dumps(feature))\n\n# We are interested in boundary relations that make up areas and not in ways at all.\nfilters = [osmium.filter.EntityFilter(osmium.osm.RELATION | osmium.osm.AREA),\n osmium.filter.TagFilter(('boundary', 'administrative'))]\n\nwith open('../data/out/boundaries.geojson', 'w') as outf:\n handler = BoundaryHandler(reader.places, outf)\n handler.apply_file('../data/liechtenstein.osm.pbf', filters=filters)\n handler.finish()\n
There are two things you should keep in mind, when working with areas:
This is already it. In the long version, we have read the input file twice, once to get the nodes and in the second pass to get the relations. This is not really necessary because the nodes come always before the relations in the file. The quick solution shows how to combine both handlers to create the geojson file in a single pass. The only part to pay attention to is the use of filters. Given that we have very different filters for nodes and relations, it is important to call enable_for() with the correct OSM type.
enable_for()
How to merge information from different OSM objects.
Administrative areas often represented with two different objects in OSM: a node describes the central point and a relation that contains all the ways that make up the boundary. The task is to find all administrative boundaries and their matching place nodes and output both togther in a geojson file. Relations and place nodes should be matched when they have the same wikidata tag.
Whenever you want to look at more than one OSM object at the time, you need to cache objects. Before starting such a task, it is always worth taking a closer look at the objects of interest. Find out how many candidates there are for you to look at and save and how large these objects are. There are always multiple ways to cache your data. Sometimes, when the number of candidates is really large, it is even more sensible to reread the file instead of caching the information.
For the boundary problem, the calculation is relatively straightforward. Boundary relations are huge, so we do not want to cache them if it can somehow be avoided. That means we need to cache the place nodes. A quick look at TagInfo tells us that there are about 7 million place nodes in the OSM planet. That is not a lot in the grand scheme of things. We could just read them all into memory and be done with it. It is still worth to take a closer look. The place nodes are later matched up by their wikidata tag. Looking into the TagInfo combinations table, only 10% of the place nodes have such a tag. That leaves 850.000 nodes to cache. Much better!
Next we need to consider what information actually needs caching. In our case we want it all: the ID, the tags and the coordinates of the node. This information needs to be copied out of the node. You cannot just cache the entire node. Pyosmium won't let you do this because it wants to get rid of it as soon as the handler has seen it. Lets create a dataclass to receive the information we need:
import osmium\nfrom collections import defaultdict\n
fp = osmium.FileProcessor('../data/liechtenstein.osm.pbf', osmium.osm.RELATION)\\\n .with_filter(osmium.filter.TagFilter(('type', 'route')))\\\n .with_filter(osmium.filter.TagFilter(('route', 'bicycle')))\n\nroutes = {}\nmembers = defaultdict(list)\nfor rel in fp:\n routes[rel.id] = (rel.tags.get('name', ''), rel.tags.get('ref', ''))\n \n for member in rel.members:\n if member.type == 'w':\n members[member.ref].append(rel.id)\n\nwith osmium.SimpleWriter('../data/out/cycling.osm.opl', overwrite=True) as writer:\n fp = osmium.FileProcessor('../data/liechtenstein.osm.pbf')\\\n .with_filter(osmium.filter.IdFilter(members.keys()).enable_for(osmium.osm.WAY))\\\n .handler_for_filtered(writer)\n\n for way in fp:\n assert all(i in routes for i in members[way.id])\n # To add tags, first convert the tags into a Python dictionary.\n tags = dict(way.tags)\n tags['cycle_route:name'] = '|'.join(routes[i][0] for i in members[way.id])[:255]\n tags['cycle_route:ref'] = '|'.join(routes[i][1] for i in members[way.id])[:255]\n writer.add(way.replace(tags=tags))\n
fp = osmium.FileProcessor('../data/liechtenstein.osm.pbf', osmium.osm.RELATION)\\\n .with_filter(osmium.filter.TagFilter(('type', 'route')))\\\n .with_filter(osmium.filter.TagFilter(('route', 'bicycle')))\n\nroutes = {}\nfor rel in fp:\n routes[rel.id] = (rel.tags.get('name', ''), rel.tags.get('ref', ''))\n\nf\"Found {len(routes)} routes.\"\n
'Found 13 routes.'
It is safe to restrict the FileProcessor to the RELATION type because we are only interested in relations and don't need geometry information. A cycling route comes with two mandatory tags in OSM, type=route and route=bicycle. To filter for relations that have both tags in them, simply chain two TagFilters. Don't just use a single filter with two tags like this: osmium.filter.TagFilter(('type', 'route'), ('route', 'bicycle')). This would filter for relation that have either the route tag or the type tag. Not exactly what we want.
type=route
route=bicycle
osmium.filter.TagFilter(('type', 'route'), ('route', 'bicycle'))
For each relation that goes through the filter, save the information needed. Resist the temptation to simply save the complete relation. For one thing, a single relation can become quite large. But more importantly, pyosmium will not allow you to access the object anymore once the end of the loop iteraton is reached. You only ever see a temporary view of an object within the processing loop. You need to make a full copy of what you want to keep.
Next we need to save the way-relation membership. This can be done in a simple dictionary. Just keep in mind that a single way can be in multiple relations. The member lookup needs to point to a list:
members = defaultdict(list)\nfor rel in fp:\n for member in rel.members:\n if member.type == 'w':\n members[member.ref].append(rel.id)\n\nf\"Found {len(members)} ways that are part of a cycling relation.\"\n
'Found 1023 ways that are part of a cycling relation.'
This is all the information needed to add the cycling information to the ways. Now we can write out the enhanced cycling info file. Only the ways with relations on them need to be modified. So we use an IdFilter to process only these ways and forward all other objects directly to the writer. This works just the same as in the Enhance-Tags cookbook:
with osmium.SimpleWriter('../data/out/cycling.osm.opl', overwrite=True) as writer:\n fp = osmium.FileProcessor('../data/liechtenstein.osm.pbf')\\\n .with_filter(osmium.filter.IdFilter(members.keys()).enable_for(osmium.osm.WAY))\\\n .handler_for_filtered(writer)\n\n for way in fp:\n assert all(i in routes for i in members[way.id])\n # To add tags, first convert the tags into a Python dictionary.\n tags = dict(way.tags)\n tags['cycle_route:name'] = '|'.join(routes[i][0] for i in members[way.id])[:255]\n tags['cycle_route:ref'] = '|'.join(routes[i][1] for i in members[way.id])[:255]\n writer.add(way.replace(tags=tags))\n
How to transfer information from a relation to its members.
Take the name and reference from all cycling routes and add it to the member ways of the route relation. Write out a new file with the added way information.
The objects in an OSM file are usually order by their type: first come nodes, then ways and finally relations. Given that pyosmium always scans files sequentially, it will be necessary to read the OSM file twice when you want to transfer information from relations to ways.
The first pass is all about getting the information from the relations. There are two pieces of information to collect: the information about the relation itself and the information which relations a way belongs to. Lets start with collection the relation information:
import osmium\n
with osmium.SimpleWriter('../data/out/renamed.pbf', overwrite=True) as writer:\n fp = osmium.FileProcessor('../data/liechtenstein.osm.pbf')\\\n .with_filter(osmium.filter.KeyFilter('name:fr'))\\\n .handler_for_filtered(writer)\n\n for obj in fp:\n # start with a set of tags without name:fr\n tags = {k: v for k, v in obj.tags if k != 'name:fr'}\n # replace the name tag with the French version\n tags['name'] = obj.tags['name:fr']\n # Save the original if it exists.\n if 'name' in obj.tags:\n tags['name:local'] = obj.tags['name']\n # Write back the object with the modified tags\n writer.add(obj.replace(tags=tags))\n
with osmium.SimpleWriter('../data/out/ele.osm.opl', overwrite=True) as writer:\n for obj in osmium.FileProcessor('../data/liechtenstein.osm.pbf'):\n if 'name:fr' in obj.tags:\n tags = {k: v for k, v in obj.tags if k != 'name:fr'}\n # ... do more stuff here\n writer.add(obj.replace(tags=tags))\n else:\n writer.add(obj)\n
If you run this code snippet on a large OSM file, it will take a very long time to execute. Even though we only want to change a handful of objects (all objects that have a name:fr tag), the FileProcessor needs to present every single object to the Python code in the loop because every single objects needs to be written in the output file. We need a way to tell the FileProcessor to directly write out all the objects that we are not inspecting in the for loop. This can be done with the handler_for_filtered() function. It allows to define a handler for all the objects, the with_filter() handlers have rejected. The SimpleWriter class can itself function as a handler. By setting it as the handler for filtered objects, they will be directly passed to the writer.
name:fr
handler_for_filtered()
with_filter()
With the SimpleWriter as fallback in place, we can now create a FileProcessor that filters for objects with a name:fr tag:
with osmium.SimpleWriter('../data/out/buildings.opl', overwrite=True) as writer:\n fp = osmium.FileProcessor('../data/liechtenstein.osm.pbf')\\\n .with_filter(osmium.filter.KeyFilter('name:fr'))\\\n .handler_for_filtered(writer)\n\n for obj in fp:\n print(f\"{obj.id} has the French name {obj.tags['name']}.\")\n
1932181216 has the French name Vaduz.\n3696525426 has the French name Liechtenstein.\n9798887324 has the French name Schweizerisches Generalkonsulat.\n159018431 has the French name Rhein.\n424375869 has the French name Rhein.\n8497 has the French name Rhein-Route.\n12464 has the French name Seen-Route.\n16239 has the French name \u00d6sterreich.\n19664 has the French name Seen-Route - Etappe 9.\n27939 has the French name Cycling in Switzerland.\n51701 has the French name Schweiz/Suisse/Svizzera/Svizra.\n74942 has the French name Vorarlberg.\n102638 has the French name Rhein-Route - Etappe 3.\n102666 has the French name \u00d6sterreich - Schweiz.\n102877 has the French name \u00d6sterreich \u2014 Liechtenstein.\n123924 has the French name Rhein.\n302442 has the French name Schweizer Hauptstrassen.\n1155955 has the French name Liechtenstein.\n1550322 has the French name \u00d6sterreich \u2014 Schweiz / Suisse / Svizzera.\n1665395 has the French name Via Alpina Red.\n1686631 has the French name Graub\u00fcnden/Grischun/Grigioni.\n1687006 has the French name Sankt Gallen.\n2128682 has the French name R\u00e4tikon.\n2171555 has the French name EuroVelo 15 - Rheinradweg.\n2668952 has the French name European Union / Union Europ\u00e9enne / Europ\u00e4ische Union.\n2698607 has the French name Alps.\n11342353 has the French name Appenzeller Alpen.\n12579662 has the French name Via Alpina Green.\n12729625 has the French name Eurozone.\n13376469 has the French name Member States of the European Union / \u00c9tats members de l'Union europ\u00e9enne / Mitgliedstaaten der Europ\u00e4ischen Union.\n
If you run this piece of code, you will notice that suddenly all objects with a French name are missing from output file. This happens because once a file is presented to Python, the SimpleWriter object doesn't see it anymore. You have to explicitly call one of the 'add' functions of the SimpleWriter to write the modified object. So the full code is:
with osmium.SimpleWriter('../data/out/buildings.opl', overwrite=True) as writer:\n fp = osmium.FileProcessor('../data/liechtenstein.osm.pbf')\\\n .with_filter(osmium.filter.KeyFilter('name:fr'))\\\n .handler_for_filtered(writer)\n for obj in fp:\n tags = {k: v for k, v in obj.tags if k != 'name:fr'}\n # ... do more stuff here\n writer.add(obj.replace(tags=tags))\n
How to modify selected objects in an OSM file.
Localise the OSM file for the French language: when a name:fr tag is available, replace the name tag with it and save the original name in name:local.
name
name:local
To change selected tags in a file, it is necessary to read the file object by object, make changes as necessary and write back the data into a new file. This could be done with a simple FileProcessor (for reading the input file) that is combined with a SimpleWriter (for writing the output file):
with osmium.ForwardReferenceWriter('../data/out/centre.osm.pbf',\n '../data/liechtenstein.osm.pbf', overwrite=True) as writer:\n for obj in osmium.FileProcessor('../data/liechtenstein.osm.pbf', osmium.osm.NODE):\n if osmium.geom.haversine_distance(osmium.osm.Location(9.52, 47.13), obj.location) < 2000:\n writer.add_node(obj)\n
with osmium.SimpleWriter('../data/out/centre.opl', overwrite=True) as writer:\n for obj in osmium.FileProcessor('../data/liechtenstein.osm.pbf', osmium.osm.NODE):\n if osmium.geom.haversine_distance(osmium.osm.Location(9.52, 47.13), obj.location) < 2000:\n writer.add_node(obj)\n
The FileProcessor reads the data and SimpleWriter writes the nodes out that we are interested in. Given that we are looking at nodes only, the FileProcessor can be restricted to that type. For one thing, this makes processing faster. For another it means, we don't have to explicitly check for the type of the object within the for loop. We can trust that only nodes will be returned. Checking if a node should be included in the output file is a simple matter of computing the distance between the target coordinates and the location of the node. pyosmium has a convenient function haversine_distance() for that. It computes the distance between two points in meters.
haversine_distance()
This gives us a file with nodes. But what about the ways and relations? To find out which ones to include, we need to follow the forward references. Given the IDs of the nodes already included in the file, we need to find the ways which reference any of the nodes. And then we need to find relations which reference either nodes already included or one of the newly found ways. Luckily for us, OSM files are ordered by node, way and relations. So by the time the FileProcessor sees the first way, it will already have seen all the nodes and it can make an informed decision, if the way needs including or not. The same is true for relations. They are last in the file, so all the node and way members have been processed already. The situation is more complicated with relation members and nested relations. We leave those out for the moment.
Given that nodes, ways and relations need to be handled differently and we need to carry quite a bit of state, it is easier to implement the forward referencing collector as a handler class:
class CoordHandler:\n def __init__(self, coord, dist, writer):\n self.center = osmium.osm.Location(*coord)\n self.dist = dist\n self.writer = writer\n self.id_tracker = osmium.IdTracker()\n \n def node(self, n):\n if osmium.geom.haversine_distance(self.center, n.location) <= self.dist:\n self.writer.add_node(n)\n self.id_tracker.add_node(n.id)\n\n def way(self, w):\n if self.id_tracker.contains_any_references(w):\n self.writer.add_way(w)\n self.id_tracker.add_way(w.id)\n\n def relation(self, r):\n if self.id_tracker.contains_any_references(r):\n self.writer.add_relation(r)\n
The IdTracker class helps to keep track of all the objects that appear in the file. Every time a node or way is written, its ID is recorded. Tracking relation IDs would only be necessary for nested relations. The IDTracker gives us also a convenient function contains_any_reference() which checks if any of the IDs it is tracking is needed by the given object. If that is the case, the object needs to be written out.
IdTracker
contains_any_reference()
This is almost it. To get a referentially complete output file, we also need to add the objects that are referenced by the ways and relations we have added. This can be easily achieved by using the BackReferenceWriter in place of the SimpleWriter:
BackReferenceWriter
SimpleWriter
with osmium.BackReferenceWriter('../data/out/centre.osm.pbf', ref_src='../data/liechtenstein.osm.pbf', overwrite=True) as writer:\n osmium.apply('../data/liechtenstein.osm.pbf', CoordHandler((9.52, 47.13), 2000, writer))\n
To learn more about adding backward references, have a look at the cookbook on Filtering By Tags.
The ForwardReferenceWriter helps to automate most of what we have just done manually. It is a replacement for the SimpleWriter which collects the forward references under the hood. It will first collects the OSM data that should be written in a temporary file. When the writer is closed, it adds the forward references from a reference file. This means, the ForwardReferenceWriter needs two mandatory parameters to be instantiated: the name of the file to write to and the name of the file to copy the referenced data from:
ForwardReferenceWriter
writer = osmium.ForwardReferenceWriter('../data/out/centre.osm.pbf', '../data/liechtenstein.osm.pbf', overwrite=True)\n
The writer will by default also add the necessary objects to make the file reference-complete. The writer can now replace the SimpleWriter in the code with the first attempt, resulting in the final solution shown in the Quick Solution.
How to create geographic extracts from an OSM file.
Given the country extract of Liechtenstein, extract all data that is within 2km of the coordinates 47.13,9.52. All objects inside the geographic area should be complete, meaning that complete geometries can be created for them.
OSM data is not a simple selection of geometries. In an OSM data file only the OSM nodes have a location. All other OSM object are made up of OSM nodes or other OSM objects. To find out where an OSM way or relation is located on the planet, it is necessary to go back to the nodes it references.
For the task at hand this means that any filtering by geometry needs to start with the OSM nodes. Lets start with a simple script that writes out all the nodes within the circle defined in the task:
fp = osmium.FileProcessor('../data/liechtenstein.osm.pbf').with_filter(osmium.filter.KeyFilter('amenity'))\n\nwith osmium.BackReferenceWriter(\"../data/out/schools_full.osm.pbf\", ref_src='../data/liechtenstein.osm.pbf', overwrite=True) as writer:\n for obj in fp:\n if obj.tags['amenity'] == 'school':\n writer.add(obj)\n
When filtering objects from a file, it is important, to include all objects that are referenced by the filtered objects. The BackReferenceWriter collects the references and writes out a complete file.
fp = osmium.FileProcessor('../data/liechtenstein.osm.pbf').with_filter(osmium.filter.KeyFilter('amenity'))\n
The additional filtering for the school value can then be done in the processing loop.
school
Lets first check how many school objects are there:
from collections import Counter\n\ncnt = Counter()\n\nfor obj in fp:\n if obj.tags['amenity'] == 'school':\n cnt.update([obj.type_str()])\n\nf\"Nodes: {cnt['n']} Ways: {cnt['w']} Relations: {cnt['r']}\"\n
'Nodes: 3 Ways: 19 Relations: 1'
The counter distinguishes by OSM object types. As we can see, schools exist as nodes (point geometries), ways (polygon geometries) and relations (multipolygon geometries). All of them need to appear in the output file.
The simple solution seems to be to write them all out into a file:
with osmium.SimpleWriter('../data/out/schools.opl', overwrite=True) as writer:\n for obj in fp:\n if obj.tags['amenity'] == 'school':\n writer.add(obj)\n
However, if you try to use the resulting file in another program, you may find that it complains that the data is incomplete. The schools that are saved as ways in the file reference nodes which are now missing. The school relation references ways which are missing. And these again reference nodes, which need to appear in the output file as well. The file needs to be made referentially complete.
references = {'n': set(), 'w': set(), 'r': set()} # save references by their object type\n\nfor obj in fp:\n if obj.tags['amenity'] == 'school':\n if obj.is_way():\n references['n'].update(n.ref for n in obj.nodes)\n elif obj.is_relation():\n for member in obj.members:\n references[member.type].add(member.ref)\n\nf\"Nodes: {len(references['n'])} Ways: {len(references['w'])} Relations: {len(references['r'])}\"\n
'Nodes: 325 Ways: 3 Relations: 0'
This gives us a set of all the direct references: the nodes of the school ways and and the ways in the school relations. We are still missing the indirect references: the nodes from the ways of the school relations. It is not possible to collect those while scanning the file for the first time. By the time the relations are scanned and we know which additional ways are of interest, the ways have already been read. We could cache all the node locations when scanning the ways in the file for the first time but that can become quite a lot of data to remember. It is faster to simply scan the file again once we know which ways are of interest:
for obj in osmium.FileProcessor('../data/liechtenstein.osm.pbf', osmium.osm.WAY):\n if obj.id in references['w']:\n references['n'].update(n.ref for n in obj.nodes)\n\nf\"Nodes: {len(references['n'])} Ways: {len(references['w'])} Relations: {len(references['r'])}\"\n
'Nodes: 395 Ways: 3 Relations: 0'
This time it is not possible to use a key filter because the ways that are part of the relations are not necessarily tagged with amenity=school. They might not have any tags at all. However, we can use a different trick and tell the file processor to only scan the ways in the file. This is the second parameter in the FileProcessor() constructor.
amenity=school
FileProcessor()
After this second scan of the file, we know the IDs of all the objects that need to go into the output file. The data we are interested in doesn't have nested relations. When relations contain other relations, then another scan of the file is required to collect the triple indirection. This part shall be left as an exercise to the reader for now.
Once all the necessary ids are collected, the objects needs to be extracted from the original file. This can be done with the IdFilter. It gets a list of all object IDs it is supposed to let pass. Given that we need nodes and ways from the original file, two filters are necessary:
ref_fp = osmium.FileProcessor('../data/liechtenstein.osm.pbf', osmium.osm.NODE | osmium.osm.WAY)\\\n .with_filter(osmium.filter.IdFilter(references['n']).enable_for(osmium.osm.NODE))\\\n .with_filter(osmium.filter.IdFilter(references['w']).enable_for(osmium.osm.WAY))\n
The data from this FileProcessor needs to be merged with the filtered data originally written out. We cannot just concatenate the two files because the order of elements matters. Most applications that process OSM data expect the elements in a well defined order: first nodes, then ways, then relations, all sorted by ID. When the input files are ordered correctly already, then the zip_processors() function can be used to iterate over multiple FileProcessors in parallel and write out the data:
zip_processors()
filtered_fp = osmium.FileProcessor('../data/out/schools.opl')\n\nwith osmium.SimpleWriter(f'../data/out/schools_full.osm.pbf', overwrite=True) as writer:\n for filtered_obj, ref_obj in osmium.zip_processors(filtered_fp, ref_fp):\n if filtered_obj:\n writer.add(filtered_obj)\n else:\n writer.add(ref_obj.replace(tags={}))\n
This writes the data from the filtered file, if any exists and otherwise takes the data from the original file. Objects from the original files have their tags removed. This avoids to have unwanted first-class objects in your file. All additionally added objects now exist for the sole purpose of completing the ones you have filtered.
references = osmium.IdTracker()\n\nwith osmium.SimpleWriter(f'../data/out/schools.opl', overwrite=True) as writer:\n for obj in fp:\n if obj.tags['amenity'] == 'school':\n writer.add(obj)\n references.add_references(obj)\n\nreferences.complete_backward_references('../data/liechtenstein.osm.pbf', relation_depth=10)\n
The function complete_backward_references() repeatedly reads from the file to collect all referenced objects. In contrast to the more simple solution above, it can also collect references in nested relations. The relation_depth parameter controls how far the nesting should be followed. In this case, we have set it to 10 which should be sufficient even for the most complex relations in OSM. It is a good idea to not set this parameter too high because every level of depth requires an additional scan of the relations in the reference file.
complete_backward_references()
relation_depth
With all the IDs collected, the final file can be written out as above. IdTracker can directly pose as a filter to a FileProcessor, so that the code can be slightly simplified:
fp1 = osmium.FileProcessor('../data/out/schools.opl')\nfp2 = osmium.FileProcessor('../data/liechtenstein.osm.pbf').with_filter(references.id_filter())\n\nwith osmium.SimpleWriter('../data/out/schools_full.opl', overwrite=True) as writer:\n for o1, o2 in osmium.zip_processors(fp1, fp2):\n if o1:\n writer.add(o1)\n else:\n writer.add(o2.replace(tags={}))\n
How to create a thematic extract from an OSM file.
Given the country extract of Liechtenstein, create a fully usable OSM file that only contains all the schools in the file.
Filtering school objects from a file is fairly easy. We need a file processor for the target file which returns all objects with an amenity key:
amenity
Lets try to collect the IDs of the missing nodes and relation manually first. This helps to understand how the process works. In a first pass, we can simply collect all the IDs we encounter when processing the schools:
The IDTracker class will track backward references for you just like described in the last paragraph.
IDTracker
The BackReferenceWriter encapsulates a SimpleWriter and IdTracker and writes out the referenced objects, when close() is called. This reduces the task of filtering schools to the simple solution shown in the beginning.
close()
import osmium\nimport geopandas\n
fp = osmium.FileProcessor('../data/buildings.opl')\\\n .with_areas()\\\n .with_filter(osmium.filter.GeoInterfaceFilter())\n\nfeatures = geopandas.GeoDataFrame.from_features(fp)\nlen(features)\n
11
This will load every single OSM object into the GeoDataFrame as long as it can be converted into a geometry, including all untagged nodes. This is usually not what you want. Therefore it is important to carefully filter the data before giving it to the GeoHandler. In our case we are only interested in streets. That means, it must be linear ways with a tag 'highway'. Lets add the appropriate filters:
fp = osmium.FileProcessor('../data/liechtenstein.osm.pbf')\\\n .with_locations()\\\n .with_filter(osmium.filter.EntityFilter(osmium.osm.WAY))\\\n .with_filter(osmium.filter.KeyFilter('highway'))\\\n .with_filter(osmium.filter.GeoInterfaceFilter())\n\nfeatures = geopandas.GeoDataFrame.from_features(fp)\n
The first filter restricts the selection to ways, the second filter only lets through highway objects. Let's have a look at the result:
features.plot()\n
<Axes: >
This shows all the highway features of Liechtenstein, including footways and paths:
features\n
7506 rows \u00d7 154 columns
It also contains all possible tags, that an OSM object can have. We are only interested in a selected number of tags. The GeoHandler can be instructed to restrict the tags, it adds to the properties:
fp = osmium.FileProcessor('../data/liechtenstein.osm.pbf')\\\n .with_locations()\\\n .with_filter(osmium.filter.EntityFilter(osmium.osm.WAY))\\\n .with_filter(osmium.filter.KeyFilter('highway'))\\\n .with_filter(osmium.filter.GeoInterfaceFilter(tags=['highway', 'name', 'maxspeed']))\n\nfeatures = geopandas.GeoDataFrame.from_features(fp)\n
This leaves us with a more concise data frame:
7506 rows \u00d7 4 columns
All that is left, to plot the ways according to their maxspeed:
features.plot(\"maxspeed\")\n
How to convert OSM data into a GeoPandas frame for further processing.
Show the street network of the input data on a map with different colors for the different maximum speeds.
GeoPanadas is a useful tool to help you process and visualise geodata in a Jupyter notebook. It can read data from many sources. Among others, it can read geo features from a Python iterable where each element implements the __geo_interface__. An 'osmium.FileProcessor' happens to behave like a Python iterable. To make it compatible with GeoPandas, the output objects need to be extended with a __geo_interface__. This can be done with the GeoInterfaceFilter. This filter adds a geometry to the object. Therefore, geometry processing needs to be enabled accordingly:
__geo_interface__
osmium.area.AreaManager
Bases: osmium.BaseHandler
Handler class that manages building area objects from ways and relations.
Building area objects always requires two passes through the file: in the first pass, the area manager collects the relation candidates for areas and IDs of all ways that are needed to build their areas. During the second pass of the file the areas are assembled: areas from ways are created immediately when the handler encounters a closed way. Areas for relations are built as soon as all the ways that the relation needs are available.
You usually should not be using the AreaManager direcly. The interface of the handler is considered an internal implementation detail and may change in future versions of pyosmium. Area assembly can be enabled through the SimpleHandler and the FileProcessor.
__init__() -> None
Set up a new area manager.
first_pass_handler() -> AreaManager
Return a handler object to be used for the first pass through a file. It collects information about area relations and their way members.
second_pass_handler(*handlers: HandlerLike) -> AreaManagerSecondPassHandler
Return a handler used for the second pass of the file, where areas are assembled. Pass the chain of filters and handlers that should be applied the areas.
second_pass_to_buffer(callback: BufferIterator) -> AreaManagerBufferHandler
Return a handler for the second pass of the file, which stores assembled areas in the given buffer.
osmium.SimpleWriter
Basic writer for OSM data. The SimpleWriter can write out object that are explicitly passed or function as a handler and write out all objects it receives. It is also possible to mix these two modes of operations.
The writer writes out the objects in the order it receives them. It is the responsibility of the caller to ensure to follow the ordering conventions for OSM files.
The SimpleWriter should normally used as a context manager. If you don't use it in a with context, don't forget to call close(), when writing is finished.
with
__init__(file: Union[str, os.PathLike[str], File], bufsz: int = ..., header: Optional[Header] = ..., overwrite: bool = ..., filetype: str = ...) -> None
Initiate a new writer for the given file. The writer will refuse to overwrite an already existing file unless overwrite is explicitly set to True.
True
The file type is usually determined from the file extension. If you want to explicitly set the filetype (for example, when writing to standard output '-'), then use a File object. Using the filetype parameter to set the file type is deprecated and only works when the file is a string.
The header parameter can be used to set a custom header in the output file. What kind of information can be written into the file header depends on the file type.
The optional parameter bufsz sets the size of the buffers used for collecting the data before they are written out. The default size is 4MB. Larger buffers are normally better but you should be aware that there are normally multiple buffers in use during the write process.
add(obj: object) -> None
Add a new object to the file. The function will try to determine the kind of object automatically.
add_node(node: object) -> None
Add a new node to the file. The node may be a Node object or its mutable variant or any other Python object that implements the same attributes.
add_relation(relation: object) -> None
Add a new relation to the file. The relation may be a Relation object or its mutable variant or any other Python object that implements the same attributes.
add_way(way: object) -> None
Add a new way to the file. The way may be a Way object or its mutable variant or any other Python object that implements the same attributes.
close() -> None
Flush the remaining buffers and close the writer. While it is not strictly necessary to call this function explicitly, it is still strongly recommended to close the writer as soon as possible, so that the buffer memory can be freed.
osmium.WriteHandler
Bases: osmium._osmium.SimpleWriter
osmium._osmium.SimpleWriter
(Deprecated) Handler function that writes all data directly to a file.
This is now simply an alias for SimpleWriter. Please refer to its documentation.
osmium.BackReferenceWriter
Writer that adds referenced objects, so that all written objects are reference-complete.
The collected data is first written into a temporary file and the necessary references are tracked internally. When the writer is closed, it writes the final file, mixing together the referenced objects from the original file and the written data.
The writer should usually be used as a context manager.
__init__(outfile: Union[str, os.PathLike[str], File], ref_src: Union[str, os.PathLike[str], File, FileBuffer], overwrite: bool = False, remove_tags: bool = True, relation_depth: int = 0)
Create a new writer.
outfile is the name of the output file to write. The file must not yet exist unless overwrite is set to True.
outfile
overwrite
ref_src is the OSM input file, where to take the reference objects from. This is usually the same file the data to be written is taken from.
ref_src
The writer will by default remove all tags from referenced objects, so that they do not appear as stray objects in the file. Set remove_tags to False to keep the tags.
remove_tags
The writer will not complete nested relations by default. If you need nested relations, set relation_depth to the minimum depth to which relations shall be completed.
add(obj: Any) -> None
Write an arbitrary OSM object. This can be either an osmium object or a Python object that has the appropriate attributes.
add_node(n: Any) -> None
Write out an OSM node.
add_relation(r: Any) -> None
Write out an OSM relation.
add_way(w: Any) -> None
Write out an OSM way.
Close the writer and write out the final file.
The function will be automatically called when the writer is used as a context manager.
osmium.ForwardReferenceWriter
Writer that adds forward-referenced objects optionally also making the final file reference complete. An object is a forward reference when it directly or indirectly needs one of the objects originally written out.
The collected data is first written into a temporary file, When the writer is closed, the references are collected from the reference file and written out together with the collected data into the final file.
__init__(outfile: Union[str, os.PathLike[str], File], ref_src: Union[str, os.PathLike[str], File, FileBuffer], overwrite: bool = False, back_references: bool = True, remove_tags: bool = True, forward_relation_depth: int = 0, backward_relation_depth: int = 1) -> None
The writer will collect back-references by default to make the file reference-complete. Set back_references=False to disable this behaviour.
back_references=False
These classes expose the data from an OSM file to the Python scripts. Objects of these classes are always views unless stated otherwise. This means that they are only valid as long as the view to an object is valid.
osmium.osm.osm_entity_bits
osmium.osm.OSMObject
This is the base class for all OSM entity classes below and contains all common attributes.
changeset: int
property
(read-only) Id of changeset where this version of the object was created.
deleted: bool
(read-only) True if the object is no longer visible.
id: int
(read-only) OSM id of the object.
tags: TagList
(read-only) List of tags describing the object. See osmium.osm.TagList.
osmium.osm.TagList
timestamp: dt.datetime
(read-only) Date when this version has been created, returned as a datetime.datetime.
datetime.datetime
uid: int
(read-only) Id of the user that created this version of the object. Only this ID uniquely identifies users.
user: str
(read-only) Name of the user that created this version. Be aware that user names can change, so that the same user ID may appear with different names and vice versa.
version: int
(read-only) Version number of the object.
visible: bool
(read-only) True if the object is visible.
is_area() -> bool
Return true if the object is a Way object.
is_node() -> bool
Return true if the object is a Node object.
is_relation() -> bool
Return true if the object is a Relation object.
is_way() -> bool
positive_id() -> int
Get the absolute value of the id of this object.
type_str() -> str
Return a single character identifying the type of the object. The character is the same as used in OPL.
user_is_anonymous() -> bool
Check if the user is anonymous. If true, the uid does not uniquely identify a single user but only the group of all anonymous users in general.
osmium.osm.Node
Bases: osmium.osm.types.OSMObject['cosm.COSMNode']
osmium.osm.types.OSMObject['cosm.COSMNode']
Represents a single OSM node. It inherits all properties from OSMObjects and adds a single extra attribute: the location.
lat: float
Return latitude of the node.
location: osmium.osm.Location
The geographic coordinates of the node. See osmium.osm.Location.
osmium.osm.Location
lon: float
Return longitude of the node.
replace(**kwargs: Any) -> osmium.osm.mutable.Node
Create a mutable node replacing the properties given in the named parameters. The properties may be any of the properties of OSMObject or Node.
Note that this function only creates a shallow copy per default. It is still bound to the scope of the original object. To create a full copy use: node.replace(tags=dict(node.tags))
node.replace(tags=dict(node.tags))
osmium.osm.Way
Bases: osmium.osm.types.OSMObject['cosm.COSMWay']
osmium.osm.types.OSMObject['cosm.COSMWay']
Represents an OSM way. It inherits the attributes from OSMObject and adds an ordered list of nodes that describes the way.
nodes: WayNodeList
(read-only) Ordered list of nodes. See osmium.osm.WayNodeList.
osmium.osm.WayNodeList
ends_have_same_id() -> bool
True if the start and end node are exactly the same.
ends_have_same_location() -> bool
True if the start and end node of the way are at the same location. Expects that the coordinates of the way nodes have been loaded (see SimpleHandler apply functions and FileProcessor.with_locations()) If the locations are not present then the function returns always true.
FileProcessor.with_locations()
is_closed() -> bool
True if the start and end node are the same (synonym for ends_have_same_id).
ends_have_same_id
replace(**kwargs: Any) -> osmium.osm.mutable.Way
Create a mutable way replacing the properties given in the named parameters. The properties may be any of the properties of OSMObject or Way.
Note that this function only creates a shallow copy per default. It is still bound to the scope of the original object. To create a full copy use: way.replace(tags=dict(way.tags), nodes=list(way.nodes))
way.replace(tags=dict(way.tags), nodes=list(way.nodes))
osmium.osm.Relation
Bases: osmium.osm.types.OSMObject['cosm.COSMRelation']
osmium.osm.types.OSMObject['cosm.COSMRelation']
Represents a OSM relation. It inherits the attributes from OSMObject and adds an ordered list of members.
members: RelationMemberList
(read-only) Ordered list of relation members. See osmium.osm.RelationMemberList.
osmium.osm.RelationMemberList
replace(**kwargs: Any) -> osmium.osm.mutable.Relation
Create a mutable relation replacing the properties given in the named parameters. The properties may be any of the properties of OSMObject or Relation.
Note that this function only creates a shallow copy per default. It is still bound to the scope of the original object. To create a full copy use: rel.replace(tags=dict(rel.tags), members=list(rel.members))
rel.replace(tags=dict(rel.tags), members=list(rel.members))
osmium.osm.Area
Bases: osmium.osm.types.OSMObject['cosm.COSMArea']
osmium.osm.types.OSMObject['cosm.COSMArea']
Areas are a special kind of meta-object representing a polygon. They can either be derived from closed ways or from relations that represent multipolygons. They also inherit the attributes of OSMObjects and in addition contain polygon geometries. Areas have their own unique id space. This is computed as the OSM id times 2 and for relations 1 is added.
from_way() -> bool
Return true if the area was created from a way, false if it was created from a relation of multipolygon type.
inner_rings(oring: OuterRing) -> InnerRingIterator
Return an iterator over all inner rings of the multipolygon.
is_multipolygon() -> bool
Return true if this area is a true multipolygon, i.e. it consists of multiple outer rings.
num_rings() -> Tuple[int, int]
Return a tuple with the number of outer rings and inner rings.
This function goes through all rings to count them.
orig_id() -> int
Compute the original OSM id of this object. Note that this is not necessarily unique because the object might be a way or relation which have an overlapping id space.
outer_rings() -> OuterRingIterator
Return an iterator over all outer rings of the multipolygon.
osmium.osm.Changeset
A changeset description.
bounds: osmium.osm.Box
(read-only) The bounding box of the area that was edited.
closed_at: dt.datetime
(read-only) Timestamp when the changeset was finalized. May be None when the changeset is still open.
None
created_at: dt.datetime
(read-only) Timestamp when the changeset was first opened.
(read-only) Unique ID of the changeset.
num_changes: int
(read-only) The total number of objects changed in this Changeset.
open: bool
(read-only) True when the changeset is still open.
(read-only) User ID of the changeset creator.
(read-only) Name of the user that created the changeset. Be aware that user names can change, so that the same user ID may appear with different names and vice versa.
Check if the user anonymous. If true, the uid does not uniquely identify a single user but only the group of all anonymous users in general.
osmium.osm.Tag
Bases: typing.NamedTuple
typing.NamedTuple
A single OSM tag.
k: str
instance-attribute
Tag key
v: str
Tag value
Bases: typing.Iterable[osmium.osm.types.Tag]
typing.Iterable[osmium.osm.types.Tag]
A fixed list of tags. The list is exported as an unmutable, dictionary-like object where the keys are tag strings and the items are Tags.
get(key: str, default: Optional[str] = None) -> Optional[str]
Return the value for the given key. or 'value' if the key does not exist in the list.
osmium.osm.NodeRef
A reference to a OSM node that also caches the nodes location.
(read-only) Latitude (y coordinate) as floating point number.
(read-only) Longitude (x coordinate) as floating point number.
x: int
(read-only) X coordinate (longitude) as a fixed-point integer.
y: int
(read-only) Y coordinate (latitude) as a fixed-point integer.
osmium.osm.NodeRefList
A list of node references, implemented as an immutable sequence of osmium.osm.NodeRef. This class is normally not used directly, use one of its subclasses instead.
True if the start and end node of the way are at the same location. \" Expects that the coordinates of the way nodes have been loaded (SimpleHandler apply functions and FileProcessor.with_locations()). If the locations are not present then the function returns always true.
Bases: osmium.osm.types.NodeRefList
osmium.osm.types.NodeRefList
List of nodes in a way. For its members see osmium.osm.NodeRefList.
osmium.osm.InnerRing
List of nodes in an inner ring. \" For its members see osmium.osm.NodeRefList.
osmium.osm.OuterRing
List of nodes in an outer ring. For its members see osmium.osm.NodeRefList.
osmium.osm.RelationMember
Single member of a relation.
ref: int = ref
OSM ID of the object. Only unique within the type.
role: str = role
The role of the member within the relation, a free-text string. If no role is set then the string is empty.
type: str = mtype
Type of object referenced, a node, way or relation.
An immutable sequence of relation members osmium.osm.RelationMember.
osmium.osm.Box
osmium.InvalidLocationError
Bases: Exception
Exception
Raised when the location of a node is requested that has no valid location. To be valid, a location must be inside the -180 to 180 and -90 to 90 degree range.
osmium.FileProcessor
A processor that reads an OSM file in a streaming fashion, optionally pre-filters the data, enhances it with geometry information, returning the data via an iterator.
header: osmium.io.Header
(read-only) Header information for the file to be read.
node_location_storage: Optional[LocationTable]
Node location cache currently in use, if enabled. This can be used to manually look up locations of nodes. Be aware that the nodes must have been read before you can do a lookup via the location storage.
__init__(indata: Union[File, FileBuffer, str, os.PathLike[str]], entities: osmium.osm.osm_entity_bits = osmium.osm.ALL) -> None
Initialise a new file processor for the given input source indata. This may either be a filename, an instance of File or buffered data in form of a FileBuffer.
The types of objects which will be read from the file can be restricted with the entities parameter. The data will be skipped directly at the source file and will never be passed to any filters including the location and area processors. You usually should not be restricting objects, when using those.
handler_for_filtered(handler: osmium._osmium.HandlerLike) -> FileProcessor
Set a fallback handler for object that have been filtered out.
Any object that does not pass the filter chain installed with with_filter() will be passed to this handler. This can be useful when the entire contents of a file should be passed to a writer and only some of the objects need to be processed specially in the iterator body.
with_areas(*filters: osmium._osmium.HandlerLike) -> FileProcessor
Enable area processing. When enabled, then closed ways and relations of type multipolygon will also be returned as an Area type.
Optionally one or more filters can be passed. These filters will be applied in the first pass, when relation candidates for areas are selected. Calling this function multiple times causes more filters to be added to the filter chain.
Calling this function automatically enables location caching if it was not enabled yet using the default storage type. To use a different storage type, call with_locations() explicity with the approriate storage parameter before calling this function.
with_locations()
with_filter(filt: osmium._osmium.HandlerLike) -> FileProcessor
Add a filter function to the processors filter chain. Filters are called for each prcoessed object in the order they have been installed. Only when the object passes all the filter functions will it be handed to the iterator.
Note that any handler-like object can be installed as a filter. A non-filtering handler simply works like an all-pass filter.
with_locations(storage: str = 'flex_mem') -> FileProcessor
Enable caching of node locations. The file processor will keep the coordinates of all nodes that are read from the file in memory and automatically enhance the node list of ways with the coordinates from the cache. This information can then be used to create geometries for ways. The node location cache can also be directly queried through the node_location_storage property.
The storage parameter can be used to change the type of cache used to store the coordinates. The default 'flex_mem' is good for small to medium-sized files. For large files you may need to switch to a disk-storage based implementation because the cache can become quite large. See the section on location storage in the user manual for more information.
osmium.OsmFileIterator
Low-level iterator interface for reading from an OSM source.
__init__(reader: Reader, *handlers: HandlerLike) -> None
Initialise a new iterator using the given reader as source. Each object is passed through the list of filters given by handlers. If all the filters are passed, the object is returned by next().
next()
set_filtered_handler(handler: object) -> None
Set a fallback handler for objects that have been filtered out. The objects will be passed to the single handler.
osmium.BufferIterator
(internal) Iterator interface for reading from a queue of buffers.
This class is needed for pyosmium's internal implementation. There is currently no way to create buffers or add them to the iterator from Python.
__init__(*handlers: HandlerLike) -> None
Create a new iterator. The iterator will pass each object through the filter chain handlers before returning it.
osmium.zip_processors(*procs: FileProcessor) -> Iterable[List[Optional[OSMEntity]]]
Return the data from the FileProcessors in parallel such that objects with the same ID are returned at the same time.
The processors must contain sorted data or the results are undefined.
osmium.filter.EmptyTagFilter
Bases: osmium._osmium.BaseFilter
osmium._osmium.BaseFilter
Filter class which only lets pass objects which have at least one tag.
Create a new filter object.
osmium.filter.EntityFilter
Filter class which lets pass objects according to their type.
__init__(entities: osm_entity_bits) -> None
Crate a new filter object. Only objects whose type is listed in entities can pass the filter.
osmium.filter.GeoInterfaceFilter
Filter class, which adds a geo_interface attribute to object which have geometry information.
The filter can process node, way and area types. All other types will be dropped. To create geometries for ways, the location cache needs to be enabled. Relations and closed ways can only be transformed to polygons when the area handler is enabled.
__init__(drop_invalid_geometries: bool = ..., tags: Iterable[str] = ...) -> None
Create a new filter object. The filter will usually drop all objects that do not have a geometry. Set drop_invalid_geometries to False to just let them pass.
False
The filter will normally add all tags it finds as properties to the GeoInterface output. To filter the tags to relevant ones, set tags to the desired list.
osmium.filter.IdFilter
Filter class which only lets pass objects with given IDs.
This filter usually only makes sense when used together with a type restriction, set using enable_for().
__init__(ids: Iterable[int]) -> None
Create a new filter object. ids contains the IDs the filter should let pass. It can be any iterable over ints.
osmium.filter.KeyFilter
Filter class which lets objects pass which have tags with at least one of the listed keys.
This filter functions like an OR filter. To create an AND filter (a filter that lets object pass that have tags with all the listed keys) you need to chain multiple KeyFilter objects.
__init__(*keys: str) -> None
Create a new filter object. The parameters list the keys by which the filter should choose the objects. At least one key is required.
osmium.filter.TagFilter
Filter class which lets objects pass which have tags with at least one of the listed key-value pair.
This filter functions like an OR filter. To create an AND filter (a filter that lets object pass that have tags with all the listed key-value pairs) you need to chain multiple TagFilter objects.
__init__(*tags: Tuple[str, str]) -> None
Create a new filter object. The parameters list the key-value pairs by which the filter should choose objects. Each pair must be a tuple with two strings and at least one pair is required.
osmium.geom.FactoryProtocol
Protocol for classes that implement the necessary functions for converting OSM objects into simple-feature-like geometries.
epsg: int
Projection of the output geometries as a EPSG number.
proj_string: str
Projection of the output geometries as a projection string.
create_linestring(line: LineStringLike, use_nodes: use_nodes = ..., direction: direction = ...) -> str
Create a line string geometry from a way like object. This may be a Way or a WayNodeList. Subsequent nodes with the exact same coordinates will be filtered out because many tools consider repeated coordinates in a line string invalid. Set use_nodes to osmium.geom.ALL to suppress this behaviour.
osmium.geom.ALL
The line string usually follows the order of the node list. Set direction to osmium.geom.BACKWARDS to inverse the direction.
osmium.geom.BACKWARDS
create_multipolygon(area: osmium.osm.Area) -> str
Create a multi-polygon geometry from an Area object.
create_point(location: PointLike) -> str
Create a point geometry from a Node, a location or a node reference.
osmium.geom.Coordinates
Represent a x/y coordinate. The projection of the coordinate is left to the interpretation of the caller.
x: float
x portion of the coordinate.
y: float
y portion of the coordinate.
valid() -> bool
Return true if the coordinate is valid. A coordinate can only be invalid when both x and y are NaN.
osmium.geom.GeoJSONFactory
Bases: osmium.geom.FactoryProtocol
Factory that creates GeoJSON geometries from osmium geometries.
osmium.geom.WKBFactory
Factory that creates WKB from osmium geometries.
osmium.geom.WKTFactory
Factory that creates WKT from osmium geometries.
osmium.geom.direction
osmium.geom.use_nodes
osmium.geom.lonlat_to_mercator(coordinate: Coordinates) -> Coordinates
Convert coordinates from WGS84 to Mercator projection.
osmium.geom.mercator_to_lonlat(coordinate: Coordinates) -> Coordinates
osmium.SimpleHandler
The most generic of OSM data handlers. Derive your data processor from this class and implement callbacks for each object type you are interested in. The following data types are recognised:
node, way, relation, area and changeset
node
way
relation
area
changeset
A callback takes exactly one parameter which is the object. Note that all objects that are handed into the handler are only readable and are only valid until the end of the callback is reached. Any data that should be retained must be copied into other data structures.
apply_buffer(buffer: Buffer, format: str, locations: bool = False, idx: str = 'flex_mem', filters: List[HandlerLike] = []) -> None
Apply the handler to a string buffer. The buffer must be a byte string.
apply_file(filename: Union[str, os.PathLike[str], File], locations: bool = False, idx: str = 'flex_mem', filters: List[HandlerLike] = []) -> None
Apply the handler to the given file. If locations is true, then a location handler will be applied before, which saves the node positions. In that case, the type of this position index can be further selected in idx. If an area callback is implemented, then the file will be scanned twice and a location handler and a handler for assembling multipolygons and areas from ways will be executed.
enabled_for() -> osm_entity_bits
Return the list of OSM object types this handler will handle.
osmium.MergeInputReader
Buffer which collects data from multiple input files, sorts it and optionally deduplicates the data before applying to a handler.
Initialize a new reader.
add_buffer(buffer: Union[ByteString, str], format: str) -> int
Add input data from a buffer to the reader. The buffer may be any data which follows the Python buffer protocol. The mandatory format parameter describes the format of the data.
The data will be copied into internal buffers, so that the input buffer can be safely discarded after the function has been called.
add_file(file: str) -> int
Add data from the given input file file to the reader.
apply(*handlers: HandlerLike, idx: str = '', simplify: bool = True) -> None
Apply collected data to a handler. The data will be sorted first. If simplify is true (default) then duplicates will be eliminated and only the newest version of each object kept. If idx is given a node location cache with the given type will be created and applied when creating the ways. Note that a diff file normally does not contain all node locations to reconstruct changed ways. If the full way geometries are needed, create a persistent node location cache during initial import of the area and reuse it when processing diffs. After the data has been applied the buffer of the MergeInputReader is empty and new data can be added for the next round of application.
apply_to_reader(reader: Reader, writer: Writer, with_history: bool = ...) -> None
Apply the collected data to data from the given reader and write the result to writer. This function can be used to merge the diff \" data together with other OSM data (for example when updating a planet file. If with_history is true, then the collected data will be applied verbatim without removing duplicates. This is important when using OSM history files as input.
osmium.NodeLocationsForWays
Handler for retriving and caching locations from ways and adding them to ways.
apply_nodes_to_ways: bool
writable
When set (the default), the collected locations are propagated to the node list of ways.
__init__(locations: LocationTable) -> None
Initiate a new handler using the given location table locations to cache the node coordinates.
ignore_errors() -> None
Disable raising an exception when filling the node list of a way and a coordinate is not available.
osmium.apply(reader: Union[Reader, str, os.PathLike[str], File, FileBuffer], *handlers: HandlerLike) -> None
Apply a chain of handlers to the given input source. The input source may be a osmium.io.Reader, a file or a file buffer. If one of the handlers is a filter, then processing of the object will be stopped when it does not pass the filter.
osmium.make_simple_handler(node: HandlerFunc[Node] = None, way: HandlerFunc[Way] = None, relation: HandlerFunc[Relation] = None, area: HandlerFunc[Area] = None, changeset: HandlerFunc[Changeset] = None) -> SimpleHandler
(deprecated) Convenience function that creates a SimpleHandler from a set of callback functions. Each of the parameters takes an optional callable that must expect a single positional parameter with the object being processed.
osmium.io.File
A wrapper for an OSM data file.
has_multiple_object_versions: bool
True when the file is in a data format which supports having multiple versions of the same object in the file. This is usually the case with OSM history and diff files.
__init__(filename: Union[str, os.PathLike[str]], format: str = '') -> None
Initialise a new file object. Normally the file format of the file is guessed from the suffix of the file name. It may also be set explicitly using the format parameter.
parse_format(format: str) -> None
Set the format of the file from a format string.
osmium.io.FileBuffer
A wrapper around a buffer containing OSM data.
__init__(buf: Buffer, format: str) -> None
Initialise a new buffer object. buf can be any buffer that adheres to the Python buffer protocol. The format of the data must be defined in the format parameter.
osmium.io.Header
File header data with global information about the file.
Initiate an empty header.
add_box(box: Box) -> Header
Add the given bounding box to the list of bounding boxes saved in the header.
box() -> Box
Return the bounding box of the data in the file. If no such information is available, an invalid box is returned.
get(key: str, default: str = ...) -> str
Get the value of header option key or return default if there is no header option with that name.
set(key: str, value: str) -> None
Set the value of header option key to value.
osmium.io.Reader
Low-level object for reading data from an OSM file.
A Reader does not expose functions to process the data it has read from the file. Use apply for that purpose.
__init__(filename: Union[str, os.PathLike[str], FileBuffer, File], types: osm_entity_bits = ...) -> None
Create a new reader object. The input may either be a filename or a File or FileBuffer object. The types parameter defines which kinds of objects will be read from the input. Any types not present will be skipped completely when reading the file. Depending on the type of input, this can save quite a bit of time. However, be careful to not skip over types that may be referenced by other objects. For example, ways need nodes in order to compute their geometry.
Readers may be used as a context manager. In that case, the close() function will be called automatically when the reader leaves the scope.
Close any open file handles and free all resources. The Reader is unusuable afterwards.
eof() -> bool
Check if the reader has reached the end of the input data.
header() -> Header
Return the Header structure containing global information about the input. What information is available depends on the format of the input data.
osmium.io.Writer
Low-level object for writing OSM data into a file. This class does not expose functions for receiving data to be written. Have a look at SimpleWriter for a higher-level interface for writing data.
__init__(ffile: Union[str, os.PathLike[str], File], header: Header = ...) -> None
Create a new Writer. The output may either be a simple filename or a File object. A custom Header object may be given, to customize the global file information that is written out. Be aware that not all file formats support writing out all header information.
close() -> int
Close any open file handles and free all resources. The Writer is unusable afterwards.
osmium.IdTracker
Class to keep track of node, way and relation IDs.
Ids can be added to the to the tracker in various ways: by adding IDs directly, by adding the IDs of referenced IDs in an OSM object or by extracting the referenced IDs from an input file.
The tracker can then be used as a filter to select objects based on whether they are contained in the tracker's ID lists.
Initialise a new empty tracker.
add_node(node: int) -> None
Add the given node ID to the tracker.
add_references(obj: object) -> None
Add all IDs referenced by the input object obj.
The function will track the IDs of node lists from Way objects or Python objects with a nodes attribute, which must be a sequence of ints. It also tracks the IDs of relation members from Relation objects or Python objects with a members attribute with an equivalent content. Input objects that do not fall into any of these categories are silently ignored.
nodes
members
add_relation(relation: int) -> None
Add the given relation ID to the tracker.
add_way(way: int) -> None
Add the given way ID to the tracker.
complete_backward_references(filename: Union[str, os.PathLike[str], File, FileBuffer], relation_depth: int = ...) -> None
Make the IDs in the tracker reference-complete by adding all referenced IDs for objects whose IDs are already tracked.
The function scans through the reference file filename, finds all the objects this tracker references and applies add_references() to them. The reference file is expected to be sorted.
filename
add_references()
The relation_depth parameter controls how nested relations are handled. When set to 0 then only way and node references of relations that are already tracked are completed. If the parameter is larger than 0, the function will make at a maximum relation_depth passes through the reference file, to find nested relation. That means, that nested relations with a nesting depth up to relation_depth are guaranteed to be included. Relations that are nested more deeply, may or may not appear.
complete_forward_references(filename: Union[str, os.PathLike[str], File, FileBuffer], relation_depth: int = ...) -> None
Add to the tracker all IDs of object that reference any ID already tracked.
The function scans through the reference file filename, checks all objects in the file with the contains_any_references() function and adds the object ID to the tracker if the check is positive.
contains_any_references()
The relation_depth parameter controls how nested relations are handled. When set to a value smaller than 0, then relations will no be added at all to the tracker. When set to 0, then only relations are added that reference a node or way already in the tracker. When set to a strictly positive value, then nested relations are tacken into account as well. The function will make at a maximum relation_depth passes to complete relations with relation members.
contains_any_references(obj: object) -> bool
Check if the given input object obj contains any references to IDs tracked by this tracker.
The function will check the IDs of node lists from Way objects or Python objects with a nodes attribute, which must be a sequence of ints. It also tracks the IDs of relation members from Relation objects or Python objects with a members attribute with an equivalent content. All other object kinds will return False.
contains_filter() -> IdTrackerContainsFilter
Return a filter object that lets all ways and relations pass which reference any of the object IDs tracked by this tracker.
You may change the tracker while the filter is in use. Such a change is then immediately reflected in the filter.
The filter has no effect on nodes, areas and changesets.
id_filter() -> IdTrackerIdFilter
Return a filter object which lets all nodes, ways and relations pass that are being tracked in this tracker.
The filter has no effect on areas and changesets.
node_ids() -> IdSet
Return a view of the set of node ids. The returned object is mutable. You may call operations like unset() and clear() on it, which then have a direct effect on the tracker.
unset()
clear()
relation_ids() -> IdSet
Return a view of the set of relation ids. The returned object is mutable. You may call operations like unset() and clear() on it, which then have a direct effect on the tracker.
way_ids() -> IdSet
Return a view of the set of way ids.The returned object is mutable. You may call operations like unset() and clear() on it, which then have a direct effect on the tracker.
osmium.index.IdSet
Compact storage for a set of IDs.
Initialise an empty set.
clear() -> None
Remove all IDs from the set.
empty() -> bool
Check if no IDs are stored yet.
get(id: int) -> bool
Check if the given ID is in the storage.
set(id: int) -> None
Add an ID to the storage. Does nothing if the ID is already contained.
unset(id: int) -> None
Remove an ID from the storage. Does nothing if the ID is not in the storage.
osmium.index.LocationTable
A map from a node ID to a location object. Location can be set and queried using the standard [] notation for dicts. This implementation works only with positive node IDs.
Remove all entries from the location table..
get(id: int) -> osmium.osm.Location
Get the location for the given node ID. Raises a KeyError when there is no location for the given id.
KeyError
set(id: int, loc: osmium.osm.Location) -> None
Set the location for the given node ID.
used_memory() -> int
Return the size (in bytes) currently allocated by this location table.
osmium.index.map_types() -> List[str]
Return a list of strings with valid types for the location table.
osmium.index.create_map(map_type: str) -> LocationTable
Create a new location store. Use the map_type parameter to choose a concrete implementation. Some implementations take additiona configuration parameters, which can also be set through the map_type argument. For example, to create an array cache backed by a file 'foo.store', the map_type needs to be set to dense_file_array,foo.store. Read the section on location storage in the user manual for more information about the different implementations.
dense_file_array,foo.store
pyosmium is a library that processes data as a stream: it reads the data from a file or other input source and presents the data to the user one object at the time. This means that it can efficiently process large files with many objects. The down-side is that it is not possible to directly access specific objects as you need them. Instead it is necessary to apply some simple techniques of caching and repeated reading of files to get all the data you need. This takes some getting used to at the beginning but pyosmium gives you the necessary tools to make it easy.
pyosmium allows to process OSM files just like any other file: Open the file by instantiating a FileProcessor, then iterate over each OSM object in the file with a simple 'for' loop.
Lets start with a very simple script that lists the contents of file:
Example
import osmium\n\nfor obj in osmium.FileProcessor('buildings.opl'):\n print(obj)\n
n1: location=45.0000000/13.0000000 tags={}\nn2: location=45.0001000/13.0000000 tags={}\nn3: location=45.0001000/13.0001000 tags={}\nn4: location=45.0000000/13.0001000 tags={entrance=yes}\nn11: location=45.0000000/13.0000000 tags={}\nn12: location=45.0000500/13.0000000 tags={}\nn13: location=45.0000500/13.0000500 tags={}\nn14: location=45.0000000/13.0000500 tags={}\nw1: nodes=[1,2,3,4,1] tags={}\nw2: nodes=[11,12,13,14,11] tags={}\nr1: members=[w1,w2], tags={type=multipolygon,building=yes}\n
While iterating over the file, pyosmium decodes the data from the file in the background and puts it into a buffer. It then returns a read-only view of each OSM object to Python. This is important to always keep in mind. pyosmium never shows you a full data object, it only ever presents a view. That means you can read and process the information about the object but you cannot change it or keep the object around for later. Once you retrieve the next object, the view will no longer be valid.
To show you what happens, when you try to keep the objects around, let us slightly modify the example above. Say you want to have a more compact output and just print for each object type, which IDs appear in the file. You might be tempted to just save the object and create the formatted output only after reading the file is done:
Buggy Example
# saves object by their type, more about types later\nobjects = {'n' : [], 'w': [], 'r': []}\n\nfor obj in osmium.FileProcessor('buildings.opl'):\n objects[obj.type_str()].append(obj)\n\nfor otype, olist in objects.items():\n print(f\"{otype}: {','.join(o.id for o in olist)}\")\n
Traceback (most recent call last):\n File \"bad_ref.py\", line 10, in <module>\n print(f\"{otype}: {','.join(o.id for o in olist)}\")\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File \"bad_ref.py\", line 10, in <genexpr>\n print(f\"{otype}: {','.join(o.id for o in olist)}\")\n ^^^^\n File \"osmium/osm/types.py\", line 313, in id\n return self._pyosmium_data.id()\n ^^^^^^^^^^^^^^^^^^^^^^^^\nRuntimeError: Illegal access to removed OSM object\n
As you can see, the code throws a runtime error complaining about an 'illegal access'. The objects dictionary doesn't contain any OSM objects. It just has collected all the views on the objects. By the time the view is accessed in the print function, the buffer the view points to is long gone. pyosmium has invalidated the view. In practise this means that you need to make an explicit copy of all information you need outside the loop iteration.
objects
The code above can be easily \"fixed\" by saving only the id instead of the full object. This also happens to be much more memory efficient:
objects = {'n' : [], 'w': [], 'r': []}\n\nfor obj in osmium.FileProcessor('buildings.opl'):\n objects[obj.type_str()].append(obj.id)\n\nfor otype, olist in objects.items():\n print(f\"{otype}: {','.join(str(id) for id in olist)}\")\n
n: 1,2,3,4,11,12,13,14\nw: 1,2\nr: 1\n
The output shows IDs for three different kind of objects: nodes, ways and relations. Before we can continue, you need to understand the basics about these types of objects and of the OpenStreetMap data model. If you are already familiar with the structure of OSM data, you can go directly to the next chapter.
OpenStreetMap data is organised as a topological model. Objects are not described with geometries like most GIS models do. Instead the objects are described in how they relate to points in the world. This makes a huge difference in how the data is processed.
An OSM object does not have a pre-defined function. What an object represents is described with a set of properties, the tags. This is a simple key-value store of strings. The meaning of the tags is not part of the data model definition. Except for some minor technical limits, for example a maximum length, any string can appear in the key and value of the tags. What keys and values are used is decided through consensus between users. This gives OSM a great flexibility to experiment with new kinds of data and evolve its dataset. Over time a large set of agreed-upon tags has emerged for most kinds of objects. These are the tags you will usually work with. You can search the documentation in the OSM Wiki to find out about the tags. It is also always useful to consult Taginfo, which shows statistics over the different keys and value in actual use.
Tags are common to all OSM objects. After that there are three kinds of objects in OSM: nodes, ways and relations.
A node is a point on the surface of the earth. Its location is described through its latitude and longitude using projection WSG84.
Ways are lines that are created by connecting a sequence of nodes. The nodes are described with the ID of a node in the database. That means that a way object does not directly have coordinates. To find out about the coordinates where the way is located, it is necessary to look up the nodes of the way in the database and get their coordinates.
Representing a way through nodes has another interesting side effect: many of the nodes in OSM are not meaningful in itself. They don't represent a bus stop or lamp post or entrance or any other point of interest. They only exist as supporting points for the ways and don't have any tags.
When a way ends at the same node ID where it starts, then the way may be interpreted as representing an area. If it is really an area or just a linear feature that happens to circle back on itself (for example, a fence around a garden) depends on the tags of the way. Areas are handled more in-depth in the chapter Working with Geometries.
A relation is an ordered collection of objects. Nodes, ways and relations all can be a member in a relation. In addition, a relation member can be assigned a role, a string that describes the function of a member. The data model doesn't define what relations should be used for or how the members should be interpreted.
The topologic nature of the OSM data model means that an OSM object rarely can be regarded in isolation. OSM ways are not meaningful without the location information contained in its nodes. And conversely, changing the location in a way also changes the geometry of the way even though the way itself is not changed. This is an important concept to keep in mind when working with OSM data. In this manual, we will use the terms forward and backward references when talking about the dependencies between objects:
A forward reference means that an object is referenced to by another. Nodes appear in ways. Ways appear in relations. And a node may even have an indirect forward reference to a relation through a way it appear in. Forward references are important when tracking changes. When the location of a node changes, then all its forward references have to be reevaluated.
A backward reference goes from an object to its referenced children. Going from a way to its containing nodes means following a backward reference. Backward references are needed to get the complete geometry of an object: given that only nodes contain location information, we have to follow the backward references for ways and relations until we reach the nodes.
OSM files usually follow a sorting convention to make life easier for processing software: first come nodes, then ways, then relations. Each group of objects is ordered by ID. One of the advantages of this order is that you can be sure that you have been already presented with all backward references to an object, when it appears in the processing loop. Knowing this fact can help you optimise how often you have to read through the file and speed up processing.
Sadly, there is an exception to the rule which is nested relations: relations can of course contain other relations with a higher ID. If you have to work with nested relations, rescanning the file multiple times or keeping large parts of the file in memory is pretty much always unavoidable.
This chapter explains more about the different object types that are returned in pyosmium and how to access its data.
pyosmium may return five different types of objects. First there are the three base types from the OSM data model already introduced in the last chapter: nodes, ways and relations. Next there is an area type. It is explained in more detail in the Geometry chapter. Finally, there is a type for changesets, which contains information about edits in the OSM database. It can only appear in special changeset files and explained in more detail below.
The FileProcessor may return any of these objects, when iterating over a file. Therefore, a script will usually first need to determine the type of object received. There are a couple of ways to do this.
is_*()
All object types, except changesets, implement a set of is_node/is_way/is_relation/is_area functions, which give a nicely readable way of testing for a specific object type.
is_node
is_way
is_relation
is_area
for o in osmium.FileProcessor('buildings.opl'):\n if o.is_relation():\n print('Found a relation.')\n
Found a relation.\n
The type_str() function returns the type of the object as a single lower case character. The supported types are:
type_str()
This type string can be useful for printing or when saving data by type. It can also be used to test for a specific type. It is particularly useful when testing for multiple types:
for o in osmium.FileProcessor('../data/buildings.opl'):\n if o.type_str() in 'wr':\n print('Found a way or relation.')\n
Found a way or relation.\nFound a way or relation.\nFound a way or relation.\n
Each OSM object type has a corresponding Python class. You can simply test for this object type:
for o in osmium.FileProcessor('buildings.opl'):\n if isinstance(o, osmium.osm.Relation):\n print('Found a relation.')\n
Every object has a list of properties, the tags. They can be accessed through the tags property, which provides a simple dictionary-like view of the tags. You can use the bracket notation to access a specific tag or use the more explicit get() function. Just like for Python dictionaries, an access by bracket raises a ValueError when the key you are looking for does not exist, while the get() function returns the selected default value.
tags
get()
ValueError
The in operation can be used to check for existence of a key:
in
for o in osmium.FileProcessor('buildings.opl'):\n # When using the bracket notation, make sure the tag exists.\n if 'entrance' in o.tags:\n print('entrace =', o.tags['entrance'])\n\n # The get() function never throws.\n print('building =', o.tags.get('building', '<unset>')\n
Tags can also be iterated over. The iterator returns Tag objects. These each hold a key (k) and a value (v) string. A tag is itself a Python iterable, so that you can easily iterate through keys and values like this:
k
v
from collections import Counter\n\nstats = Counter()\n\nfor o in osmium.FileProcessor('buildings.opl'):\n for k, v in o.tags:\n stats.update([(k, v)])\n\nprint(\"Most common tags:\", stats.most_common(3))\n
As with all data in OSM objects, the tags property is only a view on tags of the object. If you want to save the tag list for later use, you must make a copy of the list. The most simple way to do this, is to convert the tag list into a Python dictionary:
saved_tags = []\n\nfor o in osmium.FileProcessor('../data/buildings.opl'):\n if o.tags:\n saved_tags.append(dict(o.tags))\n\nprint(\"Saved tags:\", saved_tags)\n
Next to the tags, every OSM object also carries some meta information describing its ID, version and information regarding the editor.
The main property of a Node is the location, a coordinate in WGS84 projection. Latitude and longitude of the node can be accessed either through the location property or through the lat and lon shortcuts:
location
lat
lon
for o in osmium.FileProcessor('../data/buildings.opl', osmium.osm.NODE):\n assert (o.location.lon, o.location.lat) == (o.lon, o.lat)\n
OpenStreetMap, and by extension pyosmium, saves latitude and longitude internally as a 7-digit fixed-point number. You can access the coordinates as fixed-point integers through the x and y properties. There may be rare use cases, where using this fixed-point notation is faster and more precise.
x
y
The coordinates returned by the lat/lon accessors are guaranteed to be valid. That means that a value is set and is between -180 and 180 degrees for longitude and -90 and 90 degrees for latitude. If the file contains an invalid coordinate, then pyosmium will throw a ValueError. To access the raw unchecked coordinates, use the functions location.lat_without_check() and location.lon_without_check().
location.lat_without_check()
location.lon_without_check()
A Way is essentially an ordered sequence of nodes. This sequence can be accessed through the nodes property. An OSM way only stores the ID of each node. This can be rather inconvenient when you want to work with the geometry of the way, because the coordinates of each node need to be looked up. pyosmium therefore exposes a list of NodeRefs with the nodes property. Each element in this list contains the node ID and optionally the location of the node. The next chapter Working with Geometries explains in detail, how pyosmium can help to fill the location of the node.
A Relation is also an ordered sequence. Each sequence element can reference an arbitrary OSM object. In addition, each of the members can be assigned a role, an arbitrary string that describes the function of the member. The OSM data model does not specify what the function of a member is and which roles are defined. You need to know what kind of relation you are dealing with in order to understand what the members are suppose to represent. Over the years, the OSM community has established a convention that every relation comes with a type tag, which defines the basic kind of the relation. For each type you can refer to the Wiki documentation to learn about the meaning of members and roles. The most important types currently in use are:
type
The members of a relation can be accessed through the members property. This is a simple list of RelationMember objects. They expose the OSM type of the member, its ID and a role string. When no role has been set, the role property returns an empty string. Here is an example of a simple iteration over all members:
role
for o in osmium.FileProcessor('buildings.opl', osmium.osm.RELATION):\n for member in o.members:\n print(f\"Type: {member.type} ID: {member.ref} Role: {member.role}\")\n
The member property provides only a temporary read-only view of the members. If you want to save the list for later processing, you need to make an explicit copy like this:
memberlist = {}\n\nfor o in osmium.FileProcessor('buildings.opl', osmium.osm.RELATION):\n memberlist[o.id] = [(m.type, m.ref, m.role) for m in o.members]\n\nprint(memberlist)\n
Always keep in mind that relations can become very large. Some have thousands of members. Therefore consider very carefully which members you are actually interested when saving members and only keep those that are actually needed later.
The Changeset type is the odd one out among the OSM data types. It does not contain actual map data. Instead it is use to save meta information about the edits made to the OSM database. You normally don't find Changeset objects in a datafile. Changeset information is published in separate files.
When working with map data, sooner or later, you will need the geometry of an object: a point, a line or a polygon. OSM's topologic data model doesn't make them directly available with each object. In order to build a geometry for an object, the location information from referenced nodes need to be collected and then the geometry can be assembled from that. pyosmium provides a number of data structures and helpers to create geometries for OSM objects.
OSM nodes are the only kind of OSM object that produce a point geometry. The location of the point is directly stored with the OSM nodes. This makes it straightforward to extract such a geometry:
for o in osmium.FileProcessor('buildings.opl', osmium.osm.NODE):\n print(f\"Node {o.id}: lat = {o.lat} lon = {o.lon}\")\n
Node 1: lat = 13.0 lon = 45.0\nNode 2: lat = 13.0 lon = 45.0001\nNode 3: lat = 13.0001 lon = 45.0001\nNode 4: lat = 13.0001 lon = 45.0\nNode 11: lat = 13.00001 lon = 45.00001\nNode 12: lat = 13.00001 lon = 45.00005\nNode 13: lat = 13.00005 lon = 45.00005\nNode 14: lat = 13.00005 lon = 45.00001\n
Line geometries are usually created from OSM ways. The OSM way object does not contain the coordinates of a line geometry directly. It only contains a list of references to OSM nodes. To create a line geometry from an OSM way, it is necessary to look up the coordinate of each referenced node. pyosmium provides an efficient way to do so: the location storage. The storage automatically records the coordinates of each node that is read from the file and caches them for future use. When later a way is read from a file, the list of nodes in the way is augmented with the appropriate coordinates. Location storage is not enabled by default. To add it to the processing, use the function with_locations() of the FileProcessor.
for o in osmium.FileProcessor('../data/buildings.opl').with_locations():\n if o.is_way():\n coords = \", \".join((f\"{n.lon} {n.lat}\" for n in o.nodes if n.location.valid()))\n print(f\"Way {o.id}: LINESTRING({coords})\")\n
Way 1: LINESTRING(45.0 13.0, 45.0001 13.0, 45.0001 13.0001, 45.0 13.0001, 45.0 13.0)\nWay 2: LINESTRING(45.00001 13.00001, 45.00005 13.00001, 45.00005 13.00005, 45.00001 13.00005, 45.00001 13.00001)\n
Not all OSM files are reference-complete. It can happen that some nodes which are referenced by a way are missing from a file. Always write your code so that it can work with incomplete geometries. In particular, you should be aware that there is no guarantee that an OSM way will translate into a valid line geometry. An OSM way may consist of only one node. Or two subsequent coordinates in the line are exactly at the same position.
pyosmium provides different implementations for the location storage. The default should be suitable for small to medium-sized OSM files. See the paragraph on Location storage below for more information on the different types of storages and how to switch them.
OSM has two different ways to model area geometries: they may be derived from way objects or relation objects.
A way can be interpreted as an area when it is closed. That happens when the first and the last node are exactly the same. You can use the function is_closed().
is_closed()
Not every closed way necessarily represents and area. Think of a little garden with a fence around it. If the OSM way represents the garden, then it should be interpreted as an area. If it represents the fence, then it is a line geometry that just happens to go full circle. You need to look at the tags of a way in order to decide if it should become an area or a line, or sometimes even both.
There are two types of relations that also represent areas. If the relation is tagged with type=multipolygon or type=boundary then it is by convention an area independently of all the other tags of the relation.
type=multipolygon
type=boundary
pyosmium implements a special handler for the processing of areas. This handler creates a new type of object, the Area object, and makes it available like the other OSM types. It can be enabled with the with_areas() function:
with_areas()
objects = ''\nareas = ''\nfor o in osmium.FileProcessor('../data/buildings.opl').with_areas():\n objects += f\" {o.type_str()}{o.id}\"\n if o.is_area():\n areas += f\" {o.type_str()}{o.id}({'w' if o.from_way() else 'r'}{o.orig_id()})\"\n\nprint(\"OSM objects in this file:\", objects)\nprint(\"Areas in this file:\", areas)\n
OSM objects in this file: n1 n2 n3 n4 n11 n12 n13 n14 w1 w2 r1 a2 a3\nAreas in this file: a2(w1) a3(r1)\n
Note how Area objects are added to the iterator in addition to the original OSM data. During the processing of the loop, there is first OSM way 1 and then the Area object 2, which corresponds to the same way.
When the area handler is enabled, the FileProcessor scans the file twice: during the first run information about all relations that might be areas is collected. This information is then used in the main run of the file processor, where the areas are assembled as soon as all the necessary objects that are part of each relation have been collected.
The area handler automatically enables a location storage because it needs access to the node geometries. It will set up the default implementation. To use a different implementation, simply use with_locations() with a custom storage together with with_areas().
The Area type has the same common attributes as the other OSM types. However, it produces its own special ID space. This is necessary because an area might be originally derived from a relation or way. When derived from a way, the ID is computed as 2 * way ID. When it is derived from a relation, the ID is 2 * relation ID + 1. Use the function from_way() to check what type the original OSM object is and the function orig_id() to get the ID of the underlying object.
2 * way ID
2 * relation ID + 1
from_way()
orig_id()
The polygon information is organised in lists of rings. Use outer_rings() to iterate over the rings of the polygon that form outer boundaries of the polygon. The data structures for these rings are node lists just like the ones used in OSM ways. They always form a closed line that goes clockwise. Each outer ring can have one or more holes. These can be iterated through with the inner_rings() function. The inner rings are also a node list but will go anti-clockwise. To illustrate how to process the functions, here is the simplified code to create the WKT representation of the polygon:
outer_rings()
inner_rings()
for o in osmium.FileProcessor('../data/buildings.opl').with_areas():\n if o.is_area():\n polygons = []\n for outer in o.outer_rings():\n rings = \"(\" + \", \".join((f\"{n.lon} {n.lat}\" for n in outer if n.location.valid())) + \")\"\n for inner in o.inner_rings(outer):\n rings += \", (\" + \", \".join((f\"{n.lon} {n.lat}\" for n in outer if n.location.valid())) + \")\"\n polygons.append(rings)\n if o.is_multipolygon():\n wkt = f\"MULTIPOLYGON(({'), ('.join(polygons)}))\"\n else:\n wkt = f\"POLYGON({polygons[0]})\"\n print(f\"Area {o.id}: {wkt}\") \n
Area 2: POLYGON((45.0 13.0, 45.0001 13.0, 45.0001 13.0001, 45.0 13.0001, 45.0 13.0))\nArea 3: POLYGON((45.0 13.0, 45.0001 13.0, 45.0001 13.0001, 45.0 13.0001, 45.0 13.0), (45.0 13.0, 45.0001 13.0, 45.0001 13.0001, 45.0 13.0001, 45.0 13.0))\n
OSM has many other relation types apart from the area types. pyosmium has no special support for other relation types yet. You need to manually assemble geometries by collecting the geometries of the members.
pyosmium has a number of geometry factories to make it easier to convert an OSM object to well known geometry formats. To use them, instantiate the factory once and then hand in the OSM object to one of the create functions. A code snippet that converts all objects into WKT format looks approximately like that:
fab = osmium.geom.WKTFactory()\n\nfor o in osmium.FileProcessor('../data/buildings.opl').with_areas():\n if o.is_node():\n wkt = fab.create_point(o.location)\n elif o.is_way() and not o.is_closed():\n wkt = fab.create_linestring(o.nodes)\n elif o.is_area():\n wkt = fab.create_multipolygon(o)\n else:\n wkt = None # ignore relations\n
There are factories for GeoJSON (osmium.geom.GeoJSONFactory), well-known text (osmium.geom.WKTFactory) and well-known binary (osmium.geom.WKBFactory) formats.
If you want to process the geometries with Python libraries like shapely1 or GeoPandas, then the standardized geo_interface format can come in handy.
pyosmium has a special filter GeoInterfaceFilter which enhances pyosmium objects with a geo_interface attribute. This allows libraries that support this interface to directly consume the OSM objects. The GeoInterfaceFilter needs location information to create the geometries. Don't forget to add with_locations() and/or with_areas() to the FileProcessor.
geo_interface
Here is an example that computes the total length of highways using the geometry functions of shapely:
from shapely.geometry import shape\n\ntotal = 0.0\nfor o in osmium.FileProcessor('liechtenstein.osm.pbf').with_locations().with_filter(osmium.GeoHandler()):\n if o.is_way() and 'highway' in o.tags:\n # Shapely has only support for Features starting from version 2.1,\n # so lets cheat a bit here.\n geom = shape(o.__geo_interface__['geometry'])\n # Length is computed in WGS84 projection, which is practically meaningless.\n # Lets pretend we didn't notice, it is an example after all.\n total += geom.length\n\nprint(\"Total length:\", total)\n
Total length: 14.58228287312081\n
For an example on how to use the Python Geo Interface together with GeoPandas, have a look at the Visualisation Recipe.
See the Osmium manual for the different types of location storage.
Shapely only received full support for geo_interface geometries with features in version 2.1. For older versions create WKT geometries as explained above and create Shapely geometries from that.\u00a0\u21a9
When processing an OSM file, it is often only a very small part of the objects the script really needs to see and process. Say, you are interested in the road network, then the millions of buildings in the file could easily be skipped over. This is the task of filters. They provide a fast and performance-efficient way to pre-process or skip over data before it is processed within the Python code.
Filters can be added to a FileProcessor with the with_filter() function. An arbitrary number of filters can be added to the processor. Simply call the functions as many times as needed. The filters will be executed in the order they have been added. If any of the filters marks the object for removal, the object is immediately dropped and the next object from the file is processed.
Filters can have side effects. That means that a filter may add additional attributes to the OSM object it processes and these attributes will be visible for subsequent filters and in the Python processing code. For example, the GeoInterfaceFilter adds a Python __geo_interface__ attribute to the object.
Filters can be restricted to process only certain types of OSM objects. If an OSM object doesn't have the right type, the filter will be skipped over as if it wasn't defined at all. To restrict the types, call the enable_for() function.
Here is an example of a FileProcessor where only place nodes and boundary ways and relations are iterated through:
fp = osmium.FileProcessor('../data/liechtenstein.osm.pbf')\\\n .with_filter(osmium.filter.KeyFilter('place').enable_for(osmium.osm.NODE))\\\n .with_filter(osmium.filter.KeyFilter('boundary').enable_for(osmium.osm.WAY | osmium.osm.RELATION))\n
Once an object has been filtered, the default behaviour of the FileProcessor is to simply drop the object. Sometimes it can be useful to do something different with the object. For example, when you want to change some tags in a file and then write the data out again, then you'd usually want to filter out the objects that are not to be modified. However, you wouldn't want to drop them completely but write the unmodified object out. For such cases it is possible to set a fallback handler for filtered objects using the handler_for_filtered() function.
The file writer can become a fallback handler for the file processor. The next chapter Handlers will show how to write a custom handler that can be used in this function.
The following section shortly describes the filters that are built into pyosmium.
This filter removes all objects that have no tags at all. Most of the nodes in an OSM files fall under this category. So even when you don't want to apply any other filters, this one can make a huge difference in processing time:
print(\"Total number of objects:\",\n sum(1 for o in osmium.FileProcessor('liechtenstein.osm.pbf')))\n\nprint(\"Total number of tagged objects:\",\n sum(1 for o in osmium.FileProcessor('liechtenstein.osm.pbf')\n .with_filter(osmium.filter.EmptyTagFilter())))\n
Total number of objects: 340175\nTotal number of tagged objects: 49645\n
The Entity filter only lets through objects of the selected type:
print(\"Total number of objects:\",\n sum(1 for o in osmium.FileProcessor('../data/liechtenstein.osm.pbf')))\n\nprint(\"Of which are nodes:\",\n sum(1 for o in osmium.FileProcessor('../data/liechtenstein.osm.pbf')\n .with_filter(osmium.filter.EntityFilter(osmium.osm.NODE))))\n
Total number of objects: 340175\nOf which are nodes: 306700\n
On the surface, the filter is very similar to the entity selector that can be passed to the FileProcessor. In fact, it would be much faster to count the nodes using the entity selector:
print(\"Of which are nodes:\",\n sum(1 for o in osmium.FileProcessor('../data/liechtenstein.osm.pbf', osmium.osm.NODE)))\n
Of which are nodes: 306700\n
However, the two implementations use different mechanism to drop the nodes. When the entity selector in the FileProcessor is used like in the second example, then only the selected entities are read from the file. In our example, the file reader would skip over the ways and relations completely. When the entity filter is used, then the entities are only dropped when they get to the filter. Most importantly, the objects will still be visible to any filters applied before the entity filter.
This can become of some importance when working with geometries. Lets say we can to compute the length of all highways in our file. You will remember from the last chapter about Working with Geometries that it is necessary to enable the location cache in order to be able to get the geometries of the road:
total = 0.0\n\nfor o in osmium.FileProcessor('../data/liechtenstein.osm.pbf')\\\n .with_locations()\\\n .with_filter(osmium.filter.EntityFilter(osmium.osm.WAY)):\n if 'highway' in o.tags:\n total += osmium.geom.haversine_distance(o.nodes)\n\nprint(f'Total length of highways is {total/1000} km.')\n
Total length of highways is 1350.8030544343883 km.\n
The location cache needs to see all nodes in order to record their locations. This would not happen if the file reader skips over the nodes. It is therefore imperative to use the entity filter here. In fact, pyosmium will refuse to run when nodes are not enabled in a FileProcessor with location caching:
Bad example
for o in osmium.FileProcessor('../data/liechtenstein.osm.pbf', osmium.osm.WAY).with_locations():\n if 'highway' in o.tags:\n osmium.geom.haversine_distance(o.nodes)\n
---------------------------------------------------------------------------\n\nRuntimeError Traceback (most recent call last)\n\nCell In[14], line 1\n----> 1 for o in osmium.FileProcessor('../data/liechtenstein.osm.pbf', osmium.osm.WAY).with_locations():\n 2 if 'highway' in o.tags:\n 3 osmium.geom.haversine_distance(o.nodes)\n\n\nFile ~/osm/dev/pyosmium/build/lib.linux-x86_64-cpython-311/osmium/file_processor.py:46, in FileProcessor.with_locations(self, storage)\n 42 \"\"\" Enable caching of node locations. This is necessary in order\n 43 to get geometries for ways and relations.\n 44 \"\"\"\n 45 if not (self._entities & osmium.osm.NODE):\n---> 46 raise RuntimeError('Nodes not read from file. Cannot enable location cache.')\n 47 if isinstance(storage, str):\n 48 self._node_store = osmium.index.create_map(storage)\n\n\nRuntimeError: Nodes not read from file. Cannot enable location cache.\n
This filter only lets pass objects where its list of tags has any of the keys given in the arguments of the filter.
If you want to ensure that all of the keys are present, use the KeyFilter multiple times:
print(\"Objects with 'building' _or_ 'amenity' key:\",\n sum(1 for o in osmium.FileProcessor('../data/liechtenstein.osm.pbf')\n .with_filter(osmium.filter.KeyFilter('building', 'amenity'))))\n\nprint(\"Objects with 'building' _and_ 'amenity' key:\",\n sum(1 for o in osmium.FileProcessor('../data/liechtenstein.osm.pbf')\n .with_filter(osmium.filter.KeyFilter('building'))\n .with_filter(osmium.filter.KeyFilter('amenity'))))\n
This filter works exactly the same as the KeyFilter, only it looks for the presence of whole tags (key and value) in the tag list of the object.
This filter takes an iterable of numbers and lets only pass objects that have an ID that matches the list. This filter is particularly useful when doing a two-stage processing, where in the first stage the file is scanned for objects that are of interest (for example, members of certain relations) and then in the second stage these objects are read from the file. You pretty much always want to use this filter in combination with the enable_for() function to restrict it to a certain object type.
In its purest form, the filter could be used to search for a single object in a file:
fp = osmium.FileProcessor('../data/buildings.opl')\\\n .with_filter(osmium.filter.EntityFilter(osmium.osm.WAY))\\\n .with_filter(osmium.filter.IdFilter([1]))\n\nfor o in fp:\n print(o)\n
However, in practise it is a very expensive way to find a single object. Remember that the entire file will be scanned by the FileProcessor just to find that one piece of information.
It is also possible to define a custom filter in Python. Most of the time this is not very useful because calling a filter implemented in Python is just as expensive as returning the OSM object to Python and doing the processing then. However, it can be useful when the FileProcessor is used as an Iterable input to other libraries like GeoPandas.
A Python filter needs to be implemented as a class that looks exactly like a Handler class: for each type that should be handled by the filter, implement a callback function node(), way(), relation(), area() or changeset(). If a callback for a certain type is not implemented, then the object type will automatically pass through the filter. The callback function needs to return either 'True', when the object should be filtered out, or 'False' when it should pass through.
Here is a simple example of a filter that filters out all nodes that are older than 2020:
import datetime as dt\n\nclass DateFilter:\n\n def node(self, n):\n return n.timestamp < dt.datetime(2020, 1, 1, tzinfo=dt.UTC)\n\n\nprint(\"Total number of objects:\",\n sum(1 for o in osmium.FileProcessor('../data/liechtenstein.osm.pbf')))\n\nprint(\"Without nodes older than 2020:\",\n sum(1 for o in osmium.FileProcessor('../data/liechtenstein.osm.pbf')\n .with_filter(DateFilter())))\n
All examples so far have used the FileProcessor for reading files. It provides an iterative way of working through the data, which comes quite natural to a Python programmer. This chapter shows a different way of processing a file. It shows how to create one or more handler classes and apply those to an input file.
Note: handler classes used to be the only way of processing data in older pyosimum versions. You may therefore find them in many tutorials and examples. There is no disadvantage in using FileProcessors instead. Handlers simply provide a different syntax for achieving a similar goal.
A pyosmium handler object is simply a Python object that implements callbacks to handle the different types of entities (node, way, relation, area, changeset). Usually you would define a class with your handler functions and instantiate it. A complete handler class that prints out each object in the file would look like this:
class PrintHandler:\n def node(self, n):\n print(n)\n\n def way(self, w):\n print(w)\n\n def relation(self, r):\n print(r)\n\n def area(self, a):\n print(a)\n\n def changeset(self, c):\n print(c)\n
Such a handler is applied to an OSM file with the function osmium.apply(). The function takes a single file as an argument and then an arbitrary number of handlers:
import osmium\n\nmy_handler = PrintHandler()\n\nosmium.apply('buildings.opl', my_handler)\n
n1: location=45.0000000/13.0000000 tags={}\nn2: location=45.0001000/13.0000000 tags={}\nn3: location=45.0001000/13.0001000 tags={}\nn4: location=45.0000000/13.0001000 tags={entrance=yes}\nn11: location=45.0000100/13.0000100 tags={}\nn12: location=45.0000500/13.0000100 tags={}\nn13: location=45.0000500/13.0000500 tags={}\nn14: location=45.0000100/13.0000500 tags={}\nw1: nodes=[1,2,3,4,1] tags={amenity=restaurant}\nw2: nodes=[11,12,13,14,11] tags={}\nr1: members=[w1,w2], tags={type=multipolygon,building=yes}\n
Filter functions are also recognised as handlers by the apply functions. They have the same effect as when used in FileProcessors: when they signal to filter out an object, then the processing is stopped for that object and the next object is processed. You can arbitrarily mix filters and custom-made handlers. They are sequentially executed in the order in which they appear in the apply function:
osmium.apply('buildings.opl',\n osmium.filter.EntityFilter(osmium.osm.RELATION),\n my_handler,\n osmium.filter.KeyFilter('route')),\n my_other_handler\n
The apply function is a very low-level function for processing. It will only apply the handler functions to the input and be done with it. It will in particular not care about providing the necessary building blocks for geometry processing. If you need to work with geometries, you can derive your handler class from osmium.SimpleHandler. This mix-in class adds two convenience functions to your handler : apply_file() and apply_buffer(). These functions apply the handler itself to a file or buffer but come with additional parameter to enable location. If the handler implements an area callback, then they automatically enable area processing as well.
apply
apply_buffer()
pyosmium can also be used to write OSM files. It offers different writer classes which support creating referentially correct files.
All writers are created by instantiating them with the name of the file to write to.
writer = osmium.SimpleWriter('my_extra_data.osm.pbf')\n
The format of the output file is usually determined through the file prefix. pyosmium will refuse to overwrite any existing files. Either make sure to delete the files before instantiating a writer or use the parameter overwrite=true.
overwrite=true
Once a writer is instantiated, one of the add* functions can be used to add an OSM object to the file. You can either use one of the add_node/way/relation functions to force writing a specific type of object or use the generic add function, which will try to determine the object type. The OSM objects are directly written out in the order in which they are given to the writer object. It is your responsibility as a user to make sure that the order is correct with respect to the conventions for object order.
add*
add_node/way/relation
add
After writing all data the writer needs to be closed using the close() function. It is usually easier to use a writer as a context manager.
Here is a complete example for a script that converts a file from OPL format to PBF format:
with osmium.SimpleWriter('buildings.osm.pbf') as writer:\n for o in osmium.FileProcessor('buildings.opl'):\n writer.add(o)\n
In the example above an OSM object from an input file was written out directly without modifications. Writers can accept OSM nodes, ways and relations that way. However, usually you want to modify some of the data in the object before writing it out again. Use the replace() function to create a mutable version of the object with the given parameters replaced.
replace()
Say you want to create a copy of a OSM file with all source tags removed:
source
with osmium.SimpleWriter('buildings.osm.pbf') as writer:\n for o in osmium.FileProcessor('buildings.opl'):\n if 'source' in tags:\n new_tags = dict(o.tags) # make a copy of the tags\n del new_tags['source']\n writer.add(o.replace(tags=new_tags))\n else:\n # No source tag. Write object out as-is.\n writer.add(o)\n
You can also write data that is not based on OSM input data at all. The write functions will accept any Python object that mimics the attributes of a node, way or relation.
pyosmium implements three different writer classes: the basic SimpleWriter and the two reference-completing writers ForwardReferenceWriter and BackReferenceWriter.
pyosmium can read OSM data from different sources and in different formats.
pyosmium has built-in support for the most common OSM data formats as well as formats specific to libosmium. The format to use is usually determined by the suffix of the file name. The following table gives an overview over the suffix recognised, the corresponding format and if the formats support reading and/or writing.
.pbf
.osm
.xml
.o5m
.opl
.debug
.ids
All formats also support compression with gzip (suffix .gz) and bzip2 (suffix .bz2) with the exception of the PBF format.
.gz
.bz2
The suffixes may be further prefixed by three subtypes:
.osc
.osh
Thus the type .osh.xml.bz2 would be an OSM history file in XML format that has been compressed using the bzip2 algorithm.
.osh.xml.bz2
If you have file inputs where the suffix differs from the internal format, the file type can be explicitly set by instantiating an osmium.io.File object. It takes an optional format parameter which then must contain the suffix notation of the desired file format.
This example forces the given input text file to be read as OPL.
fp = osmium.FileProcessor(osmium.io.File('assorted.txt', 'opl'))\n
The special file name - can be used to read from standard input or write to standard output.
-
When reading data, use a File object to specify the file format. With the SimpleReader, you need to use the parameter filetype.
File
filetype
This code snipped dumps all ids of your input file to the console.
with osmium.SimpleWriter('-', filetype='ids') as writer:\n for o in osmium.FileProcessor('test.pbf'):\n writer.add(o)\n
pyosmium can also read data from a in-memory byte buffer. Simply wrap the buffer in a osmium.io.FileBuffer. The file format always needs to be explicitly given.
Reading from a buffer comes in handy when loading OSM data from a URL. This example computes statistics over data downloaded from an URL.
import urllib.request as urlrequest\n\ndata = urlrequest.urlopen('https://example.com/some.osm.gz').read()\n\ncounter = {'n': 0, 'w': 0, 'r': 0}\n\nfor o in osmium.FileProcessor(osmium.io.FileBuffer(data, 'osm.gz')):\n counter[o.type_str()] += 1\n\nprint(\"Nodes: %d\" % counter['n'])\nprint(\"Ways: %d\" % counter['w'])\nprint(\"Relations: %d\" % counter['r'])\n
OpenStreetMap produces two kinds of data, full data files and diff files with updates. This chapter explains how to handle diff files.
An OSM data file usually contains data of a snapshot of the OpenStreetMap database at a certain point in time. The full database contains even more data. It has all the changes that were ever made. The full version of the database with the complete history is contained in so called history files. They do require some special attention when processing.
OpenStreetMap is a database that is constantly extended and updated. When you download the planet or an extract of it, you only get a snapshot of the database at a given point in time. To keep up-to-date with the development of OSM, you either need to download a new snapshot or you can update your existing data from change files published along with the planet file. Pyosmium ships with two tools that help you to process change files: pyosmium-get-changes and pyosmium-up-to-date.
pyosmium-get-changes
pyosmium-up-to-date
This section explains the basics of OSM change files and how to use Pyosmium's tools to keep your data up to date.
Regular change files are published for the planet and also by some extract services. These change files are special OSM data files containing all changes to the database in a regular interval. Change files are not referentially complete. That means that they only contain OSM objects that have changed but not necessarily all the objects that are referenced by the changed objects. Because of that change file are rarely useful on their own. But they can be used to update an existing snapshot of OSM data.
There are multiple sources for OSM change files available:
https://planet.openstreetmap.org/replication is the official source for planet-wide updates. There are change files for minutely, hourly and daily intervals available.
Geofabrik offers daily change files for all its updates. See the extract page for a link to the replication URL. Note that change files go only about 3 months back. Older files are deleted.
download.openstreetmap.fr offers minutely change files for all its extracts.
For other services also check out the list of providers on the OSM wiki.
If you have downloaded the full planet or obtain a PBF extract file from one of the sites which offer a replication service, then updating your OSM file can be as easy as:
pyosmium-up-to-date <osmfile.osm.pbf>\n
This finds the right replication source and file to start with, downloads changes and updates the given file with the data. You can repeat this command whenever you want to have newer data. The command automatically picks up at the same point where it left off after the previous update.
OSM files in PBF format are able to save the replication source and the current status on their own. That is why pyosmium-up-to-date is able to automatically do the right thing. If you want to switch the replication source or have a file that does not have replication information, you need to bootstrap the update process and manually point pyosmium-up-to-date to the right service:
pyosmium-up-to-date --ignore-osmosis-headers --server <replication URL> <osmfile.osm.pbf>\n
pyosmium-up-to-date automatically finds the right sequence ID to use by looking at the age of the data in your OSM file. It updates the file and stores the new replication source in the file. The additional parameters are then not necessary anymore for subsequent updates.
Tip
Always use the PBF format to store your data. Other format do not support to save the replication information. pyosmium-up-to-date is still able to update these kind of files if you manually point to the replication server but the process is always more costly because it needs to find the right starting point for updates first.
When used without any parameters, pyosmium downloads at a maximum about 1GB of changes. That corresponds to about 3 days of planet-wide changes. You can increase the amount using the additional --size parameter:
--size
pyosmium-up-to-date --size=10000 planet.osm.pbf\n
This would download about 10GB or 30 days of change data. If your OSM data file is older than that, downloading the full file anew is likely going to be faster.
pyosmium-up-to-date uses return codes to signal if it has downloaded all available updates. A return code of 0 means that it has downloaded and applied all available data. A return code of 1 indicates that it has applied some updates but more are available.
A minimal script that updates a file until it is really up-to-date with the replication source would look like this:
status=1 # we want more data\nwhile [ $status -eq 1 ]; do\n pyosmium-up-to-date planet.osm.pbf\n # save the return code\n status=$?\ndone\n
There are quite a few tools that can import OSM data into databases, for example osm2pgsql, imposm or Nominatim. These tools often can use change files to keep their database up-to-date. pyosmium can be used to create the appropriate change files. This is slightly more involved than updating a file.
Before downloading the updates, you need to find out with which sequence number to start. The easiest way to remember your current status is to save the number in a file. pyosmium can then read and update the file for you.
If you still have the OSM file you used to set up your database, then create a state file as follows:
pyosmium-get-changes -O <osmfile.osm.pbf> -f sequence.state -v\n
Note that there is no output file yet. This creates a new file sequence.state with the sequence ID where updates should start and prints the URL of the replication service to use.
sequence.state
If you do not have the original OSM file anymore, then a good strategy is to look for the date of the newest node in the database to find the snapshot date of your database. Find the highest node ID, then look up the date for version 1 on the OSM website. For example the date for node 2367234 can be found at https://www.openstreetmap.org/api/0.6/node/23672341/1 Find and copy the timestamp field. Then create a state file using this date:
timestamp
pyosmium-get-changes -D 2007-01-01T14:16:21Z -f sequence.state -v\n
As before, this creates a new file sequence.state with the sequence ID where updates should start and prints the URL of the replication service to use.
Now you can create change files using the state:
pyosmium-get-changes --server <replication server> -f sequence.state -o newchange.osc.gz\n
This downloads the latest changes from the server, saves them in the file newchange.osc.gz and updates your state file. <replication server> is the URL that was printed when you set up the state file. The parameter can be omitted when you use minutely change files from openstreetmap.org. This simplifies multiple edits of the same element into the final change. If you want to retain the full version history specify --no-deduplicate.
newchange.osc.gz
<replication server>
--no-deduplicate
pyosmium-get-changes loads only about 100MB worth of updates at once (about 8 hours of planet updates). If you want more, then add a --size parameter.
pyosmium-get-changes emits special return codes that can be used to set up a script that continuously fetches updates and applies them to a database. The important error codes are:
All other error codes indicate fatal errors.
A simple shell script can look like this:
while true; do\n # pyosmium-get-changes would not overwrite an existing change file\n rm -f newchange.osc.gz\n # get the next batch of changes\n pyosmium-get-changes -f sequence.state -o newchange.osc.gz\n # save the return code\n status=$?\n\n if [ $status -eq 0 ]; then\n # apply newchange.osc.gz here\n ....\n elif [ $status -eq 3 ]; then\n # No new data, so sleep for a bit\n sleep 60\n else\n echo \"Fatal error, stopping updates.\"\n exit $status\ndone\n