Comments (20)
My C++ isn't great either tbh!
I suspect stxxl would be easiest as it's pretty much a drop-in replacement and it's easily configurable for your own disk/memory needs.
from tilemaker.
Has anyone considered accomplishing this by splitting up a larger pbf into smaller pbfs at tile boundaries then running tilemaker sequentially over them?
from tilemaker.
Doing a linear fit gives a RAM requirement for the current planet of 290GB. Sizing for 3 years growth I'd add 40% to the current planet size and a more complex conversion would require more RAM, which I'd put at a 50% growth, for a total estimated RAM need of 600 GB in 3 years.
7.5 years on and tilemaker 3.0 processes the planet in around 24GB, so I think we can close this 🎉
from tilemaker.
This is really informative: thank you.
I'd be interested to hear what ideas people have for reducing memory consumption. Two I've thought of:
- Use an attribute dictionary across all OutputObjects, rather than one map per OutputObject as at present (which results in thousands of 'highway' keys)
- Delta-encode node ids in ways (zigzag varint)
from tilemaker.
On average a node is a member of 1.1 ways. Assuming that right now tilemaker is using 4 bytes/way node, this puts way node information at using 4-7% of RAM. I wouldn't pursue this right now, but https://github.com/ademakov/Oroch#comparison and https://github.com/powturbo/TurboPFor#libraries-benchmarked are some sources I found.
from tilemaker.
I just found this repo and I think your approach looks interesting. I haven't looked at the code (sorry), but is there a reason to not use mmap or a similar technology that might keep parts of the required data on the disk? Obviously this would slow down the whole process (not sure how long the import for Germany or Europe takes right now?).
BTW: Maybe a stupid question as well, but why do we need that much RAM anyway? The planet OSM xml is roughly 60GB. I would assume that storing everything in RAM with some overhead might be possible with ~100GB. I took a sneak peak at the code, so you store nodes/ways/relations per tile, which leads to duplication? Is this the reason?
from tilemaker.
from tilemaker.
No reason not to use mmap or perhaps stxxl as an option, it's just not something that I've implemented. But I'd very happily receive a pull request as long as there wasn't a significant slowdown in those cases where you do have enough RAM to keep everything memory-resident.
I suspect the biggest win for memory usage would be to have a layer of indirection for attributes (i.e. output keys/values), rather than repeating them in every OutputObject. In other words, OutputObject has a single reference to an entry in a dictionary, rather than the current attributes map. That's currently marked as a todo at the top of output_object.cpp, and again, PRs very much welcome. :)
from tilemaker.
@dieterdreist Sorry you are right, I haven't considered the compression, now it makes sense :).
@systemed I'd love to, but unfortunately my limited knowledge of C++ & tight schedule probably won't allow it. I was just wondering why mmap hasn't been discussed in this issue yet :), since I would assume that it would "solve" the issue. I mean there might be some optimization potential, but only to a certain degree.
I suspect the biggest win for memory usage would be to have a layer of indirection for attributes
Yes, that was my first thought as well :).
Sorry for not being very helpful ;).
from tilemaker.
I've started trying to use stxxl to hold the data on disk. https://github.com/TimSC/tilemaker/tree/stxxl I've only attempted the node storage so far. Inserting into the unordered map takes longer and longer as it grows making the performance really bad. I will take a look at the feasibility of improving stxxl. (Note that unordered map in stxxl is marked as experimental. https://stxxl.org/tags/1.4.1/tutorial_unordered_map.html )
UPDATE: I've switched to use an ordered map because it has higher performance in stxxl. Insert performance is good but data reads are quite slow. I've noticed another problem: stxxl does not support variable length values in a map. I think this would have been useful for the way store.
from tilemaker.
@TimSC do you perhaps have any news on your implementation? Is the current version stable, and are there any plans to merge your code with this repo?
from tilemaker.
@JoostvdB94 I've not looked at this in a while. The performance of stxxl does seem to be poor. I was thinking of switching to try sqlite, but that tends to having trouble after a few hundred million nodes. It seems to be quite hard to import the database to anything other than PostGIS. Perhaps we need to think of splitting the OSM extracts into tiles before we attempt to run tilemaker on the whole set?
from tilemaker.
If I were implementing node storage I'd use libosmium's structures to store positions, because they're compatible with both in-memory and on-disk.
fwiw, osm2pgsql uses libosmium's format for on-disk storage with flat nodes, which takes 64 bits * max node id
on disk, and something custom for in memory which takes 1/0.85 * 64 bits * number of nodes
for the planet, and 2 * 64 bits * number of nodes
for small extracts.
Node positions are a special case, because they can be stored in a fixed size of 64 bits per node. This is used with flat nodes to avoid the need for any index on node id, and with the "dense" portion of osm2pgsql's cache to use a few bytes per block of 8k nodes.
If I had to keep node tags in memory, I'd probably do it in a different structure. About 10% of nodes have tags, and their statistics are different.
Ways don't have this property. The geometry information of a way is up to 2000 node ids, but typically about 10. Fortunately, we don't have to store that, do we?
from tilemaker.
@msbarry I've tried this, both using Osmosis and Osmconvert to exract the bounding boxes. Tilemaker does not generate tiles that clips the bounding box of the pbf, and quite frankly; why would it? This results in no tiles on low zooms if the data extracts are smaller than the tile size on that zoom. It also means that there will be large areas on the edge off the bounding boxes with no tiles. I suppose you could calculate the exact borders of the tiles in WGS84 and feed that into osmosis or osmconvert to create fitting datasets but i don't know how that will work when it comes to the non-square bounds of WGS84.
I think a better solution is generate osm files with different kinds of features: one for highways, one for POI one for landuse and so on. Generate tiles for each pbf file for planet and then merge the layers using tippecanoe. I haven't tried this yet though so could be something obvious I'm missing.
from tilemaker.
A note for people coming here (and to memory limit problem)...
Mem usage is determined not only by extract size but also heavily on number of layers produced.
A workaround is to process as separate steps (roads, landuse buildings etc.) and specify these as different sources in style. Here is a mem-use report from a /usr/bin/time -v commands for one-step vs multi-step processing for a 1.1G Poland extract:
Maximum resident set size (kbytes):
all layers at once 2 320 784
roads 976 288
aeroway 619 704
buildings 1 384 644
labels 794 188
landuse 1 087 832
water 663 884
from tilemaker.
Another scaling up related PR #128 for reading input from tiled pbf files
from tilemaker.
As a current benchmark: producing an OpenMapTiles-schema mbtiles from great-britain-latest.osm.pbf (around 1GB) requires 22GB memory and takes around 2 hours on my slowish machine. I'm sure it's possible to reduce the memory footprint.
from tilemaker.
As another data point, I'm currently converting germany-latest.osm.pbf (3.1 GB), using latest Tilemaker from Git. I haven't used any special compilation flags (in particular I haven't tried -DCOMPACT_NODES
yet - does it save a noticeable amount of RAM?).
Tilemaker is currently writing the first tiles:
Stored 316922858 nodes, 48961414 ways, 175882 relations
Zoom level 14, writing tile 32700 of 171989
and is using 52 GB RAM.
from tilemaker.
Osmium use a vector based map to store the node ids using an mmap file. This is basicly the same as the boost flat_map. It is a sorted array, which has poor insertion and lookup performance when it becomes this large.
Tilemaker used the unordered_map before and now the tsl hash map. They have really good performance, but are memory allocated. Using boost.interprocess it is possible to create an allocator for STL data types, to allow storing them on disk. This way, the OS will swap in/out pages when they are needed and when memory is available. This will greatly reduce the maximum amount of memory that is required for the import process.
I tried this a bit, you can stored a vector and boost::unordered_map in an mmap file. When a bad_alloc occurs, you can resize the file to allow for more storage. Have a look at the sample here:
MMAP Boost::unordered_map
It would be possible to have the whole osm_store to be stored in an mmap file using segments from boost::interprocess
https://github.com/systemed/tilemaker/blob/master/include/osm_store.h
This way, memory usage would be reduced greatly. But this will probably have some impact on the performance.
Storing the nodestore in mmap file would be quite straightforward. We can test what kind of effect this has on the import performance.
from tilemaker.
Please have a look at this PR #195
from tilemaker.
Related Issues (20)
- MAC OS issue with types during the make command HOT 10
- Config files aren't installed to a static location HOT 1
- Ubuntu 22.04 tilemaker-server no static folder in share HOT 1
- Attribute(name, value, minzoom) writes attribute on all zoom levels HOT 1
- Lua Runtime Error HOT 2
- Generate tile size statistics HOT 1
- attempt to call global 'Find' (a nil value) HOT 1
- Windows install instructions HOT 2
- Buildings and piers missing at zoom level 13 HOT 3
- Trying to install on Mac HOT 2
- Zoom level simplification HOT 1
- lua error: table is not a function HOT 1
- lua missing, ubuntu 23.04 HOT 1
- STL Vector assertion failed when running maptiles HOT 9
- Assertion failed: (has_wire_type(pbf_wire_type::varint) && "not a varint"), function get_sint64, file pbf_reader.hpp, line 570 HOT 2
- Invalid polygons after clipping at tile boundaries HOT 15
- terminate called after throwing an instance of 'std::out_of_range' HOT 1
- tilemaker-server terminated by signal SIGKILL (Forced quit) HOT 9
- build problems HOT 1
- Planet generation issues on hardware-constrained machine HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tilemaker.