I've been testing tilemaker with larger extracts to see how RAM usage is with larger e

2017-06-16 16:34 GMT+02:00 Robin <notifications@github.com>:

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

I've started trying to use stxxl to hold the data on disk. <a href="https://github.com

Scaling up about tilemaker HOT 20 CLOSED

systemed commented on May 12, 2024

Scaling up

from tilemaker.

Comments (20)

systemed commented on May 12, 2024 1

My C++ isn't great either tbh!

I suspect stxxl would be easiest as it's pretty much a drop-in replacement and it's easily configurable for your own disk/memory needs.

from tilemaker.

msbarry commented on May 12, 2024 1

Has anyone considered accomplishing this by splitting up a larger pbf into smaller pbfs at tile boundaries then running tilemaker sequentially over them?

from tilemaker.

systemed commented on May 12, 2024 1

Doing a linear fit gives a RAM requirement for the current planet of 290GB. Sizing for 3 years growth I'd add 40% to the current planet size and a more complex conversion would require more RAM, which I'd put at a 50% growth, for a total estimated RAM need of 600 GB in 3 years.

7.5 years on and tilemaker 3.0 processes the planet in around 24GB, so I think we can close this 🎉

from tilemaker.

systemed commented on May 12, 2024

This is really informative: thank you.

I'd be interested to hear what ideas people have for reducing memory consumption. Two I've thought of:

Use an attribute dictionary across all OutputObjects, rather than one map per OutputObject as at present (which results in thousands of 'highway' keys)
Delta-encode node ids in ways (zigzag varint)

from tilemaker.

pnorman commented on May 12, 2024

On average a node is a member of 1.1 ways. Assuming that right now tilemaker is using 4 bytes/way node, this puts way node information at using 4-7% of RAM. I wouldn't pursue this right now, but https://github.com/ademakov/Oroch#comparison and https://github.com/powturbo/TurboPFor#libraries-benchmarked are some sources I found.

from tilemaker.

boldtrn commented on May 12, 2024

I just found this repo and I think your approach looks interesting. I haven't looked at the code (sorry), but is there a reason to not use mmap or a similar technology that might keep parts of the required data on the disk? Obviously this would slow down the whole process (not sure how long the import for Germany or Europe takes right now?).

BTW: Maybe a stupid question as well, but why do we need that much RAM anyway? The planet OSM xml is roughly 60GB. I would assume that storing everything in RAM with some overhead might be possible with ~100GB. I took a sneak peak at the code, so you store nodes/ways/relations per tile, which leads to duplication? Is this the reason?

from tilemaker.

dieterdreist commented on May 12, 2024

2017-06-16 16:34 GMT+02:00 Robin <[email protected]>:

Maybe a stupid question as well, but why do we need that much RAM anyway? The planet OSM xml is roughly 60GB. I would assume that storing everything in RAM with some overhead might be possible with ~100GB.

the planet is 60GB in bzip2 compressed xml, if you extract it you will get a much bigger file...

from tilemaker.

systemed commented on May 12, 2024

No reason not to use mmap or perhaps stxxl as an option, it's just not something that I've implemented. But I'd very happily receive a pull request as long as there wasn't a significant slowdown in those cases where you do have enough RAM to keep everything memory-resident.

I suspect the biggest win for memory usage would be to have a layer of indirection for attributes (i.e. output keys/values), rather than repeating them in every OutputObject. In other words, OutputObject has a single reference to an entry in a dictionary, rather than the current attributes map. That's currently marked as a todo at the top of output_object.cpp, and again, PRs very much welcome. :)

from tilemaker.

boldtrn commented on May 12, 2024

@dieterdreist Sorry you are right, I haven't considered the compression, now it makes sense :).

@systemed I'd love to, but unfortunately my limited knowledge of C++ & tight schedule probably won't allow it. I was just wondering why mmap hasn't been discussed in this issue yet :), since I would assume that it would "solve" the issue. I mean there might be some optimization potential, but only to a certain degree.

I suspect the biggest win for memory usage would be to have a layer of indirection for attributes

Yes, that was my first thought as well :).

Sorry for not being very helpful ;).

from tilemaker.

TimSC commented on May 12, 2024

I've started trying to use stxxl to hold the data on disk. https://github.com/TimSC/tilemaker/tree/stxxl I've only attempted the node storage so far. Inserting into the unordered map takes longer and longer as it grows making the performance really bad. I will take a look at the feasibility of improving stxxl. (Note that unordered map in stxxl is marked as experimental. https://stxxl.org/tags/1.4.1/tutorial_unordered_map.html )

UPDATE: I've switched to use an ordered map because it has higher performance in stxxl. Insert performance is good but data reads are quite slow. I've noticed another problem: stxxl does not support variable length values in a map. I think this would have been useful for the way store.

from tilemaker.

JoostvdB94 commented on May 12, 2024

@TimSC do you perhaps have any news on your implementation? Is the current version stable, and are there any plans to merge your code with this repo?

from tilemaker.

TimSC commented on May 12, 2024

@JoostvdB94 I've not looked at this in a while. The performance of stxxl does seem to be poor. I was thinking of switching to try sqlite, but that tends to having trouble after a few hundred million nodes. It seems to be quite hard to import the database to anything other than PostGIS. Perhaps we need to think of splitting the OSM extracts into tiles before we attempt to run tilemaker on the whole set?

from tilemaker.

pnorman commented on May 12, 2024

If I were implementing node storage I'd use libosmium's structures to store positions, because they're compatible with both in-memory and on-disk.

fwiw, osm2pgsql uses libosmium's format for on-disk storage with flat nodes, which takes 64 bits * max node id on disk, and something custom for in memory which takes 1/0.85 * 64 bits * number of nodes for the planet, and 2 * 64 bits * number of nodes for small extracts.

Node positions are a special case, because they can be stored in a fixed size of 64 bits per node. This is used with flat nodes to avoid the need for any index on node id, and with the "dense" portion of osm2pgsql's cache to use a few bytes per block of 8k nodes.

If I had to keep node tags in memory, I'd probably do it in a different structure. About 10% of nodes have tags, and their statistics are different.

Ways don't have this property. The geometry information of a way is up to 2000 node ids, but typically about 10. Fortunately, we don't have to store that, do we?

from tilemaker.

AdamOstgaard commented on May 12, 2024

@msbarry I've tried this, both using Osmosis and Osmconvert to exract the bounding boxes. Tilemaker does not generate tiles that clips the bounding box of the pbf, and quite frankly; why would it? This results in no tiles on low zooms if the data extracts are smaller than the tile size on that zoom. It also means that there will be large areas on the edge off the bounding boxes with no tiles. I suppose you could calculate the exact borders of the tiles in WGS84 and feed that into osmosis or osmconvert to create fitting datasets but i don't know how that will work when it comes to the non-square bounds of WGS84.

I think a better solution is generate osm files with different kinds of features: one for highways, one for POI one for landuse and so on. Generate tiles for each pbf file for planet and then merge the layers using tippecanoe. I haven't tried this yet though so could be something obvious I'm missing.

from tilemaker.

michalfratczak commented on May 12, 2024

A note for people coming here (and to memory limit problem)...
Mem usage is determined not only by extract size but also heavily on number of layers produced.
A workaround is to process as separate steps (roads, landuse buildings etc.) and specify these as different sources in style. Here is a mem-use report from a /usr/bin/time -v commands for one-step vs multi-step processing for a 1.1G Poland extract:

Maximum resident set size (kbytes):
all layers at once 2 320 784
roads 976 288
aeroway 619 704
buildings 1 384 644
labels 794 188
landuse 1 087 832
water 663 884

from tilemaker.

TimSC commented on May 12, 2024

Another scaling up related PR #128 for reading input from tiled pbf files

from tilemaker.

systemed commented on May 12, 2024

As a current benchmark: producing an OpenMapTiles-schema mbtiles from great-britain-latest.osm.pbf (around 1GB) requires 22GB memory and takes around 2 hours on my slowish machine. I'm sure it's possible to reduce the memory footprint.

from tilemaker.

oliver commented on May 12, 2024

As another data point, I'm currently converting germany-latest.osm.pbf (3.1 GB), using latest Tilemaker from Git. I haven't used any special compilation flags (in particular I haven't tried -DCOMPACT_NODES yet - does it save a noticeable amount of RAM?).

Tilemaker is currently writing the first tiles:

Stored 316922858 nodes, 48961414 ways, 175882 relations
Zoom level 14, writing tile 32700 of 171989

and is using 52 GB RAM.

from tilemaker.

kleunen commented on May 12, 2024

Osmium use a vector based map to store the node ids using an mmap file. This is basicly the same as the boost flat_map. It is a sorted array, which has poor insertion and lookup performance when it becomes this large.

Tilemaker used the unordered_map before and now the tsl hash map. They have really good performance, but are memory allocated. Using boost.interprocess it is possible to create an allocator for STL data types, to allow storing them on disk. This way, the OS will swap in/out pages when they are needed and when memory is available. This will greatly reduce the maximum amount of memory that is required for the import process.

I tried this a bit, you can stored a vector and boost::unordered_map in an mmap file. When a bad_alloc occurs, you can resize the file to allow for more storage. Have a look at the sample here:
MMAP Boost::unordered_map

It would be possible to have the whole osm_store to be stored in an mmap file using segments from boost::interprocess
https://github.com/systemed/tilemaker/blob/master/include/osm_store.h

This way, memory usage would be reduced greatly. But this will probably have some impact on the performance.

Storing the nodestore in mmap file would be quite straightforward. We can test what kind of effect this has on the import performance.

from tilemaker.

kleunen commented on May 12, 2024

Please have a look at this PR #195

from tilemaker.

Scaling up about tilemaker HOT 20 CLOSED

Comments (20)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent