Giter Club home page Giter Club logo

Comments (10)

kiselev-dv avatar kiselev-dv commented on June 19, 2024

java -jar gazetteer-1.4.jar will run with default settings. It depends on JDK version but it's something about one gigabyte of ram or two.

So first step, specify amount of memory:

java -Xmx4g -jar gazetteer-1.4.jar 

Next step, how many execution threads do you have? Each one will takes about 0.5-1g of ram. (It's estimated average, some of them could take more)

So if you have 8 or 16 threads, strict join with number of threads

 java -Xmx4g -jar gazetteer-1.4.jar --threads 2 join --handlers out-gazetteer $outFile

from gazetteer.

ricadete avatar ricadete commented on June 19, 2024

Good good morning,

so we have a vm with 15GB of ram and we ran the join as:
java -Xmx10g -jar gazetteer-1.4.jar --threads 1 join --handlers out-gazetteer $outFile

it rans for hours and eventually gets stuck, does not throw any exception, it just stops. We also track the memory, it raises up to 11GB. Eventually I had to stop the process. Do you have any idea what is happening, maybe you can also check with this file?
http://download.geofabrik.de/europe/netherlands-latest.osm.bz2

Best regards,

from gazetteer.

kiselev-dv avatar kiselev-dv commented on June 19, 2024

Ok, I'll check it.
Not enough minerals.

from gazetteer.

ricadete avatar ricadete commented on June 19, 2024

Hi again!

So it finality did it, we had to increase the memory to 15GB, set single thread and wait around 14h.
It seems that we may have a memory leak somewhere, the memory seems to be always increasing rather that your code goes split by split, that is suspicious. To help you I had my logs in attachment.
Let me know if you find this useful for you.

FYI:
The generated netherlands_2015-11-24-22:38:45.json.bz have 186GB.

These were the cmds:
bzcat /opt/data/regions/netherlands-latest.osm.bz2 | java -jar gazetteer-1.4.jar --log-level DEBUG --log-file netherlands-split-2015-11-24-22:38:45.log --data-dir netherlands split - none

java -jar gazetteer-1.4.jar --log-level DEBUG --log-file netherlands-slice-2015-11-24-22:50:06.log --data-dir netherlands slice --x10

java -Xmx15g -jar gazetteer-1.4.jar --log-level DEBUG --log-file netherlands-join-2015-11-24-23:12:40.log --data-dir netherlands --threads 1 join --handlers out-gazetteer netherlands_2015-11-24-22:38:45.json.bz

netherlands-join-2015-11-24-23:12:40.txt
netherlands-slice-2015-11-24-22:50:06.txt
netherlands-split-2015-11-24-22:38:45.txt

from gazetteer.

kiselev-dv avatar kiselev-dv commented on June 19, 2024

split consumes memory from start to end and free mem at the end.

join should work with small pieces of data, so could you give me output of

ls -lh | grep stripe

The generated netherlands_2015-11-24-22:38:45.json.bz have 186GB.

It's actually a design issue, every line contains all data for address, with data for all address parts and can be processed line by line without fetching related objects. So all related parts of address inprinted into main feature. It takes a huge amount of space, but it have been done by purpose.

You could overwrite out-gazetteer handler with groovy script, to produce not so verbose output. Or use out-csv handler which produces much less verbose output. If it's the case, I could write an example of such handler.

from gazetteer.

ricadete avatar ricadete commented on June 19, 2024

the joins are small: between few KB to few MB
stripe.txt

I think you have done a great job so far :) let me know if I can help you somehow.

from gazetteer.

kiselev-dv avatar kiselev-dv commented on June 19, 2024

Thank you, but it's still a lot of things to be done.

So as I understand, most of the time was taken by join?

from gazetteer.

ricadete avatar ricadete commented on June 19, 2024

Yes the join is really the bottleneck, if you check the logs the last steps really takes long time. There was nothing really happing in foreground, I would think I saw was the pid still running.

2015-11-25 06.59.42.555 [main] INFO JoinExecutor - Join stripes done in 7:47:01.702
2015-11-25 06.59.42.562 [main] INFO JoinBoundariesExecutor - Run join boundaries, with filter []
2015-11-25 06.59.48.480 [main] INFO JoinBoundariesExecutor - 2999 boundaries was sorted
2015-11-25 06.59.48.482 [main] INFO JoinBoundariesExecutor - Admin levels: [2, 3, 4, 6, 7, 8, 9, 10]
2015-11-25 07.00.05.797 [main] INFO JoinBoundariesExecutor - 0 boundaries skiped
2015-11-25 07.00.05.859 [main] INFO JoinBoundariesExecutor - Join boundaries done in 0:00:23.297
2015-11-25 07.00.05.859 [main] INFO JoinExecutor - Join boundaries done in 0:00:23.300
2015-11-25 12.30.30.31 [main] INFO GazetteerOutWriter - Wrote poi points: 277689
2015-11-25 12.30.30.41 [main] INFO GazetteerOutWriter - Wrote address points: 8701051
2015-11-25 12.30.30.41 [main] INFO GazetteerOutWriter - Wrote highway segments: 1139017
2015-11-25 12.30.30.41 [main] INFO GazetteerOutWriter - Wrote highway networks: 370605
2015-11-25 12.30.30.41 [main] INFO GazetteerOutWriter - Wrote place boundaries: 0
2015-11-25 12.30.30.41 [main] INFO GazetteerOutWriter - Wrote place points: 6502
2015-11-25 12.30.30.41 [main] INFO GazetteerOutWriter - Wrote admin boundaries: 2999
2015-11-25 12.30.30.55 [main] INFO JoinExecutor - All handlers done in 5:30:24.194

from gazetteer.

kiselev-dv avatar kiselev-dv commented on June 19, 2024

It's a good news actually, a kind of good news :)

https://github.com/kiselev-dv/gazetteer/blob/develop/Gazetteer/src/main/java/me/osm/gazetter/join/out_handlers/GazetteerOutWriter.java#L969

So 5 hours 30 minutes was taken by sorting out the results.
There are two things actually happens:

  1. sort features with hierarchy (referenced features before features which uses dependancy)
  2. merge highways into networks (to find out one highway instead of tons of small segments)

I've added some options to skip this part in last commit, I'll test it out and give you a note.

from gazetteer.

kiselev-dv avatar kiselev-dv commented on June 19, 2024

Try 1.5 https://github.com/kiselev-dv/gazetteer/releases/tag/Gazetteer-1.5 please
If you didn't delete --data-dir netherlands folder just run it again with

java -Xmx10g -jar gazetteer-1.5.jar --log-file netherlands-join.log --data-dir netherlands --threads 1 join --handlers out-gazetteer out=netherlands.json.gz sort=NONE

Successfully convert Netherlands within 4 hours 6g of ram in two threads.

from gazetteer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.