Comments (10)
java -jar gazetteer-1.4.jar
will run with default settings. It depends on JDK version but it's something about one gigabyte of ram or two.
So first step, specify amount of memory:
java -Xmx4g -jar gazetteer-1.4.jar
Next step, how many execution threads do you have? Each one will takes about 0.5-1g of ram. (It's estimated average, some of them could take more)
So if you have 8 or 16 threads, strict join with number of threads
java -Xmx4g -jar gazetteer-1.4.jar --threads 2 join --handlers out-gazetteer $outFile
from gazetteer.
Good good morning,
so we have a vm with 15GB of ram and we ran the join as:
java -Xmx10g -jar gazetteer-1.4.jar --threads 1 join --handlers out-gazetteer $outFile
it rans for hours and eventually gets stuck, does not throw any exception, it just stops. We also track the memory, it raises up to 11GB. Eventually I had to stop the process. Do you have any idea what is happening, maybe you can also check with this file?
http://download.geofabrik.de/europe/netherlands-latest.osm.bz2
Best regards,
from gazetteer.
Ok, I'll check it.
Not enough minerals.
from gazetteer.
Hi again!
So it finality did it, we had to increase the memory to 15GB, set single thread and wait around 14h.
It seems that we may have a memory leak somewhere, the memory seems to be always increasing rather that your code goes split by split, that is suspicious. To help you I had my logs in attachment.
Let me know if you find this useful for you.
FYI:
The generated netherlands_2015-11-24-22:38:45.json.bz have 186GB.
These were the cmds:
bzcat /opt/data/regions/netherlands-latest.osm.bz2 | java -jar gazetteer-1.4.jar --log-level DEBUG --log-file netherlands-split-2015-11-24-22:38:45.log --data-dir netherlands split - none
java -jar gazetteer-1.4.jar --log-level DEBUG --log-file netherlands-slice-2015-11-24-22:50:06.log --data-dir netherlands slice --x10
java -Xmx15g -jar gazetteer-1.4.jar --log-level DEBUG --log-file netherlands-join-2015-11-24-23:12:40.log --data-dir netherlands --threads 1 join --handlers out-gazetteer netherlands_2015-11-24-22:38:45.json.bz
netherlands-join-2015-11-24-23:12:40.txt
netherlands-slice-2015-11-24-22:50:06.txt
netherlands-split-2015-11-24-22:38:45.txt
from gazetteer.
split consumes memory from start to end and free mem at the end.
join should work with small pieces of data, so could you give me output of
ls -lh | grep stripe
The generated netherlands_2015-11-24-22:38:45.json.bz have 186GB.
It's actually a design issue, every line contains all data for address, with data for all address parts and can be processed line by line without fetching related objects. So all related parts of address inprinted into main feature. It takes a huge amount of space, but it have been done by purpose.
You could overwrite out-gazetteer handler with groovy script, to produce not so verbose output. Or use out-csv handler which produces much less verbose output. If it's the case, I could write an example of such handler.
from gazetteer.
the joins are small: between few KB to few MB
stripe.txt
I think you have done a great job so far :) let me know if I can help you somehow.
from gazetteer.
Thank you, but it's still a lot of things to be done.
So as I understand, most of the time was taken by join?
from gazetteer.
Yes the join is really the bottleneck, if you check the logs the last steps really takes long time. There was nothing really happing in foreground, I would think I saw was the pid still running.
2015-11-25 06.59.42.555 [main] INFO JoinExecutor - Join stripes done in 7:47:01.702
2015-11-25 06.59.42.562 [main] INFO JoinBoundariesExecutor - Run join boundaries, with filter []
2015-11-25 06.59.48.480 [main] INFO JoinBoundariesExecutor - 2999 boundaries was sorted
2015-11-25 06.59.48.482 [main] INFO JoinBoundariesExecutor - Admin levels: [2, 3, 4, 6, 7, 8, 9, 10]
2015-11-25 07.00.05.797 [main] INFO JoinBoundariesExecutor - 0 boundaries skiped
2015-11-25 07.00.05.859 [main] INFO JoinBoundariesExecutor - Join boundaries done in 0:00:23.297
2015-11-25 07.00.05.859 [main] INFO JoinExecutor - Join boundaries done in 0:00:23.300
2015-11-25 12.30.30.31 [main] INFO GazetteerOutWriter - Wrote poi points: 277689
2015-11-25 12.30.30.41 [main] INFO GazetteerOutWriter - Wrote address points: 8701051
2015-11-25 12.30.30.41 [main] INFO GazetteerOutWriter - Wrote highway segments: 1139017
2015-11-25 12.30.30.41 [main] INFO GazetteerOutWriter - Wrote highway networks: 370605
2015-11-25 12.30.30.41 [main] INFO GazetteerOutWriter - Wrote place boundaries: 0
2015-11-25 12.30.30.41 [main] INFO GazetteerOutWriter - Wrote place points: 6502
2015-11-25 12.30.30.41 [main] INFO GazetteerOutWriter - Wrote admin boundaries: 2999
2015-11-25 12.30.30.55 [main] INFO JoinExecutor - All handlers done in 5:30:24.194
from gazetteer.
It's a good news actually, a kind of good news :)
So 5 hours 30 minutes was taken by sorting out the results.
There are two things actually happens:
- sort features with hierarchy (referenced features before features which uses dependancy)
- merge highways into networks (to find out one highway instead of tons of small segments)
I've added some options to skip this part in last commit, I'll test it out and give you a note.
from gazetteer.
Try 1.5 https://github.com/kiselev-dv/gazetteer/releases/tag/Gazetteer-1.5 please
If you didn't delete --data-dir netherlands
folder just run it again with
java -Xmx10g -jar gazetteer-1.5.jar --log-file netherlands-join.log --data-dir netherlands --threads 1 join --handlers out-gazetteer out=netherlands.json.gz sort=NONE
Successfully convert Netherlands within 4 hours 6g of ram in two threads.
from gazetteer.
Related Issues (20)
- Failed to compile HOT 6
- Неточности в поиски адресов по схеме addrN:*
- split and slice succeed,but block at join using Gazetteer-1.3 HOT 5
- Добавить уровень подробности для обратного геокода
- Геокодирование внутри дома, внутри osm границы HOT 11
- Использовать названия из Wikidata для wikipedia-затегированных объектов HOT 11
- Data extracts for Romania HOT 3
- fix object namings to follow GeoJSON standards HOT 15
- add osm node ids to Gazetteer objects HOT 4
- Gazetteer hangs on JoinSliceRunable HOT 1
- Add ref for highways
- Output to CSV - Syntax HOT 3
- Extract only addresses? HOT 2
- Падает если в osm файле нет атрибута timestamp
- Путает номер дома со номером "микрорайона" (web) HOT 5
- Complain about unrecognized handlers options
- support parsing postcode from fullText address with libpostal HOT 2
- POI станции метро
- corrupted output file when processing the whole planet
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gazetteer.