Comments (15)
Did you check whether the MD5 matches (see planet-210524.osm.bz2.md5
)?
from planet-dump-ng.
I figured that if the file was corrupt, it would be very unlikely for bzip2
to output anything other than garbage. But playing around with it now, it does seem as if a corrupt bz2
file can decompress into something that isn't completely noise.
Unhelpfully, it seems that bzcat
doesn't stop output when it senses a CRC error, but just outputs a warning to stderr and exits with a non-zero code after processing the rest of the file. So if you're not checking stderr or the exit code, it would be easy to think it had succeeded.
I started testing the original file on the planet server, but it is taking a very, very long time. I'll update here when it's finished.
from planet-dump-ng.
Did you check whether the MD5 matches (see
planet-210524.osm.bz2.md5
)?
Yes, did match.
from planet-dump-ng.
This is a bit weird - the planet file on the server looks completely fine. I grepped it for the way ID you mention, and the result is:
<way id="933805767" timestamp="2021-04-22T09:46:48Z" version="1" changeset="103400299" user="lipsigal" uid="438670">
<nd ref="8654953875"/>
<nd ref="8654953876"/>
<nd ref="8654953877"/>
...
with no chaer type=
or skipping into the relations section.
So if the file on the server is OK, and the MD5sum matches, and it matches your downloaded file too, does that mean that whatever problem is occurring must be during or after decompression? How are you decompressing? Using bzcat
on the fly, or bunzip2
, or something else?
from planet-dump-ng.
I have used 7-zip file manager version 19 under windows 10 x64.
I will try another decompressor. Thanks for investigating so far.
from planet-dump-ng.
This time I tried to uncompress with another tool (https://github.com/philr/bzip2-windows/releases) but same result.
Any more guesses?
from planet-dump-ng.
Looks to me like you (@gartenkralle) might have a problem with your hardware, faulty memory or so. I suggest running a memory tester.
from planet-dump-ng.
I think it's unlikely that a hardware fault would affect the decompression in exactly the same way with two different programs (with different memory layouts, etc...).
@gartenkralle are you decompressing the whole file? (In other words, you have a file called planet-210524.osm
which is not compressed? Please could you tell me how big it is, and what the MD5sum is of the decompressed file?
from planet-dump-ng.
Did a 2 cycle memory check. No faulty memory found.
Yes I decompressed the whole file. Decompressing again and then run MD5sum on it. Results I will report in some days...
from planet-dump-ng.
Size: 1.542.302.591.588 Bytes
MD5 now running...
from planet-dump-ng.
MD5 checksum: dfdff2778d0dfad6569ecc2b3613fbb4
from planet-dump-ng.
Here's what I got, for the same input file (our MD5s match for the .osm.bz2
) - I guess the computer I was using was much slower!
MD5: 2cf5fcca63685b13440902f0f1fa24e6
Size: 1,542,302,591,588
We get the same size, but different MD5s. I think something might be going wrong because it's a 1.4TiB file, and that might be pushing the limits of what the decompression software has been tested with (perhaps some subtle bugs when the file length / offset exceeds 40 bits?)
It might be worth trying some other software. I'm using bzip2, a block-sorting file compressor. Version 1.0.8, 13-Jul-2019
on Linux, so it might be worth trying to replicate that (either a virtual machine, or Windows Subsystem for Linux).
Alternatively, is it possible to do what you wanted without decompressing the whole file? If whatever is parsing the OSM file is capable of streaming (e.g: SAX or event parser) then you could bzcat planet.osm.bz2 | whatever
and not need to uncompress the whole thing.
Finally, if all those things won't work, then it might be worth rewriting your parser to use the PBF binary file. The data inside is exactly the same, but the PBF is about half the size of the XML and 10 or more times quicker to parse. @joto's excellent https://github.com/osmcode/libosmium is a well-tested and fast library for parsing PBFs, and there's a suite of utilities (https://github.com/osmcode/osmium-tool) for common tasks such as making geographic extracts and filtering by tags. (I think it builds on Windows, but I don't know enough about Windows to say for sure.)
from planet-dump-ng.
Thanks for all your tips. Even with bzip2 under cygwin I got wrong MD5 checksum. Maybe a very low level bug or file system bug. Now I try doing on linux and transfering file to windows. Otherwise I will go with the PBF.
from planet-dump-ng.
@gartenkralle : do you have any updates on this? Can this issue be closed now?
from planet-dump-ng.
Yes, issue can be closed.
The tool which calculated the checksum after decompression was wrong. I did a mistake in my parsing method. In the xml file there are relations which has no members. I have not considered that case. Additionally I did not consider that utf-8 has variable sized chars. After fixing it worked fine.
from planet-dump-ng.
Related Issues (20)
- PBF doesn't use DenseNodes HOT 3
- Changeset dump file contains no comments after February 23rd HOT 3
- Reference failures in planet file HOT 1
- Check references in current output
- Update github description HOT 1
- Set osmosis_replication_base_url HOT 1
- Build instructions fail on `make` HOT 5
- PBF History file contains too large blobs HOT 1
- handle directory-based input for parallel pgdump HOT 2
- Occasional lockups dumping planet HOT 12
- Exception dumping recent planets HOT 10
- .
- OSM AWS Athena Tables Not Updated HOT 1
- No separate changesets file. HOT 3
- v1.2.4 fails to compile on Debian 11 HOT 4
- All member types of relation objects are from type="relation" HOT 1
- Changeset discussion dumps HOT 1
- Planet state file HOT 4
- Add comments_count to changesets dump HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from planet-dump-ng.