Comments (10)
Hi Kris,
It should be possible to uniquely identify the file as EPUB by the signature, which is always the same, whether or not the contents of the container are compressed. Are you sure there's not something wrong with those EPUBs? If you open them in a hex editor, the start of the file should be something like this:
50 4B 03 04 14 00 00 00 00 00 55 A1 D5 42 6F 61 AB 2C 14 00 00 00
14 00 00 00 08 00 00 00 6D 69 6D 65 74 79 70 65 61 70 70 6C 69 63
61 74 69 6F 6E 2F 65 70 75 62 2B 7A 69 70
Which corresponds to the signature used by PRONOM/FIDO. If you see something different the file wont be identified as EPUB.
Another easy check you can do yourself: rename one of the files that go wrong to a .zip extension, and then open it in a ZIP extractor. There should be a mimetype file in the archive root, with the following content when opened in a text editor:
application/epub+zip
If the file is missing, or if it contains something else then that could explain your results to some degree. In that case I would expect FIDO to give a hit for the ZIP container, rather than the XHTML inside it. So this might also indicate some problem with the way Fido handles containers ...
Just my 2 cents ...
Johan
from fido.
HI Johan,
Indeed, the regex is '(?s)\A.{0,0}PK\x03\x04.{26}mimetypeapplication/epub+zip' and the epub file is well-formed (also checked with epub_validator) and recognised as epub by Fido, but also recognised as:
fmt/103,"Extensible Hypertext Markup Language","XHTML 1.1","application/xhtml+xml","container"
This happens on a few epubs and all of them happen to store the content uncompressed. I assume that fido is confused by the XHTML it sees in the file.
Unfortunately the epub files are commercial and I cannot share them. I created a dummy epub which shows the same issue. I have created a compressed and an uncompressed version. It is important to put the xhtml files early in the epub for the issue to pop up. Both epubs are exactly the same, the only difference is the compression.
compressed.epub: https://drive.google.com/file/d/0Bwh-YeRijm-GU0JYTGpQNXJLZkU/edit?usp=sharing
uncompressed.epub: https://drive.google.com/file/d/0Bwh-YeRijm-GTUU0VmlaOFRTMTA/edit?usp=sharing
from fido.
Hi Kris,
Thanks for uploading the samples, this makes things a lot clearer. I did some tests with other ID tools:
- Version 5.09 of the Unix File tool identifies both EPUBS as
application/epub+zip
. Interestingly an older version (I think 5.04, but not sure) of File on my home PC also mis-identified the uncompressed EPUB as HTML. - Apache Tika 1.5 also correctly identifies both EPUBS as
application/epub+zip
, even after deliberately changing the file extensions.
If I look at the Fido signature for fmt/103, it seems that it is looking for a HTML-specific pattern within the first 1024 bytes of the file. Since the ZIP file is uncompressed, Fido will find this. In think the solution here will be to add the following to the epub entry:
<has_priority_over>fmt/103</has_priority_over>
Since this only cover XHTML 1.1 I suppose additional entries may be needed here for other formats that can live inside an EPUB container (see the EPUB specs for this). It should be possible to make this work without using the file extension info.
I suppose this will affect DROID/PRONOM as well, since that is the upstream source of Fido's sigs.
I hope this helps.
Johan
from fido.
Update to above comment: DROID (6.1.3) identifies these files correctly, so this issue appears to be restricted to Fido.
from fido.
Seems to me that is is a PRONOM issue.
Consider fmt/103. The signature is seeking:
BOF (maximum offset 1024):
0x3C21444F43545950452068746D6C205055424C494320222D2F2F5733432F2F445444205848544D4C20312E31
(<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1)
VAR:
3C68746D6C20786D6C6E733D22687474703A2F2F7777772E77332E6F72672F313939392F7868746D6C22 then 3C7469746C653E then 3C2F7469746C653E
(<html xmlns="http://www.w3.org/1999/xhtml") then later (<title>) then later still (</title>)
In the uncompressed sample file provided, the BOF sequence begins at offset 709, so the file conforms to fmt/103.
Consider fmt/483. This signature is seeking:
504B0304{26}6D696D65747970656170706C69636174696F6E2F657075622B7A6970
(PK..) i.e. zip stub. then 26 bytes later (mimetypeapplication/epub+zip)
This too is present in the sample file, so the file also conforms to fmt/483.
DROID 6.1.3 is not giving a multiple identification, because a) the file conforms to the 'ZIP' format, so is passed to the container signature mechanism for identification, and b) the file conforms to something within the container signature, which in this case is ePub.
Run it through DROID 5, however (which has no notion of containers), and you get a multiple ID - fmt/103 and fmt/483.
I have no idea what FIDO is doing, but I understand it to work in a different way to DROID, albeit with the same data. As demonstrated above, the data itself is problematic.
The solution is simple: give ePub priority over the xhtml formats within PRONOM. Precedence here are formats like WARC that will too contain uncompressed html files, usually well within the 1024 byte offset PRONOM allows for the html format family
I'll add this to PRONOM in June's release.
Many thanks to all for investigating and reporting.
David
from fido.
Hi David,
Apart from xhtml you should probably also take into account other formats that can live in the EPUB container for the priority list. See specs of EPUB 2 and EPUB 3 for details. From the top of my head includes XML, HTML 5, SVG, and a bunch of other formats. This may be trickier than you would expect, since apart from a well-defined set of 'core media types' (formats that any epub reader must render) pretty much any other format is allowed as well.
from fido.
Thanks Johan,
XML and its subsets, within PRONOM are not allowing an offset from BOF (actually, strictly speaking we've allowed a max offset of 3 for XML to allow for the optional UTF-8 BOM), so a file would not conform to both XML (or svg etc) and ePUB (remember the ePub needs PK.. first).
However for the html formats we have allowed a max offset of 1024 to account for the slack around whitespace, comments etc. I'll give ePub priority over those, although would that be necessary for e.g. html 3.2 etc? or just xhtml, html5?
from fido.
Hi David,
Hi David,
Full lists for EPUB 2/3 respectively are here:
http://www.idpf.org/epub/20/spec/OPS_2.0.1_draft.htm#Section1.3.7
http://www.idpf.org/epub/30/spec/epub30-publications.html#sec-core-media-types
As for XHTML: apparently only XHTML 1.1 is allowed in EPUB 2, and XHTML 5 in EPUB 3. I would probably include other versions as well, as I imagine there may be non-conforming creation tools out there.
from fido.
Great. Thanks for the refs Johan.
I suspect next PRONOM update will be around 27th June, possibly sooner.
from fido.
Hi Johan, Hi David,
Thanks a lot for the follow-up and fixing the problem.
from fido.
Related Issues (20)
- Fido crashes when re-cursing into zip or tar files with embedded container files that aren't zip or tar HOT 2
- Question re: regex used in FIDO HOT 3
- Price-matching other repos HOT 3
- No 1.4.0 release available HOT 1
- Crash on XLS format 59 HOT 3
- FIDO should use the latest PRONOM release (v.96)
- 1.4.1 wheel does not match source, missing format file HOT 1
- Pronom version number needs to be updated HOT 2
- setuptools requirement in setup.py:install_requires is unsafe HOT 1
- Fido hanging on skeleton stream (fmt/1000) HOT 3
- Current fido release 1.4.1 does not find pronom v95 HOT 1
- olefile as a dependency at version >= 0.46 HOT 2
- fido documentation link fails HOT 2
- Updating signatures fails when the URL of the reference file identifier can't be found HOT 2
- convert PRONOM formats to FIDO signature fails HOT 7
- Migrate from 1.4.1 to 1.6.1 : FileNotFoundError: [Errno 2] No such file or directory: '.../fido/conf/formats-v104.xml' HOT 11
- Automation of update of FIDO signature site HOT 1
- Python 2 begone. HOT 1
- Migrate FIDO documentation to docs directory HOT 1
- FIDO should support multiple signature sources
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fido.