Giter Club home page Giter Club logo

Comments (10)

bitsgalore avatar bitsgalore commented on July 29, 2024

Hi Kris,

It should be possible to uniquely identify the file as EPUB by the signature, which is always the same, whether or not the contents of the container are compressed. Are you sure there's not something wrong with those EPUBs? If you open them in a hex editor, the start of the file should be something like this:

50 4B 03 04 14 00 00 00 00 00 55 A1 D5 42 6F 61 AB 2C 14 00 00 00 
14 00 00 00 08 00 00 00 6D 69 6D 65 74 79 70 65 61 70 70 6C 69 63 
61 74 69 6F 6E 2F 65 70 75 62 2B 7A 69 70

Which corresponds to the signature used by PRONOM/FIDO. If you see something different the file wont be identified as EPUB.

Another easy check you can do yourself: rename one of the files that go wrong to a .zip extension, and then open it in a ZIP extractor. There should be a mimetype file in the archive root, with the following content when opened in a text editor:

application/epub+zip

If the file is missing, or if it contains something else then that could explain your results to some degree. In that case I would expect FIDO to give a hit for the ZIP container, rather than the XHTML inside it. So this might also indicate some problem with the way Fido handles containers ...

Just my 2 cents ...

Johan

from fido.

Kris-LIBIS avatar Kris-LIBIS commented on July 29, 2024

HI Johan,

Indeed, the regex is '(?s)\A.{0,0}PK\x03\x04.{26}mimetypeapplication/epub+zip' and the epub file is well-formed (also checked with epub_validator) and recognised as epub by Fido, but also recognised as:

fmt/103,"Extensible Hypertext Markup Language","XHTML 1.1","application/xhtml+xml","container"

This happens on a few epubs and all of them happen to store the content uncompressed. I assume that fido is confused by the XHTML it sees in the file.

Unfortunately the epub files are commercial and I cannot share them. I created a dummy epub which shows the same issue. I have created a compressed and an uncompressed version. It is important to put the xhtml files early in the epub for the issue to pop up. Both epubs are exactly the same, the only difference is the compression.

compressed.epub: https://drive.google.com/file/d/0Bwh-YeRijm-GU0JYTGpQNXJLZkU/edit?usp=sharing
uncompressed.epub: https://drive.google.com/file/d/0Bwh-YeRijm-GTUU0VmlaOFRTMTA/edit?usp=sharing

from fido.

bitsgalore avatar bitsgalore commented on July 29, 2024

Hi Kris,

Thanks for uploading the samples, this makes things a lot clearer. I did some tests with other ID tools:

  • Version 5.09 of the Unix File tool identifies both EPUBS as application/epub+zip. Interestingly an older version (I think 5.04, but not sure) of File on my home PC also mis-identified the uncompressed EPUB as HTML.
  • Apache Tika 1.5 also correctly identifies both EPUBS as application/epub+zip, even after deliberately changing the file extensions.

If I look at the Fido signature for fmt/103, it seems that it is looking for a HTML-specific pattern within the first 1024 bytes of the file. Since the ZIP file is uncompressed, Fido will find this. In think the solution here will be to add the following to the epub entry:

<has_priority_over>fmt/103</has_priority_over>

Since this only cover XHTML 1.1 I suppose additional entries may be needed here for other formats that can live inside an EPUB container (see the EPUB specs for this). It should be possible to make this work without using the file extension info.

I suppose this will affect DROID/PRONOM as well, since that is the upstream source of Fido's sigs.

I hope this helps.

Johan

from fido.

bitsgalore avatar bitsgalore commented on July 29, 2024

Update to above comment: DROID (6.1.3) identifies these files correctly, so this issue appears to be restricted to Fido.

from fido.

Dclipsham avatar Dclipsham commented on July 29, 2024

Seems to me that is is a PRONOM issue.

Consider fmt/103. The signature is seeking:
BOF (maximum offset 1024):

0x3C21444F43545950452068746D6C205055424C494320222D2F2F5733432F2F445444205848544D4C20312E31

(<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1)
VAR:

3C68746D6C20786D6C6E733D22687474703A2F2F7777772E77332E6F72672F313939392F7868746D6C22 then 3C7469746C653E then 3C2F7469746C653E

(<html xmlns="http://www.w3.org/1999/xhtml") then later (<title>) then later still (</title>)

In the uncompressed sample file provided, the BOF sequence begins at offset 709, so the file conforms to fmt/103.

Consider fmt/483. This signature is seeking:

504B0304{26}6D696D65747970656170706C69636174696F6E2F657075622B7A6970

(PK..) i.e. zip stub. then 26 bytes later (mimetypeapplication/epub+zip)

This too is present in the sample file, so the file also conforms to fmt/483.

DROID 6.1.3 is not giving a multiple identification, because a) the file conforms to the 'ZIP' format, so is passed to the container signature mechanism for identification, and b) the file conforms to something within the container signature, which in this case is ePub.

Run it through DROID 5, however (which has no notion of containers), and you get a multiple ID - fmt/103 and fmt/483.

I have no idea what FIDO is doing, but I understand it to work in a different way to DROID, albeit with the same data. As demonstrated above, the data itself is problematic.

The solution is simple: give ePub priority over the xhtml formats within PRONOM. Precedence here are formats like WARC that will too contain uncompressed html files, usually well within the 1024 byte offset PRONOM allows for the html format family

I'll add this to PRONOM in June's release.

Many thanks to all for investigating and reporting.

David

from fido.

bitsgalore avatar bitsgalore commented on July 29, 2024

Hi David,
Apart from xhtml you should probably also take into account other formats that can live in the EPUB container for the priority list. See specs of EPUB 2 and EPUB 3 for details. From the top of my head includes XML, HTML 5, SVG, and a bunch of other formats. This may be trickier than you would expect, since apart from a well-defined set of 'core media types' (formats that any epub reader must render) pretty much any other format is allowed as well.

from fido.

Dclipsham avatar Dclipsham commented on July 29, 2024

Thanks Johan,

XML and its subsets, within PRONOM are not allowing an offset from BOF (actually, strictly speaking we've allowed a max offset of 3 for XML to allow for the optional UTF-8 BOM), so a file would not conform to both XML (or svg etc) and ePUB (remember the ePub needs PK.. first).

However for the html formats we have allowed a max offset of 1024 to account for the slack around whitespace, comments etc. I'll give ePub priority over those, although would that be necessary for e.g. html 3.2 etc? or just xhtml, html5?

from fido.

bitsgalore avatar bitsgalore commented on July 29, 2024

Hi David,

Hi David,

Full lists for EPUB 2/3 respectively are here:

http://www.idpf.org/epub/20/spec/OPS_2.0.1_draft.htm#Section1.3.7

http://www.idpf.org/epub/30/spec/epub30-publications.html#sec-core-media-types

As for XHTML: apparently only XHTML 1.1 is allowed in EPUB 2, and XHTML 5 in EPUB 3. I would probably include other versions as well, as I imagine there may be non-conforming creation tools out there.

from fido.

Dclipsham avatar Dclipsham commented on July 29, 2024

Great. Thanks for the refs Johan.

I suspect next PRONOM update will be around 27th June, possibly sooner.

from fido.

Kris-LIBIS avatar Kris-LIBIS commented on July 29, 2024

Hi Johan, Hi David,

Thanks a lot for the follow-up and fixing the problem.

from fido.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.