Giter Club home page Giter Club logo

fido's Issues

wrong result for odt into zip

A scan in an archive (with the -zip arg) which contain an odt file identify the odt as a ZIP Format (wrong) and analyze all the objects into (jpg, xml, etc.)

Strangely, when I scan the same odt file with the -zip arg, fido detect normally the OpenDocument Text format without searching into.

Maybe you can create an argument to scan into odt container but disable it by default when scanning an archive.

Anyway, the result ZIP Format is wrong for OpenDocument in a zip.

Suggestion for improved identification of XML using XMP parser

Identification of XML goes wrong if files don't contain XML declaration (which is not required by XML spec). Not a Fido bug, but simply a limitation of signature-based identification.

Possible solution: check for XML well-formedness using Python's Expat parser; possibly add this as a user-activated option. More details + sample code here:

http://www.openplanetsfoundation.org/blogs/2011-07-11-improved-identification-xml-python-experiment

Original issue: FIDO-14

Add seek to the zip file-like-object

The zipitem file-like-object supports read(n_bytes), but does not support seek(). When the item is compressed, then seek will have to scan through from the start - inefficient, but it would eliminate the need for special handling.

Format Extensions "Registry", updating through "update_signatures" script

At the moment the "format_extensions.xml" file with advanced signatures or signatures unknown to PRONOM is updated by committing the changed file to the FIDO codebase.

The drawback of this method is that users who add their own signatures to this file are in danger that a new version of FIDO or a new version of the extension file overwrites their changes.

Ideally, we should create a GitHub project for the Format Extensions, as a sort of registry, from which the "update_signatures" script pulls the changes.

This way users are able to create pull requests for advanced or unknown signatures to have them added to the Format Extension file.

Additionally, there should be a "user_extensions.xml" file with "special" or "private" signatures which is untouched by any of the update processes.

Support local extensions to the format library

Provide a method to extend the set of signatures. Perhaps a file which holds the basic information. Once mature, the new signatures could be added to Pronom or into the Pronom XML syntax.

Delete old versions of PRONOM files?

Older PRONOM updates (eg 0bbf39d) don't keep the old version around, but update & rename the DROID_SignatureFile-V##.xml, formats-v##.xml and pronom-xml-v##.zip. Newer updates to PRONOM (eg #81)have kept the older version around, presumably as a reference/backup. Is there a benefit to keeping the old versions around or should they be deleted? They're still available in the version history if something goes wrong with the new version, but wouldn't be easily available in a non-development install.

Determining file formats within a ZIP file gives an [Error 2] but then correctly determines the format

fido.py Personal_Files_Folder.zip yields:

OK,168,x-fmt/263,"ZIP Format","ZIP format",294895,"Personal_Files_Folder.zip","application/zip","signature"
OK,6,x-fmt/394,"WordPerfect for MS-DOS/Windows Document","WordPerfect 5.1",3919,"Personal_Files_Folder.zip!Personal_Files_Folder/BIBORGAN.SC","None","signature"
[Errno 2] No such file or directory: 'Personal_Files_Folder.zip!Personal_Files_Folder/CH7.RD'
OK,11,fmt/111,"OLE2 Compound Document Format","OLE2 Compound Document Format",149504,"Personal_Files_Folder.zip!Personal_Files_Folder/CH7.RD","None","signature"
OK,6,fmt/393,"Borland Reflex flat datafile","Borland Reflex flat datafile",10808,"Personal_Files_Folder.zip!Personal_Files_Folder/COURTNE.RXD","None","signature"
OK,6,x-fmt/394,"WordPerfect for MS-DOS/Windows Document","WordPerfect 5.1",7400,"Personal_Files_Folder.zip!Personal_Files_Folder/DELIVERY","None","signature"
OK,20,x-fmt/394,"WordPerfect for MS-DOS/Windows Document","WordPerfect 5.1",156689,"Personal_Files_Folder.zip!Personal_Files_Folder/DIRMIN2","None","signature"
OK,10,x-fmt/22,"7-bit ASCII Text","External",30464,"Personal_Files_Folder.zip!Personal_Files_Folder/INDEX.ASC","text/plain","extension"
OK,10,x-fmt/283,"8-bit ASCII Text","External",30464,"Personal_Files_Folder.zip!Personal_Files_Folder/INDEX.ASC","text/plain","extension"
OK,6,x-fmt/394,"WordPerfect for MS-DOS/Windows Document","WordPerfect 5.1",9915,"Personal_Files_Folder.zip!Personal_Files_Folder/MODULE1.RH","None","signature"
OK,8,x-fmt/394,"WordPerfect for MS-DOS/Windows Document","WordPerfect 5.1",27580,"Personal_Files_Folder.zip!Personal_Files_Folder/NZ&AUST.WP","None","signature"
OK,26,x-fmt/8,"dBASE Database","dBase Table Version II (date last updated (month (1-12), day (1-31), year)",819200,"Personal_Files_Folder.zip!Personal_Files_Folder/NZPN.DBF","None","signature"
KO,23,,,,220672,"Personal_Files_Folder.zip!Personal_Files_Folder/NZPNPERS.NDX",,"fail"
OK,6,x-fmt/394,"WordPerfect for MS-DOS/Windows Document","WordPerfect 5.1",3062,"Personal_Files_Folder.zip!Personal_Files_Folder/PTCHALMI.WP","None","signature"
OK,12,x-fmt/394,"WordPerfect for MS-DOS/Windows Document","WordPerfect 5.1",59816,"Personal_Files_Folder.zip!Personal_Files_Folder/SEMINAR.DOC","None","signature"
OK,9,x-fmt/394,"WordPerfect for MS-DOS/Windows Document","WordPerfect 5.1",30937,"Personal_Files_Folder.zip!Personal_Files_Folder/SESSION2","None","signature"
OK,6,x-fmt/394,"WordPerfect for MS-DOS/Windows Document","WordPerfect 5.1",7494,"Personal_Files_Folder.zip!Personal_Files_Folder/TOWNNUMB","None","signature"
OK,17,fmt/125,"Microsoft Powerpoint Presentation","Powerpoint 95",76800,"Personal_Files_Folder.zip!Personal_Files_Folder/Week7.ppt","application/vnd.ms-powerpoint","signature"

The second file within the zip is CH7.RD which fido claims is not found, but then it successfully determines the format. Also, running fido.py on the unzipped files from this .zip works fine.

Maurice Bouchard commented:

The command I issued should read:
fido.py -zip Personal_Files_Folder.zip

sorry for the confusion.

Maurice de Rooij commented:

Thank you very much for reporting.

This issue will be fixed in the next commit.

The second file within the zip is CH7.RD which fido claims is not found, but then it successfully determines the format. Also, running fido.py on the unzipped files from this .zip works fine.

The read error is due to the fact that the function which analyzes container files is not yet able to recurse into zipfiles.
The successfull determination of the format afterwards is because that result is originally the result that triggered the container function.

Original issue: FIDO-28

invalid SRE code

Dev Effort

0.5D

Description

I don't know why, since few days fido miss docx identification and found zip in his place and a message 'invalid SRE code' appear on beginning. As I read in the previous issue that is look like 'invalid SRE code' is a bug in Python v2.7.3, so I tested with Python v2.7.6 : no more 'invalid SRE code' but docx not recognize though.. :

FIDO v1.3.1 (formats-v78.xml, container-signature-20130501.xml, format_extensions.xml)
invalid SRE code
OK,9,x-fmt/263,"ZIP Format","ZIP format",3734,"/home/fajir/test.docx","application/zip","signature"

Altough the first item (customized with the pronom puid as replace of fido-puid) has priority over x-fmt/263 in my conf/format_extensions.xml :

<format>
<puid>fmt/412</puid>
<name>Microsoft Office Open XML - Word</name>
<mime>application/vnd.openxmlformats-officedocument.wordprocessingml.document</mime>
<extension>docx</extension>
<has_priority_over>x-fmt/263</has_priority_over>
<has_priority_over>fmt/189</has_priority_over>
<signature>
<name>Microsoft Office Open XML - Word</name>
<pattern><position>BOF</position><regex>(?s)\APK\x03\x04</regex></pattern>
<pattern><position>BOF</position><regex>(?s)\A.{30}\[Content_Types\]\.xml \xa2</regex></pattern>
<pattern><position>EOF</position><regex>(?s)\x00\x00word/.{1,20}\.xmlPK\x01\x02\x2d.{0,2000}\Z</regex></pattern>
</signature>
</format>

Any help ?

Specifc regex fails to parse on some Python installations

Prior to Python 2.7 commit 82219:c1b3d25882ca, the maximum repetition number in a regular expression was 65535. (It's now 4294967294 for 64-bit platforms.) I believe 2.7.5 is the first 2.x series Python with this change; it was also applied to Python 3.2 and 3.3 releases in the last year.

This ends up being a problem because one regular expression in PRONOM, the one for x-fmt/386, actually checks for 65536 repetitions of something:

(?s)\A.{0,0}\x00\x00\x01\xba.{8,12}\x00\x00\x01\xbb.{8,65536}\x00\x00\x01\xb3.{8,128}\x00\x00\x01\xb5

As a result, older Python 2.7.x releases can't compile this regular expression, raising the RuntimeError "invalid SRE code". I encountered this when using FIDO on Ubuntu 12.04, which ships Python 2.7.3. (There's no such problem in recent OS X releases or Ubuntu 14.04.)

This has strange results on file identification. When scanning a TIFF, I noticed that (even though x-fmt/386 is an MPEG format that should not match the file either way), incorrect results are returned on OSs where the exception is raised vs OSs where it is not.

Python 2.7.6:

OK,317,fmt/353,"Tagged Image File Format","TIFF generic (little-endian)",28860926,"/Users/vlcice/Downloads/Structures and Landscapes-Volume1-mtp00006.tif","image/tiff","signature"

Python 2.7.3:

OK,40,fmt/152,"Digital Negative Format (DNG)","External",28860926,"Structures and Landscapes-Volume1-mtp00006.tif","image/tiff","extension"
OK,40,fmt/153,"Tagged Image File Format for Image Technology (TIFF/IT)","External",28860926,"Structures and Landscapes-Volume1-mtp00006.tif","image/tiff","extension"
OK,40,fmt/154,"Tagged Image File Format for Electronic Photography (TIFF/EP)","External",28860926,"Structures and Landscapes-Volume1-mtp00006.tif","image/tiff","extension"
OK,40,fmt/155,"Geographic Tagged Image File Format (GeoTIFF)","External",28860926,"Structures and Landscapes-Volume1-mtp00006.tif","image/tiff","extension"
OK,40,fmt/156,"Tagged Image File Format for Internet Fax (TIFF-FX)","External",28860926,"Structures and Landscapes-Volume1-mtp00006.tif","image/tiff","extension"

I've scanned through every regex in the current extension file, and x-fmt/386 is the only one that doesn't compile in earlier Pythons.

Accept content from stdin

As of 0.7, Fido accepts input from files and a list of files from stdin.
Add the ability to accept content from stdin, perhaps when the file list is '-'.

This will allow checking of one file per invocation.

Cleanup extensions file

The extensions file needs a cleanup as some signatures are in PRONOM and/or in the container signature file.

MKV File format

Hi,

I try to check if a MKV is really a MKV file (video/x-matroska mimetype), with FIDO and formats-v81.xml. I use this command :

$ python fido.py -matchprintf "%(info.mimetype)s\n" "my-file.mkv"

But it always return "None". Can you help me please ?

Thanks

Fido "Can't convert 'bytes' object to str implicitly" in Python 3.4/3.5

Hi,
As a new user of Fido, I ran into the error message from the subject while analysing Fido's README.txt on:

  • Windows 8.1 Enterprise (64 bits)
  • Python 3.4
  • Fido 1.3.4
    The problem seems related to differences between Python 2 and 3 (Unicode handling).
    If I use Python 2.7.11, Fido works just fine ("Plain Text File").
    If I use Python 3.5.1, I get the same error message.
    See below for more details.

c:\fido>fido
usage: fido-script.py [-h] [-v] [-q] [-recurse] [-zip] [-nocontainer]
[-pronom_only] [-input INPUT] [-filename FILENAME]
[-useformats INCLUDEPUIDS] [-nouseformats EXCLUDEPUIDS]
[-matchprintf FORMATSTRING]
[-nomatchprintf FORMATSTRING] [-bufsize BUFSIZE]
[-container_bufsize CONTAINER_BUFSIZE]
[-loadformats XML1,...,XMLn] [-confdir CONFDIR]
[FILE [FILE ...]]
(etc. - Fido seems to have been installed properly)

c:\fido>fido README.txt
FIDO v1.3.4 (formats-v84.xml, container-signature-20160121.xml, format_extension
s.xml)
Traceback (most recent call last):
File "C:\Python34\Scripts\fido-script.py", line 9, in
load_entry_point('opf-fido==1.3.4', 'console_scripts', 'fido')()
File "C:\Python34\lib\site-packages\opf_fido-1.3.4-py3.4.egg\fido\fido.py", line 869, in main
fido.identify_file(file)
File "C:\Python34\lib\site-packages\opf_fido-1.3.4-py3.4.egg\fido\fido.py", line 375, in identify_file
bofbuffer, eofbuffer, _ = self.get_buffers(f, size, seekable=True)
File "C:\Python34\lib\site-packages\opf_fido-1.3.4-py3.4.egg\fido\fido.py", line 543, in get_buffers
bofbuffer = self.blocking_read(stream, bytes_to_read)
File "C:\Python34\lib\site-packages\opf_fido-1.3.4-py3.4.egg\fido\fido.py", line 527, in blocking_read
buffer += readbuffer
TypeError: Can't convert 'bytes' object to str implicitly

c:\fido>

1.3.3 struggling with zero byte files

Stack trace:

  goatslayer@goatslayer-acer-linux:~/git/droid-sqlite-analysis$ fido empty-file.empty 
  FIDO v1.3.3 (formats-v84.xml, container-signature-20160121.xml, format_extensions.xml)
  FIDO: Zero byte file (empty): Path is: empty-file.empty
  Traceback (most recent call last):
    File "/usr/local/bin/fido", line 9, in <module>
      load_entry_point('opf-fido==1.3.3', 'console_scripts', 'fido')()
    File "/usr/local/lib/python2.7/dist-packages/fido/fido.py", line 855, in main
      fido.identify_file(file)
    File "/usr/local/lib/python2.7/dist-packages/fido/fido.py", line 375, in identify_file
      bofbuffer, eofbuffer = self.get_buffers(f, size, seekable=True)
  ValueError: too many values to unpack

Empty file listing below:

  goatslayer@goatslayer-acer-linux:~/git/droid-sqlite-analysis$ ls -l empty-file.empty 
  -rw-rw-r-- 1 goatslayer goatslayer 0 May 15 12:48 empty-file.empty
  goatslayer@goatslayer-acer-linux:~/git/droid-sqlite-analysis$ 

Distro stats:

Python 2.7.6
No LSB modules are available.
Distributor ID: Ubuntu 
Description:    Ubuntu 14.04.4 LTS
Release:    14.04
Codename:   trusty

Matching failed when unzipping

I found that fido was not giving consistent results for the same files when they were stored in a ZIP rather than as a plain file. To reproduce, attempt to identify the contents of the govdocs1 subset0.zip test files.

Unittests

Currently there are no unit or regression tests. They should be added.

Assertion error while updating signature file

I just installed Fido 1.0 (Win XP) after which I tried to update the signature file (latest vrsion is v 59). At the end of the updating procedure, while Fido is trying to convert the PRONOM signatures to Fido's format an assertion error occurs. Below is a screen dump of the updating procedure:

C:\fido>c:\python27\python .\update_signatures.py
FIDO signature updater v1.0
Contacting PRONOM...
Querying latest signaturefile version...
Downloading signature file version 59...
Extracting PRONOM PUID's from signature file...
Found 864 PRONOM PUID's
Downloading signatures can take a while
Continue and download signatures? (yes/no): y
Creating temporary folder for download: C:\fido\conf\tmp
Downloading signatures, one moment please...
  100%
Creating PRONOM zip, adding files with compression mode 'deflated'
Deleting temporary folder and files...
Preparing to convert PRONOM formats to FIDO signatures...
Conversion: Illegal character in bracket: char='0', at pos 31 in
  52494646{4}57415645*666D7420[12000000:FFFFFF7F][!FEFF]{16-*}64617461
                                 ^
Buffer = (?s)\ARIFF.{4}WAVE.*fmt [\x12
Traceback (most recent call last):
  File ".\update_signatures.py", line 131, in <module>
    main()
  File ".\update_signatures.py", line 126, in main
    prepare.main()
  File "C:\fido\prepare.py", line 574, in main
    info.load_pronom_xml(args.puid)
  File "C:\fido\prepare.py", line 105, in load_pronom_xml
    format = self.parse_pronom_xml(stream, puid_filter)
  File "C:\fido\prepare.py", line 185, in parse_pronom_xml
    regex = convert_to_regex(bytes, 'Little', pos, offset, max_offset)
  File "C:\fido\prepare.py", line 456, in convert_to_regex
    assert(chars[i] == ':')
AssertionError

Add multi-threading / multi-processing

The 0.5.x implementation appears to be IO bound. Throughput would be increased by moving file-reads to a separate thread so that they will happen in parallel with pattern matching.

One approach: add multiple workers, each of which reads, matches. Another: add a pool to do reads, and another to do matches.

But - it's all fast enough for now!

Refactor code to Python 3.x

Refactoring to Python 3.x involves following tasks

* Can't convert 'bytes' object to str implicitly in def identify_file
* urlparse is now moved to urllib, but import fails
* builtin object problems

Original issue: FIDO-17

Review recent code that extends fido XML to include more registry data

Submitted by Andrew Jackson:

To clarify, I added experimental code that pulls more details out of the Format Record and populates an additional 'details' section in the Fido signature file. It looks like this:

<details>
      <dc:description>This is an outline record only, and requires further details, research or authentication to provide information that will enable users to further understand the format and to assess digital preservation risks associated with it if appropriate. If you are able to help by supplying any additional information concerning this entry, please return to the main PRONOM page and select &#226;&#128;&#152;Add an Entry&#226;&#128;&#153;.</dc:description>
      <dcterms:available />
      <dc:creator />
      <dcterms:publisher />
      <content_type />
      <record_metadata>
        <status>unknown</status>
        <dc:creator>Digital Preservation Department / The National Archives</dc:creator>
        <dcterms:created>11 Mar 2005</dcterms:created>
        <dcterms:modified>02 Aug 2005</dcterms:modified>
        <dc:description />
      </record_metadata>
    </details>

The problem is that it's not clear that this is a good idea, and it may slow down parsing unnecessarily. I am increasingly of the opinion that most of this data should be in a true format registry, and that identification tools should only include a minimal amount of data and refer the user to the registry for these kind of details.

Having said that, this is not a critical issue for Fido as it only slows things down, at worst.

Original issue: FIDO-2

Automated importing of PRONOM signatures and file extensions

Currently there is no way to automatically convert PRONOM/DROID signatures to Fido-compatible format. In an operational setting this would be a pretty severe limitation, and it makes managing the signature information quite difficult . Also, it would be helpful to use some kind of versioning scheme for the 'formats' and 'format_extensions' files, and some information on the provenance of the information in these files, (e.g. "DROID" + sig file number).

Andrew Jackson added a comment - 19/Sep/11 3:23 PM
I don't quite understand this issue, as Fido does contain code to turn PRONOM Format Records into Fido signatures. Does this issue refer to the ability to re-use the pre-compiled signatures in the DROID signature file? That might be possible, but will certainly be rather ugly. It might be easier to download the corresponding Format Record instead of using the DROID sig file directly.

Maurice de Rooij added a comment - 19/Sep/11 4:26 PM
There is a script called 'prepare.py' which converts the pronom-xml.zip in 'conf'.
Have just fixed a minor bug which caused the script to crash when it encountered a certain byte while saving the formats.xml file. The current script to fetch the Format Records is not very cross-platform friendly (a bash script) and am currently extending 'prepare.py' to fetch AND convert the Format Records on the fly.

Original issue: FIDO-6

Python 3 support

Dev Effort

10D

Description

Improve Python3 support.

Some inital work was done in #67 but as issues like #78 suggest it still needs some attention.

To Do:

  • Check for bytes vs str mismatch (eg #67)
  • Fix update_signatures to not use deprecated httplib.HTTP
  • ... other?

epub recognized as xls

tried with several epub files, same behaviour

$ ./fido.py ~/Downloads/Zizek\ -\ Vivere\ alla\ fine\ dei\ tempi.epub

FIDO v1.1.2 (formats-v66.xml, container-signature-20121218.xml, format_extensions.xml)
OK,295,x-fmt/263,"ZIP Format","ZIP format",742241,"/Users/void/Downloads/Zizek - Vivere alla fine dei tempi [Ladri di biblioteche].epub","application/zip","container"
OK,295,fmt/61,"Microsoft Excel 97 Workbook (xls)","BIFF 8 & 8X Workbook (generic)",742241,"/Users/raffaele/Downloads/Zizek - Vivere alla fine dei tempi.epub","application/vnd.ms-excel","container"
FIDO: Processed      1 files in 386.89 msec,  3 files/sec

Peformance testing

I've done informal performance testing getting 20-60 files per second with the Oct-2010 signature files. This should be done in a proper controlled environment using an established corpus.

When using -zip experiencing errors on a file from OPF Format Corpus

Attempting to scan the opt-format-corpus I'm seeing an error from a specific file:

pdfCabinetOfHorrors/embedded_video_quicktime.doc

  goatslayer@goatslayer-acer-linux:~/git/opf-format-corpus/format-corpus/pdfCabinetOfHorrors$ fido -zip embedded_video_quicktime.doc
  FIDO v1.3.3 (formats-v84.xml, container-signature-20160121.xml, format_extensions.xml)
  bad repeat interval
  bad repeat interval
  OK,250,fmt/111,"OLE2 Compound Document Format","OLE2 Compound Document Format",26624,"embedded_video_quicktime.doc","None","signature"
  Traceback (most recent call last):
    File "/usr/local/bin/fido", line 9, in <module>
      load_entry_point('opf-fido==1.3.3', 'console_scripts', 'fido')()
    File "/usr/local/lib/python2.7/dist-packages/opf_fido-1.3.3-py2.7.egg/fido/fido.py", line 855, in main
      fido.identify_file(file)
    File "/usr/local/lib/python2.7/dist-packages/opf_fido-1.3.3-py2.7.egg/fido/fido.py", line 400, in identify_file
      self.identify_contents(filename, type=self.container_type(matches))
    File "/usr/local/lib/python2.7/dist-packages/opf_fido-1.3.3-py2.7.egg/fido/fido.py", line 418, in identify_contents
      raise RuntimeError("Unknown container type: " + repr(type))
  RuntimeError: Unknown container type: 'ole'

Distro stats:

Python 2.7.6
No LSB modules are available.
Distributor ID: Ubuntu 
Description:    Ubuntu 14.04.4 LTS
Release:    14.04
Codename:   trusty

My mirror of the OPF Format Corpus can be found here: https://github.com/ross-spencer/opf-format-corpus

ascii text

Why is it that fido cannot identify a file with the contents:

Hello world.

Whereas the Unix file utility can?

Add format groups

Add format groups so that it is easier to use the -formats or -excludeformats arguments. For example, if all of the PDF formats were placed into a group, then
fido.run -formats pdf -r .
would identify all of the non-pdf documents in the directory tree.

bogus escape: '\\x' on word documents identification (container-signature updated : container-signature-20140923.xml)

I tried to update conf/container-signature file (container-signature-20140923.xml) to see the difference on word document identification and the result is bad : docx files are no more recognized (identify as a zip), see :

with container-signature-20140923.xml :

FIDO v1.3.1 (formats-v78.xml, container-signature-20140923.xml, format_extensions.xml)
OK,240,fmt/40,"Microsoft Word for Windows Document","Microsoft Word for Windows 97 - 2002",11776,"/home/fajir/docs/AnnexeDoc.doc","application/msword","signature"
bogus escape: '\\x'
OK,10,x-fmt/263,"ZIP Format","ZIP format",25204,"/home/fajir/docs/cours sur les theories de la motivation.docx","application/zip","signature"

with container-signature-20130501.xml :

FIDO v1.3.1 (formats-v78.xml, container-signature-20130501.xml, format_extensions.xml)
OK,230,fmt/40,"Microsoft Word for Windows Document","Microsoft Word for Windows 97 - 2002",11776,"/home/fajir/docs/AnnexeDoc.doc","application/msword","signature"
OK,29,fmt/412,"Microsoft Office Open XML - Word","Microsoft Office Open XML - Word",25204,"/home/fajir/docs/cours sur les theories de la motivation.docx","None","signature"

I think the bug come from "bogus escape: '\x'"

uncompressed epubs

Currently epubs whose files are stored in the container uncompressed are recognized by fido as either format fmt/483 or fmt/103.

IMHO fmt/483 ('ePub format') should have precedence over fmt/103 ('Extensible Hypertext Markup Language'). Maybe this is a PRONOM issue, but I was bitten by the fact that fido lists the two formats in different order on different machines. On the test machines 483 is constantly listed first, but in production, the 103 format was first, causing epub detection to fail as we assumed fido would list the best match first.

If I'm not mistaken fido currently does not use the file extension in determining the format, but I'm advocating that file extension should be used to determine the 'best' format match in case of multiple hits. It would make fido more reliable in cases like ours, where a background process cannot pause and wait for an operator to make the proper choice.

Match each pattern only once per file

Many signatures re-use patterns. For example, the PDFs all have the same end-of-file pattern. The Zip family (jar, zip, odf, ooxml, ...) all share some patterns. It would be easy to check these once per file.
A better approach might be to change the signature approach so that these tests are moved up to a super-type and only stored once in Pronom. This would help to avoid inconsistencies between signatures.

Fix install for linux

The setup.py is not quite right for linux (although the windows installer works fine). The setup.py should probably be up a directory.

Some pdf and htm files are not recognised

With the v40 signatures, some files are not being correctly identified. These include some pdf's, doc's, htm's, and mov's. Need to (1) check if the behaviour has changed from v39; (2) check the signatures; (3) patch-up any missing signatures.

Fix fetching example URLs when updating signature

When updating signatures, if the format has a ReferenceFileIdentifier of type URL, we include a reference to it, including fetching it and calculating a checksum. However, ReferenceFileIdentifier is not consistent in its meaning or format.

Eg from PRONOM 88 where fmt/11 starts with a www, and the URL is actually a PNG

<ReferenceFileIdentifier>
  <Identifier>www.w3.org/Graphics/PNG/nurbcup2si.png</Identifier>
  <IdentifierType>URL</IdentifierType>
</ReferenceFileIdentifier>
...
<ReferenceFileIdentifier>
  <Identifier>www.w3.org/Graphics/PNG/666.png</Identifier>
  <IdentifierType>URL</IdentifierType>
</ReferenceFileIdentifier>

compared to fmt/569, which starts with http:// and is a HTML page linking to examples

<ReferenceFileIdentifier>
  <Identifier>http://www.matroska.org/downloads/test_w1.html</Identifier>
  <IdentifierType>URL</IdentifierType>
</ReferenceFileIdentifier>

When parsing it, we prepend http:// and fetch it, which breaks with http://www.matroska.org/downloads/test_w1.html

url = "http://" + get_text_tna(id, 'Identifier')
...
sock = urlopen(url)

Options include removing the examples and checksums from formats-v##.xml, or adding error handling around that section.

Signature file improvements?

Some ideas for improving how signature files are handled

One idea is to use Roy to create signature files. This would allow access to a wider range of formats since it can include information from Apache Tika, freedesktop.org MIME-info and Library of Congress FDDs.

Another is to use DROID's signature files directly, since they're what PRONOM offers by default, and not have to perform a transformation. This may be a simpler change than above.

Other suggestions welcome!

Fido identifies PowerPoint file as Excel

Dev Effort

0.5D

Description

The following file is misidentified by fido via its container signature: https://drive.google.com/file/d/0B_ULgjJDmvCkRGdzREc1WFB6Ym8/edit?usp=sharing (File => Download will download the original unconverted document.)

fido returns the following ID:

OK,550,fmt/61,"Microsoft Excel 97 Workbook (xls)","BIFF 8 & 8X Workbook (generic)",2706944,"d775b31c-b627-4f7c-908a-9a3502e18e69_Archivematica-0.6-alpha-screenshots.ppt","application/vnd.ms-excel","container"

Whereas DROID returns:

/home/mistydemeo/artefactual/archivematica/src/MCPServer/share/sharedDirectoryStructure/watchedDirectories/workFlowDecisions/selectFormatIDToolTransfer/maildir-dd365c04-0c91-4bbc-9e4a-b24c27533af1/objects/attachments/Gmail.Sent_Mail/cur/d775b31c-b627-4f7c-908a-9a3502e18e69_Archivematica-0.6-alpha-screenshots.ppt,fmt/126

fmt/126 (Microsoft Powerpoint Presentation (97-XP)) is the correct ID for this file.

Looks like this is a bufsize size issue again (cf. #41) - if I increase the buffer size to 1MB, it's identified. Is this just expected behaviour in the default config?

Release process

Github holds the source. We also need a place to put the Windows installer. What is the conventional approach?

File extension identification should use the most generic match

I came across an edge case of fido's file extension identification when trying to look at some invalid XML. It looks like fido is following normal signature precedence rules, even when that doesn't make as much sense for non-signature identification/

For example, look at this file: https://gist.github.com/mistydemeo/8967705/raw/f273b8df4ee2998776fafcd6b1e99b94549181a8/pointer.xml

There's no signature match, since it's missing an XML declaration. When fido falls back to using file extension, though, it has this curious result:

FIDO v1.3.1 (formats-v70.xml, container-signature-20130501.xml, format_extensions.xml)
OK,130,fmt/121,"DROID Signature File Format","External",2779,"pointer.xml","text/xml","extension"
OK,130,fmt/120,"DROID File Collection File Format","External",2779,"pointer.xml","text/xml","extension"
FIDO: Processed      1 files in 170.00 msec,  6 files/sec

It's not exactly a DROID signature file! Turns out that fmt/121 declares precedence over fmt/101 (XML), and fido duly follows that when identifying by extension.

Given that extension matching is spotty, and only happens when specific matches haven't occurred, I think fido's behaviour should be the opposite here - it should match the most general format instead of following precedence to the most specific.

Adobe Illustrator 14 file identified as PDF 1.5, not AI

This Adobe Illustrator sample is being misidentified in fido 1.3.1 using the PRONOM v70 signatures: https://github.com/artefactual/archivematica-sampledata/raw/master/SampleTransfers/Images/BBhelmet.ai

The file is an Illustrator 14 (CS4) file (fmt/563), but is being identified as PDF 1.5 (fmt/19). This isn't actually wrong per se (since AI files are a superset of PDF), but isn't fully accurate. DROID 6.1.2, using the same v70 signature files, correctly identifies the file as fmt/563.

xls format give two results

When I analyze an xls document (http://lecompagnon.info/demos/demoxl1.xls)
fido give me two results:

[0] => Array
    (
        [result] => OK
        [puid] => fmt/62
        [formatname] => Microsoft Excel 2000-2003 Workbook (xls)
        [version] => 8X
        [signaturename] => BIFF 8 & 8X Workbook (generic)
        [mimetype] => application/vnd.ms-excel
    )

[1] => Array
    (
        [result] => OK
        [puid] => fmt/61
        [formatname] => Microsoft Excel 97 Workbook (xls)
        [version] => 8
        [signaturename] => BIFF 8 & 8X Workbook (generic)
        [mimetype] => application/vnd.ms-excel
    )

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.