digital-preservation / csv-validator Goto Github PK

View Code? Open in Web Editor NEW

200.0 200.0 54.0 20.94 MB

CSV Validation Tool and API (CSV Schema RI)

Home Page: http://digital-preservation.github.io/csv-validator

License: Mozilla Public License 2.0

Scala 96.02% Java 2.54% Dockerfile 0.27% Shell 0.64% Batchfile 0.53%

csv-validator's People

Contributors

Stargazers

Watchers

csv-validator's Issues

Case or switch conditional statement as alternative to nested ifs

the schemas we are writing, particularly for transcription metadata, are often ending up with some fairly complex nested if statements that are quite had to read, and slightly difficult to keep track of the brackets. Suggested syntax:

case((ConditionalExpr1,NonConditionalExpr1),(ConditionalExpr2,NonConditionalExpr2),...) could also have a final else, or other, as a catch all

Document useage

Document how to use the CSV Validator tool from:

Command Line
Scala
Java

Non-fail-fast reports failure from cmd-line on warning

When executing CSV Validator from the cmd-line app when in non fail-fast mode it exits with a non-zero status if it encounters a warning. It should only exit with non-zero if there is an error (not a warning).

Kind of related to #70

When processing large files, progress bar is not updated

Having recently received two batches of videos from a government department it appears that when processing a relatively small number of large files the progress bar does not update for some reason. File sizes were in the range of 10s of gigabytes, one batch had 36 files, the other 45 files, average file size 32 gb. The CSV files eventually passed validation, but the progress bar never changed from the original blank state. Both fileExists and checksum checks were being used in the schema

Update README.md

Remove stuff that has been put into web pages - link to those
Add build instructions.

Combinatorial Expressions as condition in Conditional Expressions

When you are using a Conditional Expression such as "if" you cannot current use Combinatorial Expressions as the condition. i.e. the following is currently invalid, but should be valid -

if(starts("a") or starts("b"), ends("10"))

You should be able to use all Combinatorial Expressions (e.g. "or" and "and") in the condition of a Conditional Expression.

upper/lower expressions

It would be nice to have expressions that assert that text in a field is either all upper-case or all lower-case. e.g.

field1: upper
field2: lower

Possible issue with xDateTime validation

A recent validation run reported that 2013-03-22T11:33:40+00:00 was not a valid xDateTime - can we recheck this please?

Opening settings pane moves bottom of output pane and save and close buttons off bottom of window

When you open the settings pane, the bottom of the output page and the save and close buttons are pushed off the bottom of the visible area of the window. You can expand the window by dragging the bottom of it, or clicking maximise, but it might be more intuitive if either a scroll bar became available, or if the window resized automatically.

DateTime checks don't enforce timezone inclusion

As time zone is not mandatory in the ISO standard, the xDateTime check does not flag time stamps which do not include timezone information. However, we know that the absence of time zone information causes problems for record openings. We probably need an enhanced version of xDateTime, either completely new and making timezone mandatory ie xDateTimeWithTimezone or a flag added to the existing expression to indicate that timezones are mandatory. Either would take the existing regex for XsdDateTimeStringLiteral ::= """ -?[0-9]{4}-(((0(1|3|5|7|8)|1(0|2))-(0[1-9]|(1|2)[0-9]|3[0-1]))|((0(4|6|9)|11)-(0[1-9]|(1|2)[0-9]|30))|(02-(0[1-9]|(1|2)[0-9])))T([0-1][0-9]|2[0-4]):(0[0-9]|[1-5][0-9]):(0[0-9]|[1-5][0-9])(.[0-999])?((+|-)(0[1-9]|1[0-9]|2[0-4]):(0[0-9]|[1-5][0-9])|Z)? """ and make the last section mandatory ie XsdDateTimeStringLiteralWithTimezone ::= """ -?[0-9]{4}-(((0(1|3|5|7|8)|1(0|2))-(0[1-9]|(1|2)[0-9]|3[0-1]))|((0(4|6|9)|11)-(0[1-9]|(1|2)[0-9]|30))|(02-(0[1-9]|(1|2)[0-9])))T([0-1][0-9]|2[0-4]):(0[0-9]|[1-5][0-9]):(0[0-9]|[1-5][0-9])(.[0-999])?((+|-)(0[1-9]|1[0-9]|2[0-4]):(0[0-9]|[1-5][0-9])|Z) """ (removed ? following !Z) )

It would probably make sense to make similar changes to xTime too.

Allow date and UK date expressions to take textual representations of months

Presently date expression and UK date expression only take numeric representations for month values (ie integers 1-12). However, Scanning and TRanscription Framework, and derived work for 1939 Registers specify that the month column should be supplied as strings: January, February, March, April, ..., December. This means we can't currently fully validate these dates in the supplied CSV files. There may also be a case for allowing common three letter abbreviations for month names (ie Jan, Feb, Mar, ..., Dec), though that would be more of an enhancement.

Cannot combine conditional expressions

At present you cannot combine conditional expressions e.g.

if(is("a"), is("a"), is("b")) and if(is("b"), is("b"), is("a"))

Add total number of rows processed

Experience of running validation processes suggests it would be useful if the output always included the total number of data lines processed in a validation run. Currently line numbers are given for errors, but if one wants to reconcile the total number of lines in a csv file against eg the number of images delivered, you have to manually open the metadata file as well.

using range(1,*) within an if statement produces error

Get Schema error

[8.167] failure: string matching regex -?[0-9]+(\.[0-9]+)?' expected but*' found

Enable checksum validation for files within ZIPs and similar compressed folders

DROID can read inside ZIPs and other containers and produce checksums for the individual files inside the ZIP. When CSV Validator attempts to process a DROID csv file it then cannot verify the checksums for the individual files, producing a "NO CHECKSUM" message.

Status bar only updates when validation completes

The new status bar only updates when validation has completed - and never actually reaches 100%.

It would be useful if the output cleared when a new validation run started (rather than leaving previous messages there), and it that could also write "live" so you could see new validation errors as they occur.

Progress bar for the GUI

It would be good to have a progress bar on the GUI so you can tell that it is working and how long it is taking to work through the rows.

Multiline comments fail if * is used anywhere other than opening or closing comment markers

Since Release 1.0 if you use an asterisk within a multiline comment other than in the open or close comment marker you get either:

[3.1] failure: Invalid column definition

/**-----------

If the asterisk immediately follows the open comment marker, or an apparently infinite hang if used elsewhere within the multiline comment. This may be related to commit 8da45dc which closed issue #42

Improve performance of checksum expressions

The National Archives have reported that when validating large CSV files where a row or rows reference files (and those files are large) and the Schema includes a checksum expression. The validation process can be quite slow.

Inside the validator each row is validated sequentially. If a row is slow to validate for example due to a checksum operation, it slows down the entire validation process. Some sort of multi-threading should be introduced to allow multiple checksums to be calculated in parallel, or even better perhaps to be able to perform out-of-order concurrent row validation.

Implement Partial Date Expression

The schema language defines the Partial Date Expression, but this has not yet been implemented in the CSV Validator, even though it is most common for transcription projects to record the date in separate columns, one each for day, month and year.

Create "identical" test

Some data should be the same in every row, in the standard TNA use case, this would be (for example) batch_code. It differs from batch to batch so cannot use an "is" test without updating schema for every batch, but we do know it should not change within the metadata received for a single batch. This is in a sense the converse of the "unique" test.

Error report sometimes truncated

With the test:
sub_schedule_no: if($metadata_type/regex("(SUBITEM_NAME)|(SUBITEM_QADDRESS)"), range(1,613) and unique($piece,$schedule_no,$sub_schedule_no), if($metadata_type/is("SUBITEM_CONNAME"),range(1,613),is("")))

When it is the if test that is taken and fails (ie "range(1,613) and unique($piece,$schedule_no,$sub_schedule_no)" ) the error output is only showing:

range(1,613) and unique( fails for line

Rather than showing the full test and reporting the line where the supposedly unique value was seen

fails to parse over-complex path

/mnt/imf/RW_32/content/EA-TNA1108.www.publicappts-vacs.gov.uk~%28sstth555zlg2npizrrzqgo45%29~~Default.aspx/EA-TNA1108.www.publicappts-vacs.gov.uk~~%28sstth555zlg2npizrrzqgo45%29~Default.aspx-20081211214944-00000.arc.gz

returns 'fileExists fails' , although file is present

Generate command line params from GUI

I think there might be a use case for allowing a user of the GUI to generate an equivalent set of command line parameters to the current settings being used in the GUI. This would allow less technical staff to test settings and schema in the GUI locally, and then supply appropriate parameters as a basis for running via command line for production processing.

This might also require that the GUI download also incudes the .bat/shell script for running via command line (a the moment you'd have to also download the command line version, the bulk of which is jar files identical to those already downloaded for the GUI, minus the GUI components).

Validation of column names

At present the validator does not compare the column names in the csv file with those in the schema.

Valid strings for FileNameExpr

FileNameExpr is defined as a simple StringLiteral, but is it intended that in addition to simply representing the name of the file, this could be an expression such as * or *.jp2 to count all files, or files with a particular extension, within a directory (the path to which would be given by an optional filepath in the prepended StringProvider as FileNameExpr is only used in the second argument to a FileExpr) (see https://digital-preservation.github.io/csv-validator/csv-schema-1.0.html#file-related-sub-expressions )

@separator doesn't exists

I was trying to use the @Separator tag and it isn't working. I checked out your SchemaParser.scala file and I saw that there's is not function for the @Separator tag. Is this going to be implemented soon?

Thanks!

There is no "empty" expression

There is a "notEmpty" expression, but there is not an "empty" expression.

Whilst it is possible to express an empty column by using either -

is("")

or even -

length(0)

this does not read very well in complex rules. It is better perhaps to be able to write something like -

if($other/starts("a"), is("a4"), empty)

as opposed to currently writing something like -

if($other/starts("a"), is("a4"), is(""))

EBNF does not match implementation

The EBNF has become outdated as the implementation progressed. We need to do two things -

Update the EBNF to match terminal names in the implementation
Document where constructs in he EBNF are currently not implemented in the CSV Validation Tool.

Support @separator ;

Hi,

I've tried to call csv-validator from Java to validate a CSV separated with ; character specifying the directive @separator ; as defined in CSV Schema, but it throws an Error. ¿Is this currently not supported?

Cheers,

Mariano.

FailFast reports PASS even if there are warnings

FailFast is desired not to stop on warnings, but is actually suppressing them altogether, reporting PASS even if running the same validation without the FailFast switch does produce warnings.

regex syntax is not checked during parse inside conditional expr

Using the CSV file -

col1,col2,col3
v1,v2,v3

Using the CSVS file -

version 1.0
@totalColumns 3
col1: is("v1")
col2: if(is("v8"), is("v2"), regex("*"))
col3: is("v3")

Results in the stack trace at runtime -

Dangling meta character '*' near index 0
*
^
at java.util.regex.Pattern.error(Pattern.java:1713)
at java.util.regex.Pattern.sequence(Pattern.java:1878)
at java.util.regex.Pattern.expr(Pattern.java:1752)
at java.util.regex.Pattern.compile(Pattern.java:1460)
at java.util.regex.Pattern.(Pattern.java:1133)
at java.util.regex.Pattern.compile(Pattern.java:823)
at java.util.regex.Pattern.matches(Pattern.java:928)
at java.lang.String.matches(String.java:2090)
at uk.gov.tna.dri.schema.RegexRule.valid(Rule.scala:158)
at uk.gov.tna.dri.schema.Rule.evaluate(Rule.scala:34)
at uk.gov.tna.dri.schema.IfRule$$anonfun$5.apply(Rule.scala:131)
at uk.gov.tna.dri.schema.IfRule$$anonfun$5.apply(Rule.scala:130)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:309)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at uk.gov.tna.dri.schema.IfRule.evaluate(Rule.scala:130)

This implies that the problem with the regular expression syntax is not caught at Schema parse time, but is unexpectedly found at evaluation time.

However if you use the CSVS file -

version 1.0
@totalColumns 3
col1: is("v1")
col2: regex("*")
col3: is("v3")

The problem with the regular expression syntax is caught at Schema parse time. Therefore the issue seems to be with using invalid regular expressions inside conditional statements like 'if'.

Validator fails when there is a '#' character in filename

FileExists fails when there is a '#' in the filename

min/max functions

Consider adding min/max like functions. Possibly these could be done by overloading the range function in the same way as length

range expression should offer min/max

The 'length' expression offers min/max capabilities. The 'range' expression should also offer similar capabilities.

Cross check values in other rows

Image acquisition metadata file consistency checks say we should be able to check that if image_split is("yes") we should be able to cross-check that the image_split_other_uuid refers to another file_uuid within the same csv file, and that in the row for that other image image_split_other_uuid refers back to file_uuid in the same row we are actually validating

Show calculated checksum value as well as value from file when checksum match fails

If a checksum match fails, currently the validator only shows the value read from the file being validated. Initial experience suggests it would be useful to have the calculated value in the error report as well.

integrityCheck bug

While validating the TNA transfer with redacted files, we tried to add an integrityCheck constraint to the schema as it appeared that references to some original files had been omitted from the metadata file.

integrityCheck was introduced in Pull Request #88 - it has one mandatory parameter, on the first two attempts to include integrityCheck the mandatory parameter was omitted in error, and the second time it wasn't quote wrapped as it should be. However, on both occasions the validator ran rather than reporting a schema error as should have occurred.

When the mandatory parameter was correctly included the validator reported PASS even though there were files in the content folder which were not referenced in the metadata.

As defined, TABExpr accepts 0 or -ve values

TABExpr is defined as taking an IntegerLiteral as its (optional) parameter. IntegerLiteral is ultimately defined as an Integer value, which allows 0 and negative values. TABExpr is used to supply a number of space characters to be recognised as the separator value in a CSV file, so 0 and negative values have no sensible meaning, so this parameter should be constrained to PositiveNonZeroIntegerLiteral

if a single line comment is not terminated by new line at end of file, schema error is generated

If you wish to comment on the final Column Validation Expression of a schema, and use a single line comment (begun with //), and do not add a final newline character to terminate the comment, a schema error is generated.

Some form of concatenation as string provider

Many of the column validation expressions take a string provider as input, presently this can be either a simple string, or a reference to a field. However, it would be useful to be able to concatenate the content of several fields, such as for the URI contained within the metadata that is embedded into JP2 files, which is included in the standard technical metadata files as the resource_uri. This is formed from a base URL, plus catalogue information and the UUID for the image, all of which are also available in the metadata file. We can presently do a certain amount of cross checking, but with risk of false positives, ie we can say that the value for the piece must be present in the resource_uri, but as pieces often include single digit values such as 1 etc, there is a strong risk that that would appear somewhere in eg the UUID, but may not be where it should be in the URI to actually represent the piece reference.

any expression

It would be easier if we had an 'any' expression rather than combining several string options with 'or'.

e.g. This -

is("a") or is("b") or is("c")

Could be replaced with -

any("a", "b", "c")

This is arguably more readable, especially when many 'or' expressions are used.

Extend checksum expression to optionally produce warning or error if checksum value is for an empty message

The cryptographic hash functions will each produce a consistent hash value when presented with an empty file (or any empty message). This value is different for each hash function, but a given funtion will always produce the same checksum for an empty message.

For the case where we are receiving digitised files a checksum indicating the file is empty would be considered an error, for born digital, it may be admissible, but it would probably be useful for the Digital Preservation to be given the files as a list of warnings to see if there is any case for including the empty files in the accession, of if they should be dropped as not being a record.

Deeply nested if statements seem to cause validator to hang

In the WO 95 project there are quite involved relationships between piece numbers and the sub-sub-series and sub-series to which they are expected to belong. I attempted to verify that the scanning lists that we are sending to the digitisation supplier had this information correct using a set of nested if statements, however, it seems that the depth of nesting required causes the validator to hang while parsing the schema, and it never actually begins validating the csv file.

The statement was:

if($sub_series/is("1"),if($sub_sub_series/is("8"),range(572,587),if($sub_sub_series/is("10"),range(588,628),if($sub_sub_series/is("11"),range(629,667),if($sub_sub_series/is("12"),range(668,705),if($sub_sub_series/is("13"),range(706,742),if($sub_sub_series/is("14"),range(743,766),if($sub_sub_series/is("16"),range(767,803),if($sub_sub_series/is("17"),range(804,819),if($sub_sub_series/is("18"),range(820,834),if($sub_sub_series/is("19"),range(835,849),if($sub_sub_series/is("20"),range(850,879),if($sub_sub_series/is("21"),range(880,893),if($sub_sub_series/is("22"),is("894"),if($sub_sub_series/is("23"),range(895,909),if($sub_sub_series/is("24"),range(910,920),if($sub_sub_series/is("25"),range(921,933),if($sub_sub_series/is("26"),range(934,950),if($sub_sub_series/is("27"),range(951,958),if($sub_sub_series/is("28"),range(959,973),if($sub_sub_series/is("29"),range(974,979),if($sub_sub_series/is("30"),range(980,1031),if($sub_sub_series/is("31"),range(1032,1044),if($sub_sub_series/is("32"),range(1045,1087),range(1088,1095)))))))))))))))))))))))),if($sub_sub_series/is("1"),range(4965,5008),if($sub_sub_series/is("2"),range(5009,5010),if($sub_sub_series/is("3"),range(5011,5012),if($sub_sub_series/is("4"),range(5013,5026),if($sub_sub_series/is("5"),range(5027,5031),if($sub_sub_series/is("6"),range(5032,5041),if($sub_sub_series/is("7"),range(5042,5044),if($sub_sub_series/is("8"),range(5045,5047),if($sub_sub_series/is("9"),range(5048,5051),if($sub_sub_series/is("10"),range(5052,5060),if($sub_sub_series/is("11"),range(5061,5081),if($sub_sub_series/is("12"),range(5082,5083),if($sub_sub_series/is("13"),range(5084,5093),if($sub_sub_series/is("14"),range(5094,5111),if($sub_sub_series/is("15"),range(5112,5126),if($sub_sub_series/is("16"),range(5127,5141),if($sub_sub_series/is("17"),range(5142,5146),if($sub_sub_series/is("18"),range(5147,5162),if($sub_sub_series/is("19"),range(5163,5181),if($sub_sub_series/is("20"),range(5182,5199),if($sub_sub_series/is("21"),range(5200,5214),if($sub_sub_series/is("22"),range(5215,5230),if($sub_sub_series/is("23"),range(5231,5284),range(5285,5288)))))))))))))))))))))))))

In order to get WO95_scanning_list_Y15.csvs working this was replaced with a simpler sequence of basic range checks on the piece numbers (the original statement given here is still in the schema, but commented out), but unfortunately this does not verify that the pieces are correctly placed within the archival hierarchy.

We also verify the csv file returned by the digitisation company when they return images etc to us, and ideally the schema used then would also verify these relationships.

Document CSV Schema Syntax

All of the syntax that may be used in a CSV Schema needs to be documented.

Empty lines still cause array out of bounds exception

Commit b205d97 of Dec 9 should have fixed any problems caused by a carriage return and subsequent empty line at end of file, but it doesn't seem to work in all cases
(csv file on a linux system with a final CR and no spaces or other characters on the final empty line)

Naming of "isNot" does not follow convention

Should "isNot" be renamed "not"?

It does not seem to follow convention at present!

Write out error reports during processing, rather than all at end

Presently not presented to the user until processing is entirely completed. It would be more useful if error reports were written out to the console as they are found.

Enhance validator output to make filtering by error type more straightforward

Experience of using the validator in live projects suggests that it is often useful to be able to concentrate on one error type at a time. If output was more structured it would be easier to import data into other programs (eg Excel), and use the filtering options available in there to examine error types one-by-one to understand the underlying issues. This is especially true for transcription data where there are often different systemic issues affecting the transcription of each individual field.

Suggest at least a CSV output format, but could consider XML too.

Url rule

We have a UriRule for checking if a URI is valid with the "uri" schema expression. It would be good to have a corresponding "url" rule.

digital-preservation / csv-validator Goto Github PK

csv-validator's People

Contributors

Stargazers

Watchers

Forkers

csv-validator's Issues

Recommend Projects

Recommend Topics

Recommend Org