Giter Club home page Giter Club logo

csv-validator's People

Contributors

adamretter avatar andy1138 avatar benjaminparker avatar davidainslie avatar dependabot-preview[bot] avatar dependabot[bot] avatar etorreborre avatar jessflan avatar jim-collins avatar lauradamiantna avatar luketebbs avatar nickiwelch avatar nyango avatar rhubner avatar rwalpole avatar sparkhi avatar valydia avatar yysdsk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

csv-validator's Issues

Case or switch conditional statement as alternative to nested ifs

the schemas we are writing, particularly for transcription metadata, are often ending up with some fairly complex nested if statements that are quite had to read, and slightly difficult to keep track of the brackets. Suggested syntax:

case((ConditionalExpr1,NonConditionalExpr1),(ConditionalExpr2,NonConditionalExpr2),...) could also have a final else, or other, as a catch all

Document useage

Document how to use the CSV Validator tool from:

  • Command Line
  • Scala
  • Java

Non-fail-fast reports failure from cmd-line on warning

When executing CSV Validator from the cmd-line app when in non fail-fast mode it exits with a non-zero status if it encounters a warning. It should only exit with non-zero if there is an error (not a warning).

Kind of related to #70

When processing large files, progress bar is not updated

Having recently received two batches of videos from a government department it appears that when processing a relatively small number of large files the progress bar does not update for some reason. File sizes were in the range of 10s of gigabytes, one batch had 36 files, the other 45 files, average file size 32 gb. The CSV files eventually passed validation, but the progress bar never changed from the original blank state. Both fileExists and checksum checks were being used in the schema

Update README.md

  • Remove stuff that has been put into web pages - link to those
  • Add build instructions.

Combinatorial Expressions as condition in Conditional Expressions

When you are using a Conditional Expression such as "if" you cannot current use Combinatorial Expressions as the condition. i.e. the following is currently invalid, but should be valid -

if(starts("a") or starts("b"), ends("10"))

You should be able to use all Combinatorial Expressions (e.g. "or" and "and") in the condition of a Conditional Expression.

upper/lower expressions

It would be nice to have expressions that assert that text in a field is either all upper-case or all lower-case. e.g.

field1: upper
field2: lower

DateTime checks don't enforce timezone inclusion

As time zone is not mandatory in the ISO standard, the xDateTime check does not flag time stamps which do not include timezone information. However, we know that the absence of time zone information causes problems for record openings. We probably need an enhanced version of xDateTime, either completely new and making timezone mandatory ie xDateTimeWithTimezone or a flag added to the existing expression to indicate that timezones are mandatory. Either would take the existing regex for XsdDateTimeStringLiteral ::= """ -?[0-9]{4}-(((0(1|3|5|7|8)|1(0|2))-(0[1-9]|(1|2)[0-9]|3[0-1]))|((0(4|6|9)|11)-(0[1-9]|(1|2)[0-9]|30))|(02-(0[1-9]|(1|2)[0-9])))T([0-1][0-9]|2[0-4]):(0[0-9]|[1-5][0-9]):(0[0-9]|[1-5][0-9])(.[0-999])?((+|-)(0[1-9]|1[0-9]|2[0-4]):(0[0-9]|[1-5][0-9])|Z)? """ and make the last section mandatory ie XsdDateTimeStringLiteralWithTimezone ::= """ -?[0-9]{4}-(((0(1|3|5|7|8)|1(0|2))-(0[1-9]|(1|2)[0-9]|3[0-1]))|((0(4|6|9)|11)-(0[1-9]|(1|2)[0-9]|30))|(02-(0[1-9]|(1|2)[0-9])))T([0-1][0-9]|2[0-4]):(0[0-9]|[1-5][0-9]):(0[0-9]|[1-5][0-9])(.[0-999])?((+|-)(0[1-9]|1[0-9]|2[0-4]):(0[0-9]|[1-5][0-9])|Z) """ (removed ? following !Z) )

It would probably make sense to make similar changes to xTime too.

Allow date and UK date expressions to take textual representations of months

Presently date expression and UK date expression only take numeric representations for month values (ie integers 1-12). However, Scanning and TRanscription Framework, and derived work for 1939 Registers specify that the month column should be supplied as strings: January, February, March, April, ..., December. This means we can't currently fully validate these dates in the supplied CSV files. There may also be a case for allowing common three letter abbreviations for month names (ie Jan, Feb, Mar, ..., Dec), though that would be more of an enhancement.

Add total number of rows processed

Experience of running validation processes suggests it would be useful if the output always included the total number of data lines processed in a validation run. Currently line numbers are given for errors, but if one wants to reconcile the total number of lines in a csv file against eg the number of images delivered, you have to manually open the metadata file as well.

Status bar only updates when validation completes

The new status bar only updates when validation has completed - and never actually reaches 100%.

It would be useful if the output cleared when a new validation run started (rather than leaving previous messages there), and it that could also write "live" so you could see new validation errors as they occur.

Progress bar for the GUI

It would be good to have a progress bar on the GUI so you can tell that it is working and how long it is taking to work through the rows.

Improve performance of checksum expressions

The National Archives have reported that when validating large CSV files where a row or rows reference files (and those files are large) and the Schema includes a checksum expression. The validation process can be quite slow.

Inside the validator each row is validated sequentially. If a row is slow to validate for example due to a checksum operation, it slows down the entire validation process. Some sort of multi-threading should be introduced to allow multiple checksums to be calculated in parallel, or even better perhaps to be able to perform out-of-order concurrent row validation.

Implement Partial Date Expression

The schema language defines the Partial Date Expression, but this has not yet been implemented in the CSV Validator, even though it is most common for transcription projects to record the date in separate columns, one each for day, month and year.

Create "identical" test

Some data should be the same in every row, in the standard TNA use case, this would be (for example) batch_code. It differs from batch to batch so cannot use an "is" test without updating schema for every batch, but we do know it should not change within the metadata received for a single batch. This is in a sense the converse of the "unique" test.

Error report sometimes truncated

With the test:
sub_schedule_no: if($metadata_type/regex("(SUBITEM_NAME)|(SUBITEM_QADDRESS)"), range(1,613) and unique($piece,$schedule_no,$sub_schedule_no), if($metadata_type/is("SUBITEM_CONNAME"),range(1,613),is("")))

When it is the if test that is taken and fails (ie "range(1,613) and unique($piece,$schedule_no,$sub_schedule_no)" ) the error output is only showing:

range(1,613) and unique( fails for line

Rather than showing the full test and reporting the line where the supposedly unique value was seen

fails to parse over-complex path

/mnt/imf/RW_32/content/EA-TNA1108.www.publicappts-vacs.gov.uk~%28sstth555zlg2npizrrzqgo45%29Default.aspx/EA-TNA1108.www.publicappts-vacs.gov.uk%28sstth555zlg2npizrrzqgo45%29~Default.aspx-20081211214944-00000.arc.gz

returns 'fileExists fails' , although file is present

Generate command line params from GUI

I think there might be a use case for allowing a user of the GUI to generate an equivalent set of command line parameters to the current settings being used in the GUI. This would allow less technical staff to test settings and schema in the GUI locally, and then supply appropriate parameters as a basis for running via command line for production processing.

This might also require that the GUI download also incudes the .bat/shell script for running via command line (a the moment you'd have to also download the command line version, the bulk of which is jar files identical to those already downloaded for the GUI, minus the GUI components).

Valid strings for FileNameExpr

FileNameExpr is defined as a simple StringLiteral, but is it intended that in addition to simply representing the name of the file, this could be an expression such as * or *.jp2 to count all files, or files with a particular extension, within a directory (the path to which would be given by an optional filepath in the prepended StringProvider as FileNameExpr is only used in the second argument to a FileExpr) (see https://digital-preservation.github.io/csv-validator/csv-schema-1.0.html#file-related-sub-expressions )

@separator doesn't exists

I was trying to use the @Separator tag and it isn't working. I checked out your SchemaParser.scala file and I saw that there's is not function for the @Separator tag. Is this going to be implemented soon?

Thanks!

There is no "empty" expression

There is a "notEmpty" expression, but there is not an "empty" expression.

Whilst it is possible to express an empty column by using either -

is("")

or even -

length(0)

this does not read very well in complex rules. It is better perhaps to be able to write something like -

if($other/starts("a"), is("a4"), empty)

as opposed to currently writing something like -

if($other/starts("a"), is("a4"), is(""))

EBNF does not match implementation

The EBNF has become outdated as the implementation progressed. We need to do two things -

  1. Update the EBNF to match terminal names in the implementation
  2. Document where constructs in he EBNF are currently not implemented in the CSV Validation Tool.

Support @separator ;

Hi,

I've tried to call csv-validator from Java to validate a CSV separated with ; character specifying the directive @separator ; as defined in CSV Schema, but it throws an Error. ¿Is this currently not supported?

Cheers,

Mariano.

FailFast reports PASS even if there are warnings

FailFast is desired not to stop on warnings, but is actually suppressing them altogether, reporting PASS even if running the same validation without the FailFast switch does produce warnings.

regex syntax is not checked during parse inside conditional expr

Using the CSV file -

col1,col2,col3
v1,v2,v3

Using the CSVS file -

version 1.0
@totalColumns 3
col1: is("v1")
col2: if(is("v8"), is("v2"), regex("*"))
col3: is("v3")

Results in the stack trace at runtime -

Dangling meta character '*' near index 0
*
^
at java.util.regex.Pattern.error(Pattern.java:1713)
at java.util.regex.Pattern.sequence(Pattern.java:1878)
at java.util.regex.Pattern.expr(Pattern.java:1752)
at java.util.regex.Pattern.compile(Pattern.java:1460)
at java.util.regex.Pattern.(Pattern.java:1133)
at java.util.regex.Pattern.compile(Pattern.java:823)
at java.util.regex.Pattern.matches(Pattern.java:928)
at java.lang.String.matches(String.java:2090)
at uk.gov.tna.dri.schema.RegexRule.valid(Rule.scala:158)
at uk.gov.tna.dri.schema.Rule.evaluate(Rule.scala:34)
at uk.gov.tna.dri.schema.IfRule$$anonfun$5.apply(Rule.scala:131)
at uk.gov.tna.dri.schema.IfRule$$anonfun$5.apply(Rule.scala:130)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:309)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at uk.gov.tna.dri.schema.IfRule.evaluate(Rule.scala:130)

This implies that the problem with the regular expression syntax is not caught at Schema parse time, but is unexpectedly found at evaluation time.

However if you use the CSVS file -

version 1.0
@totalColumns 3
col1: is("v1")
col2: regex("*")
col3: is("v3")

The problem with the regular expression syntax is caught at Schema parse time. Therefore the issue seems to be with using invalid regular expressions inside conditional statements like 'if'.

min/max functions

Consider adding min/max like functions. Possibly these could be done by overloading the range function in the same way as length

Cross check values in other rows

Image acquisition metadata file consistency checks say we should be able to check that if image_split is("yes") we should be able to cross-check that the image_split_other_uuid refers to another file_uuid within the same csv file, and that in the row for that other image image_split_other_uuid refers back to file_uuid in the same row we are actually validating

integrityCheck bug

While validating the TNA transfer with redacted files, we tried to add an integrityCheck constraint to the schema as it appeared that references to some original files had been omitted from the metadata file.

integrityCheck was introduced in Pull Request #88 - it has one mandatory parameter, on the first two attempts to include integrityCheck the mandatory parameter was omitted in error, and the second time it wasn't quote wrapped as it should be. However, on both occasions the validator ran rather than reporting a schema error as should have occurred.

When the mandatory parameter was correctly included the validator reported PASS even though there were files in the content folder which were not referenced in the metadata.

As defined, TABExpr accepts 0 or -ve values

TABExpr is defined as taking an IntegerLiteral as its (optional) parameter. IntegerLiteral is ultimately defined as an Integer value, which allows 0 and negative values. TABExpr is used to supply a number of space characters to be recognised as the separator value in a CSV file, so 0 and negative values have no sensible meaning, so this parameter should be constrained to PositiveNonZeroIntegerLiteral

Some form of concatenation as string provider

Many of the column validation expressions take a string provider as input, presently this can be either a simple string, or a reference to a field. However, it would be useful to be able to concatenate the content of several fields, such as for the URI contained within the metadata that is embedded into JP2 files, which is included in the standard technical metadata files as the resource_uri. This is formed from a base URL, plus catalogue information and the UUID for the image, all of which are also available in the metadata file. We can presently do a certain amount of cross checking, but with risk of false positives, ie we can say that the value for the piece must be present in the resource_uri, but as pieces often include single digit values such as 1 etc, there is a strong risk that that would appear somewhere in eg the UUID, but may not be where it should be in the URI to actually represent the piece reference.

any expression

It would be easier if we had an 'any' expression rather than combining several string options with 'or'.

e.g. This -

is("a") or is("b") or is("c")

Could be replaced with -

any("a", "b", "c")

This is arguably more readable, especially when many 'or' expressions are used.

Extend checksum expression to optionally produce warning or error if checksum value is for an empty message

The cryptographic hash functions will each produce a consistent hash value when presented with an empty file (or any empty message). This value is different for each hash function, but a given funtion will always produce the same checksum for an empty message.

For the case where we are receiving digitised files a checksum indicating the file is empty would be considered an error, for born digital, it may be admissible, but it would probably be useful for the Digital Preservation to be given the files as a list of warnings to see if there is any case for including the empty files in the accession, of if they should be dropped as not being a record.

Deeply nested if statements seem to cause validator to hang

In the WO 95 project there are quite involved relationships between piece numbers and the sub-sub-series and sub-series to which they are expected to belong. I attempted to verify that the scanning lists that we are sending to the digitisation supplier had this information correct using a set of nested if statements, however, it seems that the depth of nesting required causes the validator to hang while parsing the schema, and it never actually begins validating the csv file.

The statement was:

if($sub_series/is("1"),if($sub_sub_series/is("8"),range(572,587),if($sub_sub_series/is("10"),range(588,628),if($sub_sub_series/is("11"),range(629,667),if($sub_sub_series/is("12"),range(668,705),if($sub_sub_series/is("13"),range(706,742),if($sub_sub_series/is("14"),range(743,766),if($sub_sub_series/is("16"),range(767,803),if($sub_sub_series/is("17"),range(804,819),if($sub_sub_series/is("18"),range(820,834),if($sub_sub_series/is("19"),range(835,849),if($sub_sub_series/is("20"),range(850,879),if($sub_sub_series/is("21"),range(880,893),if($sub_sub_series/is("22"),is("894"),if($sub_sub_series/is("23"),range(895,909),if($sub_sub_series/is("24"),range(910,920),if($sub_sub_series/is("25"),range(921,933),if($sub_sub_series/is("26"),range(934,950),if($sub_sub_series/is("27"),range(951,958),if($sub_sub_series/is("28"),range(959,973),if($sub_sub_series/is("29"),range(974,979),if($sub_sub_series/is("30"),range(980,1031),if($sub_sub_series/is("31"),range(1032,1044),if($sub_sub_series/is("32"),range(1045,1087),range(1088,1095)))))))))))))))))))))))),if($sub_sub_series/is("1"),range(4965,5008),if($sub_sub_series/is("2"),range(5009,5010),if($sub_sub_series/is("3"),range(5011,5012),if($sub_sub_series/is("4"),range(5013,5026),if($sub_sub_series/is("5"),range(5027,5031),if($sub_sub_series/is("6"),range(5032,5041),if($sub_sub_series/is("7"),range(5042,5044),if($sub_sub_series/is("8"),range(5045,5047),if($sub_sub_series/is("9"),range(5048,5051),if($sub_sub_series/is("10"),range(5052,5060),if($sub_sub_series/is("11"),range(5061,5081),if($sub_sub_series/is("12"),range(5082,5083),if($sub_sub_series/is("13"),range(5084,5093),if($sub_sub_series/is("14"),range(5094,5111),if($sub_sub_series/is("15"),range(5112,5126),if($sub_sub_series/is("16"),range(5127,5141),if($sub_sub_series/is("17"),range(5142,5146),if($sub_sub_series/is("18"),range(5147,5162),if($sub_sub_series/is("19"),range(5163,5181),if($sub_sub_series/is("20"),range(5182,5199),if($sub_sub_series/is("21"),range(5200,5214),if($sub_sub_series/is("22"),range(5215,5230),if($sub_sub_series/is("23"),range(5231,5284),range(5285,5288)))))))))))))))))))))))))

In order to get WO95_scanning_list_Y15.csvs working this was replaced with a simpler sequence of basic range checks on the piece numbers (the original statement given here is still in the schema, but commented out), but unfortunately this does not verify that the pieces are correctly placed within the archival hierarchy.

We also verify the csv file returned by the digitisation company when they return images etc to us, and ideally the schema used then would also verify these relationships.

Empty lines still cause array out of bounds exception

Commit b205d97 of Dec 9 should have fixed any problems caused by a carriage return and subsequent empty line at end of file, but it doesn't seem to work in all cases
(csv file on a linux system with a final CR and no spaces or other characters on the final empty line)

Enhance validator output to make filtering by error type more straightforward

Experience of using the validator in live projects suggests that it is often useful to be able to concentrate on one error type at a time. If output was more structured it would be easier to import data into other programs (eg Excel), and use the filtering options available in there to examine error types one-by-one to understand the underlying issues. This is especially true for transcription data where there are often different systemic issues affecting the transcription of each individual field.

Suggest at least a CSV output format, but could consider XML too.

Url rule

We have a UriRule for checking if a URI is valid with the "uri" schema expression. It would be good to have a corresponding "url" rule.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.