jscott7 / csvcomparer Goto Github PK

View Code? Open in Web Editor NEW

6.0 2.0 1.0 145 KB

.NET Tool for comparing two csv files

License: MIT License

C# 100.00%

csv-files comparisons

csvcomparer's Introduction

👋 Hi, I’m Jonathan Scott, also known as JSco
👀 I’m interested in Physics and the natural world.
🌱 I’m currently restoring my old Physics analysis code, learning Quantum Computing and generally trying new things.
📫 How to reach me

Blog

JScos Blog

csvcomparer's People

Contributors

Stargazers

Watchers

Forkers

jonadamsfromnc

csvcomparer's Issues

Use quotes in output where required

If a break includes , in the value then we want to make sure it is enclosed in quotes. This way it will be easy to open in a spreadsheet for example and the values will stay aligned with column headers

Perform a value comparison when surrounded by " quotes

If the column is surrounded by " quotes we are always performing a string comparison. In the case where the column is a double we would like to take example of tolerance considerations, for example

"String Value A","String Value B","10.5"

Improve logging of exceptions when a key definition doesn't define unique rows

The key columns in the configuration do not define unique rows then the comparer can throw a confusing exception.

This exception should be improved to make it clear what is happening.

In this example if only column A is defined as the key column an exception could be thrown when checking the second row.

||A||B||C||
|1|1|A|
|1|2|B|

Support quotes within a field

Add support as per RFC 4180 https://tools.ietf.org/html/rfc4180

If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote. For example:

"aaa","b""bb","ccc"

Improve unit test coverage

See coverage here https://dev.azure.com/jonathanscott80/CSVComparer/_build/results?buildId=53&view=codecoverage-tab

Coverage needs improving in Program.cs and ComparisonUtils.cs

Review code duplication in ComparisonUtils

The RunDirectoryComparison and RunFileComparison methods seem to have unnecessary duplication, especially in the generation of the results file. Review this and refactor where necessary

Show full path to input file in output

Run a comparison with the following arguments:
ExampleData\ReferenceFile.csv ExampleData\CandidateFile.csv ExampleData\Configuration.xml C:\temp\TestResults

The output shows this:

Date run,26/02/2023 20:01:37
Reference,ExampleData\ReferenceFile.csv
Candidate,ExampleData\CandidateFile.csv

I would like the files reported here to include the full path

Improve clarity of output in README

The output is formatted as CSV but for the README it's just being shown raw. Change to tabulate the output to make it clearer what the differences mean

Implement Tolerance

The configuration allows for absolute and relative tolerance to be defined. This isn't yet implemented though

Default to all double value columns. If no tolerance is defined then report precise matches only

Fix success code

CompareRow returns false even if it is successful. This logic needs to be made obvious and clear

Comparison Runner output shows incorrect line for saving results

Run with command line
ExampleData\ReferenceFile.csv ExampleData\CandidateFile.csv ExampleData\Configuration.xml C:\temp\TestResults

There should be breaks for this comparison but the output also shows Saving results to ..\ComparisonResults.csv, which doesn't exist

Reference: ExampleData\ReferenceFile.csv
Candidate: ExampleData\CandidateFile.csv
Saving results to C:\temp\TestResults\ComparisonResults.csv
Saving results to C:\temp\TestResults\ComparisonResults.BREAKS.csv
Finished. Comparison took 13ms

We must only log the file actually being saved to

Check key column consistency

If a Key column is defined in the configuration but that column doesn't exist in the csv file we should check and terminate the comparison. Otherwise there could be unexpected runtime behaviour

Row number is inconsistent in output

Consider checking two files with a difference at key value 1000

COL A,COL B,COL C,COL D
:
999,AA,G,-9.05
1000,D,FF,10.5
1001,BB,FF,-9.05

and

COL A,COL B,COL C,COL D
:
999,AA,G,-9.05
1000,AA,H,-9.05
1001,BB,FF,-9.05

The row number for LHS and RHS is the same as the key. However it should be row 1001 in the file because of the header row

Break Type,Key (COL A),Column Name,LHS Row,LHS Value,RHS Row,RHS Value
ValueMismatch,1000,COL B,1000,D,1000,AA
ValueMismatch,1000,COL C,1000,FF,1000,H
ValueMismatch,1000,COL D,1000,10.5,1000,-9.05

The row number should be for the file.

Save comparison output to file

Save all the breaks to a file.
Include summary information
Output of application will be summary and a link to the output file

Add column exclusion configuration

Some columns may contain unique data that will not match across comparisons. For example a GUID, or timestamp
In this case we want to add the ability to exclude one or more columns

Date in ComparisonResult is not being set

The date field in ComparisonResult should record the date the comparison was made. But this is not being set.

Ignore trailing rows

Sometimes a CSV can have information at the end of the file.
This should be optionally ignored, for example

COLA, COLB, COLC
A,B,1
C,D,2
E,F,3
Some non-csv information
On extra Rows

SplitStringWithQuotes does not exit when a quote is last character

If a row ends with a quote, for example

A,B,"C,D"

Then the method SplitStringWithQuotes does not exit

Create an interface for CSVComparer

I would like to add other comparison implementations in future. Create an interface to CSVComparer to allow this to be achieved

Fix path to README.md in for nuget packaging

The path to the readme is pointing to RunningTracker, inadvertently set from visual studio for another project when setting up nuget.
This is breaking the Azure pipeline

Tidy class usage

When should configuration be applied? At class construction? or during comparison

Given the configuration is stored as a class field I think for consistency it should be applied in the constructor

Support key exclusions based on regex

Consider a value break in a row with a 3 part key for ColA, ColB, ColC:

A:B:SomeValue

We may wish to exclude value breaks where the Col C part is "SomeSpecialValue"

Add support for this

Add support for empty files

If one or both files are completely empty, the comparison should still complete and report meaningful results. At the moment the comparison thread is permanently waiting on loader completion.

Data with commas won't be presented in result CSV

The comparer handles columns with embedded delimiters but if there is a break in such a column, either as an orphan or a value mismatch, the output CSV file doesn't wrap this in quotes. This makes it difficult to view in an editor

COL A,COLB,COLC
A,x,"A value, with a comma"

We want this:

Break Type,Key,Reference Row,Reference Value,Candidate Row|Candidate Value
RowInReferenceNotInCandidate,A,1,"A value, with a comma",,

Check command line output makes sense

Confirm the command-line output makes sense. For example if the output file is a *.BREAKS.csv then make it clear

Naming change Reference/Candiate to LeftHand/RightHand

When comparing in UIs it makes more sense to have the Left Hand/Right Hand name for the input CSVs rather than Reference/Candidate which is more for testing.

As this library is not explicitly a testing tool we should change the naming convention

Support multiple configurations in single process

Say for example we want to compare multiple csv files in folder. These may be of different types. Rather than running multiple instances of the process we should setup a configuration that contains multiple definitions. These can be associated to files using a pattern matching. For example:

<configs>
 <config pattern="Type1*.csv">
    <definition>...Definition1...</definition>
 </config>
 <config pattern="Type2*.csv">
    <definition>...Definition2...</definition>
 </config>
</configs

Add column name in breaks report

Although we have row information it would be clearer if the Reference Column and Candidate Column is also included in a break report

Break Type	Key	Reference Row	Reference Value	Candidate Row	Candidate Value
ValueMismatch	7	8	32.1	8	42.1

Becomes

Break Type	Key	Reference Row	Reference Col	Reference Value	Candidate Row	Candidate Col	Candidate Value
ValueMismatch	7	8	COL B	32.1	8	COL B	42.1

Remove build warnings and set compiler to fail on warnings

There are 8 warnings in ComparisonUtils that need resolving

Exclude Orphan breaks from results based on patterns

There may be an orphan break that we don't want to consider as failing the reconciliation. Add the ability to exclude breaks based on patterns. Define a Regex in the configuration, for example:

  <OrphanExclusions>
      <ExclusionPattern>Regex</ExclusionPattern>
      <ExclusionPattern>Regex</ExclusionPattern>
  </OrphanExclusions>
</ComparisonDefinition>

Refactor duplicate orphan checks

The following methods in CSVComparer are duplicates. This should be refactored into a single method.

        private CsvRow CheckReferenceOrphan(CsvRow candidateRow)
        {
            CsvRow csvRow = null;

            if (_referenceOrphans.ContainsKey(candidateRow.Key))
            {
                csvRow = _referenceOrphans[candidateRow.Key];
                _referenceOrphans.Remove(candidateRow.Key);
            }
            else
            {
                if (_candidateOrphans.ContainsKey(candidateRow.Key))
                {
                    throw new ComparisonException($"Candidate orphan {candidateRow.Key} already exists. This usually means the key columns do not define unique rows.");
                }

                _candidateOrphans.Add(candidateRow.Key, candidateRow);
            }

            return csvRow;
        }

        private CsvRow CheckCandidateOrphan(CsvRow referenceRow)
        {
            CsvRow csvRow = null;
            if (_candidateOrphans.ContainsKey(referenceRow.Key))
            {             
                csvRow = _candidateOrphans[referenceRow.Key];          
                _candidateOrphans.Remove(referenceRow.Key);
             
            }
            else
            {
                if (_referenceOrphans.ContainsKey(referenceRow.Key))
                {
                    throw new ComparisonException($"Reference orphan {referenceRow.Key} already exists. This usually means the key columns do not define unique rows.");
                }

                _referenceOrphans.Add(referenceRow.Key, referenceRow);
            }

            return csvRow;
        }

Update to dotnet core 3.1

Warning from CI (Azure pipeline)

##[warning]C:\Program Files\dotnet\sdk\5.0.201\Sdks\Microsoft.NET.Sdk\targets\Microsoft.NET.EolTargetFrameworks.targets(28,5): Warning NETSDK1138: The target framework 'netcoreapp3.0' is out of support and will not receive security updates in the future. Please refer to https://aka.ms/dotnet-core-support for more information about the support policy.