jscott7 / csvcomparer Goto Github PK

.NET Tool for comparing two csv files

License: MIT License

C# 100.00%

csvcomparer's Issues

Check key column consistency

If a Key column is defined in the configuration but that column doesn't exist in the csv file we should check and terminate the comparison. Otherwise there could be unexpected runtime behaviour

Data with commas won't be presented in result CSV

The comparer handles columns with embedded delimiters but if there is a break in such a column, either as an orphan or a value mismatch, the output CSV file doesn't wrap this in quotes. This makes it difficult to view in an editor

COL A,COLB,COLC
A,x,"A value, with a comma"

We want this:

Break Type,Key,Reference Row,Reference Value,Candidate Row|Candidate Value
RowInReferenceNotInCandidate,A,1,"A value, with a comma",,

Ignore trailing rows

Sometimes a CSV can have information at the end of the file.
This should be optionally ignored, for example

COLA, COLB, COLC
A,B,1
C,D,2
E,F,3
Some non-csv information
On extra Rows

Show full path to input file in output

Run a comparison with the following arguments:
ExampleData\ReferenceFile.csv ExampleData\CandidateFile.csv ExampleData\Configuration.xml C:\temp\TestResults

The output shows this:

Date run,26/02/2023 20:01:37
Reference,ExampleData\ReferenceFile.csv
Candidate,ExampleData\CandidateFile.csv

I would like the files reported here to include the full path

Add column exclusion configuration

Some columns may contain unique data that will not match across comparisons. For example a GUID, or timestamp
In this case we want to add the ability to exclude one or more columns

Naming change Reference/Candiate to LeftHand/RightHand

When comparing in UIs it makes more sense to have the Left Hand/Right Hand name for the input CSVs rather than Reference/Candidate which is more for testing.

As this library is not explicitly a testing tool we should change the naming convention

Use commas instead of colons in output summary so they appear as different columns in CSV view

Reference: xxx
Candidate: xxx

If this is opened in a spreadsheet the same cell contains the Reference and value. If we separate by columns these will be placed in different cells.

Exclude Orphan breaks from results based on patterns

There may be an orphan break that we don't want to consider as failing the reconciliation. Add the ability to exclude breaks based on patterns. Define a Regex in the configuration, for example:

  <OrphanExclusions>
      <ExclusionPattern>Regex</ExclusionPattern>
      <ExclusionPattern>Regex</ExclusionPattern>
  </OrphanExclusions>
</ComparisonDefinition>

Implement early termination

Early termination should happen where one side of the comparison is missing data, either the CSV file is missing, or it doesn't have data.
In this case the comparison should stop immediately and give a high-level report

Update to dotnet core 3.1

Warning from CI (Azure pipeline)

##[warning]C:\Program Files\dotnet\sdk\5.0.201\Sdks\Microsoft.NET.Sdk\targets\Microsoft.NET.EolTargetFrameworks.targets(28,5): Warning NETSDK1138: The target framework 'netcoreapp3.0' is out of support and will not receive security updates in the future. Please refer to https://aka.ms/dotnet-core-support for more information about the support policy.

Allow same instance to run multiple comparisons

At the moment we need to instantiate a new instance of the CSV Comparer to run a comparison. This is because of internal state being retained. We should be able to reuse the same object for multiple comparisons

Add column name in breaks report

Although we have row information it would be clearer if the Reference Column and Candidate Column is also included in a break report

Break Type	Key	Reference Row	Reference Value	Candidate Row	Candidate Value
ValueMismatch	7	8	32.1	8	42.1

Becomes

Break Type	Key	Reference Row	Reference Col	Reference Value	Candidate Row	Candidate Col	Candidate Value
ValueMismatch	7	8	COL B	32.1	8	COL B	42.1

Fix success code

CompareRow returns false even if it is successful. This logic needs to be made obvious and clear

Improve unit test coverage

See coverage here https://dev.azure.com/jonathanscott80/CSVComparer/_build/results?buildId=53&view=codecoverage-tab

Coverage needs improving in Program.cs and ComparisonUtils.cs

Support quotes within a field

Add support as per RFC 4180 https://tools.ietf.org/html/rfc4180

If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote. For example:

"aaa","b""bb","ccc"

Refactor application and library

Refactor the application into a separate project and use a class library for the comparison logic

Improve logging of exceptions when a key definition doesn't define unique rows

The key columns in the configuration do not define unique rows then the comparer can throw a confusing exception.

This exception should be improved to make it clear what is happening.

In this example if only column A is defined as the key column an exception could be thrown when checking the second row.

||A||B||C||
|1|1|A|
|1|2|B|

Refactor duplicate orphan checks

The following methods in CSVComparer are duplicates. This should be refactored into a single method.

        private CsvRow CheckReferenceOrphan(CsvRow candidateRow)
        {
            CsvRow csvRow = null;

            if (_referenceOrphans.ContainsKey(candidateRow.Key))
            {
                csvRow = _referenceOrphans[candidateRow.Key];
                _referenceOrphans.Remove(candidateRow.Key);
            }
            else
            {
                if (_candidateOrphans.ContainsKey(candidateRow.Key))
                {
                    throw new ComparisonException($"Candidate orphan {candidateRow.Key} already exists. This usually means the key columns do not define unique rows.");
                }

                _candidateOrphans.Add(candidateRow.Key, candidateRow);
            }

            return csvRow;
        }

        private CsvRow CheckCandidateOrphan(CsvRow referenceRow)
        {
            CsvRow csvRow = null;
            if (_candidateOrphans.ContainsKey(referenceRow.Key))
            {             
                csvRow = _candidateOrphans[referenceRow.Key];          
                _candidateOrphans.Remove(referenceRow.Key);
             
            }
            else
            {
                if (_referenceOrphans.ContainsKey(referenceRow.Key))
                {
                    throw new ComparisonException($"Reference orphan {referenceRow.Key} already exists. This usually means the key columns do not define unique rows.");
                }

                _referenceOrphans.Add(referenceRow.Key, referenceRow);
            }

            return csvRow;
        }

Support key exclusions based on regex

Consider a value break in a row with a 3 part key for ColA, ColB, ColC:

A:B:SomeValue

We may wish to exclude value breaks where the Col C part is "SomeSpecialValue"

Add support for this

Add support for empty files

If one or both files are completely empty, the comparison should still complete and report meaningful results. At the moment the comparison thread is permanently waiting on loader completion.

Add the row numbers to the break detail

The break detail currently reports the unique key for the break but it will also be useful to include the actual row number this occurs on for both the reference and candidate files.

Save comparison output to file

Save all the breaks to a file.
Include summary information
Output of application will be summary and a link to the output file

Implement Tolerance

The configuration allows for absolute and relative tolerance to be defined. This isn't yet implemented though

Default to all double value columns. If no tolerance is defined then report precise matches only

Date in ComparisonResult is not being set

The date field in ComparisonResult should record the date the comparison was made. But this is not being set.

SplitStringWithQuotes does not exit when a quote is last character

If a row ends with a quote, for example

A,B,"C,D"

Then the method SplitStringWithQuotes does not exit

Support multiple configurations in single process

Say for example we want to compare multiple csv files in folder. These may be of different types. Rather than running multiple instances of the process we should setup a configuration that contains multiple definitions. These can be associated to files using a pattern matching. For example:

<configs>
 <config pattern="Type1*.csv">
    <definition>...Definition1...</definition>
 </config>
 <config pattern="Type2*.csv">
    <definition>...Definition2...</definition>
 </config>
</configs

Investigate configuration using yaml

Use quotes in output where required

If a break includes , in the value then we want to make sure it is enclosed in quotes. This way it will be easy to open in a spreadsheet for example and the values will stay aligned with column headers

Some comparisons terminate while there are still items in the queue

We sometimes get breaks reported where there are a number of orphans (either Candidate or Reference)
This seems to be caused because the comparison stops before all the items in the queue have been drained after loading has completed.

Fix path to README.md in for nuget packaging

The path to the readme is pointing to RunningTracker, inadvertently set from visual studio for another project when setting up nuget.
This is breaking the Azure pipeline

Remove build warnings and set compiler to fail on warnings

There are 8 warnings in ComparisonUtils that need resolving

Update readme.md

As it's the main landing document this should include more information

Improve clarity of output in README

The output is formatted as CSV but for the README it's just being shown raw. Change to tabulate the output to make it clearer what the differences mean

Review code duplication in ComparisonUtils

The RunDirectoryComparison and RunFileComparison methods seem to have unnecessary duplication, especially in the generation of the results file. Review this and refactor where necessary

Check command line output makes sense

Confirm the command-line output makes sense. For example if the output file is a *.BREAKS.csv then make it clear

Create an interface for CSVComparer

I would like to add other comparison implementations in future. Create an interface to CSVComparer to allow this to be achieved

Row number is inconsistent in output

Consider checking two files with a difference at key value 1000

COL A,COL B,COL C,COL D
:
999,AA,G,-9.05
1000,D,FF,10.5
1001,BB,FF,-9.05

and

COL A,COL B,COL C,COL D
:
999,AA,G,-9.05
1000,AA,H,-9.05
1001,BB,FF,-9.05

The row number for LHS and RHS is the same as the key. However it should be row 1001 in the file because of the header row

Break Type,Key (COL A),Column Name,LHS Row,LHS Value,RHS Row,RHS Value
ValueMismatch,1000,COL B,1000,D,1000,AA
ValueMismatch,1000,COL C,1000,FF,1000,H
ValueMismatch,1000,COL D,1000,10.5,1000,-9.05

The row number should be for the file.

Reference: ExampleData\ReferenceFile.csv
Candidate: ExampleData\CandidateFile.csv
Saving results to C:\temp\TestResults\ComparisonResults.csv
Saving results to C:\temp\TestResults\ComparisonResults.BREAKS.csv
Finished. Comparison took 13ms

We must only log the file actually being saved to

jscott7 / csvcomparer Goto Github PK

csvcomparer's Issues

Recommend Projects

Recommend Topics

Recommend Org