jscott7 / csvcomparer Goto Github PK
View Code? Open in Web Editor NEW.NET Tool for comparing two csv files
License: MIT License
.NET Tool for comparing two csv files
License: MIT License
If a Key column is defined in the configuration but that column doesn't exist in the csv file we should check and terminate the comparison. Otherwise there could be unexpected runtime behaviour
The comparer handles columns with embedded delimiters but if there is a break in such a column, either as an orphan or a value mismatch, the output CSV file doesn't wrap this in quotes. This makes it difficult to view in an editor
COL A,COLB,COLC
A,x,"A value, with a comma"
We want this:
Break Type,Key,Reference Row,Reference Value,Candidate Row|Candidate Value
RowInReferenceNotInCandidate,A,1,"A value, with a comma",,
Sometimes a CSV can have information at the end of the file.
This should be optionally ignored, for example
COLA, COLB, COLC
A,B,1
C,D,2
E,F,3
Some non-csv information
On extra Rows
Run a comparison with the following arguments:
ExampleData\ReferenceFile.csv ExampleData\CandidateFile.csv ExampleData\Configuration.xml C:\temp\TestResults
The output shows this:
Date run,26/02/2023 20:01:37
Reference,ExampleData\ReferenceFile.csv
Candidate,ExampleData\CandidateFile.csv
I would like the files reported here to include the full path
Some columns may contain unique data that will not match across comparisons. For example a GUID, or timestamp
In this case we want to add the ability to exclude one or more columns
When comparing in UIs it makes more sense to have the Left Hand/Right Hand name for the input CSVs rather than Reference/Candidate which is more for testing.
As this library is not explicitly a testing tool we should change the naming convention
Reference: xxx
Candidate: xxx
If this is opened in a spreadsheet the same cell contains the Reference and value. If we separate by columns these will be placed in different cells.
There may be an orphan break that we don't want to consider as failing the reconciliation. Add the ability to exclude breaks based on patterns. Define a Regex in the configuration, for example:
<OrphanExclusions>
<ExclusionPattern>Regex</ExclusionPattern>
<ExclusionPattern>Regex</ExclusionPattern>
</OrphanExclusions>
</ComparisonDefinition>
Early termination should happen where one side of the comparison is missing data, either the CSV file is missing, or it doesn't have data.
In this case the comparison should stop immediately and give a high-level report
Warning from CI (Azure pipeline)
##[warning]C:\Program Files\dotnet\sdk\5.0.201\Sdks\Microsoft.NET.Sdk\targets\Microsoft.NET.EolTargetFrameworks.targets(28,5): Warning NETSDK1138: The target framework 'netcoreapp3.0' is out of support and will not receive security updates in the future. Please refer to https://aka.ms/dotnet-core-support for more information about the support policy.
At the moment we need to instantiate a new instance of the CSV Comparer to run a comparison. This is because of internal state being retained. We should be able to reuse the same object for multiple comparisons
Although we have row information it would be clearer if the Reference Column and Candidate Column is also included in a break report
Break Type | Key | Reference Row | Reference Value | Candidate Row | Candidate Value |
---|---|---|---|---|---|
ValueMismatch | 7 | 8 | 32.1 | 8 | 42.1 |
Becomes
Break Type | Key | Reference Row | Reference Col | Reference Value | Candidate Row | Candidate Col | Candidate Value |
---|---|---|---|---|---|---|---|
ValueMismatch | 7 | 8 | COL B | 32.1 | 8 | COL B | 42.1 |
CompareRow returns false even if it is successful. This logic needs to be made obvious and clear
See coverage here https://dev.azure.com/jonathanscott80/CSVComparer/_build/results?buildId=53&view=codecoverage-tab
Coverage needs improving in Program.cs and ComparisonUtils.cs
Add support as per RFC 4180 https://tools.ietf.org/html/rfc4180
If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with
another double quote. For example:
"aaa","b""bb","ccc"
Refactor the application into a separate project and use a class library for the comparison logic
The key columns in the configuration do not define unique rows then the comparer can throw a confusing exception.
This exception should be improved to make it clear what is happening.
In this example if only column A is defined as the key column an exception could be thrown when checking the second row.
||A||B||C||
|1|1|A|
|1|2|B|
The following methods in CSVComparer are duplicates. This should be refactored into a single method.
private CsvRow CheckReferenceOrphan(CsvRow candidateRow)
{
CsvRow csvRow = null;
if (_referenceOrphans.ContainsKey(candidateRow.Key))
{
csvRow = _referenceOrphans[candidateRow.Key];
_referenceOrphans.Remove(candidateRow.Key);
}
else
{
if (_candidateOrphans.ContainsKey(candidateRow.Key))
{
throw new ComparisonException($"Candidate orphan {candidateRow.Key} already exists. This usually means the key columns do not define unique rows.");
}
_candidateOrphans.Add(candidateRow.Key, candidateRow);
}
return csvRow;
}
private CsvRow CheckCandidateOrphan(CsvRow referenceRow)
{
CsvRow csvRow = null;
if (_candidateOrphans.ContainsKey(referenceRow.Key))
{
csvRow = _candidateOrphans[referenceRow.Key];
_candidateOrphans.Remove(referenceRow.Key);
}
else
{
if (_referenceOrphans.ContainsKey(referenceRow.Key))
{
throw new ComparisonException($"Reference orphan {referenceRow.Key} already exists. This usually means the key columns do not define unique rows.");
}
_referenceOrphans.Add(referenceRow.Key, referenceRow);
}
return csvRow;
}
Consider a value break in a row with a 3 part key for ColA, ColB, ColC:
A:B:SomeValue
We may wish to exclude value breaks where the Col C part is "SomeSpecialValue"
Add support for this
If one or both files are completely empty, the comparison should still complete and report meaningful results. At the moment the comparison thread is permanently waiting on loader completion.
The break detail currently reports the unique key for the break but it will also be useful to include the actual row number this occurs on for both the reference and candidate files.
Save all the breaks to a file.
Include summary information
Output of application will be summary and a link to the output file
The configuration allows for absolute and relative tolerance to be defined. This isn't yet implemented though
Default to all double value columns. If no tolerance is defined then report precise matches only
The date field in ComparisonResult should record the date the comparison was made. But this is not being set.
If a row ends with a quote, for example
A,B,"C,D"
Then the method SplitStringWithQuotes does not exit
Say for example we want to compare multiple csv files in folder. These may be of different types. Rather than running multiple instances of the process we should setup a configuration that contains multiple definitions. These can be associated to files using a pattern matching. For example:
<configs>
<config pattern="Type1*.csv">
<definition>...Definition1...</definition>
</config>
<config pattern="Type2*.csv">
<definition>...Definition2...</definition>
</config>
</configs
If a break includes , in the value then we want to make sure it is enclosed in quotes. This way it will be easy to open in a spreadsheet for example and the values will stay aligned with column headers
We sometimes get breaks reported where there are a number of orphans (either Candidate or Reference)
This seems to be caused because the comparison stops before all the items in the queue have been drained after loading has completed.
The path to the readme is pointing to RunningTracker, inadvertently set from visual studio for another project when setting up nuget.
This is breaking the Azure pipeline
There are 8 warnings in ComparisonUtils that need resolving
As it's the main landing document this should include more information
The output is formatted as CSV but for the README it's just being shown raw. Change to tabulate the output to make it clearer what the differences mean
The RunDirectoryComparison and RunFileComparison methods seem to have unnecessary duplication, especially in the generation of the results file. Review this and refactor where necessary
Confirm the command-line output makes sense. For example if the output file is a *.BREAKS.csv then make it clear
I would like to add other comparison implementations in future. Create an interface to CSVComparer to allow this to be achieved
Consider checking two files with a difference at key value 1000
COL A,COL B,COL C,COL D
:
999,AA,G,-9.05
1000,D,FF,10.5
1001,BB,FF,-9.05
and
COL A,COL B,COL C,COL D
:
999,AA,G,-9.05
1000,AA,H,-9.05
1001,BB,FF,-9.05
The row number for LHS and RHS is the same as the key. However it should be row 1001 in the file because of the header row
Break Type,Key (COL A),Column Name,LHS Row,LHS Value,RHS Row,RHS Value
ValueMismatch,1000,COL B,1000,D,1000,AA
ValueMismatch,1000,COL C,1000,FF,1000,H
ValueMismatch,1000,COL D,1000,10.5,1000,-9.05
The row number should be for the file.
When should configuration be applied? At class construction? or during comparison
Given the configuration is stored as a class field I think for consistency it should be applied in the constructor
If the column is surrounded by " quotes we are always performing a string comparison. In the case where the column is a double we would like to take example of tolerance considerations, for example
"String Value A","String Value B","10.5"
Run with command line
ExampleData\ReferenceFile.csv ExampleData\CandidateFile.csv ExampleData\Configuration.xml C:\temp\TestResults
There should be breaks for this comparison but the output also shows Saving results to ..\ComparisonResults.csv, which doesn't exist
Reference: ExampleData\ReferenceFile.csv
Candidate: ExampleData\CandidateFile.csv
Saving results to C:\temp\TestResults\ComparisonResults.csv
Saving results to C:\temp\TestResults\ComparisonResults.BREAKS.csv
Finished. Comparison took 13ms
We must only log the file actually being saved to
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.