Giter Club home page Giter Club logo

csvcomparer's Introduction

  • ๐Ÿ‘‹ Hi, Iโ€™m Jonathan Scott, also known as JSco
  • ๐Ÿ‘€ Iโ€™m interested in Physics and the natural world.
  • ๐ŸŒฑ Iโ€™m currently restoring my old Physics analysis code, learning Quantum Computing and generally trying new things.
  • ๐Ÿ“ซ How to reach me Twitter Badge Linkedin Badge

Blog

JScos Blog

jscott7's Github Stats

csvcomparer's People

Contributors

jscott7 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

jonadamsfromnc

csvcomparer's Issues

Use quotes in output where required

If a break includes , in the value then we want to make sure it is enclosed in quotes. This way it will be easy to open in a spreadsheet for example and the values will stay aligned with column headers

Perform a value comparison when surrounded by " quotes

If the column is surrounded by " quotes we are always performing a string comparison. In the case where the column is a double we would like to take example of tolerance considerations, for example

"String Value A","String Value B","10.5"

Improve logging of exceptions when a key definition doesn't define unique rows

The key columns in the configuration do not define unique rows then the comparer can throw a confusing exception.

This exception should be improved to make it clear what is happening.

In this example if only column A is defined as the key column an exception could be thrown when checking the second row.

||A||B||C||
|1|1|A|
|1|2|B|

Review code duplication in ComparisonUtils

The RunDirectoryComparison and RunFileComparison methods seem to have unnecessary duplication, especially in the generation of the results file. Review this and refactor where necessary

Show full path to input file in output

Run a comparison with the following arguments:
ExampleData\ReferenceFile.csv ExampleData\CandidateFile.csv ExampleData\Configuration.xml C:\temp\TestResults

The output shows this:

Date run,26/02/2023 20:01:37
Reference,ExampleData\ReferenceFile.csv
Candidate,ExampleData\CandidateFile.csv

I would like the files reported here to include the full path

Improve clarity of output in README

The output is formatted as CSV but for the README it's just being shown raw. Change to tabulate the output to make it clearer what the differences mean

Implement Tolerance

The configuration allows for absolute and relative tolerance to be defined. This isn't yet implemented though

Default to all double value columns. If no tolerance is defined then report precise matches only

Fix success code

CompareRow returns false even if it is successful. This logic needs to be made obvious and clear

Comparison Runner output shows incorrect line for saving results

Run with command line
ExampleData\ReferenceFile.csv ExampleData\CandidateFile.csv ExampleData\Configuration.xml C:\temp\TestResults

There should be breaks for this comparison but the output also shows Saving results to ..\ComparisonResults.csv, which doesn't exist

Reference: ExampleData\ReferenceFile.csv
Candidate: ExampleData\CandidateFile.csv
Saving results to C:\temp\TestResults\ComparisonResults.csv
Saving results to C:\temp\TestResults\ComparisonResults.BREAKS.csv
Finished. Comparison took 13ms

We must only log the file actually being saved to

Check key column consistency

If a Key column is defined in the configuration but that column doesn't exist in the csv file we should check and terminate the comparison. Otherwise there could be unexpected runtime behaviour

Row number is inconsistent in output

Consider checking two files with a difference at key value 1000

COL A,COL B,COL C,COL D
:
999,AA,G,-9.05
1000,D,FF,10.5
1001,BB,FF,-9.05

and

COL A,COL B,COL C,COL D
:
999,AA,G,-9.05
1000,AA,H,-9.05
1001,BB,FF,-9.05

The row number for LHS and RHS is the same as the key. However it should be row 1001 in the file because of the header row

Break Type,Key (COL A),Column Name,LHS Row,LHS Value,RHS Row,RHS Value
ValueMismatch,1000,COL B,1000,D,1000,AA
ValueMismatch,1000,COL C,1000,FF,1000,H
ValueMismatch,1000,COL D,1000,10.5,1000,-9.05

The row number should be for the file.

Save comparison output to file

Save all the breaks to a file.
Include summary information
Output of application will be summary and a link to the output file

Add column exclusion configuration

Some columns may contain unique data that will not match across comparisons. For example a GUID, or timestamp
In this case we want to add the ability to exclude one or more columns

Ignore trailing rows

Sometimes a CSV can have information at the end of the file.
This should be optionally ignored, for example

COLA, COLB, COLC
A,B,1
C,D,2
E,F,3
Some non-csv information
On extra Rows

Tidy class usage

When should configuration be applied? At class construction? or during comparison

Given the configuration is stored as a class field I think for consistency it should be applied in the constructor

Support key exclusions based on regex

Consider a value break in a row with a 3 part key for ColA, ColB, ColC:

A:B:SomeValue

We may wish to exclude value breaks where the Col C part is "SomeSpecialValue"

Add support for this

Add support for empty files

If one or both files are completely empty, the comparison should still complete and report meaningful results. At the moment the comparison thread is permanently waiting on loader completion.

Data with commas won't be presented in result CSV

The comparer handles columns with embedded delimiters but if there is a break in such a column, either as an orphan or a value mismatch, the output CSV file doesn't wrap this in quotes. This makes it difficult to view in an editor

COL A,COLB,COLC
A,x,"A value, with a comma"

We want this:

Break Type,Key,Reference Row,Reference Value,Candidate Row|Candidate Value
RowInReferenceNotInCandidate,A,1,"A value, with a comma",,

Naming change Reference/Candiate to LeftHand/RightHand

When comparing in UIs it makes more sense to have the Left Hand/Right Hand name for the input CSVs rather than Reference/Candidate which is more for testing.

As this library is not explicitly a testing tool we should change the naming convention

Support multiple configurations in single process

Say for example we want to compare multiple csv files in folder. These may be of different types. Rather than running multiple instances of the process we should setup a configuration that contains multiple definitions. These can be associated to files using a pattern matching. For example:

<configs>
 <config pattern="Type1*.csv">
    <definition>...Definition1...</definition>
 </config>
 <config pattern="Type2*.csv">
    <definition>...Definition2...</definition>
 </config>
</configs

Add column name in breaks report

Although we have row information it would be clearer if the Reference Column and Candidate Column is also included in a break report

Break Type Key Reference Row Reference Value Candidate Row Candidate Value
ValueMismatch 7 8 32.1 8 42.1

Becomes

Break Type Key Reference Row Reference Col Reference Value Candidate Row Candidate Col Candidate Value
ValueMismatch 7 8 COL B 32.1 8 COL B 42.1

Exclude Orphan breaks from results based on patterns

There may be an orphan break that we don't want to consider as failing the reconciliation. Add the ability to exclude breaks based on patterns. Define a Regex in the configuration, for example:

  <OrphanExclusions>
      <ExclusionPattern>Regex</ExclusionPattern>
      <ExclusionPattern>Regex</ExclusionPattern>
  </OrphanExclusions>
</ComparisonDefinition>

Refactor duplicate orphan checks

The following methods in CSVComparer are duplicates. This should be refactored into a single method.

        private CsvRow CheckReferenceOrphan(CsvRow candidateRow)
        {
            CsvRow csvRow = null;

            if (_referenceOrphans.ContainsKey(candidateRow.Key))
            {
                csvRow = _referenceOrphans[candidateRow.Key];
                _referenceOrphans.Remove(candidateRow.Key);
            }
            else
            {
                if (_candidateOrphans.ContainsKey(candidateRow.Key))
                {
                    throw new ComparisonException($"Candidate orphan {candidateRow.Key} already exists. This usually means the key columns do not define unique rows.");
                }

                _candidateOrphans.Add(candidateRow.Key, candidateRow);
            }

            return csvRow;
        }

        private CsvRow CheckCandidateOrphan(CsvRow referenceRow)
        {
            CsvRow csvRow = null;
            if (_candidateOrphans.ContainsKey(referenceRow.Key))
            {             
                csvRow = _candidateOrphans[referenceRow.Key];          
                _candidateOrphans.Remove(referenceRow.Key);
             
            }
            else
            {
                if (_referenceOrphans.ContainsKey(referenceRow.Key))
                {
                    throw new ComparisonException($"Reference orphan {referenceRow.Key} already exists. This usually means the key columns do not define unique rows.");
                }

                _referenceOrphans.Add(referenceRow.Key, referenceRow);
            }

            return csvRow;
        }

Update to dotnet core 3.1

Warning from CI (Azure pipeline)

##[warning]C:\Program Files\dotnet\sdk\5.0.201\Sdks\Microsoft.NET.Sdk\targets\Microsoft.NET.EolTargetFrameworks.targets(28,5): Warning NETSDK1138: The target framework 'netcoreapp3.0' is out of support and will not receive security updates in the future. Please refer to https://aka.ms/dotnet-core-support for more information about the support policy.

Allow same instance to run multiple comparisons

At the moment we need to instantiate a new instance of the CSV Comparer to run a comparison. This is because of internal state being retained. We should be able to reuse the same object for multiple comparisons

Add the row numbers to the break detail

The break detail currently reports the unique key for the break but it will also be useful to include the actual row number this occurs on for both the reference and candidate files.

Update readme.md

As it's the main landing document this should include more information

Implement early termination

Early termination should happen where one side of the comparison is missing data, either the CSV file is missing, or it doesn't have data.
In this case the comparison should stop immediately and give a high-level report

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.