Giter Club home page Giter Club logo

csvlint.io's People

Contributors

benfoxall avatar danielgavrilov avatar dependabot-preview[bot] avatar dependabot-support avatar dependabot[bot] avatar floppy avatar jolankester avatar ldodds avatar olivierthereaux avatar parkr avatar pezholio avatar quadrophobiac avatar thill-odi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

csvlint.io's Issues

is Kaminari.paginate_array necessary?

I just came across this in the schemas_controller:

schemas = Schema.all
@schemas = Kaminari.paginate_array(schemas).page(params[:page])

I think paginate_array is for when you have a simple array object, with a AR query I think you could have

@schemas = Schema.all.page params[:page]

Which would have the advantage of calling limit/skip on the Schema query rather than bringing in all the items each time.

Allow validation of CSV dataset

e.g. take a CKAN url and validate all the CSVs in that dataset and then cross validate them.
Imagine you have a series of monthly CSVs, it would take a while to validate the whole lot. Further they may be valid individually, but as a collection the column titles are inconsistent etc.
Would be great to be able to detect errors in a dataset.

Handle 404s gracefully

CSVs that 404 currently produce a broken image in the list view. Perhaps a "not found" badge would be better?

Colours in results table on report page are inconsistent

I think the colours in the table have not been implemented quite right. Should be a grey background if the number is 0; otherwise the background should be the appropriate colour.

google chrome

So in this example:
errors/context should be red background
warnings/structure should be grey background
warnings/schema should be grey background
messages/structure should be grey background
messages/schema should be grey background

Home page thoughts

  1. Possible text:

Check your CSV files with CSVLint

CSV looks easy, but it can be hard to make a CSV file that other people can read easily. CSVLint helps you to check that your CSV file is readable. And you can use it to check whether it contains the columns and types of values that it should.

Just enter the location of the file you want to check, or upload it. If you have a schema which describes the contents of the CSV file, you can also give its URL or upload it. Read more...

  1. Maybe number the steps 1 (enter CSV), 2 (enter schema if you want), 3 (hit the validate button).

Validate in background

In order to cope with big files, we need to:

  • Make sure we only revalidate every so often (define a minimum time period of validity)
  • Validate using a background job so we don't block the UI

Duplicate column names

http://csvlint.io/validation/530b5aa263737633f8000000 gives duplicate column name errors, as there's a title line on the first row, then a blank row, followed by headers.

It seems to be a common issue that people put information about the CSV in the first line (or few lines). Should we generate a separate error for this if the first few lines have one column? Something which in the front end would say:

"Your CSV seems to contain unstructured text at the beginning of the file, it is important that your CSV only contains structured data - any background information or metadata should be included on a referring web page or accompanying document"

Timed revalidation of CSVs

If a server doesn't support if-modified-since, then it will be revalidated every time. We should make it only revalidate every few hours, perhaps, with a manual 'revalidate' button to override.

Provide suggestions to user how to fix errors/warnings

Need some feedback to help user identify how to fix up some of the errors/warnings.

We might be able to qualify this based on some additional information in the headers. For example, I just came across an interesting one testing the Land Registry CSV data. This file reports wrong content-type:

http://publicdata.landregistry.gov.uk/market-trend-data/price-paid-data/b/pp-monthly-update-new-version.csv

Its served as application/octet-stream. Doing a HEAD request I can see the Server is being reported as AmazonS3. We could provide some specific guidance here, e.g. that application/octet-stream is the default for S3 unless you specify content type during upload.

Add missing messages

<span class="translation_missing" title="translation missing: en.wrong_content_type">wrong content type</span>

<span class="translation_missing" title="translation missing: en.quoting">quoting</span> on row 12

Brain dump of validating data

This is literally a brain dump, so apologies for the unstructured form.

Warning flags would go up if I'd find anything of the following:

  • A column has less than 5% non-missing entries.
  • CSV contains only one column
  • Negative values for a numeric variables (some valid exceptions exist of course)
  • Duplicate rows, especially if there's an identifier.
  • Characters are used in suspicious ways, e.g. em-dash and en-dash in dates
  • Headers appear within the data and not just in the first row
  • Dates.
  • What should be numeric formats are parsed as strings because they include symbols, e.g. "%" or "£" or even just ",".
  • Symbols like "Â" are often encoding errors.
  • Column names are not, or erroneously, recognised because of whitespaces.
  • What is the indicator for missing data? Are 0s actually missings or vice versa?
  • Is there more than one type of missing data? Coded as "." or perhaps ".a"?
  • Are there any patterns that would indicate a column got shifted half-way down the file e.g. one column to the right?
  • If there's a unique identifier is it unique?
  • Is there data in a column that is a gross outlier? E.g. a table of regional population statistics that includes overall UK population in the last row, which is several magnitudes larger.
  • Validate common data columns, e.g. postcodes or email addresses
  • The first and last rows follow a different pattern than the rest.
  • Maximum character length of a string column is suspiciously long.
  • Boolean variables are not boolean, i.e. include misspelled T/F or other values than (0,1). (Though sometimes factors are coded with (1,2).)
  • Large percentage of numeric data is sprinkled with a few character strings.
  • Fields start with code language such as "{" or "<"

and my favourite one:

  • Infinite jest: CSV-file comes from a European country where they use ";" as separator and "," for the decimal point.

Here's a terrific example:

http://data.gov.uk/data/dumps/data.gov.uk-ckan-meta-data-latest.csv.zip

Layout of homepage

Just playing with layouts for homepage. @JeniT we talked about making it clear that you only have to fill in 1 field: do you think either of these layouts are clearer?

google chrome

google chrome

(Note that the 2nd layout would need some tweaking as there isn't much room for the url / filename )

Headers in CSV files

Following discussion on the CSV on the Web Working Group, I think we should downgrade issues with headers to warnings rather than errors, or at least in general. More specifically:

  • give an error if there is no header line and there's no Content-Type header or the Content-Type header doesn't include the parameter header=absent
  • give a warning if there is no header and the file has been uploaded rather than retrieved through HTTP
  • :empty_column_name and :duplicate_column_name should be warnings rather than errors

Validations Page

Tasks:

· current 'File URL' column to change functionality on-click to 'view report'
· current 'view-report' becomes 'download file'
· possible to support an optional 'file name' in current 'File URL' column? will need separate column? how often is this column likely to be filled? - future functionality - consider supporting in design

Validation dashboard

@JeniT as discussed we are proposing pulling out key information into a dashboard style.

Errors will continue to be grouped by category (probably in the form of an accordion) and colours will continue be used to indicate severity.

odi_csv_lint_wireframes-03

· include 'Messages' under errors & warnings in dashboard
· colour code the dashboard errors/warnings/messages inline with report
· change yellow of warnings to amber to bring further inline with 'traffic light' system
· create 'accordion' sections of report - structural/schema/context problem with breakdown of errors/warnings/messages
· include column confliction form under dashboard
· allow space for multiple 'badges' in dashboard render

Homepage

@JeniT As discussed we will remove the "Powered by the ODI" link in the header. Stephen will update this issue with a more high fidelity version soon.

odi_csv_lint_wireframes-04

Tasks;
· Remove ODI logo from top right of header nav
· Menu moves into header nav with same styling as ODI ( http://theodi.org )
· Add 'About' section to menu illustrated above
· Insert existing copy/edit of 'About' text
· remove 'Home' title
· Social links in footer?
· investigate possibility of cleaner solution to double tabbed 'upload/from URL
· Correct wording of validation form

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.