etalab / csv-detective Goto Github PK

CSV inspection

Python 99.91% Shell 0.09%

csv-detective's Introduction

CSV Detective

This is a package to automatically detect column content in CSV files. As of now, the script reads the first few rows of the CSV and performs various checks to see for each column if it matches with various content types. This is currently done through regex and string comparison.

How To ?

Install the package

You need to have python >= 3.7 installed. We recommend using a virtual environement.

pip install csv-detective

Detect some columns

Say you have a CSV file located in file_path. This is how you could use csv_detective:

# Import the csv_detective package
from csv_detective.explore_csv import routine
import os # for this example only
import json # for json dump only

# Replace by your file path
file_path = os.path.join('.', 'tests', 'code_postaux_v201410.csv')

# Open your file and run csv_detective
inspection_results = routine(
  file_path,
  num_rows=-1, # Value -1 will analyze all lines of your csv, you can change with the number of lines you wish to analyze
  output_mode="LIMITED", # By default value is LIMITED, if you want result of analysis of all detections made, you can apply an output_mode="ALL"
  save_results=False, # Default False. If True, it will save result output into the same directory than the csv analyzed
  output_profile=True, # Default False. If True, returned dict will contain a property "profile" indicating profile (min, max, mean, tops...) of every column of you csv
  output_schema=True, # Default False. If True, returned dict will contain a property "schema" containing basic [tableschema](https://specs.frictionlessdata.io/table-schema/) of your file. This can be use to validate structure of other csv which should match same structure. 
)


# Write your file as json
with open(file_path.replace('.csv', '.json'), 'w', encoding='utf8') as fp:
    json.dump(inspection_results, fp, indent=4, separators=(',', ': '))

So What Do You Get ?

Output

The program creates a Python dictionnary with the following information :

{
    "heading_columns": 0, 					# Number of heading columns
    "encoding": "windows-1252", 			        # Encoding detected
    "ints_as_floats": [],					# Columns where integers may be represented as floats
    "trailing_columns": 0,					# Number of trailing columns
    "headers": ['code commune INSEE', 'nom de la commune', 'code postal', "libell\\u00e9 d'acheminement\n"], # Header row
    "separator": ";",						# Detected CSV separator
    "headers_row": 0,						# Number of heading rows
    "columns": { # Property that conciliate detection from labels and content of a column
        "Code commune": {
            "python_type": "string",
            "format": "code_commune_insee",
            "score": 1.0
        },
    },
    "columns_labels": { # Property that return detection from header columns
        "Code commune": {
            "python_type": "string",
            "format": "code_commune_insee",
            "score": 0.5
        },
    },
    "columns_fields": { # Property that return detection from content columns
        "Code commune": {
            "python_type": "string",
            "format": "code_commune_insee",
            "score": 1.25
        },
    },
    "profile": {
      "column_name" : {
        "min": 1, # only int and float
        "max: 12, # only int and float
        "mean": 5, # only int and float
        "std": 5, # only int and float
        "tops": [  # limited to 10
          "xxx",
          "yyy",
          "..."
        ],
        "nb_distinct": 67,
        "nb_missing_values": 102
      }
    },
    "schema": {
      "$schema": "https://frictionlessdata.io/schemas/table-schema.json",
      "name": "",
      "title": "",
      "description": "",
      "countryCode": "FR",
      "homepage": "",
      "path": "https://github.com/etalab/csv-detective",
      "resources": [],
      "sources": [
        {"title": "Spécification Tableschema", "path": "https://specs.frictionlessdata.io/table-schema"},
        {"title": "schema.data.gouv.fr", "path": "https://schema.data.gouv.fr"}
      ],
      "created": "2023-02-10",
      "lastModified": "2023-02-10",
      "version": "0.0.1",
      "contributors": [
        {"title": "Table schema bot", "email": "[email protected]", "organisation": "data.gouv.fr", "role": "author"}
      ],
      "fields": [
        {
          "name": "Code commune",
          "description": "Le code INSEE de la commune",
          "example": "23150",
          "type": "string",
          "formatFR": "code_commune_insee",
          "constraints": {
            "required": False,
            "pattern": "^([013-9]\\d|2[AB1-9])\\d{3}$",
          }
        }
      ]
    }
}

What Formats Can Be Detected

Includes :

Communes, Départements, Régions, Pays
Codes Communes, Codes Postaux, Codes Departement, ISO Pays
Codes CSP, Description CSP, SIREN
E-Mails, URLs, Téléphones FR
Years, Dates, Jours de la Semaine FR
UUIDs, Mongo ObjectIds

Format detection and scoring

For each column, 3 scores are computed for each format, the higher the score, the more likely the format:

the field score based on the values contained in the column (0.0 to 1.0).
the label score based on the header of the column (0.0 to 1.0).
the overall score, computed as field_score * (1 + label_score/2) (0.0 to 1.5).

The overall score computation aims to give more weight to the column contents while still leveraging the column header.

`output_mode` - Select the output mode you want for json report

This option allows you to select the output mode you want to pass. To do so, you have to pass a output_mode argument to the routine function. This variable has two possible values:

output_mode defaults to 'LIMITED' which means report will contain only detected column formats based on a pre-selected threshold proportion in data. Report result is the standard output (an example can be found above in 'Output' section). Only the format with highest score is present in the output.
output_mode='ALL' which means report will contain a full list of all column format possibilities for each input data columns with a value associated which match to the proportion of found column type in data. With this report, user can adjust its rules of detection based on a specific threshold and has a better vision of quality detection for each columns. Results could also be easily transformed into dataframe (columns types in column / column names in rows) for analysis and test.

TODO (this list is too long)

Clean up
Make more robust
Batch analyse
Command line interface
Improve output format
Improve testing structure to make modular searches (search only for cities for example)
Get rid of pandas dependency
Improve pre-processing and pre-processing tracing (removing heading rows for example)
Make differentiated pre-processing (no lower case for country codes for example)
Give a sense of probability in the prediction
Add more and more detection modules...

Related ideas:

store column names to make a learning model based on column names for (possible pre-screen)
normalising data based on column prediction
entity resolution (good luck...)

Why Could This Be of Any Use ?

Organisations such as data.gouv aggregate huge amounts of un-normalised data. Performing cross-examination across datasets can be difficult. This tool could help enrich the datasets metadata and facilitate linking them together.

Here is project (just started) that has code to download all csv files from the data.gouv website and analyse them using csv_detective.

Release

The release process uses bumpr.

pip install -r requirements-build.txt

Process

bumpr will handle bumping the version according to your command (patch, minor, major)
It will update the CHANGELOG according to the new version being published
It will push a tag with the given version to github
CircleCI will pickup this tag, build the package and publish it to pypi
bumpr will have everything ready for the next version (version, changelog...)

Dry run

bumpr -d -v

Release

This will release a patch version:

bumpr -v

See bumpr options for minor and major:

$ bumpr -h
usage: bumpr [-h] [--version] [-v] [-c CONFIG] [-d] [-st] [-b | -pr] [-M] [-m] [-p]
             [-s SUFFIX] [-u] [-pM] [-pm] [-pp] [-ps PREPARE_SUFFIX] [-pu]
             [--vcs {git,hg}] [-nc] [-P] [-nP]
             [file] [files ...]

[...]

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  -v, --verbose         Verbose output
  -c CONFIG, --config CONFIG
                        Specify a configuration file
  -d, --dryrun          Do not write anything and display a diff
  -st, --skip-tests     Skip tests
  -b, --bump            Only perform the bump
  -pr, --prepare        Only perform the prepare

bump:
  -M, --major           Bump major version
  -m, --minor           Bump minor version
  -p, --patch           Bump patch version
  -s SUFFIX, --suffix SUFFIX
                        Set suffix
  -u, --unsuffix        Unset suffix

[...]

csv-detective's People

Contributors

Stargazers

Watchers

Forkers

skyle97 psorianom vincentetalab marcstefanon vlasvlasvlas iiistvan mryinglee qpc-github quantum-platinum-cloud sarrabah

csv-detective's Issues

Add possibility to read from URL

Instead of having to download the file, could set an option to enable reading directly from URL

Add support for detection of ISO 3166-1 alpha-3 country codes.

Currently only alpha2 is supported.

Attention au maintien à jour de la liste des codes insee

https://github.com/etalab/csv-detective/tree/master/csv_detective/detect_fields/FR/geo/code_commune_insee

Some columns which contains numbers are considered as latitude or longitude even if they are not

Example:
https://csvapi-front.etalab.studio/?url=https://www.data.gouv.fr/fr/datasets/r/5c4e1452-3850-4b59-b11c-3dd51d7fb8b5
Look at columns tx_pos and tx_incid

Lat lon wgs seems to be prioritised over lat long wgs FR when it should be the other way around.

If lat lon WGS coordinates lie on the French territory they should be considered lat lon WGS FR

CamelCase label handling

Label DateBlabla (camelCase) should be treated as date, which is currently not the case.

Laisser la possibilité de lire des dataframe au lieu de fichiers

Aujourd'hui csv-detective n'analyse qu'à partir d'un fichier. Laisser la possibilité d'analyser un dataframe également.

Recode all tests

Tests are very dependant from previous versions and not working fine with actual one.
==> Need to code clean test for repo.

date or datetime as python_type?

Would it make sense to output date or datetime as python_type for date_der_maj in the following example?

{
   "encoding":"UTF-8-SIG",
   "separator":";",
   "header_row_idx":0,
   "header":[
      "cle_interop",
      "uid_adresse",
      "voie_nom",
      "numero",
      "suffixe",
      "commune_nom",
      "position",
      "x",
      "y",
      "long",
      "lat",
      "source",
      "date_der_maj",
      "refparc",
      "voie_nom_eu",
      "complement"
   ],
   "total_lines":82,
   "heading_columns":0,
   "trailing_columns":0,
   "continuous":[
      "x",
      "y",
      "long",
      "lat"
   ],
   "categorical":[
      "uid_adresse",
      "suffixe",
      "commune_nom",
      "position",
      "source",
      "complement"
   ],
   "columns_fields":{
      "cle_interop":{
         "python_type":"float",
         "format":"float",
         "score":1.0
      },
      "uid_adresse":{
         "python_type":"string",
         "format":"string",
         "score":1.0
      },
      "voie_nom":{
         "python_type":"string",
         "format":"adresse",
         "score":1.0
      },
      "numero":{
         "python_type":"int",
         "format":"int",
         "score":1.0
      },
      "suffixe":{
         "python_type":"string",
         "format":"string",
         "score":1.0
      },
      "commune_nom":{
         "python_type":"string",
         "format":"commune",
         "score":1.0
      },
      "position":{
         "python_type":"string",
         "format":"string",
         "score":1.0
      },
      "x":{
         "python_type":"float",
         "format":"longitude_l93",
         "score":0.9795918367346939
      },
      "y":{
         "python_type":"float",
         "format":"latitude_l93",
         "score":1.0
      },
      "long":{
         "python_type":"float",
         "format":"latitude_wgs",
         "score":1.0
      },
      "lat":{
         "python_type":"float",
         "format":"longitude_wgs",
         "score":1.0
      },
      "source":{
         "python_type":"string",
         "format":"string",
         "score":1.0
      },
      "date_der_maj":{
         "python_type":"string",
         "format":"date",
         "score":1.0
      },
      "refparc":{
         "python_type":"string",
         "format":"string",
         "score":1.0
      },
      "voie_nom_eu":{
         "python_type":"string",
         "format":"string",
         "score":1.0
      },
      "complement":{
         "python_type":"string",
         "format":"string",
         "score":1.0
      }
   },
   "columns_labels":{
      "cle_interop":{
         "python_type":"string",
         "format":"string",
         "score":1.0
      },
      "uid_adresse":{
         "python_type":"string",
         "format":"adresse",
         "score":0.5
      },
      "voie_nom":{
         "python_type":"string",
         "format":"string",
         "score":1.0
      },
      "numero":{
         "python_type":"string",
         "format":"string",
         "score":1.0
      },
      "suffixe":{
         "python_type":"string",
         "format":"string",
         "score":1.0
      },
      "commune_nom":{
         "python_type":"string",
         "format":"commune",
         "score":0.5
      },
      "position":{
         "python_type":"string",
         "format":"latlon_wgs",
         "score":1.0
      },
      "x":{
         "python_type":"float",
         "format":"longitude_wgs_fr_metropole",
         "score":1.0
      },
      "y":{
         "python_type":"float",
         "format":"latitude_wgs_fr_metropole",
         "score":1.0
      },
      "long":{
         "python_type":"float",
         "format":"longitude_wgs_fr_metropole",
         "score":1.0
      },
      "lat":{
         "python_type":"float",
         "format":"latitude_wgs_fr_metropole",
         "score":1.0
      },
      "source":{
         "python_type":"string",
         "format":"string",
         "score":1.0
      },
      "date_der_maj":{
         "python_type":"string",
         "format":"date",
         "score":1.0
      },
      "refparc":{
         "python_type":"string",
         "format":"string",
         "score":1.0
      },
      "voie_nom_eu":{
         "python_type":"string",
         "format":"string",
         "score":1.0
      },
      "complement":{
         "python_type":"string",
         "format":"string",
         "score":1.0
      }
   },
   "columns":{
      "cle_interop":{
         "python_type":"float",
         "format":"float",
         "score":1.0
      },
      "uid_adresse":{
         "python_type":"string",
         "format":"string",
         "score":1.0
      },
      "voie_nom":{
         "python_type":"string",
         "format":"adresse",
         "score":1.0
      },
      "numero":{
         "python_type":"int",
         "format":"int",
         "score":1.0
      },
      "suffixe":{
         "python_type":"string",
         "format":"string",
         "score":1.0
      },
      "commune_nom":{
         "python_type":"string",
         "format":"commune",
         "score":1.25
      },
      "position":{
         "python_type":"string",
         "format":"string",
         "score":1.0
      },
      "x":{
         "python_type":"float",
         "format":"longitude_l93",
         "score":1.4693877551020407
      },
      "y":{
         "python_type":"float",
         "format":"latitude_l93",
         "score":1.5
      },
      "long":{
         "python_type":"float",
         "format":"longitude_wgs_fr_metropole",
         "score":1.5
      },
      "lat":{
         "python_type":"float",
         "format":"latitude_wgs_fr_metropole",
         "score":1.5
      },
      "source":{
         "python_type":"string",
         "format":"string",
         "score":1.0
      },
      "date_der_maj":{
         "python_type":"string",
         "format":"date",
         "score":1.5
      },
      "refparc":{
         "python_type":"string",
         "format":"string",
         "score":1.0
      },
      "voie_nom_eu":{
         "python_type":"string",
         "format":"string",
         "score":1.0
      },
      "complement":{
         "python_type":"string",
         "format":"string",
         "score":1.0
      }
   },
   "formats":{
      "float":[
         "cle_interop"
      ],
      "string":[
         "uid_adresse",
         "suffixe",
         "position",
         "source",
         "refparc",
         "voie_nom_eu",
         "complement"
      ],
      "adresse":[
         "voie_nom"
      ],
      "int":[
         "numero"
      ],
      "commune":[
         "commune_nom"
      ],
      "longitude_l93":[
         "x"
      ],
      "latitude_l93":[
         "y"
      ],
      "longitude_wgs_fr_metropole":[
         "long"
      ],
      "latitude_wgs_fr_metropole":[
         "lat"
      ],
      "date":[
         "date_der_maj"
      ]
   }
}

Wrong type and header line detection

header_row_idx should be 1 (there are two duplicate header lines)
NUMCOM and NUMDEP should not be detected as int (Corsica forever)

http://data.caf.fr/dataset/f6411f07-10bf-4f13-b4fb-8d30ba9328b5/resource/94a182c4-19c8-4d3a-987c-187a49756365/download/txcouvglo2014.csv

[:~] $ head /Users/alexandre/Downloads/txcouvglo2014.csv
NUMCOM;NOMCOM;NUMDEP;NOMDEP;NUMEPCI;NOMEPCI;TXCOUVGLO_COM_2014;TXCOUVGLO_DEP_2014;TXCOUVGLO_EPCI_2014
NUMCOM;NOMCOM;NUMDEP;NOMDEP;NUMEPCI;NOMEPCI;TXCOUVGLO_COM_2014;TXCOUVGLO_DEP_2014;TXCOUVGLO_EPCI_2014
01001;L'ABERGEMENT-CLEMENCIAT;01;AIN;200035210;CC CHALARONNE CENTRE;41.7;65.2;72.9
01002;L'ABERGEMENT-DE-VAREY;01;AIN;240100883;CC DE LA PLAINE DE L'AIN;34.1;65.2;75.2
01004;AMBERIEU-EN-BUGEY;01;AIN;240100883;CC DE LA PLAINE DE L'AIN;61.8;65.2;75.2
01005;AMBERIEUX-EN-DOMBES;01;AIN;200042497;CC DOMBES SAONE VALLEE;73.6;65.2;77.8
01006;AMBLEON;01;AIN;200040350;CC BUGEY SUD;93.1;65.2;52.4
01007;AMBRONAY;01;AIN;240100883;CC DE LA PLAINE DE L'AIN;51.4;65.2;75.2
01008;AMBUTRIX;01;AIN;240100883;CC DE LA PLAINE DE L'AIN;92;65.2;75.2
01009;ANDERT-ET-CONDON;01;AIN;200040350;CC BUGEY SUD;34.2;65.2;52.4
[:~] $ tail /Users/alexandre/Downloads/txcouvglo2014.csv
97415;SAINT-PAUL;974;LA REUNION;249740101;CA TERRITOIRE DE LA COTE OUEST (TCO);33.2;25.8;29
97416;SAINT-PIERRE;974;LA REUNION;249740077;CA CIVIS (COMMUNAUTE INTERCOMMUNALE DES VILLES SOLIDAIRES);34.5;25.8;25.8
97417;SAINT-PHILIPPE;974;LA REUNION;249740085;CA DU SUD;16.9;25.8;18.4
97418;SAINTE-MARIE;974;LA REUNION;249740119;CA INTERCOMMUNALE DU NORD DE LA REUNION (CINOR);32;25.8;31.1
97419;SAINTE-ROSE;974;LA REUNION;249740093;CA INTERCOMMUNALE DE LA REUNION EST (CIREST);17.2;25.8;20
97420;SAINTE-SUZANNE;974;LA REUNION;249740119;CA INTERCOMMUNALE DU NORD DE LA REUNION (CINOR);28.1;25.8;31.1
97421;SALAZIE;974;LA REUNION;249740093;CA INTERCOMMUNALE DE LA REUNION EST (CIREST);17.7;25.8;20
97422;LE TAMPON;974;LA REUNION;249740085;CA DU SUD;20.3;25.8;18.4
97423;LES TROIS-BASSINS;974;LA REUNION;249740101;CA TERRITOIRE DE LA COTE OUEST (TCO);14.3;25.8;29
97424;CILAOS;974;LA REUNION;249740077;CA CIVIS (COMMUNAUTE INTERCOMMUNALE DES VILLES SOLIDAIRES);9.1;25.8;25.8
[:~] $ grep "2A" /Users/alexandre/Downloads/txcouvglo2014.csv
2A001;AFA;2A;CORSE DU SUD;242010056;CA DU PAYS AJACCIEN;32.6;35.8;27.6
2A004;AJACCIO;2A;CORSE DU SUD;242010056;CA DU PAYS AJACCIEN;29.5;35.8;27.6
2A006;ALATA;2A;CORSE DU SUD;242010056;CA DU PAYS AJACCIEN;20.8;35.8;27.6

{
   "header":[
      "NUMCOM",
      "NOMCOM",
      "NUMDEP",
      "NOMDEP",
      "NUMEPCI",
      "NOMEPCI",
      "TXCOUVGLO_COM_2014",
      "TXCOUVGLO_DEP_2014",
      "TXCOUVGLO_EPCI_2014"
   ],
   "columns":{
      "NOMCOM":{
         "score":1.0,
         "format":"commune",
         "python_type":"string"
      },
      "NOMDEP":{
         "score":1.0,
         "format":"departement",
         "python_type":"string"
      },
      "NUMCOM":{
         "score":1.0,
         "format":"int",
         "python_type":"int"
      },
      "NUMDEP":{
         "score":1.0,
         "format":"int",
         "python_type":"int"
      },
      "NOMEPCI":{
         "score":1.0,
         "format":"string",
         "python_type":"string"
      },
      "NUMEPCI":{
         "score":1.0,
         "format":"siren",
         "python_type":"string"
      },
      "TXCOUVGLO_COM_2014":{
         "score":1.0,
         "format":"float",
         "python_type":"float"
      },
      "TXCOUVGLO_DEP_2014":{
         "score":1.0,
         "format":"float",
         "python_type":"float"
      },
      "TXCOUVGLO_EPCI_2014":{
         "score":1.0,
         "format":"float",
         "python_type":"float"
      }
   },
   "formats":{
      "int":[
         "NUMCOM",
         "NUMDEP"
      ],
      "float":[
         "TXCOUVGLO_COM_2014",
         "TXCOUVGLO_DEP_2014",
         "TXCOUVGLO_EPCI_2014"
      ],
      "siren":[
         "NUMEPCI"
      ],
      "string":[
         "NOMEPCI"
      ],
      "commune":[
         "NOMCOM"
      ],
      "departement":[
         "NOMDEP"
      ]
   },
   "encoding":"ISO-8859-1",
   "separator":";",
   "continuous":[
      "TXCOUVGLO_DEP_2014",
      "TXCOUVGLO_EPCI_2014"
   ],
   "categorical":[
      
   ],
   "total_lines":36636,
   "columns_fields":{
      "NOMCOM":{
         "score":1.0,
         "format":"commune",
         "python_type":"string"
      },
      "NOMDEP":{
         "score":1.0,
         "format":"departement",
         "python_type":"string"
      },
      "NUMCOM":{
         "score":1.0,
         "format":"code_commune_insee",
         "python_type":"string"
      },
      "NUMDEP":{
         "score":1.0,
         "format":"code_departement",
         "python_type":"string"
      },
      "NOMEPCI":{
         "score":1.0,
         "format":"string",
         "python_type":"string"
      },
      "NUMEPCI":{
         "score":1.0,
         "format":"siren",
         "python_type":"string"
      },
      "TXCOUVGLO_COM_2014":{
         "score":1.0,
         "format":"float",
         "python_type":"float"
      },
      "TXCOUVGLO_DEP_2014":{
         "score":0.9183673469387755,
         "format":"latitude_wgs",
         "python_type":"float"
      },
      "TXCOUVGLO_EPCI_2014":{
         "score":0.9387755102040817,
         "format":"longitude_wgs",
         "python_type":"float"
      }
   },
   "columns_labels":{
      "NOMCOM":{
         "score":1.0,
         "format":"string",
         "python_type":"string"
      },
      "NOMDEP":{
         "score":1.0,
         "format":"string",
         "python_type":"string"
      },
      "NUMCOM":{
         "score":1.0,
         "format":"string",
         "python_type":"string"
      },
      "NUMDEP":{
         "score":1.0,
         "format":"string",
         "python_type":"string"
      },
      "NOMEPCI":{
         "score":1.0,
         "format":"string",
         "python_type":"string"
      },
      "NUMEPCI":{
         "score":1.0,
         "format":"string",
         "python_type":"string"
      },
      "TXCOUVGLO_COM_2014":{
         "score":0.5,
         "format":"code_commune_insee",
         "python_type":"string"
      },
      "TXCOUVGLO_DEP_2014":{
         "score":0.5,
         "format":"code_departement",
         "python_type":"string"
      },
      "TXCOUVGLO_EPCI_2014":{
         "score":1.0,
         "format":"string",
         "python_type":"string"
      }
   },
   "header_row_idx":0,
   "heading_columns":0,
   "trailing_columns":0
}

Code APE pris pour code Fantoir

Par exemple dans le fichier etablissements-du-domaine-sanitaire-et-social-en-france-2020.csv le code APE est detecte comme un code Fantoir.
Une solution potentielle serait de rajouter la detection des codes APE.

Performance issue with csv-detective

When applying csv-detective routine (with num_rows=-1) on the datasets catalog (~100Mo), the global amount of time is of ~160 seconds.

Majority of this time comes from Testing columns to a great extent (~96%).

Verbose logs in detail

INFO:root:Detecting encoding
INFO:root:Detected encoding: "UTF-8" in 0.213s (confidence: 99%)
INFO:root:Detecting separator
INFO:root:Detected separator: ";" in 0.0s
INFO:root:Detecting headers
INFO:root:Detected headers in 0.0s
INFO:root:Detecting heading columns
INFO:root:No heading column detected in 0.0s
INFO:root:Detecting trailing columns
INFO:root:No trailing column detected in 0.0s
INFO:root:Parsing table
WARNING:root:Table parsed successfully in 2.613s
INFO:root:Detecting categorical columns
INFO:root:Detected 6 categorical columns out of 30 in 0.658s

INFO:root:Testing columns to get types
CRITICAL:root:  - Done with type "date" in 21.878s (1/47)
INFO:root:      - Done with type "year" in 0.305s (2/47)
INFO:root:      - Done with type "email" in 0.389s (3/47)
INFO:root:      - Done with type "mongo_object_id" in 0.418s (4/47)
INFO:root:      - Done with type "uuid" in 0.41s (5/47)
INFO:root:      - Done with type "url" in 0.335s (6/47)
INFO:root:      - Done with type "iso_country_code_alpha2" in 0.308s (7/47)
INFO:root:      - Done with type "iso_country_code_alpha3" in 0.35s (8/47)
INFO:root:      - Done with type "iso_country_code_numeric" in 0.324s (9/47)
INFO:root:      - Done with type "jour_de_la_semaine" in 0.353s (10/47)
INFO:root:      - Done with type "csp_insee" in 0.33s (11/47)
INFO:root:      - Done with type "tel_fr" in 0.357s (12/47)
INFO:root:      - Done with type "siren" in 0.348s (13/47)
INFO:root:      - Done with type "code_csp_insee" in 0.313s (14/47)
INFO:root:      - Done with type "sexe" in 0.286s (15/47)
CRITICAL:root:  - Done with type "pays" in 17.903s (16/47)
INFO:root:      - Done with type "code_departement" in 0.407s (17/47)
CRITICAL:root:  - Done with type "adresse" in 18.212s (18/47)
INFO:root:      - Done with type "code_commune_insee" in 0.363s (19/47)
CRITICAL:root:  - Done with type "commune" in 20.625s (20/47)
INFO:root:      - Done with type "region" in 0.647s (21/47)
INFO:root:      - Done with type "code_postal" in 0.587s (22/47)
CRITICAL:root:  - Done with type "departement" in 22.128s (23/47)
INFO:root:      - Done with type "uai" in 0.495s (24/47)
INFO:root:      - Done with type "siret" in 0.569s (25/47)
CRITICAL:root:  - Done with type "latitude_wgs" in 3.878s (26/47)
CRITICAL:root:  - Done with type "longitude_wgs" in 5.02s (27/47)
INFO:root:      - Done with type "latlon_wgs" in 0.406s (28/47)
INFO:root:      - Done with type "json_geojson" in 0.579s (29/47)
INFO:root:      - Done with type "code_fantoir" in 0.438s (30/47)
INFO:root:      - Done with type "insee_ape700" in 0.388s (31/47)
INFO:root:      - Done with type "datetime_iso" in 0.451s (32/47)
INFO:root:      - Done with type "datetime_rfc822" in 0.402s (33/47)
CRITICAL:root:  - Done with type "latitude_wgs_fr_metropole" in 3.489s (34/47)
CRITICAL:root:  - Done with type "longitude_wgs_fr_metropole" in 3.126s (35/47)
INFO:root:      - Done with type "code_region" in 0.347s (36/47)
INFO:root:      - Done with type "booleen" in 0.404s (37/47)
INFO:root:      - Done with type "twitter" in 0.357s (38/47)
WARNING:root:   - Done with type "float" in 1.248s (39/47)
WARNING:root:   - Done with type "int" in 1.056s (40/47)
INFO:root:      - Done with type "json" in 0.433s (41/47)
CRITICAL:root:  - Done with type "latitude_l93" in 3.56s (42/47)
CRITICAL:root:  - Done with type "longitude_l93" in 3.231s (43/47)
CRITICAL:root:  - Done with type "insee_canton" in 19.299s (44/47)
INFO:root:      - Done with type "date_fr" in 0.347s (45/47)
INFO:root:      - Done with type "code_waldec" in 0.494s (46/47)
INFO:root:      - Done with type "code_rna" in 0.44s (47/47)
CRITICAL:root:Done testing columns in 158.045s

INFO:root:Testing labels to get types
INFO:root:      - Done with type "adresse" in 0.002s (1/48)
INFO:root:      - Done with type "code_commune_insee" in 0.002s (2/48)
INFO:root:      - Done with type "code_departement" in 0.002s (3/48)
INFO:root:      - Done with type "code_fantoir" in 0.002s (4/48)
INFO:root:      - Done with type "code_postal" in 0.003s (5/48)
INFO:root:      - Done with type "code_region" in 0.002s (6/48)
INFO:root:      - Done with type "commune" in 0.002s (7/48)
INFO:root:      - Done with type "departement" in 0.003s (8/48)
INFO:root:      - Done with type "insee_canton" in 0.003s (9/48)
INFO:root:      - Done with type "latitude_l93" in 0.003s (10/48)
INFO:root:      - Done with type "latitude_wgs_fr_metropole" in 0.003s (11/48)
INFO:root:      - Done with type "longitude_l93" in 0.003s (12/48)
INFO:root:      - Done with type "longitude_wgs_fr_metropole" in 0.002s (13/48)
INFO:root:      - Done with type "pays" in 0.003s (14/48)
INFO:root:      - Done with type "region" in 0.002s (15/48)
INFO:root:      - Done with type "code_csp_insee" in 0.002s (16/48)
INFO:root:      - Done with type "code_rna" in 0.002s (17/48)
INFO:root:      - Done with type "code_waldec" in 0.002s (18/48)
INFO:root:      - Done with type "csp_insee" in 0.002s (19/48)
INFO:root:      - Done with type "date_fr" in 0.002s (20/48)
INFO:root:      - Done with type "insee_ape700" in 0.002s (21/48)
INFO:root:      - Done with type "sexe" in 0.002s (22/48)
INFO:root:      - Done with type "siren" in 0.004s (23/48)
INFO:root:      - Done with type "siret" in 0.004s (24/48)
INFO:root:      - Done with type "tel_fr" in 0.003s (25/48)
INFO:root:      - Done with type "uai" in 0.002s (26/48)
INFO:root:      - Done with type "jour_de_la_semaine" in 0.002s (27/48)
INFO:root:      - Done with type "mois_de_annee" in 0.002s (28/48)
INFO:root:      - Done with type "iso_country_code_alpha2" in 0.003s (29/48)
INFO:root:      - Done with type "iso_country_code_alpha3" in 0.002s (30/48)
INFO:root:      - Done with type "iso_country_code_numeric" in 0.002s (31/48)
INFO:root:      - Done with type "json_geojson" in 0.002s (32/48)
INFO:root:      - Done with type "latitude_wgs" in 0.003s (33/48)
INFO:root:      - Done with type "latlon_wgs" in 0.004s (34/48)
INFO:root:      - Done with type "longitude_wgs" in 0.003s (35/48)
INFO:root:      - Done with type "booleen" in 0.002s (36/48)
INFO:root:      - Done with type "email" in 0.003s (37/48)
INFO:root:      - Done with type "mongo_object_id" in 0.003s (38/48)
INFO:root:      - Done with type "uuid" in 0.002s (39/48)
INFO:root:      - Done with type "float" in 0.002s (40/48)
INFO:root:      - Done with type "int" in 0.002s (41/48)
INFO:root:      - Done with type "money" in 0.002s (42/48)
INFO:root:      - Done with type "twitter" in 0.002s (43/48)
INFO:root:      - Done with type "url" in 0.003s (44/48)
INFO:root:      - Done with type "date" in 0.003s (45/48)
INFO:root:      - Done with type "datetime_iso" in 0.003s (46/48)
INFO:root:      - Done with type "datetime_rfc822" in 0.003s (47/48)
INFO:root:      - Done with type "year" in 0.002s (48/48)
INFO:root:Done testing labels in 0.133s
INFO:root:Creating profile
WARNING:root:Created profile in 2.445s
CRITICAL:root:Routine completed in 164.138s

This ends up timing out in hydra workers, making csv parsing fail : https://errors.data.gouv.fr/organizations/sentry/issues/129487/events/fced5f3fae964450b7d249efa9a35f96/?project=2&referrer=issue-list&statsPeriod=14d

Detection of lat lon not working when in square brackets

Column is mistaken for a postcode column

See https://csvapi-front.etalab.studio/?url=https://www.data.gouv.fr/fr/datasets/r/2824df12-c542-4250-97f4-d6e566f69ef1
Column code_canal_parent

json as python_type?

While not a primary type (same as #43), it would be nice to know that a column contains some json.

Add max length of column

[Maybe not : CHARACTER VARYING instead]
Useful for SQL uploads later on, to set the VARCHAR max length (ould use TEXT but impossible to set an index on a TEXT column)
Could be in the profile section :
{'tops': [
{'count': 2772, 'value': 'TTTTTTTT'},
{'count': 780, 'value': 'XXXXXXXXXX'}}],
'nb_distinct': 10,
'nb_missing_values': 28,
'max_length': 12}

/Users/alexandre/Developer/Etalab/udata-hydra/.venv/lib/python3.9/site-packages/csv_detective/utils.py:64: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.
  return_table.loc[key] = table.apply(lambda serie: test_col_val(
/Users/alexandre/Developer/Etalab/udata-hydra/.venv/lib/python3.9/site-packages/csv_detective/utils.py:64: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.
  return_table.loc[key] = table.apply(lambda serie: test_col_val(
/Users/alexandre/Developer/Etalab/udata-hydra/.venv/lib/python3.9/site-packages/csv_detective/utils.py:64: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.
  return_table.loc[key] = table.apply(lambda serie: test_col_val(
/Users/alexandre/Developer/Etalab/udata-hydra/.venv/lib/python3.9/site-packages/csv_detective/utils.py:64: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.
  return_table.loc[key] = table.apply(lambda serie: test_col_val(
/Users/alexandre/Developer/Etalab/udata-hydra/.venv/lib/python3.9/site-packages/csv_detective/utils.py:64: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.
  return_table.loc[key] = table.apply(lambda serie: test_col_val(
/Users/alexandre/Developer/Etalab/udata-hydra/.venv/lib/python3.9/site-packages/csv_detective/utils.py:64: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.
  return_table.loc[key] = table.apply(lambda serie: test_col_val(
/Users/alexandre/Developer/Etalab/udata-hydra/.venv/lib/python3.9/site-packages/csv_detective/utils.py:64: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.
  return_table.loc[key] = table.apply(lambda serie: test_col_val(
/Users/alexandre/Developer/Etalab/udata-hydra/.venv/lib/python3.9/site-packages/csv_detective/utils.py:64: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.
  return_table.loc[key] = table.apply(lambda serie: test_col_val(
/Users/alexandre/Developer/Etalab/udata-hydra/.venv/lib/python3.9/site-packages/csv_detective/utils.py:64: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.
  return_table.loc[key] = table.apply(lambda serie: test_col_val(
/Users/alexandre/Developer/Etalab/udata-hydra/.venv/lib/python3.9/site-packages/csv_detective/utils.py:64: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.
  return_table.loc[key] = table.apply(lambda serie: test_col_val(
/Users/alexandre/Developer/Etalab/udata-hydra/.venv/lib/python3.9/site-packages/csv_detective/utils.py:64: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.
  return_table.loc[key] = table.apply(lambda serie: test_col_val(
/Users/alexandre/Developer/Etalab/udata-hydra/.venv/lib/python3.9/site-packages/csv_detective/utils.py:64: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.
  return_table.loc[key] = table.apply(lambda serie: test_col_val(
/Users/alexandre/Developer/Etalab/udata-hydra/.venv/lib/python3.9/site-packages/csv_detective/utils.py:64: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.
  return_table.loc[key] = table.apply(lambda serie: test_col_val(
/Users/alexandre/Developer/Etalab/udata-hydra/.venv/lib/python3.9/site-packages/csv_detective/utils.py:64: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.
  return_table.loc[key] = table.apply(lambda serie: test_col_val(
/Users/alexandre/Developer/Etalab/udata-hydra/.venv/lib/python3.9/site-packages/csv_detective/utils.py:64: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.
  return_table.loc[key] = table.apply(lambda serie: test_col_val(
/Users/alexandre/Developer/Etalab/udata-hydra/.venv/lib/python3.9/site-packages/csv_detective/utils.py:64: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.
  return_table.loc[key] = table.apply(lambda serie: test_col_val(
/Users/alexandre/Developer/Etalab/udata-hydra/.venv/lib/python3.9/site-packages/csv_detective/utils.py:64: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.
  return_table.loc[key] = table.apply(lambda serie: test_col_val(
/Users/alexandre/Developer/Etalab/udata-hydra/.venv/lib/python3.9/site-packages/csv_detective/utils.py:64: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.
  return_table.loc[key] = table.apply(lambda serie: test_col_val(
/Users/alexandre/Developer/Etalab/udata-hydra/.venv/lib/python3.9/site-packages/csv_detective/utils.py:64: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.
  return_table.loc[key] = table.apply(lambda serie: test_col_val(
/Users/alexandre/Developer/Etalab/udata-hydra/.venv/lib/python3.9/site-packages/csv_detective/utils.py:64: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.
  return_table.loc[key] = table.apply(lambda serie: test_col_val(
/Users/alexandre/Developer/Etalab/udata-hydra/.venv/lib/python3.9/site-packages/csv_detective/utils.py:64: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.
  return_table.loc[key] = table.apply(lambda serie: test_col_val(
/Users/alexandre/Developer/Etalab/udata-hydra/.venv/lib/python3.9/site-packages/csv_detective/utils.py:64: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.
  return_table.loc[key] = table.apply(lambda serie: test_col_val(
/Users/alexandre/Developer/Etalab/udata-hydra/.venv/lib/python3.9/site-packages/csv_detective/utils.py:64: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.
  return_table.loc[key] = table.apply(lambda serie: test_col_val(
/Users/alexandre/Developer/Etalab/udata-hydra/.venv/lib/python3.9/site-packages/csv_detective/utils.py:64: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.
  return_table.loc[key] = table.apply(lambda serie: test_col_val(
/Users/alexandre/Developer/Etalab/udata-hydra/.venv/lib/python3.9/site-packages/csv_detective/utils.py:64: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.
  return_table.loc[key] = table.apply(lambda serie: test_col_val(
/Users/alexandre/Developer/Etalab/udata-hydra/.venv/lib/python3.9/site-packages/csv_detective/utils.py:64: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.
  return_table.loc[key] = table.apply(lambda serie: test_col_val(
/Users/alexandre/Developer/Etalab/udata-hydra/.venv/lib/python3.9/site-packages/csv_detective/utils.py:64: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.
  return_table.loc[key] = table.apply(lambda serie: test_col_val(
/Users/alexandre/Developer/Etalab/udata-hydra/.venv/lib/python3.9/site-packages/csv_detective/utils.py:64: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.
  return_table.loc[key] = table.apply(lambda serie: test_col_val(
/Users/alexandre/Developer/Etalab/udata-hydra/.venv/lib/python3.9/site-packages/csv_detective/utils.py:64: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.
  return_table.loc[key] = table.apply(lambda serie: test_col_val(
/Users/alexandre/Developer/Etalab/udata-hydra/.venv/lib/python3.9/site-packages/csv_detective/utils.py:64: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.
  return_table.loc[key] = table.apply(lambda serie: test_col_val(
/Users/alexandre/Developer/Etalab/udata-hydra/.venv/lib/python3.9/site-packages/csv_detective/utils.py:64: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.
  return_table.loc[key] = table.apply(lambda serie: test_col_val(
/Users/alexandre/Developer/Etalab/udata-hydra/.venv/lib/python3.9/site-packages/csv_detective/utils.py:64: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.
  return_table.loc[key] = table.apply(lambda serie: test_col_val(
/Users/alexandre/Developer/Etalab/udata-hydra/.venv/lib/python3.9/site-packages/csv_detective/utils.py:64: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.
  return_table.loc[key] = table.apply(lambda serie: test_col_val(
/Users/alexandre/Developer/Etalab/udata-hydra/.venv/lib/python3.9/site-packages/csv_detective/utils.py:64: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.
  return_table.loc[key] = table.apply(lambda serie: test_col_val(
/Users/alexandre/Developer/Etalab/udata-hydra/.venv/lib/python3.9/site-packages/csv_detective/utils.py:64: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.
  return_table.loc[key] = table.apply(lambda serie: test_col_val(
/Users/alexandre/Developer/Etalab/udata-hydra/.venv/lib/python3.9/site-packages/csv_detective/utils.py:64: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.
  return_table.loc[key] = table.apply(lambda serie: test_col_val(
/Users/alexandre/Developer/Etalab/udata-hydra/.venv/lib/python3.9/site-packages/csv_detective/utils.py:64: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.
  return_table.loc[key] = table.apply(lambda serie: test_col_val(
/Users/alexandre/Developer/Etalab/udata-hydra/.venv/lib/python3.9/site-packages/csv_detective/utils.py:64: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.
  return_table.loc[key] = table.apply(lambda serie: test_col_val(
/Users/alexandre/Developer/Etalab/udata-hydra/.venv/lib/python3.9/site-packages/csv_detective/utils.py:64: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.
  return_table.loc[key] = table.apply(lambda serie: test_col_val(
/Users/alexandre/Developer/Etalab/udata-hydra/.venv/lib/python3.9/site-packages/csv_detective/utils.py:64: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.
  return_table.loc[key] = table.apply(lambda serie: test_col_val(
/Users/alexandre/Developer/Etalab/udata-hydra/.venv/lib/python3.9/site-packages/csv_detective/utils.py:64: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.
  return_table.loc[key] = table.apply(lambda serie: test_col_val(
/Users/alexandre/Developer/Etalab/udata-hydra/.venv/lib/python3.9/site-packages/csv_detective/utils.py:64: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.
  return_table.loc[key] = table.apply(lambda serie: test_col_val(
/Users/alexandre/Developer/Etalab/udata-hydra/.venv/lib/python3.9/site-packages/csv_detective/utils.py:64: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.
  return_table.loc[key] = table.apply(lambda serie: test_col_val(
/Users/alexandre/Developer/Etalab/udata-hydra/.venv/lib/python3.9/site-packages/csv_detective/utils.py:64: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.
  return_table.loc[key] = table.apply(lambda serie: test_col_val(
/Users/alexandre/Developer/Etalab/udata-hydra/.venv/lib/python3.9/site-packages/csv_detective/utils.py:64: FutureWarning: In a future version, object-dtype columns with all-bool values will not be included in reductions with bool_only=True. Explicitly cast to bool dtype instead.
  return_table.loc[key] = table.apply(lambda serie: test_col_val(

Add pandas profiling infos in csv detective report

Add in report for each columns :

min
max
mean
std
top 10 values with count for each
nb distinct values

Lat/lon, code INSEE and postcode should only be detected when there is a label boost

Currently there are lots of false positives due to numbers with values that match these.

etalab / csv-detective Goto Github PK

csv-detective's Introduction

CSV Detective

How To ?

Install the package

Detect some columns

So What Do You Get ?

Output

What Formats Can Be Detected

Format detection and scoring

output_mode - Select the output mode you want for json report

TODO (this list is too long)

Why Could This Be of Any Use ?

Release

Process

Dry run

Release

csv-detective's People

Contributors

Stargazers

Watchers

Forkers

csv-detective's Issues

Recommend Projects

Recommend Topics

Recommend Org

`output_mode` - Select the output mode you want for json report