Giter Club home page Giter Club logo

csv_detective_api's Introduction

CSV Detective API and Frontend

What?

CSV Detective is a tool that gives you information about a CSV, such as its encoding and separator, as well as the type of columns contained inside: whether there are columns containing a SIRET or a SIREN number, a postal code, a department or a commune name, a geographic position, etc.

This UI builds on CSV Detective. We improved it, APIfied it, and through this interface, allow a friendlier use. Also a machine learning model to detect types was added (which is work in progress).

Why?

This tool was developed with data.gouv.fr (DGF) in mind. Being a repository of open datasets is one of the main tasks of DGF. In that sense, knowing what is inside the large collection of CSVs it contains can be useful for several tasks:

  • Enrich the results of the search engine with the contents of the CSVs.
  • Link datasets together according to their values.
  • Link datasets with well-maintained, trustable reference datasets.
  • Group datasets together according to their general topic.

How?

CSV Detective has two strategies to detect a column type:

  1. Rules + References: using regular expressions and also comparing the values with reference data (e.g., if the value 69007 appears in a list of postal codes, then it is a postal code.
  2. Supervised Learning (In progress): manually tagging column types and then determining simple features coupled to the content of the cells themselves to train classification algorithms.

Requirements

The easiest way to install this API is by cloning it and creating a Docker container. To do this you first need docker and docker-compose installed. After cloning, move into the project's folder and run docker-compose up.

Using the API

The API is described in localhost:5000 via the API swagger interface.

csv_detective_api's People

Contributors

dependabot[bot] avatar psorianom avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

csv_detective_api's Issues

One CSV resource ID does not work

When inspecting https://www.data.gouv.fr/fr/datasets/annuaire-de-leducation/, I manage to make it work with "Export au format CSV" (b22f04bf-64a8-495d-b8bb-d84dbc4c7983) but not with "Annuaire au 12 septembre 2019" (85aefd85-3025-400f-90ff-ccfd17ca588e).

Dans la console je vois

{
  "columns_ml": {
    "Fax": "tel_fr",
    "Fiche_onisep": "url",
    "Identifiant_de_l_etablissement": "uai",
    "Lycee_Agricole": "booleen",
    "Restauration": "booleen",
    "Section_europeenne": "booleen",
    "Section_internationale": "booleen",
    "Voie_professionnelle": "booleen",
    "Voie_technologique": "booleen",
    "date_maj_ligne": "date",
    "latitude": "latitude_wgs",
    "position": "latlon_wgs",
    "rpi_concentre": "booleen"
  },
  "columns_rb": {
    "Adresse_1": "adresse",
    "Apprentissage": "booleen",
    "Code postal": "code_postal",
    "Ecole_elementaire": "booleen",
    "Ecole_maternelle": "booleen",
    "Fax": "tel_fr",
    "Fiche_onisep": "url",
    "GRETA": "booleen",
    "Hebergement": "booleen",
    "Identifiant_de_l_etablissement": "uai",
    "Libelle_departement": "departement",
    "Lycee_Agricole": "booleen",
    "Lycee_des_metiers": "booleen",
    "Lycee_militaire": "booleen",
    "Mail": "email",
    "Nom_commune": "commune",
    "Post_BAC": "booleen",
    "Restauration": "booleen",
    "Section_arts": "booleen",
    "Section_cinema": "booleen",
    "Section_europeenne": "booleen",
    "Section_internationale": "booleen",
    "Section_sport": "booleen",
    "Section_theatre": "booleen",
    "Segpa": "booleen",
    "ULIS": "booleen",
    "Voie_generale": "booleen",
    "Voie_professionnelle": "booleen",
    "Voie_technologique": "booleen",
    "Web": "url",
    "date_maj_ligne": "date",
    "date_ouverture": "date",
    "etablissement_multi_lignes": "booleen",
    "latitude": "latitude_wgs",
    "longitude": "latitude_wgs",
    "position": "latlon_wgs",
    "precision_localisation": "adresse",
    "rpi_concentre": "booleen"
  },
  "metadata": {
    "encoding": "UTF-8",
    "header": [
      "Identifiant_de_l_etablissement",
      "Nom_etablissement",
      "Type_etablissement",
      "Statut_public_prive",
      "Adresse_1",
      "Adresse_2",
      "Adresse_3",
      "Code postal",
      "Code_commune",
      "Nom_commune",
      "Code_departement",
      "Code_academie",
      "Code_region",
      "Ecole_maternelle",
      "Ecole_elementaire",
      "Voie_generale",
      "Voie_technologique",
      "Voie_professionnelle",
      "Telephone",
      "Fax",
      "Web",
      "Mail",
      "Restauration",
      "Hebergement",
      "ULIS",
      "Apprentissage",
      "Segpa",
      "Section_arts",
      "Section_cinema",
      "Section_theatre",
      "Section_sport",
      "Section_internationale",
      "Section_europeenne",
      "Lycee_Agricole",
      "Lycee_militaire",
      "Lycee_des_metiers",
      "Post_BAC",
      "Appartenance_Education_Prioritaire",
      "GRETA",
      "SIREN_SIRET",
      "Nombre_d_eleves",
      "Fiche_onisep",
      "position",
      "Type_contrat_prive",
      "Libelle_departement",
      "Libelle_academie",
      "Libelle_region",
      "coordonnee_X",
      "coordonnee_Y",
      "epsg",
      "nom_circonscription",
      "latitude",
      "longitude",
      "precision_localisation",
      "date_ouverture",
      "date_maj_ligne",
      "etat",
      "ministere_tutelle",
      "etablissement_multi_lignes",
      "rpi_concentre",
      "rpi_disperse",
      "code_nature",
      "libelle_nature"
    ],
    "header_row_idx": 0,
    "heading_columns": 0,
    "ints_as_floats": [],
    "separator": ";",
    "total_lines": 65182,
    "trailing_columns": 0
  },
  "reference_matched_datasets": {
    "matched_datasets": {
      "0": [
        "code_postal",
        "adresse",
        "commune"
      ]
    },
    "reference_datasets": {
      "0": {
        "acronym": "BAN",
        "name": "Base Adresse Nationale",
        "url": "https://www.data.gouv.fr/en/datasets/base-adresse-nationale/"
      },
      "1": {
        "acronym": "RNA",
        "name": "Répertoire National des Associations",
        "url": "https://www.data.gouv.fr/en/datasets/repertoire-national-des-associations/"
      },
      "2": {
        "acronym": "SUB",
        "name": "Subventions",
        "url": "https://www.data.gouv.fr/en/search/?q=subventions"
      }
    }
  }
}

Et ensuite je vois l'erreur :

TypeError: "can't convert undefined to object"
    value App.js:292
    value App.js:291
    React 7
    unstable_runWithPriority scheduler.production.min.js:338
    React 6
    handlePredictClick App.js:160
react-dom.production.min.js:4260:12

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.