Giter Club home page Giter Club logo

anon's Introduction

Anon โ€” A UNIX Command To Anonymise Data

Build Status Go Report Card License GitHub release

Anon is a tool for taking delimited files and anonymising or transforming columns until the output is useful for applications where sensitive information cannot be exposed.

Installation

Releases of Anon are available as pre-compiled static binaries on the corresponding GitHub release. Simply download the appropriate build for your machine and make sure it's in your PATH (or use it directly).

Usage

anon [--config <path to config file, default is ./config.json>]
     [--output <path to output to, default is STDOUT>]

Anon is designed to take input from STDIN and by default will output the anonymised file to STDOUT:

anon < some_file.csv > some_file_anonymised.csv

Configuration

In order to be useful, Anon needs to be told what you want to do to each column of the CSV. The config is defined as a JSON file (defaults to a file called config.json in the current directory):

{
  "csv": {
    "delimiter": ","
  },
  // Optionally define a number of rows to randomly sample down to.
  // To do it, it will hash (using FNV-1 32 bits) the column with the ID
  // in it and will mod the result by the value specified to decide if the
  // row is included or not -> include = hash(idColumn) % mod == 0
  "sampling": {
    // Number used to mod the hash of the id and determine if the row
    // has to be included in the sample or not
    "mod": 30000
    // Specify in which a column a unique ID exists on which the sampling can
    // be performed. Indices are 0 based, so this would sample on the first
    // column.
    "idColumn": 0
  },
  // An array of actions to take on each column - indices are 0 based, so index
  // 0 in this array corresponds to column 1, and so on.
  //
  // There must be an action for every column in the CSV.
  "actions": [
    {
      // The no-op, leaves the input unchanged.
      "name": "nothing"
    },
    {
      // Takes a UK format postcode (eg. W1W 8BE) and just keeps the outcode
      // (eg. W1W).
      "name": "outcode"
    },
    {
      // Hash (SHA1) the input.
      "name": "hash",
      // Optional salt that will be appened to the input.
      // If not defined, a random salt will be generated
      "salt": "salt"
    },
    {
      // Given a date, just keep the year.
      "name": "year",
      "dateConfig": {
        // Define the format of the input date here.
        "format": "YYYYmmmdd"
      }
    },
    {
      // Summarise a range of values.
      "name": "range",
      "rangeConfig": {
        "ranges": [
          // For example, this will take values between 0 and 100, and convert
          // them to the string "0-100".
          // You can use one of (gt, gte) and (lt, lte) but not both at the
          // same time.
          // You also need to define at least one of (gt, gte, lt, lte).
          {
            "gte": 0,
            "lt": 100,
            "output": "0-100"
          }
        ]
      }
    }
  ]
}

Contributing

Any contribution will be welcome, please refer to our contributing guidelines for more information.

License

This project is licensed under the MIT license.

The icon is by Pixel Perfect from Flaticon, and is licensed under a Creative Commons 3.0 BY license.

anon's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

anon's Issues

Feature: k-anonymity

Hello,

It would be great if this software supported k-anonymity, so that no row in the output were uniquely distinguishable. It should be possible to output the maximum k for a given dataset, as well.

Thanks!

Address concerns raised on HN about this tool being used on data intending to be made public.

There are some comments on our HN post about this tool that are concerned that we don't address the elephant in the room: that this tool is really not a good solution if you intend to make the resulting data public. There is countless research to show that de-anonymising data is completely possible with increasingly less effort because there are almost always unique "fingerprints" leftover in anonymised data.

We should add something to the README that:

  • Addresses these issues, and talks about when you should use this tool.
  • Makes sure that we point users to tools more appropriate to the job if you want to make resulting data public.
  • Makes sure that even those tools are caveated with links to the research showing their is no silver bullet to prevent de-anonymisation.

Add anonymisation action: Remove value/column

It would be great to be able to entirely remove a column from the input.

Config could be something like the following:

{
  "actions": [
    {
      "name": "identity"
    },
    {
      "name": "remove"
    },
    {
      "name": "hash"
    }
  ]
}

Then, for an input like this one:

a,b,c
d,e,f

The output would be:

a,84a516841ba77a5b4648de2cd0dfcb30ea46dbb4
d,4a0a19218e082a343a1b17e5333409af9d98f0f5

Add anonymisation action: Range of dates

Another very common way to reduce date precision is to group dates according to a period of time from an initial date.

For example, if we have the date of birth of a person, we may want to output what range of years the age of this person belongs to.

e.g. 1/1/1990 -> 1990 or 20-30 years

Possible config:

{
  "actions": [
    {
      "name": "timeElapsed",
      "dateConfig": {
        "format": "YYYYmmmdd",
        // should we count the number of months or years
        "elapsedIn": "years",
        // since when should we count
        // accepts a date in the above format or `now` as a value
        "since": "19901212"
      },
      "rangeConfig": {
        "ranges": [
          {
            "gt": 20,
            "lte": 30,
            "output": "20-30 years"
          }
        ]
      }
    }
  ]
}

Add support for salts within the `hash` action.

Sometimes you want to make sure the hash action is irreversible and not vulnerable to rainbow table attacks. To support this, it would be useful if one was able to optionally turn on random salts being added to the hash (and perhaps this should be the default, for safety).

For example, given the following config and CSV, you'd expect to get the following output:

Config:

{
  "csv": {
    "delimiter": ","
  },
  "actions": [
    {
      // Salt is not given, so is random and on by default.
      "name": "hash"
    },
    {
      "name": "hash",
      // Have no salt.
      "salt": false
    },
    {
      "name": "hash",
      // Have a salt, but once which stays the same for all values.
      "salt": "somesalt"
    }
  ]
}

Input:

foo,bar,lux

Output:

d8b685c1a4b889369299f275d583e34f94831bb6,62cdb7020ff920e5aa642c3d4066950dd1f01f4d,98307a2daa4aa31a9e0b2deeeb98dad737970927

Where the first column is effectively random, the second column is a deterministic hash, and the third is deterministic but with the salt added as a suffix. That is:

sha1(foo<some random noise>,sha1(bar),sha1(luxsomesalt)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.