Giter Club home page Giter Club logo

postgres_upsert's Introduction

postgres_upsert Build Status

Allows your rails app to load data in a very fast way, avoiding calls to ActiveRecord.

Using the PG gem and postgres's powerful COPY command, you can create thousands of rails objects in your db in a single query.

Compatibility Note

The master branch requires the 'pg' gem which only supports MRI ruby. the jruby branch requires 'activerecord-jdbcpostgresql-adapter' which, of course only supports JRuby. Installation is the same whatever your platform.

Install

Put it in your Gemfile

gem 'postgres_upsert'

Run the bundle command

bundle

Usage

PostgresUpsert.write <class_or_table_name>, <io_object_or_file_path>[, options]

<class_or_table_name> is either an ActiveRecord::Base subclass, or a string representing the name of a database table. <io_object_or_file_path> can be either a string representing a file path, or an io object (StringIO, FileIO, etc.)

options:

  • :delimiter - the string to use to delimit fields from the source data. Default is ","
  • :header => specifies if the file/io source contains a header row. Either :header option must be true, or :columns list must be passed. Default true
  • :unique_key => the primary key or unique key column (or composite key columns) on your destination table, used to distinguish new records from existing records. Default is the primary_key of your destination table/model.
  • :update_only => when true, postgres_upsert will ONLY update existing records, and not insert new. Default is false.

Examples

for these examples let's assume we have a users table and model:

class User < ActiveRecord::Base

In the rails console we can run:

PostgresUpsert.write User, "/tmp/users.csv"

This command will use the headers in the CSV file as fields of the target table (by default) If the CSV file's header does not match the field names of the User class, you can pass a map in the options parameter.

PostgresUpsert.write "users", "/tmp/users.csv", :map => {'name' => 'first_name'}

The name column in the CSV file will be mapped to the first_name field in the users table.

postgres_upsert supports 'merge' operations, which is not yet natively supported in Postgres. The data can include both new and existing records, and postgres_upsert will handle either update or insert of records appropriately. Since the Postgres COPY command does not handle this, postgres_upsert accomplishes it using an intermediary temp table.

The merge/upsert happens in 5 steps (assume your data table is called "users")

  • create a temp table named users_temp_123 where "123" is a random int. In postgres temp tables are only visible to the current database session, so naming conflicts should not be a problem. We add this random suffix just for additional safety.
  • COPY the data to user_temp
  • issue a query to insert all new records from users_temp_123 into users ("new" records are those records whos primary key does not already exist in the users)
  • issue a query to update all existing records in users with the data in users_temp_123 ("existing" records are those whose primary key already exists in the users table)
  • drop the temp table.

timestamp columns

currently postgres_upsert detects and manages the default rails timestamp columns created_at and updated_at. If these fields exist in your destination table, postgres_upsert will keep these current as expected. I recommend you do NOT include these fields in your source CSV/IO, as postgres_upsert will not honor them.

  • newly inserted records get a current timestamp for created_at
  • records existing in the source file/IO will get an update to their updated_at timestamp (even if all fields maintain the same value)
  • records that are in the destination table but not the source will not have their timestamps changed.

Overriding the unique_key

By default postgres_upsert uses the primary key on your ActiveRecord table to determine if each record should be inserted or updated. You can override the column using the :unique_key option:

PostgresUpsert.write User "/tmp/users.csv", :unique_key => ["external_twitter_id"]

obviously, the field you pass must be a unique key in your database (this is not enforced at the moment, but will be)

If your source data does not contain the primary key, or an individual unique key, you can pass multiple columns in the unique_key option:

  PostgresUpsert.write User "/tmp/users.csv", :unique_key => ["state_id_numer", "state_name"]

As long as combined columns represent a unique value, row, we can successfully upsert.

passing :update_only => true will ensure that no new records are created, but records will be updated.

Insert/Update Counts

PostgresUpsert with also return a PostgresUpsert::Result object that will tell you how many records were inserted or updated:

User.delete_all
result = PostgresUpsert.write User "/tmp/users.csv"
result.inserted 
# => 10000
result.updated
# => 0

Huge Caveat!

Since postgres_upsert does not use validations or even instantiate rails objects, you can get invalid data if you're not careful. Postgres upsert assumes that your source data is minimally cleaned up, and will not tell you if any data is invalid based on rails model rules. It will, of course raise an error if data does not conform to your database constraints.

Benchmarks!

Given a User model, (validates presence of email and paassword)

2.1.3 :008 > User
 => User(id: integer, email: string, password: string, created_at: datetime, updated_at: datetime) 

And the following railsy code to create 10,000 users:

def insert_dumb
    time = Benchmark.measure do
      (1..10000).each do |n|
        User.create!(:email => "number#{n}@email.com", :password => "password#{n)}")
      end
    end
  puts time
end

Compared to the following code using Postgres_upsert:

def insert_smart
    time = Benchmark.measure do
      csv_string = CSV.generate do |csv|
        csv << %w(email password)    # CSV header row
        (1..10000).each do |n|
          csv << ["number#{n}@email.com", "password#{n)}"]
        end
      end
      io = StringIO.new(csv_string)
      PostgresUpsert.write User io, unique_key: "email"
    end
    puts time
end

let's compare!

2.1.3 :002 > insert_dumb
   #...snip  ~30k lines of output :( (10k queries, each wrapped in a transaction)
   (0.3ms)  COMMIT
26.639246
2.1.3 :004 > User.delete_all
  SQL (15.4ms)  DELETE FROM "users"
2.1.3 :006 > insert_smart
   #...snip ~30 lines of output, composing 5 sql queries...
0.275503

...That's 26.6 seconds for classic create loop... vs. 0.276 seconds for postgres_upsert.
This is over 96X faster. And it only cost me ~6 extra lines of code.

Note that for the benchmark, my database is local. The performance improvement should only increase when we have network latency to worry about.

Note on Patches/Pull Requests

I greatly appreciate contribution to this gem.

  • Fork the project and clone the repo locally
  • run 'bin/setup' to setup dependencies and create test DB
  • add your feature/fix to your fork(add and run rpsec tests please)
  • submit a PR
  • If you find an issue but can't fix in in a PR, please log an issue. I'll do my best.

postgres_upsert's People

Contributors

simsalabim avatar thestevemitchell avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

postgres_upsert's Issues

Add option for quote character

I have CSVs with long text descriptions, which often include HTML tags and weird characters. It would be helpful to have an option to specify a non-standard quote character, like ruby's CSV quote_char parameter.

Interpret path_or_io as a csv string if it isn't a valid file path

PostgresUpsert.write( User, <io_object_or_file_path>). The idea for taking either an IO or a string as the second parameter cam from the old postgres_copy project.

The idea makes sense, I can either pass a file, or a stream of data.

In practice though, I ofter find myself building a CSV string from a chunk of data, rather than as a stream. And I ofter need to

io = StringIO.new(string_variable)

If I try to pass the string_variable directly, pgup will interpret that as a file path and try to read the file.

So options:

  1. pass a File object to PostgresUpsert.write. If the passed param isn't a file, interpret it as a data String or IO. this would break the API again.
  2. Use a guard like Pathname.new().redable? to decide if it's a file path. Otherwise, interpret it as data.

Upsert fails when trying to import non-unique rows

I am trying to import a file which contains duplicate rows. Instead of updating subsequent records, I get this error:

ActiveRecord::RecordNotUnique: PG::UniqueViolation: ERROR:  duplicate key value violates unique constraint "index_dealers_on_source_id_and_source_dealer_id"
DETAIL:  Key (source_id, source_dealer_id)=(2, 236) already exists.

Shouldn't a duplicate key get updated instead of throwing this error?

add an option for log_levels

Running Postgres_upsert in development is terrible because each sql query is logged. We should either set log_level to :info by default, or use a different logger for pg_upsert statements.

simplify API

There are currently 2 ways to call pg_upsert: ARClass.pg_upsert (for ar class), or call PostgresUpsert::Writer directly non-active-record tables.

let's unify these to look something like:
PostgresUpsert.write(ARClass, io, options) => uses ActiveRecord to access things like columns, primary key, etc.
PostgresUpsert.write("table_name", io, options) => uses explicit sql to access columns for the table, primary key.

key_column verification

It's easy when first using pg_upsert to attempt to pass a data source without an "id" column, or column matching the primary key of the destination table. this results in no records being inserted or updated. passing the key_column is the workaround, but not intuitive enough. Options:

  1. raise an error if all (any?) records in the temp table do not have a value in the primary key field
  2. pg_upsert or Writer.write could return the number of records inserted/updated.
  3. raise an error if the values loaded into the temp table are not unique, or if the key_column field does not have a uniqueness restriction.

ActiveRecord::RecordNotUnique: PG::UniqueViolation: ERROR: duplicate key value violates unique constraint

Hi,
I am using this gem (thanks ๐Ÿ‘ ) for updating records and also for seeding the db.
When I try to seed, the csv file includes the id (primary) and the other columns with their respective values.
i'm using this method: PostgresUpsert.write table_name, io, unique_key: unique_key where in this case unique_key = ['id'].

The problem starts when I try to add a new record from the console (or admin panel) where I get the above error. It seems that the primary_key is not incremented so Rails tries to create the new record with the first available id which is already taken because of the seed upsert.

Is there a correct way to override this issue? (basically, use upsert for seeding and then continue "manually..."

Thanks

Option to delete after import????

It would be great to have a way to delete records that are not included in an import. Not sure how that works. Possibly after an import:

@timestamp = dateTime.now.
User.pg_upsert...
User.pg_upsert...
User.pg_upsert...
User.pg_drop(@timestamp)

...That only buys some small syntax sugar above User.delete_all('updated_at < "#{@timestamp}'')

or possibly

User.delete_after do
     User.pg_upsert...
     User.pg_upsert...
     User.pg_upsert...
end

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.