Giter Club home page Giter Club logo

sequelizer's People

Contributors

aguynamedryan avatar jeremyevans avatar mattschultz avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sequelizer's Issues

Merge options from different sources

Originally I designed Sequelizer to gather options from config/database.yml OR .env OR from a set of options passed to the db/new_db methods.

The idea was to avoid confusing situations where .env overrode config/database.yml.

But I've decided to reverse that decision.

Options are given the following precedence: passed options > .env > manually defined environment variables > config/database.yml

So if config/database.yml defines the host as 'a', and there is an environment variable that defines SEQUELIZER_HOST=b, the host will be 'b'. Likewise, if a calling program passes the db function a hash with { host: 'c' } in it, the host will now be 'c'.

This is useful for Thor-based scripts that take a set of connection options since they can just pass that options hash to the db method and allow users to override certain connection parameters.

My specific use case is that I have data living in many different "schemas" within my PostgreSQL database and I want to easily switch between those schemas using an argument, but I don't want to have to specify all my connection parameters on the command line.

Return Sequel::Database given a DataStore

outcomesinsights/jigsaw-diagram-editor#968 introduces a DataStore AR Model. I'd like Sequelizer to take in this DataStore object and return a Sequel::Database connection. I imagine Jigsaw and TShank will both leverage this feature.

Sequelizer script to modify Gemfile

As I mentioned in #1, right now a user has to manually edit their Gemfile to include the appropriate database gem to work with Sequel.

I'd love to have a completely hands-off, super-magical solution like the one proposed in #1, but for the time being, perhaps I should create a small script with Sequelizer that will let users run a command in their project directory and it will modify their Gemfile for them.

Something like:

# sequilizer update_gemfile

Which would add/replace a line in the Gemfile with the correct database gem to work with the adapter that they specified in their Sequelizer configuration. It's a lot less magical, but probably much more practical.

Refactor SET command code

d3ba973 fixes a bug where Sequelizer called Impala::Connection#execute, but never closed the cursor that is returned.

I had consulted with @jeremyevans on how to best implement #7 and he suggested calling #log_connection_execute on any connection passed into the after_connect proc.

However, Impala::Connection doesn't implement this method, so, for now, Sequelizer is checking for a cursor and closing it if it finds one.

It feels like a Bad Idea to be handling something so low-level in the middle of after_connect.

After outcomesinsights/sequel_impala#17 is implemented, we should refactor the code in Sequelizer::DbOpts to take advantage of the new API.

Pass extensions into Sequelizer

I'd like to have Sequelizer enable certain Sequel extensions.

I can think of a couple ways to do this:

  1. SEQUELIZER_EXTENSIONS=extension1,extension2
  2. SEQUELIZER_EXTENSION_EXTENSION1=1,SEQUELIZER_EXTENSION_EXTENSION2

Because Sequelizer accumulates settings by reading config files and ENV variables, I think I prefer option 2 since it allows us to "stack" extensions settings without overriding them.

Migrate SparkPrepper from ConceptQL

ConceptQL currently has a "SparkPrepper" class that handles setting up views in Spark based on externally managed Parquet files.

TShank takes advantage of the class as well, but neither should really be responsible for setting up Spark views.

I would like to migrate this code into Sequelizer.

Originally the use case for SparkPrepper was to create a set of temporary views so externally managed Parquet files could be used as tables in Spark. It contained a lot of logic to create partitioned tables from some of these views.

Moving forward, we will have a more permanent set of managed tables that we want to be able to query from Spark. However, for the next few months, we'll most likely have a set of managed tables that have been specially partitioned for testing purposes, living in specific schemas, and then a set of unmanaged tables that need to be available across a range of schemas.

Unlike PostgreSQL that supports multiple schemas defined in the search_path variable, Spark's USE <database> command limits us to a single schema. Only tables that live in that schema and the global namespace are visible to queries.

Under PostgreSQL, TShank normally operates by defining a search_path to read from and we're able to mix and match tables from various schemas. It makes unqualified calls to tables like observations or patients. We need to support similar behavior under Spark.

We can probably emulate this behavior via SparkPrepper. It will need to be able to take a path to externally managed Parquet files, from which it will derive temporary views. It will need to be able to take a set of schema names and, from that, figure out which tables live in each schema, then create views as necessary to pull from some of those schemas.

For example, say I have

  • schema1.observations
  • schema2.supplemented_payer_reimbursements
  • schema2.observations
  • Externally managed files /opt/data/gdm/{observations,patients,supplemented_payer_reimbursements}.parquet
  • I pass it the following schema set:
    • schema1
    • schema2
    • path /opt/data/gdm

I'd like SparkPrepper to execute the following to cobble together the appropriate tables:

USE schema1;
CREATE TEMPORARY VIEW `supplemented_payer_reimbursements` as select * from schema2.supplemented_payer_reimbursements;
CREATE TEMPORARY VIEW `patients` USING org.apache.spark.sql.parquet OPTIONS ('path' = '/opt/data/gdm/patients.parquet');

I'll need to test Spark to make sure that creating this kind of "pass thru view" on schema2 doesn't impact performance. If it doesn't, it might make sense to make a view for observations as well and not worry about the USE command at all.

Dynamically require appropriate adapter

Currently, a user has to explicitly specify in their Gemfile the database gem they want to with Sequelizer.

It would be great if Sequelizer could detect which adapter the user specified in their configuration and then require the appropriate database gem at runtime.

I haven't found a way to achieve this, but I'm interested in suggestions. Ultimately, I envision something like the following scenario:

  • User includes gem 'sequelizer' in their gem file
  • User creates a database configuration file for Sequelizer
  • User runs bundle to install gems and their dependencies
  • Bundler install sequelizer gem
  • User can run bundle again and this time Sequelizer determines which database adapter the user specified in their configuration and tells Bundler to install that gem?
  • User runs program/script that uses Sequelizer
  • Sequelizer detects which database adapter is in use and requires the appropriate database gem

Universal Connection URL format?

Connection URLs vary quite a bit between different database drivers.

For instance, the pg driver allows a connection string like postgres://username:[email protected]/somedatabase whereas the JDBC driver only allows jdbc:postgresql://somehost.example.com/somedatabase?username=username&password=password

In Impala, two of the JDBC drivers have vastly different connection strings when employing Kerberos. jdbc/impala likes the following:

jdbc:impala://worker1.hadoop.jsaw.io:21050/default;AuthMech=1;KrbServiceName=impala;KrbAuthType=2;KrbHostFQDN=ip-172-31-15-50.us-west-2.compute.internal;KrbRealm=CLOUDERADEV.OUTINS.COM

whereas jdbc/hive2 likes:

jdbc:hive2://worker1.hadoop.jsaw.io:21050/default;principal=impala/ip-172-31-15-50.us-west-2.compute.internal@CLOUDERADEV.OUTINS.COM

I'd like to find a way to generate the correct connection URL based on a set of options passed into Sequelizer.

@jeremyevans proposed a simple case statement that switches on the value of options[:adapter]. I think this is a good place to start.

url = case opts[:adapter]
 when :jdbc_hive2
   # ...
 when :jdbc_impala
   # ...
 when :impala
   # ...
 end

From what I can tell, the following options need to be supported:

  • username
  • password
  • realm
  • host*
  • host_fqdn (defaults to host)
  • database
  • port

I'd like to focus on support for URLs for:

  • PostgreSQL
    • drivers I'd like to support:
      • pg
      • jdbc:postgres
    • username/password is the only thing required. No need for handling Kerberos.
  • Impala
    • drivers I'd like to support:
      • jdbc:hive2
      • jdbc:impala
      • impala-ruby
    • Handle no authentication and authentication via Kerberos

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.