outcomesinsights / sequelizer Goto Github PK

View Code? Open in Web Editor NEW

1.0 1.0 3.0 95 KB

A Ruby Gem to quickly establish a Sequel-based connection to a database

License: MIT License

Ruby 100.00%

sequelizer's People

Contributors

Stargazers

Watchers

Forkers

mattschultz jeremyevans

sequelizer's Issues

Ensure sequalizer is assigning pool and settings for connections

As described in the following bug report from Amgen (https://github.com/outcomesinsights/amgen_jigsaw_external/issues/119) studies were failing due to connections not being assigned to the jigsaw.tst pool. When study queries are sent, they do not use the same connection for the entire process. In some instances, new connections, with wrong pool assignments, are being assigned to study queries which cause the studies to fail.

Merge options from different sources

Originally I designed Sequelizer to gather options from config/database.yml OR .env OR from a set of options passed to the db/new_db methods.

The idea was to avoid confusing situations where .env overrode config/database.yml.

But I've decided to reverse that decision.

Options are given the following precedence: passed options > .env > manually defined environment variables > config/database.yml

So if config/database.yml defines the host as 'a', and there is an environment variable that defines SEQUELIZER_HOST=b, the host will be 'b'. Likewise, if a calling program passes the db function a hash with { host: 'c' } in it, the host will now be 'c'.

This is useful for Thor-based scripts that take a set of connection options since they can just pass that options hash to the db method and allow users to override certain connection parameters.

My specific use case is that I have data living in many different "schemas" within my PostgreSQL database and I want to easily switch between those schemas using an argument, but I don't want to have to specify all my connection parameters on the command line.

Return Sequel::Database given a DataStore

outcomesinsights/jigsaw-diagram-editor#968 introduces a DataStore AR Model. I'd like Sequelizer to take in this DataStore object and return a Sequel::Database connection. I imagine Jigsaw and TShank will both leverage this feature.

Sequelizer script to modify Gemfile

As I mentioned in #1, right now a user has to manually edit their Gemfile to include the appropriate database gem to work with Sequel.

I'd love to have a completely hands-off, super-magical solution like the one proposed in #1, but for the time being, perhaps I should create a small script with Sequelizer that will let users run a command in their project directory and it will modify their Gemfile for them.

Something like:

# sequilizer update_gemfile

Which would add/replace a line in the Gemfile with the correct database gem to work with the adapter that they specified in their Sequelizer configuration. It's a lot less magical, but probably much more practical.

Expose SET options as part of API

Jigsaw needs to know what DB options are set before running a study, so Sequelizer needs to expose this as part of its API.

Refactor SET command code

d3ba973 fixes a bug where Sequelizer called Impala::Connection#execute, but never closed the cursor that is returned.

I had consulted with @jeremyevans on how to best implement #7 and he suggested calling #log_connection_execute on any connection passed into the after_connect proc.

However, Impala::Connection doesn't implement this method, so, for now, Sequelizer is checking for a cursor and closing it if it finds one.

It feels like a Bad Idea to be handling something so low-level in the middle of after_connect.

After outcomesinsights/sequel_impala#17 is implemented, we should refactor the code in Sequelizer::DbOpts to take advantage of the new API.

Pass extensions into Sequelizer

I'd like to have Sequelizer enable certain Sequel extensions.

I can think of a couple ways to do this:

SEQUELIZER_EXTENSIONS=extension1,extension2
SEQUELIZER_EXTENSION_EXTENSION1=1,SEQUELIZER_EXTENSION_EXTENSION2

Because Sequelizer accumulates settings by reading config files and ENV variables, I think I prefer option 2 since it allows us to "stack" extensions settings without overriding them.

Create dev/tst/prd branches

Migrate SparkPrepper from ConceptQL

ConceptQL currently has a "SparkPrepper" class that handles setting up views in Spark based on externally managed Parquet files.

TShank takes advantage of the class as well, but neither should really be responsible for setting up Spark views.

I would like to migrate this code into Sequelizer.

Originally the use case for SparkPrepper was to create a set of temporary views so externally managed Parquet files could be used as tables in Spark. It contained a lot of logic to create partitioned tables from some of these views.

Moving forward, we will have a more permanent set of managed tables that we want to be able to query from Spark. However, for the next few months, we'll most likely have a set of managed tables that have been specially partitioned for testing purposes, living in specific schemas, and then a set of unmanaged tables that need to be available across a range of schemas.

Unlike PostgreSQL that supports multiple schemas defined in the search_path variable, Spark's USE <database> command limits us to a single schema. Only tables that live in that schema and the global namespace are visible to queries.

Under PostgreSQL, TShank normally operates by defining a search_path to read from and we're able to mix and match tables from various schemas. It makes unqualified calls to tables like observations or patients. We need to support similar behavior under Spark.

We can probably emulate this behavior via SparkPrepper. It will need to be able to take a path to externally managed Parquet files, from which it will derive temporary views. It will need to be able to take a set of schema names and, from that, figure out which tables live in each schema, then create views as necessary to pull from some of those schemas.

For example, say I have

schema1.observations
schema2.supplemented_payer_reimbursements
schema2.observations
Externally managed files /opt/data/gdm/{observations,patients,supplemented_payer_reimbursements}.parquet
I pass it the following schema set:
- schema1
- schema2
- path /opt/data/gdm

I'd like SparkPrepper to execute the following to cobble together the appropriate tables:

USE schema1;
CREATE TEMPORARY VIEW `supplemented_payer_reimbursements` as select * from schema2.supplemented_payer_reimbursements;
CREATE TEMPORARY VIEW `patients` USING org.apache.spark.sql.parquet OPTIONS ('path' = '/opt/data/gdm/patients.parquet');

I'll need to test Spark to make sure that creating this kind of "pass thru view" on schema2 doesn't impact performance. If it doesn't, it might make sense to make a view for observations as well and not worry about the USE command at all.

Dynamically require appropriate adapter

Currently, a user has to explicitly specify in their Gemfile the database gem they want to with Sequelizer.

It would be great if Sequelizer could detect which adapter the user specified in their configuration and then require the appropriate database gem at runtime.

I haven't found a way to achieve this, but I'm interested in suggestions. Ultimately, I envision something like the following scenario:

User includes gem 'sequelizer' in their gem file
User creates a database configuration file for Sequelizer
User runs bundle to install gems and their dependencies
Bundler install sequelizer gem
User can run bundle again and this time Sequelizer determines which database adapter the user specified in their configuration and tells Bundler to install that gem?
User runs program/script that uses Sequelizer
Sequelizer detects which database adapter is in use and requires the appropriate database gem

Universal Connection URL format?

Connection URLs vary quite a bit between different database drivers.

For instance, the pg driver allows a connection string like postgres://username:[email protected]/somedatabase whereas the JDBC driver only allows jdbc:postgresql://somehost.example.com/somedatabase?username=username&password=password

In Impala, two of the JDBC drivers have vastly different connection strings when employing Kerberos. jdbc/impala likes the following:

jdbc:impala://worker1.hadoop.jsaw.io:21050/default;AuthMech=1;KrbServiceName=impala;KrbAuthType=2;KrbHostFQDN=ip-172-31-15-50.us-west-2.compute.internal;KrbRealm=CLOUDERADEV.OUTINS.COM

whereas jdbc/hive2 likes:

jdbc:hive2://worker1.hadoop.jsaw.io:21050/default;principal=impala/ip-172-31-15-50.us-west-2.compute.internal@CLOUDERADEV.OUTINS.COM

I'd like to find a way to generate the correct connection URL based on a set of options passed into Sequelizer.

@jeremyevans proposed a simple case statement that switches on the value of options[:adapter]. I think this is a good place to start.

url = case opts[:adapter]
 when :jdbc_hive2
   # ...
 when :jdbc_impala
   # ...
 when :impala
   # ...
 end

From what I can tell, the following options need to be supported:

username
password
realm
host*
host_fqdn (defaults to host)
database
port

I'd like to focus on support for URLs for:

PostgreSQL
- drivers I'd like to support:
  - pg
  - jdbc:postgres
- username/password is the only thing required. No need for handling Kerberos.
Impala
- drivers I'd like to support:
  - jdbc:hive2
  - jdbc:impala
  - impala-ruby
- Handle no authentication and authentication via Kerberos

outcomesinsights / sequelizer Goto Github PK

sequelizer's People

Contributors

Stargazers

Watchers

Forkers

sequelizer's Issues

Ensure sequalizer is assigning pool and settings for connections

Merge options from different sources

Return Sequel::Database given a DataStore

Sequelizer script to modify Gemfile

Expose SET options as part of API

Refactor SET command code

Pass extensions into Sequelizer

Create dev/tst/prd branches

Migrate SparkPrepper from ConceptQL

Dynamically require appropriate adapter

Universal Connection URL format?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent