outcomesinsights / sequelizer Goto Github PK
View Code? Open in Web Editor NEWA Ruby Gem to quickly establish a Sequel-based connection to a database
License: MIT License
A Ruby Gem to quickly establish a Sequel-based connection to a database
License: MIT License
As described in the following bug report from Amgen (https://github.com/outcomesinsights/amgen_jigsaw_external/issues/119) studies were failing due to connections not being assigned to the jigsaw.tst pool. When study queries are sent, they do not use the same connection for the entire process. In some instances, new connections, with wrong pool assignments, are being assigned to study queries which cause the studies to fail.
Originally I designed Sequelizer to gather options from config/database.yml OR .env OR from a set of options passed to the db/new_db methods.
The idea was to avoid confusing situations where .env overrode config/database.yml.
But I've decided to reverse that decision.
Options are given the following precedence: passed options > .env > manually defined environment variables > config/database.yml
So if config/database.yml defines the host as 'a', and there is an environment variable that defines SEQUELIZER_HOST=b, the host will be 'b'. Likewise, if a calling program passes the db function a hash with { host: 'c' } in it, the host will now be 'c'.
This is useful for Thor-based scripts that take a set of connection options since they can just pass that options hash to the db method and allow users to override certain connection parameters.
My specific use case is that I have data living in many different "schemas" within my PostgreSQL database and I want to easily switch between those schemas using an argument, but I don't want to have to specify all my connection parameters on the command line.
outcomesinsights/jigsaw-diagram-editor#968 introduces a DataStore AR Model. I'd like Sequelizer to take in this DataStore object and return a Sequel::Database connection. I imagine Jigsaw and TShank will both leverage this feature.
As I mentioned in #1, right now a user has to manually edit their Gemfile to include the appropriate database gem to work with Sequel.
I'd love to have a completely hands-off, super-magical solution like the one proposed in #1, but for the time being, perhaps I should create a small script with Sequelizer that will let users run a command in their project directory and it will modify their Gemfile for them.
Something like:
# sequilizer update_gemfile
Which would add/replace a line in the Gemfile with the correct database gem to work with the adapter that they specified in their Sequelizer configuration. It's a lot less magical, but probably much more practical.
Jigsaw needs to know what DB options are set before running a study, so Sequelizer needs to expose this as part of its API.
d3ba973 fixes a bug where Sequelizer called Impala::Connection#execute, but never closed the cursor that is returned.
I had consulted with @jeremyevans on how to best implement #7 and he suggested calling #log_connection_execute on any connection passed into the after_connect proc.
However, Impala::Connection doesn't implement this method, so, for now, Sequelizer is checking for a cursor and closing it if it finds one.
It feels like a Bad Idea to be handling something so low-level in the middle of after_connect.
After outcomesinsights/sequel_impala#17 is implemented, we should refactor the code in Sequelizer::DbOpts to take advantage of the new API.
I'd like to have Sequelizer enable certain Sequel extensions.
I can think of a couple ways to do this:
Because Sequelizer accumulates settings by reading config files and ENV variables, I think I prefer option 2 since it allows us to "stack" extensions settings without overriding them.
ConceptQL currently has a "SparkPrepper" class that handles setting up views in Spark based on externally managed Parquet files.
TShank takes advantage of the class as well, but neither should really be responsible for setting up Spark views.
I would like to migrate this code into Sequelizer.
Originally the use case for SparkPrepper was to create a set of temporary views so externally managed Parquet files could be used as tables in Spark. It contained a lot of logic to create partitioned tables from some of these views.
Moving forward, we will have a more permanent set of managed tables that we want to be able to query from Spark. However, for the next few months, we'll most likely have a set of managed tables that have been specially partitioned for testing purposes, living in specific schemas, and then a set of unmanaged tables that need to be available across a range of schemas.
Unlike PostgreSQL that supports multiple schemas defined in the search_path
variable, Spark's USE <database>
command limits us to a single schema. Only tables that live in that schema and the global namespace are visible to queries.
Under PostgreSQL, TShank normally operates by defining a search_path
to read from and we're able to mix and match tables from various schemas. It makes unqualified calls to tables like observations
or patients
. We need to support similar behavior under Spark.
We can probably emulate this behavior via SparkPrepper. It will need to be able to take a path to externally managed Parquet files, from which it will derive temporary views. It will need to be able to take a set of schema names and, from that, figure out which tables live in each schema, then create views as necessary to pull from some of those schemas.
For example, say I have
I'd like SparkPrepper to execute the following to cobble together the appropriate tables:
USE schema1;
CREATE TEMPORARY VIEW `supplemented_payer_reimbursements` as select * from schema2.supplemented_payer_reimbursements;
CREATE TEMPORARY VIEW `patients` USING org.apache.spark.sql.parquet OPTIONS ('path' = '/opt/data/gdm/patients.parquet');
I'll need to test Spark to make sure that creating this kind of "pass thru view" on schema2 doesn't impact performance. If it doesn't, it might make sense to make a view for observations as well and not worry about the USE command at all.
Currently, a user has to explicitly specify in their Gemfile the database gem they want to with Sequelizer.
It would be great if Sequelizer could detect which adapter the user specified in their configuration and then require the appropriate database gem at runtime.
I haven't found a way to achieve this, but I'm interested in suggestions. Ultimately, I envision something like the following scenario:
gem 'sequelizer'
in their gem filebundle
to install gems and their dependenciesbundle
again and this time Sequelizer determines which database adapter the user specified in their configuration and tells Bundler to install that gem?Connection URLs vary quite a bit between different database drivers.
For instance, the pg driver allows a connection string like postgres://username:[email protected]/somedatabase
whereas the JDBC driver only allows jdbc:postgresql://somehost.example.com/somedatabase?username=username&password=password
In Impala, two of the JDBC drivers have vastly different connection strings when employing Kerberos. jdbc/impala likes the following:
jdbc:impala://worker1.hadoop.jsaw.io:21050/default;AuthMech=1;KrbServiceName=impala;KrbAuthType=2;KrbHostFQDN=ip-172-31-15-50.us-west-2.compute.internal;KrbRealm=CLOUDERADEV.OUTINS.COM
whereas jdbc/hive2 likes:
jdbc:hive2://worker1.hadoop.jsaw.io:21050/default;principal=impala/ip-172-31-15-50.us-west-2.compute.internal@CLOUDERADEV.OUTINS.COM
I'd like to find a way to generate the correct connection URL based on a set of options passed into Sequelizer.
@jeremyevans proposed a simple case statement that switches on the value of options[:adapter]. I think this is a good place to start.
url = case opts[:adapter]
when :jdbc_hive2
# ...
when :jdbc_impala
# ...
when :impala
# ...
end
From what I can tell, the following options need to be supported:
I'd like to focus on support for URLs for:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.