Giter Club home page Giter Club logo

lcc-server's Introduction

Build Status

LCC-Server is a Python framework to serve collections of light curves. The code here forms the basis for the HAT data server. See the installation notes below for how to install and configure the server.

Features

LCC-Server includes the following functionality:

  • collection of light curves from various projects into a single output format (text CSV files)
  • HTTP API and an interactive frontend for searching over multiple light curve collections by:
    • spatial cone search near specified coordinates
    • full-text search on object names, descriptions, tags, name resolution using SIMBAD's SESAME resolver for individual objects, and for open clusters, nebulae, etc.
    • queries based on applying filters to database columns of object properties, e.g. object names, magnitudes, colors, proper motions, variability and object type tags, variability indices, etc.
    • cross-matching to uploaded object lists with object IDs and coordinates
  • HTTP API for generating datasets from search results asychronously and interactive frontend for browsing these, caching results from searches, and generating output zip bundles containing search results and all matching light curves
  • HTTP API and interactive frontend for detailed information per object, including light curve plots, external catalog info, and period-finding results plus phased LCs if available
  • Access controls for all generated datasets, and support for user sign-ins and sign-ups

Installation

NOTE: Python >= 3.6 is required. Use of a virtualenv is recommended; something like this will work well:

$ python3 -m venv lcc
$ source lcc/bin/activate

This package is available on PyPI. Install it with the virtualenv activated:

$ pip install numpy  # to set up Fortran bindings for dependencies
$ pip install lccserver  # add --pre to install unstable versions

To install the latest version from Github:

$ git clone https://github.com/waqasbhatti/lcc-server
$ cd lcc-server
$ pip install -e .

If you're on Linux or MacOS, you can install the uvloop package to optionally speed up some of the eventloop bits:

$ pip install uvloop

SQLite requirement

The LCC-Server relies on the fact that the system SQLite library is new enough to contain the fts5 full-text search module. For some older Enterprise Linux systems, this isn't the case. To get the LCC-Server and its tests running on these systems, you'll have to install a newer version of the SQLite amalgamation. I recommend downloading the tarball with autoconf so it's easy to install; e.g. for SQLite 3.27.2, use this file: sqlite-autoconf-3270200.tar.gz.

To install at the default location /usr/local/lib:

$ tar xvf sqlite-autoconf-3270200.tar.gz
$ ./configure
$ make
$ sudo make install

Then, override the default location that Python uses for its SQLite library using LD_LIBRARY_PATH:

$ export LD_LIBRARY_PATH='/usr/local/lib'

# create a virtualenv using Python 3
# here I've installed Python 3.7 to /opt/python37
$ /opt/python37/bin/python3 -m venv env

# activate the virtualenv, launch Python, and check if we've got a newer SQLite
$ source env/bin/activate
(env) $ python3
Python 3.7.0 (default, Jun 28 2018, 15:17:26)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sqlite3
>>> sqlite3.sqlite_version
'3.27.2'

You can then run the LCC-Server using this virtualenv. You can use an Environment directive in the systemd service files to add in the LD_LIBRARY_PATH override before launching the server.

Using the server

Some post-installation setup is required to begin serving light curves. In particular, you will need to set up a base directory where LCC-Server can work from and various sub-directories.

To make this process easier, there's an interactive CLI available when you install LCC-Server. This will be in your $PATH as lcc-server.

A Jupyter notebook walkthough using this CLI to stand up an LCC-Server instance, with example light curves, can be found in the astrobase-notebooks repo: lcc-server-setup.ipynb (Jupyter nbviewer).

Documentation

Server docs are automatically generated from the server-docs directory in the git repository. Sphinx-based documentation for the Python modules is on the TODO list and will be linked here when done.

Changelog

Please see: https://github.com/waqasbhatti/lcc-server/blob/master/CHANGELOG.md for a list of changes applicable to tagged release versions.

Screenshots

The search interface

LCC server search interface

Datasets from search results

LCC server results display

Per-object information

LCC server object info

License

LCC-Server is provided under the MIT License. See the LICENSE file for the full text.

lcc-server's People

Contributors

dependabot[bot] avatar waqasbhatti avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

lcc-server's Issues

make the table div in the dataset view height = window.height

this is so we always have the horizontal scroll bar at the bottom of the window. should also add an event listener on the window.resize event so we can always keep the table's height = window.height

also need to figure out the lingering table width issue. look up how I solved this for LC server v1.

site-specific settings from site.json in basedir, add lcc-server info footer to pages

This will allow us to add site specific stuff like:

  • documentation for the collections available (/docs/collections/
  • documentation for the project running this LCC server (/docs/project)
  • citations required when downloading and using the light curves (/docs/cite)

All of this stuff will go into the footer, which will look something like:

[lcc-server -> /docs/lcc-server] [github]         [operating project name and link]
                                                  [light curve collection docs]
                                                  [citing data]

add ability to make individual objects private/public

this requires an integer column in objectinfo-catalog.sqlite called object_is_public. This will not be returned by dbsearch.py functions, but an optional kwarg for these functions: require_object_ispublic can be used to enforce the rule that only objects marked public are returned.

separate out docs handling into internal lcc-server docs vs basedir docs

we'll do this because want to separate out internal docs for LCC server processes vs any collection or installation specific docs people put in the lcc-server-basedir. This way we can also version control useful stuff like lcformat and API docs.

the /docs endpoint will then read the doc-index.json file from two locations:

  • in this repository: lcc-server/lccserver/lcc-docs for LCC server specific doc markdown files. a static subdir contains any images, etc.
  • in the user's LCC server base directory for their specific documentation markdown files (like their original LC format spec, what the data contains, etc.). a static subdir contains any images, etc.

LCC server specific docs should include:

in this repository: lcc-server/lccserver/lcc-docs:
-> doc-index.json
-> lcformat.md
-> about.md
-> conesearch.md
-> columnsearch.md
-> ftsquery.md
-> xmatch.md
-> sqlsearch.md
-> datasets.md
-> api.md

The installation specific docs directory should include:

directory is same as the docspath kwarg provided to indexserver:
-> doc-index.json
-> whatever-else.md

we shouldn't have to open an sqlitecurve to get the objectid out of it when converting to CSV

should figure out someway to collect CSV LCs without hitting sqlitecurves heavily. currently, convert_to_csvlc needs to open the sqlitecurve to get the objectid out, which it then uses to form the output CSV LC filename, which is then checked if it exists already. only at that point do we return, but most of the time's already been lost in opening the sqlitecurve. need to get rid of this.

how to solve the per-collection vs common column dilemma

we'll just swap out the columns whenever collections are selected. the column listings in the search controls will always contain the most appropriate values:

  • for all collections, only the columns to all collections will be able to be searched
  • for a single collection, we'll load and show its columns in the column list widgets
  • for any other combination, we'll take intersections of the column lists of the selected collections

This should effectively allow searching across collections seamlessly

add postgres support

This will involve:

  • separating dbsearch.py into two modules: sqlite.py and postgres.py
  • adding in support for kwargs for indexserver to pick up which backend to use
  • adding support for full-text search in postgres (it doesn't have BM25, but maybe pg_trgrm ops on a GIN index on the appropriate FTS columns is enough? we could also implement BM25 in Python and rerank after results come back)
  • searchserver_handlers will have to something like from ..backend import sqlite, postgres instead of from ..backend import dbsearch
  • decide what to do with the lcc-datasets.sqlite and lcc-index.sqlite databases we use (I think we can keep them around, no need to shove everything into postgres)

The translation between sqlite -> postgres:

  • separate collections go into separate tables instead of separate databases
  • we can now join across tables (if we ever wanted to)

performance testing and improvement thoughts

This is mostly to investigate if uvloop is actually as awesome as claimed.

Python 3.7 on OSX 10.13, Macbook Pro 2015 with i7 4-core/8-thread at 2.2 Ghz and 16 GB

pip freeze:

astrobase==0.3.16
astropy==3.0.3
bleach==2.1.3
certifi==2018.4.16
chardet==3.0.4
cycler==0.10.0
html5lib==1.0.1
idna==2.7
itsdangerous==0.24
jplephem==2.8
kiwisolver==1.0.1
lcc-server@d603197b94111a1c8e16167381017eac1e8c9e0d
Markdown==2.6.11
matplotlib==2.2.2
numpy==1.15.0
passlib==1.7.1
Pillow==5.2.0
psutil==5.4.6
pyeebls==0.1.6
Pygments==2.2.0
pyparsing==2.2.0
python-dateutil==2.7.3
pytz==2018.5
requests==2.19.1
scikit-learn==0.19.2
scipy==1.1.0
six==1.11.0
tornado==5.1
tqdm==4.23.4
urllib3==1.23
uvloop==0.11.0
webencodings==0.5.1

Running indexserver with the default 4 background workers, with logging set to a file instead of stdout.

Before uvloop:

@nerrivik:~/scratch
[23:49]$ wrk -t12 -c400 -d30s http://127.0.0.1:12500/api/datasets
Running 30s test @ http://127.0.0.1:12500/api/datasets
  12 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   775.47ms  154.82ms 962.30ms   86.16%
    Req/Sec    44.35     38.09   240.00     81.64%
  14145 requests in 30.10s, 370.63MB read
  Socket errors: connect 0, read 441, write 0, timeout 0
Requests/sec:    469.89
Transfer/sec:     12.31MB

After uvloop:

[23:50]$ wrk -t12 -c400 -d30s http://127.0.0.1:12500/api/datasets
Running 30s test @ http://127.0.0.1:12500/api/datasets
  12 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   713.53ms  145.67ms 989.02ms   85.81%
    Req/Sec    52.22     45.63   250.00     77.57%
  15316 requests in 30.08s, 401.31MB read
  Socket errors: connect 0, read 464, write 0, timeout 0
Requests/sec:    509.11
Transfer/sec:     13.34MB

Slight improvement I guess. Will continue to investigate.

add a minimal test suite for the frontend and backend

we need a test suite to exercise the server and all the backend search functions. can probably set up a tiny sqlite db in the test module which the tests can then run against

this won't be hugely complicated (i.e. no selenium yet):

  • make a lclist-augcat.pkl of 150 objects randomly chosen from the HN Kepler and HS 579 data and their sqlitecurves
  • split these into 3 collections to test cross-collection searching
  • run the whole machinery from generating the databases to the backend search functions, to generating datasets, to hitting the API endpoints with requests.

We'll need:

  • test_abcat.py: for testing DB creation
  • test_dbsearch.py for testing individual search functions
  • test_datasets.py for testing:
    • preparing dataset
    • new dataset
    • do LC zip
    • generate data table rows
    • generate data CSV
  • test_indexserver.py for testing the API:
    • do conesearch, columnsearch, ftsquery, xmatch
    • list datasets, list collections
    • get dataset

add API key and session secret handling

we'll use the following (to make sure we don't have to store passwords or session IDs or other nonsense)

  • to enable saving private datasets and get an API key, the user must give us an email address
  • we'll generate a short-lived token (10 minutes) using their IP address, email address, and the services they want access to

the token is generated like so:

  • serialize IP address, email address, user requested services, API version to a string dumped JSON
  • encrypt the serialized string using cryptography.Fernet with an expiry of 10 minutes
  • sign the encrypted string using the itsdangerous.URLSafeSerializer with a 'salt' of email verification
  • async send email to the user with a generated link

when the user hits the generated link:

  • check signature with expiry date using itsdangerous.URLSafeSerializer
  • decrypt the string with the serialized JSON representation of the user's info
  • JSON load the string

if all succeeds above, then we can actually grant them access and generate a long-lived authorization token (1 year until renewed) using the email address and services they want access to. this is shown only on an HTTPS page on our server and we'll prompt the user to save it.

from this access token, we'll generate a new API key that expires in 60 days and a session cookie that also expires in 60 days. save the session cookie and prompt the user to remember the API key. all datasets created by this user from this point on will include the encrypted serialized authorization token (only on our server, we will strip them when we send out the pickles via download)

  • figure out renewal and expiry for API keys will work
  • figure out how we deal with the auth token itself expiring

Some reading material:

clean up duplicated code

the different query types all use identical code to handle the busy work of parsing data on the frontend and dealing with async queries on the backend. this isn't optimal, so we should consolidate all the common stuff into a couple of functions so we don't have to repeat ourselves.

also, the backend XMatchHandler implements an API key verifier that should be moved into a BaseHandler we can inherit from. this will enable all other RequestHandlers to do the same and will be super useful.

faster LC collection by using caching and other backend fixes

Here is the plan of attack:

  • Cache the dataset header and data rows in JSON using the datasets directory. We'll use the naming scheme: dataset-[setid]-header.json and dataset-[setid]-datarows-strformat.json, generate these the first time the dataset is accessed. All subsequent hits to the dataset will look for, load these files, and send them back instead.
  • Cache the light curves ZIP. See the scheme below.
  • Make sure any query that returns more than 20,000 rows never produces a light curve ZIP.
  • Always make the dataset complete when the header and datarows are ready. Light curve zipping then becomes an async operation carried out on the dataset page.

Caching light curve ZIPs:

  • Let's cache based on the list of LCs returned instead of the search args because those can be in any order. When we get a list of LCs from all collections at the end of sqlite_new_dataset, alpha-sort it, then sha256 it. write this along with the info of the dataset to the lcc-datasets.sqlite database. Put an index on this column.
  • When an LC ZIP is requested for a list of LCs, make_dataset_lczip should regen this cache key and check if a lightcurve ZIP with the matching cache key exists. If it does, then symlink that file to lightcurves-[setid].zip using the current setid and return that instead of going through all the collection bits over again.

add a lcc manager command line utility

This will be lcc-server. Using the magical psutil module, will be able to:

  • see all indexserver instances (their current dir, bytes r/w on disk, network connections
  • kill indexserver instances
  • start indexserver instances (in specific basedirs, with automatic log fix prefixes as needed)

Args:

  • status: shows status of all indexserver instances, instance_id
  • logs <instance id> -> gets logs from all instances by default
  • start <basedir or current dir by default> <other args pass directly to indexserver> -> need to figure out how to use subprocess and daemonize processes, etc.
  • stop <instance id | all>

change collection ispublic and object ispublic fields to access_mask or similar

Right now we only have the following:

  • ispublic = 0 -> fully private, ispublic -> 1 fully public for collections
  • same for objects

should change these integer fields to something like a bitmask for user groups and IDs, so:

  • ispublic -> access_mask
  • a user ID or group ID must be in the mask for the object to be accessible by that user or group
  • a mask value of 0 means the object is public

TODO: figure out how many users/groups this restricts us to. Maybe 64 for 64-bit floats? That's not enough for user IDs, but may be enough for user groups.

This may be useful: https://docs.python.org/3/library/enum.html#flag

add composite columns in objectinfo-catalog.sqlite (e.g. sdssr-jmag)

We need to add these so people can select on color, etc.

Probably need to mess with abcat, abcat_columns, and generate these automatically from the specs in abcat_columns in abcat.sqlite_make_objectinfo_catalog if both columns are present in the lclist augmented catalog pickle.

will probably need to regen the hatnet_keplerfield and hatsouth_hs579 objectinfo catalog sqlite files

what to do when we run into the sqlite3 attached DB limit

Our cunning scheme to allow many fully-independent databases may run into this roadblock. The limit can apparently be raised to 125 from 10, but this requires recompiling (which we could think about doing). Think about what to do here (maybe recommend postgres -- this means start implementing the postgres_ dbsearch functions soon).

add a favicon

in /static/images:

  • convert the existing top icon to 32x32 PNG from SVG
  • add to top of base.html:
<link rel="icon" type="image/png" href="https://example.com/favicon.png">

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.