waqasbhatti / lcc-server Goto Github PK

The Light Curve Collection Server framework

License: MIT License

Python 77.19% CSS 0.95% HTML 6.63% JavaScript 15.24%

lcc-server's Introduction

LCC-Server is a Python framework to serve collections of light curves. The code here forms the basis for the HAT data server. See the installation notes below for how to install and configure the server.

Features
Installation
SQLite requirement
Using the server
Documentation
Changelog
Screenshots
License

Features

LCC-Server includes the following functionality:

collection of light curves from various projects into a single output format (text CSV files)
HTTP API and an interactive frontend for searching over multiple light curve collections by:
- spatial cone search near specified coordinates
- full-text search on object names, descriptions, tags, name resolution using SIMBAD's SESAME resolver for individual objects, and for open clusters, nebulae, etc.
- queries based on applying filters to database columns of object properties, e.g. object names, magnitudes, colors, proper motions, variability and object type tags, variability indices, etc.
- cross-matching to uploaded object lists with object IDs and coordinates
HTTP API for generating datasets from search results asychronously and interactive frontend for browsing these, caching results from searches, and generating output zip bundles containing search results and all matching light curves
HTTP API and interactive frontend for detailed information per object, including light curve plots, external catalog info, and period-finding results plus phased LCs if available
Access controls for all generated datasets, and support for user sign-ins and sign-ups

Installation

NOTE: Python >= 3.6 is required. Use of a virtualenv is recommended; something like this will work well:

$ python3 -m venv lcc
$ source lcc/bin/activate

This package is available on PyPI. Install it with the virtualenv activated:

$ pip install numpy  # to set up Fortran bindings for dependencies
$ pip install lccserver  # add --pre to install unstable versions

To install the latest version from Github:

$ git clone https://github.com/waqasbhatti/lcc-server
$ cd lcc-server
$ pip install -e .

If you're on Linux or MacOS, you can install the uvloop package to optionally speed up some of the eventloop bits:

$ pip install uvloop

SQLite requirement

The LCC-Server relies on the fact that the system SQLite library is new enough to contain the fts5 full-text search module. For some older Enterprise Linux systems, this isn't the case. To get the LCC-Server and its tests running on these systems, you'll have to install a newer version of the SQLite amalgamation. I recommend downloading the tarball with autoconf so it's easy to install; e.g. for SQLite 3.27.2, use this file: sqlite-autoconf-3270200.tar.gz.

To install at the default location /usr/local/lib:

$ tar xvf sqlite-autoconf-3270200.tar.gz
$ ./configure
$ make
$ sudo make install

Then, override the default location that Python uses for its SQLite library using LD_LIBRARY_PATH:

$ export LD_LIBRARY_PATH='/usr/local/lib'

# create a virtualenv using Python 3
# here I've installed Python 3.7 to /opt/python37
$ /opt/python37/bin/python3 -m venv env

# activate the virtualenv, launch Python, and check if we've got a newer SQLite
$ source env/bin/activate
(env) $ python3
Python 3.7.0 (default, Jun 28 2018, 15:17:26)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import sqlite3
>>> sqlite3.sqlite_version
'3.27.2'

You can then run the LCC-Server using this virtualenv. You can use an Environment directive in the systemd service files to add in the LD_LIBRARY_PATH override before launching the server.

Using the server

Some post-installation setup is required to begin serving light curves. In particular, you will need to set up a base directory where LCC-Server can work from and various sub-directories.

To make this process easier, there's an interactive CLI available when you install LCC-Server. This will be in your $PATH as lcc-server.

A Jupyter notebook walkthough using this CLI to stand up an LCC-Server instance, with example light curves, can be found in the astrobase-notebooks repo: lcc-server-setup.ipynb (Jupyter nbviewer).

Documentation

Documentation for how to use the server for searching LC collections is hosted at the HAT data server instance: https://data.hatsurveys.org/docs.
The HTTP API is documented at: https://data.hatsurveys.org/docs/api.
A standalone Python module that serves as an LCC-Server HTTP API client is available in the astrobase repository: lccs.py (Docs).

Server docs are automatically generated from the server-docs directory in the git repository. Sphinx-based documentation for the Python modules is on the TODO list and will be linked here when done.

Changelog

Please see: https://github.com/waqasbhatti/lcc-server/blob/master/CHANGELOG.md for a list of changes applicable to tagged release versions.

Screenshots

The search interface

Datasets from search results

Per-object information

License

LCC-Server is provided under the MIT License. See the LICENSE file for the full text.

lcc-server's People

Contributors

Stargazers

Watchers

lcc-server's Issues

make the table div in the dataset view height = window.height

this is so we always have the horizontal scroll bar at the bottom of the window. should also add an event listener on the window.resize event so we can always keep the table's height = window.height

also need to figure out the lingering table width issue. look up how I solved this for LC server v1.

site-specific settings from site.json in basedir, add lcc-server info footer to pages

This will allow us to add site specific stuff like:

documentation for the collections available (/docs/collections/
documentation for the project running this LCC server (/docs/project)
citations required when downloading and using the light curves (/docs/cite)

All of this stuff will go into the footer, which will look something like:

[lcc-server -> /docs/lcc-server] [github]         [operating project name and link]
                                                  [light curve collection docs]
                                                  [citing data]

add ability to make individual objects private/public

this requires an integer column in objectinfo-catalog.sqlite called object_is_public. This will not be returned by dbsearch.py functions, but an optional kwarg for these functions: require_object_ispublic can be used to enforce the rule that only objects marked public are returned.

separate out docs handling into internal lcc-server docs vs basedir docs

we'll do this because want to separate out internal docs for LCC server processes vs any collection or installation specific docs people put in the lcc-server-basedir. This way we can also version control useful stuff like lcformat and API docs.

the /docs endpoint will then read the doc-index.json file from two locations:

in this repository: lcc-server/lccserver/lcc-docs for LCC server specific doc markdown files. a static subdir contains any images, etc.
in the user's LCC server base directory for their specific documentation markdown files (like their original LC format spec, what the data contains, etc.). a static subdir contains any images, etc.

LCC server specific docs should include:

in this repository: lcc-server/lccserver/lcc-docs:
-> doc-index.json
-> lcformat.md
-> about.md
-> conesearch.md
-> columnsearch.md
-> ftsquery.md
-> xmatch.md
-> sqlsearch.md
-> datasets.md
-> api.md

The installation specific docs directory should include:

directory is same as the docspath kwarg provided to indexserver:
-> doc-index.json
-> whatever-else.md

add columns used in the filter and sort conditions to the query

tried to do an interactive version of this, but too annoying keep the appropriate select boxes auto-updating in jquery (need to learn vue.js). for now, we'll do this right before the submission of the query and not touch the select boxes at all.

we shouldn't have to open an sqlitecurve to get the objectid out of it when converting to CSV

should figure out someway to collect CSV LCs without hitting sqlitecurves heavily. currently, convert_to_csvlc needs to open the sqlitecurve to get the objectid out, which it then uses to form the output CSV LC filename, which is then checked if it exists already. only at that point do we return, but most of the time's already been lost in opening the sqlitecurve. need to get rid of this.

investigate why the nobjects item doesn't update if > 3000 rows in dataset

the JS isn't working for some reason. nobjects should show "showing only top 3000 rows" if nobjects > 3000.

how to solve the per-collection vs common column dilemma

we'll just swap out the columns whenever collections are selected. the column listings in the search controls will always contain the most appropriate values:

for all collections, only the columns to all collections will be able to be searched
for a single collection, we'll load and show its columns in the column list widgets
for any other combination, we'll take intersections of the column lists of the selected collections

This should effectively allow searching across collections seamlessly

add checkplot pickle access

all ZIP downloads should be passed through to upstream httpd if possible

Not sure if Tornado sends static files asynchronously. This could break stuff if people are downloading GBs of stuff while someone else is trying to search stuff.

add postgres support

This will involve:

separating dbsearch.py into two modules: sqlite.py and postgres.py
adding in support for kwargs for indexserver to pick up which backend to use
adding support for full-text search in postgres (it doesn't have BM25, but maybe pg_trgrm ops on a GIN index on the appropriate FTS columns is enough? we could also implement BM25 in Python and rerank after results come back)
searchserver_handlers will have to something like from ..backend import sqlite, postgres instead of from ..backend import dbsearch
decide what to do with the lcc-datasets.sqlite and lcc-index.sqlite databases we use (I think we can keep them around, no need to shove everything into postgres)

The translation between sqlite -> postgres:

separate collections go into separate tables instead of separate databases
we can now join across tables (if we ever wanted to)

performance testing and improvement thoughts

This is mostly to investigate if uvloop is actually as awesome as claimed.

Python 3.7 on OSX 10.13, Macbook Pro 2015 with i7 4-core/8-thread at 2.2 Ghz and 16 GB

pip freeze:

astrobase==0.3.16
astropy==3.0.3
bleach==2.1.3
certifi==2018.4.16
chardet==3.0.4
cycler==0.10.0
html5lib==1.0.1
idna==2.7
itsdangerous==0.24
jplephem==2.8
kiwisolver==1.0.1
lcc-server@d603197b94111a1c8e16167381017eac1e8c9e0d
Markdown==2.6.11
matplotlib==2.2.2
numpy==1.15.0
passlib==1.7.1
Pillow==5.2.0
psutil==5.4.6
pyeebls==0.1.6
Pygments==2.2.0
pyparsing==2.2.0
python-dateutil==2.7.3
pytz==2018.5
requests==2.19.1
scikit-learn==0.19.2
scipy==1.1.0
six==1.11.0
tornado==5.1
tqdm==4.23.4
urllib3==1.23
uvloop==0.11.0
webencodings==0.5.1

Running indexserver with the default 4 background workers, with logging set to a file instead of stdout.

Before uvloop:

@nerrivik:~/scratch
[23:49]$ wrk -t12 -c400 -d30s http://127.0.0.1:12500/api/datasets
Running 30s test @ http://127.0.0.1:12500/api/datasets
  12 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   775.47ms  154.82ms 962.30ms   86.16%
    Req/Sec    44.35     38.09   240.00     81.64%
  14145 requests in 30.10s, 370.63MB read
  Socket errors: connect 0, read 441, write 0, timeout 0
Requests/sec:    469.89
Transfer/sec:     12.31MB

After uvloop:

[23:50]$ wrk -t12 -c400 -d30s http://127.0.0.1:12500/api/datasets
Running 30s test @ http://127.0.0.1:12500/api/datasets
  12 threads and 400 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   713.53ms  145.67ms 989.02ms   85.81%
    Req/Sec    52.22     45.63   250.00     77.57%
  15316 requests in 30.08s, 401.31MB read
  Socket errors: connect 0, read 464, write 0, timeout 0
Requests/sec:    509.11
Transfer/sec:     13.34MB

Slight improvement I guess. Will continue to investigate.

add a minimal test suite for the frontend and backend

we need a test suite to exercise the server and all the backend search functions. can probably set up a tiny sqlite db in the test module which the tests can then run against

this won't be hugely complicated (i.e. no selenium yet):

make a lclist-augcat.pkl of 150 objects randomly chosen from the HN Kepler and HS 579 data and their sqlitecurves
split these into 3 collections to test cross-collection searching
run the whole machinery from generating the databases to the backend search functions, to generating datasets, to hitting the API endpoints with requests.

We'll need:

test_abcat.py: for testing DB creation
test_dbsearch.py for testing individual search functions
test_datasets.py for testing:
- preparing dataset
- new dataset
- do LC zip
- generate data table rows
- generate data CSV
test_indexserver.py for testing the API:
- do conesearch, columnsearch, ftsquery, xmatch
- list datasets, list collections
- get dataset

consider enabling access to datasets even if their LCs aren't zipped yet

This will improve our response time. We can reform all LC filenames to None when fetching the dataset if they're not retrieved yet.

add API key and session secret handling

we'll use the following (to make sure we don't have to store passwords or session IDs or other nonsense)

to enable saving private datasets and get an API key, the user must give us an email address
we'll generate a short-lived token (10 minutes) using their IP address, email address, and the services they want access to

the token is generated like so:

serialize IP address, email address, user requested services, API version to a string dumped JSON
encrypt the serialized string using cryptography.Fernet with an expiry of 10 minutes
sign the encrypted string using the itsdangerous.URLSafeSerializer with a 'salt' of email verification
async send email to the user with a generated link

when the user hits the generated link:

check signature with expiry date using itsdangerous.URLSafeSerializer
decrypt the string with the serialized JSON representation of the user's info
JSON load the string

if all succeeds above, then we can actually grant them access and generate a long-lived authorization token (1 year until renewed) using the email address and services they want access to. this is shown only on an HTTPS page on our server and we'll prompt the user to save it.

from this access token, we'll generate a new API key that expires in 60 days and a session cookie that also expires in 60 days. save the session cookie and prompt the user to remember the API key. all datasets created by this user from this point on will include the encrypted serialized authorization token (only on our server, we will strip them when we send out the pickles via download)

figure out renewal and expiry for API keys will work
figure out how we deal with the auth token itself expiring

Some reading material:

clean up duplicated code

the different query types all use identical code to handle the busy work of parsing data on the frontend and dealing with async queries on the backend. this isn't optimal, so we should consolidate all the common stuff into a couple of functions so we don't have to repeat ourselves.

also, the backend XMatchHandler implements an API key verifier that should be moved into a BaseHandler we can inherit from. this will enable all other RequestHandlers to do the same and will be super useful.

add more LC collections to the LCC server at data.hatsurveys.org

HATNet planets
HATSouth planets
HATSouth G579 -> K2C7

faster LC collection by using caching and other backend fixes

Here is the plan of attack:

Cache the dataset header and data rows in JSON using the datasets directory. We'll use the naming scheme: dataset-[setid]-header.json and dataset-[setid]-datarows-strformat.json, generate these the first time the dataset is accessed. All subsequent hits to the dataset will look for, load these files, and send them back instead.
Cache the light curves ZIP. See the scheme below.
Make sure any query that returns more than 20,000 rows never produces a light curve ZIP.
Always make the dataset complete when the header and datarows are ready. Light curve zipping then becomes an async operation carried out on the dataset page.

Caching light curve ZIPs:

Let's cache based on the list of LCs returned instead of the search args because those can be in any order. When we get a list of LCs from all collections at the end of sqlite_new_dataset, alpha-sort it, then sha256 it. write this along with the info of the dataset to the lcc-datasets.sqlite database. Put an index on this column.
When an LC ZIP is requested for a list of LCs, make_dataset_lczip should regen this cache key and check if a lightcurve ZIP with the matching cache key exists. If it does, then symlink that file to lightcurves-[setid].zip using the current setid and return that instead of going through all the collection bits over again.

adding user sessions and saving per-user private datasets and query

this could be a special entry in the datasets table (or make a new empty table just for private datasets)

add a lcc manager command line utility

This will be lcc-server. Using the magical psutil module, will be able to:

see all indexserver instances (their current dir, bytes r/w on disk, network connections
kill indexserver instances
start indexserver instances (in specific basedirs, with automatic log fix prefixes as needed)

Args:

status: shows status of all indexserver instances, instance_id
logs <instance id> -> gets logs from all instances by default
start <basedir or current dir by default> <other args pass directly to indexserver> -> need to figure out how to use subprocess and daemonize processes, etc.
stop <instance id | all>

collection UI overhaul

add draggable filter conditions so they can be reordered easily in the list

also add at the bottom of the active filter conditions box, so it looks like:

Active filters




drag filters to change their order                        [add a (] [add a )]

We'll use https://github.com/RubaXa/Sortable for the draggable stuff. Need to figure out how to deal with the parens as well.

figure out some way to stop people from downloading ALL the light curves

and also crashing our server in the process. Some work on this is underway when we force completion for a dataset.

change collection ispublic and object ispublic fields to access_mask or similar

Right now we only have the following:

ispublic = 0 -> fully private, ispublic -> 1 fully public for collections
same for objects

should change these integer fields to something like a bitmask for user groups and IDs, so:

ispublic -> access_mask
a user ID or group ID must be in the mask for the object to be accessible by that user or group
a mask value of 0 means the object is public

TODO: figure out how many users/groups this restricts us to. Maybe 64 for 64-bit floats? That's not enough for user IDs, but may be enough for user groups.

This may be useful: https://docs.python.org/3/library/enum.html#flag

debounce the query submit buttons and other frontend fixes

https://davidwalsh.name/javascript-debounce-function
https://css-tricks.com/debouncing-throttling-explained-examples/

allow people to search within their private and all public datasets

This will be kind of tedious because we'll need to stand up another kdtree and database infrastructure for datasets. Should be useful though. We'll do this later.

add composite columns in objectinfo-catalog.sqlite (e.g. sdssr-jmag)

We need to add these so people can select on color, etc.

Probably need to mess with abcat, abcat_columns, and generate these automatically from the specs in abcat_columns in abcat.sqlite_make_objectinfo_catalog if both columns are present in the lclist augmented catalog pickle.

will probably need to regen the hatnet_keplerfield and hatsouth_hs579 objectinfo catalog sqlite files

what to do when we run into the sqlite3 attached DB limit

Our cunning scheme to allow many fully-independent databases may run into this roadblock. The limit can apparently be raised to 125 from 10, but this requires recompiling (which we could think about doing). Think about what to do here (maybe recommend postgres -- this means start implementing the postgres_ dbsearch functions soon).

after a query goes to the background, we should set a periodic hit to its dataset JSON

this way, we'll be able to tell if it finished and update the query status page. another nice enhancement would be to save each backgrounded query setid in localstorage or cookies so we can check on them when the user comes back to the page.

add a favicon

in /static/images:

convert the existing top icon to 32x32 PNG from SVG
add to top of base.html:

<link rel="icon" type="image/png" href="https://example.com/favicon.png">