Giter Club home page Giter Club logo

ebop_maven's Introduction

EBOP Model Automatic input Value Estimation Neural network

A machine learning model for estimating input values for characterization of detached eclipsing binaries stars by JKTEBOP.

Included in this repository are codes for generating training & test datasets, for building, training & testing the machine learning model and some other stuff still to be decided.

Warning

This is a work in progress. Handle with care.

Installation

This code base was developed within the context of an Anaconda 3 conda environment named ebop_maven. This environment supports Python 3.9+, TensorFlow, Keras, lightkurve, astropy and any further libraries upon which the code is dependent. To set up the ebop_maven conda environment, having first cloned this GitHub repo, open a Terminal, navigate to this local directory and run the following command;

$ conda env create -f environment.yaml

You will need to activate the environment whenever you wish to run any of these modules. Use the following command;

$ conda activate ebop_maven
JKTEBOP

These codes have a dependency on the JKTEBOP tool for generating and fitting lightcurves. The installation media and build instructions can be found here. The JKTEBOP_DIR environment variable is used by ebop_maven to be locate the executable at runtime and is set to ~/jktebop/ in the ebop_maven conda env. This may require updating to match the location where JKTEBOP has been set up.

Alternative, venv installation

If you prefer not to use a conda environment, the following venv setup works although I haven't tested it as thoroughly. Again, from this directory run the following to create and activate the .ebop_maven env;

$ python -m venv .ebop_maven

$ source .ebop_maven/bin/activate

Then run the following to set up the required packages within the environment:

$ pip install -r requirements.txt

You may need to install the jupyter kernel in the new venv:

$ ipython kernel install --user --name=.ebop_maven

The ebop_maven package

Finally there is support for installing ebop_maven as a pip package, however this is still very much "work in progress" and subject to change. Simply run:

$ pip install git+https://github.com/SteveOv/ebop_maven

This will install the Estimator class, a pre-built default model and the required support libraries. The code used in the following steps for training and testing models is not installed.

Usage

Generation of training and testing datasets

To generate the datasets which will be used to train and test the machine learning model, first run

$ python3 make_training_dataset.py

to generate the the formal-training-dataset-250k. This is a synthetic training dataset built by randomly sampling distributions of JKTEBOP model parameters across its entire parameter space. It generates 250,000 instances split 80:20 between training and validation sets.

Next run

$ python3 make_synthetic_test_dataset.py

to build the synthetic-mist-tess-dataset. This is the full dataset of synthetic light-curves generated from physically plausible systems based on MIST stellar models and the TESS photometric bandpass. It generates 20,000 randomly oriented instances based on an initial random selection of metallicity, age and initial masses supplemented with lookups of stellar parameters in the isochrones.

This module depends on MIST isochrone files which are not distributed as part of this GitHub repo. You will need to download and extract a pre-built model grid by following the instructions in readme.txt.

Finally run

$ python3 make_formal_test_dataset.py

to build the formal-test-dataset. These are set of real, well characterized systems from DEBCAT selected on the availability of TESS lightcurves, suitability for fitting with JKTEBOP and a published characterization from which parameters can be taken. The chosen systems are configured in the file ./config/formal-test-dataset.json which contains the search criteria, labels and supplementary information for each.

These steps will take roughly one to two hours on a moderately powerful system, with the resulting datasets taking up ~10 GB of disk space under the ./datasets/ directory.

Training and testing the machine learning model

The default machine learning model can be built and tested by running the following:

$ python3 make_trained_cnn_model.py

This will create the default CNN/DNN model, trained and validated on the formal-training-dataset to predict the $r_A+r_B$, $k$, $J$, $e\cos{\omega}$, $e\sin{\omega}$ and $b_P$ labels. Once trained it is evaluated on the synthetic-mist-tess-dataset before a final evaluation on the real systems of the formal-test-dataset.

By default CUDA cores are disabled so that training and testing is repeatable. In this configuration the process above takes about an hour and a half on my laptop with an 8 core 11th gen Intel i7 CPU. If you have them, CUDA cores can be enabled by setting the ENFORCE_REPEATABILITY const to False to give a significant reduction in training time.

Note: there are recorded incidents where TensorFlow v2.16.1 does not "see" installed GPUs (me, for one) and under these circumstances the above change may have no effect.

The compiled and trained model will be saved to the ./drop/training/cnn-new-ext0-4096-0.75-250k/default-model.keras file. Plots of the learning curves and the model structure are written to the plots sub-directory.

A detailed evaluation of any models can be invoked with the following command:

$ python3 model_testing.py [model_files ...]

This will initially evaluate model predictions against the synthetic-mist-tess-dataset and the formal-test-dataset. Subsequently it will run the full end-to-end testing of model predictions and JKTEBOP fitting against the formal-test-dataset. Testing output files and a log file will be written to a testing sub-directory alongside any tested models.

You can test the pre-built model, at ./ebop_maven/data/estimator/default-model.keras, by running model_testing without any arguments. In this case, the results will be written to the ./drop/training/published/testing/ directory.

Warning

The model structure and hyperparameters are still subject to change as ongoing testing and model searches continue to reveal improvements.

Interactive model tester

This is a jupyter notebook which can be used to download, predict and fit any target system selected from the formal-test-dataset against the pre-built default-model within the ./ebop_maven/data/estimator directory or any model found within ./drop/training/. It can be run with:

$ jupyter notebook model_interactive_tester.ipynb

Model structure and hyperparameter search

A search over a range of model structures and hyperparameter values, using the hyperopt libarary's tpe.suggest algorithm, can be run with the following command:

$ python3 model_search.py

Warning

This will take a long time! As in hours, if not days.

ebop_maven's People

Contributors

steveov avatar

Stargazers

 avatar

Watchers

 avatar  avatar

ebop_maven's Issues

better/more consistent support for uncertainties

needs some work so we propagate uncertainties from predictions/fitting through calculations. We have a couple of options

  • uncertainties package: a brief prototype with the orbital.orbital_inclination() function shows this is far from just a drop in. It doesn't appear to play well with numpy trig functions (or at least arccos) or astropy units
  • astropy.uncertainty: I've never used this before so there may be a learning curve

investigate revised strategy for training/validation dataset

Currently I have training/validation/testing data split in advance by supplying valid_ratio and test_ratio params when calling datasets.make_dataset_files(). Another option would be to leave these ratios a zero, thenlet TF do the training/validation split on the fly and use the MIST dataset as the test dataset. Final testing would be with the formal-test-dataset as it is now. This would make life simpler but I need to see how it would work out.

compare results of mags+ext_features model vs mags only model

It would be good to move away from the mags+ext_features approach as getting the phiS and dS_over_dP values can be quite difficult as part of an input pipeline. If I can get good enough results without these features it will make future developments much easier.

Update StellarModels to resolve the pandas FutureWarning

Initializing MistStellarModels on model data in '/home/steveo/projects/main/ebop_maven/ebop_maven/libs/data/stellar_models/mist/default.pkl.xz'
/home/steveo/projects/main/ebop_maven/ebop_maven/libs/stellarmodels.py:258: FutureWarning: Logical ops (and, or, xor) between Pandas objects and dtype-less sequences (e.g. list, tuple) are deprecated and will raise in a future version. Wrap the object in a Series, Index, or np.array before operating instead.
  mask &= index_df[fname] == fval

In datasets allow directly specified validation and test set row count

make_dataset_files() & make_dataset _file() currently have valid_ratio and test_ratio params for setting the ratio [0, 1) of the total dataset to assign the validation and testing subsets. Extend this so that the row counts can be given explicitly. The rule could be;

  • <= 1 indicates a ratio
  • 1 indicates an explicit row count

The make_dataset_files() function will need updating to read the input total row counts from the source trainset so that it can accurately split an explicit row count across the output files when calling make_data_file() for each.

make_dataset_files() doesn't support setting test_ratio arg to 1.0

We get the following (true error probably masked by the process pooling);

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/steveo/projects/main/ebop_maven/make_training_datasets.py", line 108, in <module>
    datasets.make_dataset_files(trainset_files=sorted(dataset_dir.glob("trainset*.csv")),
  File "/home/steveo/projects/main/ebop_maven/ebop_maven/datasets.py", line 89, in make_dataset_files
    pool.starmap(make_dataset_file, iter_params)
  File "/home/steveo/anaconda3/envs/ebop_maven/lib/python3.12/multiprocessing/pool.py", line 375, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/steveo/anaconda3/envs/ebop_maven/lib/python3.12/multiprocessing/pool.py", line 774, in get
    raise self._value
UnboundLocalError: cannot access local variable 'msg' where it is not associated with a value

Not the most helpful bug report but the repo is simple; make_dataset_files(..., valid_ratio=0., test_ratio=1.0, ...). I'm expecting the function to only create a testing subdirectory and tfrecord files

datasets.make_formal_test_dataset() not reading ecc from config

In following code (from line 347) it looks like if ecc is set in the target or sector config, it is never used.

                # omega & ecc are not used as labels but we need them for phiS and impact params
                ecosw, esinw = labels["ecosw"], labels["esinw"]
                omega = sector_cfg.get("omega", None) \
                    or np.rad2deg(np.arctan(np.divide(esinw, ecosw))) if ecosw else 0
                ecc = np.divide(ecosw, np.cos(np.deg2rad(omega))) if ecosw else 0

remove remaining references to deb_example from Estimator

The Estimator currently has to look to deb_example for information such as input feature names, default values and the mags wrap phase.

It would be better if it could read these metadata from its model so we have the option of using various models, each set up for different features. Even if we don't have mutiple models this change would prevent bugs if deb_example were to change after a model has been saved.

This will require a change to the model structure to enable the metadata to be persisted in it. Perhaps something similar to the custom OutputLayer we've implemented which enables storage of the label names and scales.

move to newer version of tensorflow/keras with support for python >= 3.8

With the inclusion of the ML training code and the supporting changes to requirements.txt the venv/pip install setup has stopped working because the tensorflow library (pinned at 2.6.*) doesn't support the version of Python (3.11) on my system.

Need to move to later/latest tensorflow/keras and make the related code changes.

Investigate having ML model predict sin(i) or cos(i) instead of i/100

I've added sini and cosi the trainsets output (see 7e71715).

One of these would be a natural way of predicting a value for the inclination without units and in the range 0 - 1 (similar to magnitude of the other predicted values).

I'm tempted by cos(i) as that ranges over 0 to 0.5 for inclinations in the range 90\deg to 60\deg (where most inclinations will lie). Over the same range of inclinations sin(i) only ranges from 1 to 0.87, so a narrower range than the 0.9 to 0.6 of the current approach.

Need to be able to work against multiple concatenated sectors

Will need some rework as currently it all works based on each target/sector being a separate entity. However, some targets have long periods and would benefit from >1 sector in the fitting.

The following will need updating;

  • datasets make_formal_test_dataset() still only want one row per target system but made up of 1 or more sectors
  • model_interactive_tester.ipynb
  • model_testing test_fitting_against_formal_test_dataset() and methods it's dependent on

This is not a small change ;)

review best way to work with MIST models; isos or eeps?

The synthetic-mist-tess-dataset is derived from synthetic star systems based on MIST 1.2 stellar models. This currently works by random selecting of Z (for which there is one choice) and the initial masses. The system age is selected to be late main-sequence of the more massive star. From this the radii, temps, luminosity and logg are looked up which gives us the info we need for surface brightness calcs and limb darkening lookups. Throw in period, inclination and eccentricity params chosen from random distributions (a la training dataset) and we have everything we need to produce a test light-curve.

The crux of the question being, is this the best way to work with the MIST data?

The models are effectively published in two forms;

  • isochrones (ISOs)
    • a single file (for each Z, Y & vcrit) containing many EEP tables, one per distinct stellar age
    • each table covers a range of stellar initial_mass values and publishes the "current" params
    • effectively the data is "grouped by" log10 age
  • evolutionary tracks (EEP)
    • a single file for each combination of initial mass, Z, Y and vcrit
    • each file contains the evolutionary track for a single star (of i_mass, Z, Y and vcrit)

The current implementation uses the latter data via a convoluted import process.

Investigate whether it will be simpler to use ISOs. The generator code could select a Z and age value, then two masses from those available in the EEP and then validate that the phases were reasonable (we don't want protostars or remnants, so bias towards M-S) - repeat if there's a problem. With the Z, age and initial masses we're ready to continue with the current generator approach. The benefit; remove the need to pre-process plus the ISO data is more compact.

something wrong with the formal-trainset or formal-test-dataset

I've managed to successfully use the existing code (not yet added here) to train a CNN model on the new formal-trainset. However, when it tests it against the formal-test-dataset I get very bad results. There's something wrong with one, or maybe both, of these new datasets.

formal.test.mc.csv

A quick look at the predictions-vs-labels plot hints that the formal-test-dataset labels haven't been serialized correctly; they're all being read as zero.

predictions_vs_labels_mc

resolve warning from limb_darkening module

[/home/steveo/projects/main/ebop_maven/ebop_maven/libs/limb_darkening.py:91](https://file+.vscode-resource.vscode-cdn.net/home/steveo/projects/main/ebop_maven/ebop_maven/libs/limb_darkening.py:91): FutureWarning: The 'delim_whitespace' keyword in pd.read_csv is deprecated and will be removed in a future version. Use ``sep='\s+'`` instead
  return pd.read_csv(data_file,

Create a new ebop_mave/pipeline module

There's a lot of code spread across datasets, model_testing and model_interactive_tester which is doing much the same thing; find and download lightcurves, prepare them for the estimator then take the results and get JKTEBOP to fit it. There's not too much duplication, however there are lots of calls between these modules which is starting to get difficult to manage/extend.

What is needed is a single pipeline module which publishes a set of functions which carry out these tasks and can be stitched together by datasets, model_testing, interactive_tester and future code. This will be published as part of the ebop_maven package for use in client applications too.

This is mainly a refactoring excercise. Once complete it should make some of the other outstanding issues more tractable.

In model_search failure if setting the trials_save_file arg (~ln 486)

If I try:

    best = fmin(fn = train_and_test_model,
                space = trials_pspace,
                trials = trials,
                algo = tpe.suggest,
                max_evals = MAX_HYPEROPT_EVALS,
                loss_threshold = HYPEROPT_LOSS_TH,
                catch_eval_exceptions = True,
                rstate=np.random.default_rng(SEED),
                trials_save_file=f"{results_dir}/trials.pkl",
                verbose=True,
                show_progressbar=False)

I get the following error:

Traceback (most recent call last):
  File "/home/steveo/projects/main/ebop_maven/model_search.py", line 478, in <module>
    best = fmin(fn = train_and_test_model,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/steveo/anaconda3/envs/ebop_maven/lib/python3.12/site-packages/hyperopt/fmin.py", line 540, in fmin
    return trials.fmin(
           ^^^^^^^^^^^^
  File "/home/steveo/anaconda3/envs/ebop_maven/lib/python3.12/site-packages/hyperopt/base.py", line 671, in fmin
    return fmin(
           ^^^^^
  File "/home/steveo/anaconda3/envs/ebop_maven/lib/python3.12/site-packages/hyperopt/fmin.py", line 586, in fmin
    rval.exhaust()
  File "/home/steveo/anaconda3/envs/ebop_maven/lib/python3.12/site-packages/hyperopt/fmin.py", line 364, in exhaust
    self.run(self.max_evals - n_done, block_until_done=self.asynchronous)
  File "/home/steveo/anaconda3/envs/ebop_maven/lib/python3.12/site-packages/hyperopt/fmin.py", line 304, in run
    pickler.dump(self.trials, open(self.trials_save_file, "wb"))
  File "/home/steveo/anaconda3/envs/ebop_maven/lib/python3.12/site-packages/cloudpickle/cloudpickle.py", line 1461, in dump
    Pickler(file, protocol=protocol, buffer_callback=buffer_callback).dump(obj)
  File "/home/steveo/anaconda3/envs/ebop_maven/lib/python3.12/site-packages/cloudpickle/cloudpickle.py", line 1245, in dump
    return super().dump(obj)
           ^^^^^^^^^^^^^^^^^
TypeError: cannot pickle 'FuncGraph' object

add flatten() functionality to ingest pipeline

Following catch-up with John he's recommended that fitting QR Hya (and probably other targets) would benefit from a flattened LC to get rid of the trends that normal de-trending isn't helping with.

visualizations for bulk testset results

This is not the formal test dataset.

Need some visualization on the bulk quality of the predictions against the bulk test dataset. Potential options;

  • overlaid histograms
  • Chang at the exocomm had some nice plots showing the residual across a label's range - should be easy to produce something like this

Review MIST train/test dataset

I've attached a historgram produced from synthetic-mist-tess-dataset.json with a 0.5 drop ratio. Review this and the json to see if we can get better/smoother coverage (e.g. relatively few systems with e=0).

Perhaps fewer stellar masses but more dense parameter space and/or higher drop ratio to keep the instance count reasonable.

histogram_full

Investigate "UserWarning: Your input ran out of data; interrupting training." when training CNN

See the following text whenever training the CNN: UserWarning: Your input ran out of data; interrupting training.

Training continues and I get a usable model. Putting a repeat() in, as advised, just means that epoch 1 never seems to end

Training the model on 80000 training and 10000 validation instances, with a further 10000 instances held back for test.
Epoch 1/10
   1000/Unknown 12s 9ms/step - loss: 0.1261 - mse: 0.0508/home/steveo/anaconda3/envs/ebop_maven/lib/python3.12/contextlib.py:158: UserWarning: Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least `steps_per_epoch * epochs` batches. You may need to use the `.repeat()` function when building your dataset.
  self.gen.throw(value)
1000/1000 ━━━━━━━━━━━━━━━━━━━━ 13s 10ms/step - loss: 0.1261 - mse: 0.0508 - val_loss: 0.0555 - val_mse: 0.0157
Epoch 2/10

It only seems to be a problem on the first epoch.

investigate discrepancy reported MAE/MSE values from model_testing

The MAE/MSE values reported by evaluate_model_against_dataset(), which uses keras metrics classes, appear significantly different to those reported by fit_against_formal_test_dataset() which uses preds_vs_labels_dicts_to_table()/numpy functions.

For example, on current default-model, we get the following logged from evaluate_model_against_dataset();

-----------------------------------
Total      MAE (nonmc): 0.081985429
Total      MSE (nonmc): 0.019339621
Total r2_score (nonmc): 0.652802169
-----------------------------------

whereas in predctions-nonmc-vs-label.txt we see;

----------------------------------------------------------------------------------------------------
           | rA_plus_rB          k          J      ecosw      esinw         bP        MAE        MSE
----------------------------------------------------------------------------------------------------
 ...
====================================================================================================
MAE        |   0.019151   0.151148   0.067645   0.007184   0.030970   0.130484   0.067764
MSE        |   0.000701   0.043867   0.011438   0.000089   0.001744   0.025792              0.013938

I think it may be down to the fact that the former runs against all 27 targets in the config json, whereas the latter is against those 22 that are not excluded. Need to confirm that this is the case and make things consistent.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.