steveov / ebop_maven Goto Github PK

EBOP Model Automatic input Value Estimation Neural network

License: GNU General Public License v3.0

Python 45.59% Jupyter Notebook 54.41%

binary-stars convolutional-neural-networks machine-learning neural-networks python3 scikit-learn lightcurves jupyter keras tensorflow2

ebop_maven's Introduction

EBOP Model Automatic input Value Estimation Neural network

A machine learning model for estimating input values for characterization of detached eclipsing binaries stars by JKTEBOP.

Included in this repository are codes for generating training & test datasets, for building, training & testing the machine learning model and some other stuff still to be decided.

Warning

This is a work in progress. Handle with care.

Installation

This code base was developed within the context of an Anaconda 3 conda environment named ebop_maven. This environment supports Python 3.9+, TensorFlow, Keras, lightkurve, astropy and any further libraries upon which the code is dependent. To set up the ebop_maven conda environment, having first cloned this GitHub repo, open a Terminal, navigate to this local directory and run the following command;

$ conda env create -f environment.yaml

You will need to activate the environment whenever you wish to run any of these modules. Use the following command;

$ conda activate ebop_maven

JKTEBOP

These codes have a dependency on the JKTEBOP tool for generating and fitting lightcurves. The installation media and build instructions can be found here. The JKTEBOP_DIR environment variable is used by ebop_maven to be locate the executable at runtime and is set to ~/jktebop/ in the ebop_maven conda env. This may require updating to match the location where JKTEBOP has been set up.

Alternative, venv installation

If you prefer not to use a conda environment, the following venv setup works although I haven't tested it as thoroughly. Again, from this directory run the following to create and activate the .ebop_maven env;

$ python -m venv .ebop_maven

$ source .ebop_maven/bin/activate

Then run the following to set up the required packages within the environment:

$ pip install -r requirements.txt

You may need to install the jupyter kernel in the new venv:

$ ipython kernel install --user --name=.ebop_maven

The ebop_maven package

Finally there is support for installing ebop_maven as a pip package, however this is still very much "work in progress" and subject to change. Simply run:

$ pip install git+https://github.com/SteveOv/ebop_maven

This will install the Estimator class, a pre-built default model and the required support libraries. The code used in the following steps for training and testing models is not installed.

Usage

Generation of training and testing datasets

To generate the datasets which will be used to train and test the machine learning model, first run

$ python3 make_training_dataset.py

to generate the the formal-training-dataset-250k. This is a synthetic training dataset built by randomly sampling distributions of JKTEBOP model parameters across its entire parameter space. It generates 250,000 instances split 80:20 between training and validation sets.

Next run

$ python3 make_synthetic_test_dataset.py

to build the synthetic-mist-tess-dataset. This is the full dataset of synthetic light-curves generated from physically plausible systems based on MIST stellar models and the TESS photometric bandpass. It generates 20,000 randomly oriented instances based on an initial random selection of metallicity, age and initial masses supplemented with lookups of stellar parameters in the isochrones.

This module depends on MIST isochrone files which are not distributed as part of this GitHub repo. You will need to download and extract a pre-built model grid by following the instructions in readme.txt.

Finally run

$ python3 make_formal_test_dataset.py

to build the formal-test-dataset. These are set of real, well characterized systems from DEBCAT selected on the availability of TESS lightcurves, suitability for fitting with JKTEBOP and a published characterization from which parameters can be taken. The chosen systems are configured in the file ./config/formal-test-dataset.json which contains the search criteria, labels and supplementary information for each.

These steps will take roughly one to two hours on a moderately powerful system, with the resulting datasets taking up ~10 GB of disk space under the ./datasets/ directory.

Training and testing the machine learning model

The default machine learning model can be built and tested by running the following:

$ python3 make_trained_cnn_model.py

This will create the default CNN/DNN model, trained and validated on the formal-training-dataset to predict the $r_A+r_B$, $k$, $J$, $e\cos{\omega}$, $e\sin{\omega}$ and $b_P$ labels. Once trained it is evaluated on the synthetic-mist-tess-dataset before a final evaluation on the real systems of the formal-test-dataset.

By default CUDA cores are disabled so that training and testing is repeatable. In this configuration the process above takes about an hour and a half on my laptop with an 8 core 11th gen Intel i7 CPU. If you have them, CUDA cores can be enabled by setting the ENFORCE_REPEATABILITY const to False to give a significant reduction in training time.

Note: there are recorded incidents where TensorFlow v2.16.1 does not "see" installed GPUs (me, for one) and under these circumstances the above change may have no effect.

The compiled and trained model will be saved to the ./drop/training/cnn-new-ext0-4096-0.75-250k/default-model.keras file. Plots of the learning curves and the model structure are written to the plots sub-directory.

A detailed evaluation of any models can be invoked with the following command:

$ python3 model_testing.py [model_files ...]

This will initially evaluate model predictions against the synthetic-mist-tess-dataset and the formal-test-dataset. Subsequently it will run the full end-to-end testing of model predictions and JKTEBOP fitting against the formal-test-dataset. Testing output files and a log file will be written to a testing sub-directory alongside any tested models.

You can test the pre-built model, at ./ebop_maven/data/estimator/default-model.keras, by running model_testing without any arguments. In this case, the results will be written to the ./drop/training/published/testing/ directory.

Warning

The model structure and hyperparameters are still subject to change as ongoing testing and model searches continue to reveal improvements.

Interactive model tester

This is a jupyter notebook which can be used to download, predict and fit any target system selected from the formal-test-dataset against the pre-built default-model within the ./ebop_maven/data/estimator directory or any model found within ./drop/training/. It can be run with:

$ jupyter notebook model_interactive_tester.ipynb

Model structure and hyperparameter search

A search over a range of model structures and hyperparameter values, using the hyperopt libarary's tpe.suggest algorithm, can be run with the following command:

$ python3 model_search.py

Warning

This will take a long time! As in hours, if not days.

ebop_maven's People

Contributors

Stargazers

Watchers

ebop_maven's Issues

add hyperparameter search functionality

make_trained_cnn_model - UserWarning: Argument `decay` is no longer supported and will be ignored

When setting up the optimizer. Shouldn't be needed with Adam / Nadam.

better/more consistent support for uncertainties

needs some work so we propagate uncertainties from predictions/fitting through calculations. We have a couple of options

uncertainties package: a brief prototype with the orbital.orbital_inclination() function shows this is far from just a drop in. It doesn't appear to play well with numpy trig functions (or at least arccos) or astropy units
astropy.uncertainty: I've never used this before so there may be a learning curve

move the MIST isos out of the ebop_maven/libs directory structure

This is used for training/testing, so should be outside the ebop_maven package.

investigate revised strategy for training/validation dataset

Currently I have training/validation/testing data split in advance by supplying valid_ratio and test_ratio params when calling datasets.make_dataset_files(). Another option would be to leave these ratios a zero, thenlet TF do the training/validation split on the fly and use the MIST dataset as the test dataset. Final testing would be with the formal-test-dataset as it is now. This would make life simpler but I need to see how it would work out.

compare results of mags+ext_features model vs mags only model

It would be good to move away from the mags+ext_features approach as getting the phiS and dS_over_dP values can be quite difficult as part of an input pipeline. If I can get good enough results without these features it will make future developments much easier.

a more formal approach to omitting systems from formal-test-dataset

Currently we are using skip lists in plots.py and model_testing.py but these can get (and are) out of sync. We should just add an omit or enables=False tag to the JSON and use that to filter out systems that are not being used.

Update StellarModels to resolve the pandas FutureWarning

Initializing MistStellarModels on model data in '/home/steveo/projects/main/ebop_maven/ebop_maven/libs/data/stellar_models/mist/default.pkl.xz'
/home/steveo/projects/main/ebop_maven/ebop_maven/libs/stellarmodels.py:258: FutureWarning: Logical ops (and, or, xor) between Pandas objects and dtype-less sequences (e.g. list, tuple) are deprecated and will raise in a future version. Wrap the object in a Series, Index, or np.array before operating instead.
  mask &= index_df[fname] == fval

add ability to specify a map_func which publishes a subset of the labels and features

add ability to configure mags augmentations through the read_dataset() path

So that it can produce comperable instances to the full training pipeline.

Need to be able to configure:

stddev of additive Gaussian noise
random roll steps or a maximum value from which random roll steps are derived (latter is easier)

(re)produce the functionality which builds the test targets tabular

The POC code for building the formal-test-dataset also writes out a LaTeX tabular with the details of the targets. I need to replicate the functionality in this repo.

In datasets allow directly specified validation and test set row count

make_dataset_files() & make_dataset _file() currently have valid_ratio and test_ratio params for setting the ratio [0, 1) of the total dataset to assign the validation and testing subsets. Extend this so that the row counts can be given explicitly. The rule could be;

<= 1 indicates a ratio
1 indicates an explicit row count

The make_dataset_files() function will need updating to read the input total row counts from the source trainset so that it can accurately split an explicit row count across the output files when calling make_data_file() for each.

investigate using a matplotlib cmap on the H-R and pred-vs-label plots to indicate stars approximate colour

model_interactive_tester.ipynb -> JKTEBOP fit result appears shifted up from folded LC

For example:

Enhance Estimator predict() so that (un)scaling of predictions is optional

add fitting jupyter page

A simplified version of the debfit jupyter page

add post detrending clipping to the simple ingest pipeline (a la poc debfit pipeline)

There's a chance it may help replicate the fits of Grazcyk et al.

Extend Estimator so that it's possible to access all MC Dropout predictions

This will allow the option of producing box plots to visualize MC Predictions.

re-produce the predictions-vs-labels plots for the formal test dataset

As before but with enhancement

use the "morphology" flags in the dataset config json to highlight transiting and/or difficult targets

model_testing: revisit code for excluding labels from report

It just looks wrong. I'm pretty sure that with the current (& inherited) algorithm of popping names/scales we may end up applying the wrong scales if the labels are not at the end of the list.

3 or 4 lightcurve methods in datasets.py that may be better housed in lightcurve.py

make default JKTEBOP setup non-version specific

Currently the JKTEBOP_DIR env variable is set to ~/jktebop43/

Change this to ~/jktebop/ and update my local set up.

make_dataset_files() doesn't support setting test_ratio arg to 1.0

We get the following (true error probably masked by the process pooling);

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/steveo/projects/main/ebop_maven/make_training_datasets.py", line 108, in <module>
    datasets.make_dataset_files(trainset_files=sorted(dataset_dir.glob("trainset*.csv")),
  File "/home/steveo/projects/main/ebop_maven/ebop_maven/datasets.py", line 89, in make_dataset_files
    pool.starmap(make_dataset_file, iter_params)
  File "/home/steveo/anaconda3/envs/ebop_maven/lib/python3.12/multiprocessing/pool.py", line 375, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/steveo/anaconda3/envs/ebop_maven/lib/python3.12/multiprocessing/pool.py", line 774, in get
    raise self._value
UnboundLocalError: cannot access local variable 'msg' where it is not associated with a value

Not the most helpful bug report but the repo is simple; make_dataset_files(..., valid_ratio=0., test_ratio=1.0, ...). I'm expecting the function to only create a testing subdirectory and tfrecord files

(re)produce the functionality which plots the test targets H-R Diagram

Also a chance to enhance this. A possibility is rather than plotting a ZAMS line plot a shaded region covering the ZAMS to TAMS lines.

Look into adding R2 Score to model_testing as this may be better for comparing models than MAE/MSE

Will need to update these functions

table_of_predictions_vs_labels
predictions_vs_labels_to_csv

Ideally, we could make an argument to say which two metrics to report on.

It may be possible to use the keras functions for MAE, MSE and R2_score to do the calcs.

datasets.make_formal_test_dataset() not reading ecc from config

In following code (from line 347) it looks like if ecc is set in the target or sector config, it is never used.

                # omega & ecc are not used as labels but we need them for phiS and impact params
                ecosw, esinw = labels["ecosw"], labels["esinw"]
                omega = sector_cfg.get("omega", None) \
                    or np.rad2deg(np.arctan(np.divide(esinw, ecosw))) if ecosw else 0
                ecc = np.divide(ecosw, np.cos(np.deg2rad(omega))) if ecosw else 0

remove remaining references to deb_example from Estimator

The Estimator currently has to look to deb_example for information such as input feature names, default values and the mags wrap phase.

It would be better if it could read these metadata from its model so we have the option of using various models, each set up for different features. Even if we don't have mutiple models this change would prevent bugs if deb_example were to change after a model has been saved.

This will require a change to the model structure to enable the metadata to be persisted in it. Perhaps something similar to the custom OutputLayer we've implemented which enables storage of the label names and scales.

code for creating JKTEBOP file names not working under venv

Doesn't like the backslashes of the regex emebedded within an f-string. Affects model_testing (line 214) and equivalent line in the interactive tester notebook.

move to newer version of tensorflow/keras with support for python >= 3.8

With the inclusion of the ML training code and the supporting changes to requirements.txt the venv/pip install setup has stopped working because the tensorflow library (pinned at 2.6.*) doesn't support the version of Python (3.11) on my system.

Need to move to later/latest tensorflow/keras and make the related code changes.

Investigate having ML model predict sin(i) or cos(i) instead of i/100

I've added sini and cosi the trainsets output (see 7e71715).

One of these would be a natural way of predicting a value for the inclination without units and in the range 0 - 1 (similar to magnitude of the other predicted values).

I'm tempted by cos(i) as that ranges over 0 to 0.5 for inclinations in the range 90\deg to 60\deg (where most inclinations will lie). Over the same range of inclinations sin(i) only ranges from 1 to 0.87, so a narrower range than the 0.9 to 0.6 of the current approach.

reduce height of histogram plots - taking up too much space in paper

May be aided by switching to a log y-axis

Need to be able to work against multiple concatenated sectors

Will need some rework as currently it all works based on each target/sector being a separate entity. However, some targets have long periods and would benefit from >1 sector in the fitting.

The following will need updating;

datasets make_formal_test_dataset() still only want one row per target system but made up of 1 or more sectors
model_interactive_tester.ipynb
model_testing test_fitting_against_formal_test_dataset() and methods it's dependent on

This is not a small change ;)

extend formal testing to fit directly against interpolated model of formal test dataset systems

We could compare the results of these fits against those which are based on the TESS lightcurves

Need prediction/fit statistics for transiting vs eclipsing systems

move deb_example from libs to root of ebop_maven

It's too specific to be part of the general purpose library code.

extend formal testing to include a JKTEBOP fit

can potentially use the binned lightcurve to speed these up

review best way to work with MIST models; isos or eeps?

The synthetic-mist-tess-dataset is derived from synthetic star systems based on MIST 1.2 stellar models. This currently works by random selecting of Z (for which there is one choice) and the initial masses. The system age is selected to be late main-sequence of the more massive star. From this the radii, temps, luminosity and logg are looked up which gives us the info we need for surface brightness calcs and limb darkening lookups. Throw in period, inclination and eccentricity params chosen from random distributions (a la training dataset) and we have everything we need to produce a test light-curve.

The crux of the question being, is this the best way to work with the MIST data?

The models are effectively published in two forms;

isochrones (ISOs)
- a single file (for each Z, Y & vcrit) containing many EEP tables, one per distinct stellar age
- each table covers a range of stellar initial_mass values and publishes the "current" params
- effectively the data is "grouped by" log10 age
evolutionary tracks (EEP)
- a single file for each combination of initial mass, Z, Y and vcrit
- each file contains the evolutionary track for a single star (of i_mass, Z, Y and vcrit)

The current implementation uses the latter data via a convoluted import process.

Investigate whether it will be simpler to use ISOs. The generator code could select a Z and age value, then two masses from those available in the EEP and then validate that the phases were reasonable (we don't want protostars or remnants, so bias towards M-S) - repeat if there's a problem. With the Z, age and initial masses we're ready to continue with the current generator approach. The benefit; remove the need to pre-process plus the ISO data is more compact.

something wrong with the formal-trainset or formal-test-dataset

I've managed to successfully use the existing code (not yet added here) to train a CNN model on the new formal-trainset. However, when it tests it against the formal-test-dataset I get very bad results. There's something wrong with one, or maybe both, of these new datasets.

formal.test.mc.csv

A quick look at the predictions-vs-labels plot hints that the formal-test-dataset labels haven't been serialized correctly; they're all being read as zero.

update pyproject.toml with revised dependencies

This looks out of data; it's still referring to tensorflow==2.6.

resolve warning from limb_darkening module

[/home/steveo/projects/main/ebop_maven/ebop_maven/libs/limb_darkening.py:91](https://file+.vscode-resource.vscode-cdn.net/home/steveo/projects/main/ebop_maven/ebop_maven/libs/limb_darkening.py:91): FutureWarning: The 'delim_whitespace' keyword in pd.read_csv is deprecated and will be removed in a future version. Use ``sep='\s+'`` instead
  return pd.read_csv(data_file,

Create a new ebop_mave/pipeline module

There's a lot of code spread across datasets, model_testing and model_interactive_tester which is doing much the same thing; find and download lightcurves, prepare them for the estimator then take the results and get JKTEBOP to fit it. There's not too much duplication, however there are lots of calls between these modules which is starting to get difficult to manage/extend.

What is needed is a single pipeline module which publishes a set of functions which carry out these tasks and can be stitched together by datasets, model_testing, interactive_tester and future code. This will be published as part of the ebop_maven package for use in client applications too.

This is mainly a refactoring excercise. Once complete it should make some of the other outstanding issues more tractable.

Consider HO Tel for formal-test-dataset

John has sent me an early draft of the Rediscussion XX paper for this.

FFS: Why is tf/keras no longer seeing the GPU?

de-prioritize predicting L3 and bP while doing the bulk of the write up

should be fairly easy to revisit/reintegrate these params later

In model_search failure if setting the trials_save_file arg (~ln 486)

If I try:

    best = fmin(fn = train_and_test_model,
                space = trials_pspace,
                trials = trials,
                algo = tpe.suggest,
                max_evals = MAX_HYPEROPT_EVALS,
                loss_threshold = HYPEROPT_LOSS_TH,
                catch_eval_exceptions = True,
                rstate=np.random.default_rng(SEED),
                trials_save_file=f"{results_dir}/trials.pkl",
                verbose=True,
                show_progressbar=False)

I get the following error:

Traceback (most recent call last):
  File "/home/steveo/projects/main/ebop_maven/model_search.py", line 478, in <module>
    best = fmin(fn = train_and_test_model,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/steveo/anaconda3/envs/ebop_maven/lib/python3.12/site-packages/hyperopt/fmin.py", line 540, in fmin
    return trials.fmin(
           ^^^^^^^^^^^^
  File "/home/steveo/anaconda3/envs/ebop_maven/lib/python3.12/site-packages/hyperopt/base.py", line 671, in fmin
    return fmin(
           ^^^^^
  File "/home/steveo/anaconda3/envs/ebop_maven/lib/python3.12/site-packages/hyperopt/fmin.py", line 586, in fmin
    rval.exhaust()
  File "/home/steveo/anaconda3/envs/ebop_maven/lib/python3.12/site-packages/hyperopt/fmin.py", line 364, in exhaust
    self.run(self.max_evals - n_done, block_until_done=self.asynchronous)
  File "/home/steveo/anaconda3/envs/ebop_maven/lib/python3.12/site-packages/hyperopt/fmin.py", line 304, in run
    pickler.dump(self.trials, open(self.trials_save_file, "wb"))
  File "/home/steveo/anaconda3/envs/ebop_maven/lib/python3.12/site-packages/cloudpickle/cloudpickle.py", line 1461, in dump
    Pickler(file, protocol=protocol, buffer_callback=buffer_callback).dump(obj)
  File "/home/steveo/anaconda3/envs/ebop_maven/lib/python3.12/site-packages/cloudpickle/cloudpickle.py", line 1245, in dump
    return super().dump(obj)
           ^^^^^^^^^^^^^^^^^
TypeError: cannot pickle 'FuncGraph' object

add flatten() functionality to ingest pipeline

Following catch-up with John he's recommended that fitting QR Hya (and probably other targets) would benefit from a flattened LC to get rid of the trends that normal de-trending isn't helping with.

visualizations for bulk testset results

This is not the formal test dataset.

Need some visualization on the bulk quality of the predictions against the bulk test dataset. Potential options;

overlaid histograms
Chang at the exocomm had some nice plots showing the residual across a label's range - should be easy to produce something like this

Review MIST train/test dataset

I've attached a historgram produced from synthetic-mist-tess-dataset.json with a 0.5 drop ratio. Review this and the json to see if we can get better/smoother coverage (e.g. relatively few systems with e=0).

Perhaps fewer stellar masses but more dense parameter space and/or higher drop ratio to keep the instance count reasonable.

Investigate "UserWarning: Your input ran out of data; interrupting training." when training CNN

See the following text whenever training the CNN: UserWarning: Your input ran out of data; interrupting training.

Training continues and I get a usable model. Putting a repeat() in, as advised, just means that epoch 1 never seems to end

Training the model on 80000 training and 10000 validation instances, with a further 10000 instances held back for test.
Epoch 1/10
   1000/Unknown 12s 9ms/step - loss: 0.1261 - mse: 0.0508/home/steveo/anaconda3/envs/ebop_maven/lib/python3.12/contextlib.py:158: UserWarning: Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least `steps_per_epoch * epochs` batches. You may need to use the `.repeat()` function when building your dataset.
  self.gen.throw(value)
1000/1000 ━━━━━━━━━━━━━━━━━━━━ 13s 10ms/step - loss: 0.1261 - mse: 0.0508 - val_loss: 0.0555 - val_mse: 0.0157
Epoch 2/10

It only seems to be a problem on the first epoch.

investigate discrepancy reported MAE/MSE values from model_testing

The MAE/MSE values reported by evaluate_model_against_dataset(), which uses keras metrics classes, appear significantly different to those reported by fit_against_formal_test_dataset() which uses preds_vs_labels_dicts_to_table()/numpy functions.

For example, on current default-model, we get the following logged from evaluate_model_against_dataset();

-----------------------------------
Total      MAE (nonmc): 0.081985429
Total      MSE (nonmc): 0.019339621
Total r2_score (nonmc): 0.652802169
-----------------------------------

whereas in predctions-nonmc-vs-label.txt we see;

----------------------------------------------------------------------------------------------------
           | rA_plus_rB          k          J      ecosw      esinw         bP        MAE        MSE
----------------------------------------------------------------------------------------------------
 ...
====================================================================================================
MAE        |   0.019151   0.151148   0.067645   0.007184   0.030970   0.130484   0.067764
MSE        |   0.000701   0.043867   0.011438   0.000089   0.001744   0.025792              0.013938

I think it may be down to the fact that the former runs against all 27 targets in the config json, whereas the latter is against those 22 that are not excluded. Need to confirm that this is the case and make things consistent.

steveov / ebop_maven Goto Github PK

ebop_maven's Introduction

EBOP Model Automatic input Value Estimation Neural network

Installation

JKTEBOP

Alternative, venv installation

The ebop_maven package

Usage

Generation of training and testing datasets

Training and testing the machine learning model

Interactive model tester

Model structure and hyperparameter search

ebop_maven's People

Contributors

Stargazers

Watchers

ebop_maven's Issues

Recommend Projects

Recommend Topics

Recommend Org