gully / blase Goto Github PK

View Code? Open in Web Editor NEW

24.0 24.0 6.0 29.57 MB

Interpretable Machine Learning for astronomical spectroscopy in PyTorch and JAX

Home Page: https://blase.readthedocs.io

License: MIT License

Python 1.21% Jupyter Notebook 92.56% Makefile 0.01% TeX 6.23%

astronomy interpretable-machine-learning machine-learning spectroscopy

blase's Introduction

Hi there 👋 I'm gully

I build scientific Python software, ideally with PyTorch/JAX, focusing on interpretability, flexibility, uncertainty, and impact.

Creator of:

blasé: Interpretable Machine Learning for super-resolution semi-empirical spectroscopy in PyTorch and JAX (paper)
gollum: Python API to precomputed synthetic spectral models
muler: Python API to échelle spectra from IGRINS, HPF, and Keck NIRSPEC (paper)
ynot: A spectrograph digital twin for pixel-perfect échellogram forward modeling in PyTorch (paper draft)

Contributor to:

lightkurve: Python API for precision time series photometry from Kepler/K2/TESS
Starfish: Robust likelihood function for astronomical spectral inference

Selected publications:

A Large and Variable Leading Tail of Helium in a Hot Saturn Undergoing Runaway Inflation (paper, code)
Observationally Constraining the Starspot Properties of Magnetically Active M67 Sub-subgiant S1063 (paper, code)
Placing the Spotted T Tauri Star LkCa 4 on an HR Diagram (paper, code)
Optical characterization of gaps in directly bonded Si compound optics using infrared spectroscopy (paper, code)
My PhD Thesis at UT Austin Astronomy (dissertation, revision history)

🆕 Unpublished projects

Starspots: contracosta, acdc, 🔒monhegan, 🔒 star-witness, 🔒hopful, 🔒 calico, xveganx, 🔒 KaneDoe, 🔒 varasly
Brown Dwarfs: 🔒gandules, varsity, 🔒 lombok, 🔒 ucdwhpf, jammer, 🔒 BAADE, 🔒zoja
Methodological: coldrum (ft. pysr), fiatlux, TgiF, HelloWorldNet, gpytorch-astro, ffi-motion, bombcat, probabilisticAGN
Hardware: nubble, postale

blase's People

Contributors

Stargazers

Watchers

Forkers

erwanp mgebran jibancat aperez313 dkrolikowski sujay-shankar

blase's Issues

IGRINS figure for paper

We mention IGRINS a bunch in the beginning and then haven't shown any figures for it yet. We should! Let's add a figure of IGRINS + Sonora Bobcat models.

Themes that could go in the paper, but currently are left out

Here are some themes that could go in the paper, but currently are left out.

Injection/recovery tests
We did these, but left them out for now because of time/space/priorities. The code had changed somewhat since they were run and so we'd likely have to redo them. I think they're better suited to a follow-on paper anyways.
Earth Science Remote Sensing
The newfound precision semi-empirical telluric capability may be so good as to rival Earth Science remote sensing techniques. We could catalog a grid of semi-empirical high-bandwidth Earth Atmosphere spectra observed under a range of conditions, with the astronomical spectrum removed (use A0V stars so it is easy). I am not aware of such a resource, and many Earth science practitioners don't have access to 10 meter telescopes or such high-bandwidth spectrographs in this wavelength range at night time. We could/should mention these themes. Since it's not "Astrophysics" I left it out for now, but it is still "Science". In particular, remote sensing of Methane would have both climate change and policy enforcement implications. This topic is big enough to be explored in its own paper, probably not for an astronomy journal.
Plots of the clustering of line properties
I made these but did not put them in the paper. As a result all the figures (other than Fig 1) are of Spectra and not derived properties. While the spectra are nice, showing that we have access to the full catalog of line properties is only stated (albeit repeatedly) in the paper. Showing it with a graphic would emphasize this important aspect of the approach. There are many choices for how to generate that figure, upto and including a corner plot. I am fine to leave these plots for a separate paper on EW-finding or something like that. The paper is long enough as it is!

Slides for Cool Stars 21 Machine Learning Splinter Session

I am giving a talk at the Machine Learning splinter session at Cool Stars 21 in Toulouse tomorrow. 🥳

Here is a draft of the slides:
https://speakerdeck.com/gully/blase-an-interpretable-transfer-learning-approach-to-cool-star-echelle-spectroscopy

Any feedback is invited!

GPU vs CPU benchmarking for sparse (and/or dense) emulator

We have the sparse implementation of the emulator working, yay! 🥳 🍾

I am running the sparse implementation on GPU now and it works nicely. On the PHOENIX HPF-bandwith clone task, I'm seeing about 0.5 seconds per iteration on RTX 2070 GPU whereas my Macbook Air M1 is 3.8 seconds per iteration for a 7.5 boost on the GPU. We are only using <10% of the memory and computation of the GPU though so I suspect we will see even bigger gains as I increase the wingcut size, and decrease the prominence of lines (i.e. emulator weaker and weaker lines, producing more lines to clone).

I'm impressed that the sparse matrix does Okay on the GPU. I heard that the NVIDIA A100's have sparse matrix primatives, but I'm not sure how much of that exists in CUDA/PyTorch, and how an RTX-series GPU compares, especially for double precision, etc.

Paper 1 submission hiccups

Greetings from the rooftop of the Simons Center for Computational Astrophysics, on a beautiful summer-in-NYC day 👋 🌇

I submitted the paper last Friday evening, yay! The AASJournal came back with a few hiccups printed below:

During our quality check we notice the following problems:

The abstract is over the 250 words by 27 words. Please reduce the words and upload a corrected file. Also, please ensure the abstract in the ms.pdf matches the abstract in the system.
Question marks on pages 8 and 17:
such as MOLECFIT (?)
same way as described in Section (?).
Our new policy for authors is to add line numbers to their manuscript source file. This will help the reviewer write up the report. Please make the change and upload a new source file. Instructions can be found at:

https://journals.aas.org/manuscript-preparation/#linenumber

Injection/recovery figure for paper

We did a bunch of injection/recovery experiments, let's include one of those in the paper. It's a great way to show the role of regularization.

Killer app demo brainstorm?

Good conversation with @astrocaroline today about a killer app demo for blase.

The framework is already pretty slick, and has lots of promise for awesome extensions. We seek a scientific application that "works out of the box" and is transformative to demo its current usefulness. Here is a brainstorm of some applications. We can sort them into ease of implementation and impact axes later.

Precision Doppler Imaging
Telluric-immune spectral templates
Precision spectral decomposition
"Precision Equivalent width machine"
Robust exoplanet atmosphere emission spectra
M-dwarf spectral templates
Precision stellar activity (let line positions vary)

Some of these applications require complicated extensions that don't meet our "works out of the box" criterion. Items 1, 3, 4, 6, and 7 seem plausible. Maybe multi-epoch young stars are a good fit?

An attribute (not nn.Param) for Equivalent Width and FWHM

What we want

An attribute for Equivalent Width. It will have the same shape as lam_centers, amp, sigma, and gamma, but it will not be an nn.Parameter. Just a derived parameter.

How to compute it

Two options:

Analytically
This turns out to be impossible for Voigt profiles because the Lorenzian long tails integrate to infinity. However, we could simply provide a number of FWHMs out to which we integrate. This choice has the benefit of operating on a table of derived properties $\sigma$, $A$, $\gamma$.
Numerically
Simply torch.sum() the n_lines axis of flux2D, including the size of the pixels in angstroms. Probably np.trapz or equivalent would work better (or scipy quad).

Refactor/deprecate the notebooks

There are a bunch of notebooks that have served their purpose. We should deprecate and refactor the notebooks. I'd recommend a running-number naming convention for whichever notebooks we want to keep.

Add unit tests for LogEmulator

We need to make sure that it can actually work.

Update the API docstrings to reflect the mathematical notation in the paper

Title says it all!

Migrate telluric module towards the sparse implementation

We have some awesome existing source code for the telluric module. The code takes molecular data from the Hitran Python Interface (HAPI) and produces spectra.

That code goes through the trouble of cascading the transmission through a multi-layer atmosphere. While that's nice and accurate and all, it's a bit of a computational waste, since for 100 layers, we have to do 100x the amount of computation. In the end the spectral lines still mostly look like Voigt profiles. So what we should do instead is emulate a precomputed synthetic model spectrum of telluric transmission from a high-resolution, high accuracy model such as TelFit. We proceed the same way we currently do for emulating precomputed synthetic spectra of stars: we will set a threshold and use scipy find-peaks to find all the prominent telluric absorption lines.

We will introduce one augmentation at that stage: rather than instantiate Voigts with unknown sigmas and gammas, we can initialize the amplitude, sigma and gamma with values that a typical temperature/humidity/pressure-level would dictate. That might be more trouble and less accurate than it's worth though? So there are some tradeoffs in the initialization to explore.

In either case, the training will proceed in the same way as we're used to. One good thing is that our wingcut can be much smaller than in the stellar case. We do not have to worry about RV and vsini, so the influence of a given telluric line stays very close to its few tens of pixels. On the downside, the sampling of the telluric model is much finer than the stellar model. So we will have a very tall-and-thin sparse 2D matrix for telluric models and---when coalesced---it will yield a high resolution, high-bandwidth 1D spectrum.

Once the telluric model parameters are pretrained on the TelFit model, they can be combined with the stellar model. This introduces another question: how do we want to handle the sampling? Should we evaluate the stellar model on the same wavelength coordinate grid as the telluric model? That would makes the stellar model match the finer sampling of the telluric model, blooming the computation time. Ideally we would want to get away with the coarser stellar sampling, but some telluric lines may be too narrow to accurately sample with "merely" 0.01 Angstrom resolution. We'll just have to explore! The cool thing is that I think we can stick with a small wingcut for the telluric model even through the data-application phase.

Loading a pre-trained model with custom wavelength_grid requires sufficient bandwidth to see all the lines

SparseLinearEmulator (and all the emulators for that matter) support a neat feature-- you can hand-in a wavelength_grid vector that is different from the native precomputed synthetic spectral model's wavelength coodinates. That strategy allows us to pass in logarithmically spaced points which facilitates later Doppler convolution steps. Yay!

A problem arises if you pass in too small of a bandwidth because a given line may then exist outside of the wavelength grid.

Two solutions:

Check for and then demand that the bandwidth be at least as wide as the reddest and bluest wavelengths.
Truncate the input lines to only include those in the handed-in bandwidth.

The right answer is problem-specific and to be determinded.

Version 0.8 of the paper

Yay, I sent version "0.7" of the paper to @astrocaroline, while that review is underway, there are a few placeholders to fill-in:

IGRINS Sonora-Bobcat demo figure and text (Section 5.5) and Issue #41
Injection recovery test figure (Section 5.6) and Issue #42
Figure 4-- Replace with a telluric demo, may require redoing the whole figure sequence to illustrate a region with telluric lines (see the step-by-step tutorial) #37
Write conclusions
Re-order the presentation of the Extensions section? (Section 7) #43 ...also emphasize modularity for separable physics
Put in a blurb about how to calibrate Teff and logg (Section 7.x) #44

What to do next with our blazing fast sparse implementation?

We now have blazing fast initialization and training of a spectral model, which gives us a ripe environment for innovation: we can do lots of experiments and improvements for relatively cheap. There are a few steps forward worth thinking about.

More accurate: Line wings

We covered this in #9 and #10, and it should definitely be done.

More accurate: True Voigt instead of pseudo-Voigt

Since we're so fast now we can afford to spend more on the previously too-expensive Voigt computation. We can use the Voigt-Hjerting function, which we have already previously implemented. It appears to add ~27x the computation time, but should be feasible.

Parameter studies: accuracy scaling with P_rom

We could make a plot of residual sigma versus Prominence. At what point do we hit a noise floor from the inaccuracies of our model?

Parameter studies: Clustering of erroneous line widths

A plot of Amplitude versus gamma width reveals a bunch of impossibly wide lines that are overfitting continuum imperfections

Application to data

The model accuracy is probably "good enough" to apply to data. Yay! Let's do that, since we'll learn a bunch of new things along the way. We need to implement vsini, etc.

Continuum flexibility: Polynomial fitting

We could simultaneously fit the continuum polynomial. We currently have this turned off.

Continuum flexibility: Gaussian Process

A GP seems like the right way to specify the continuum, but it has a scaling problem. We could investigate scaling with GPyTorch, or celerite-for-pytorch?

Upgrade PyTorch from 1.7 to 1.9

Long time no update! 🕸️

Last time I checked in the most recent version of PyTorch was 1.7. Since then PyTorch has released v1.9, which includes some special functions including erfc, which is useful for computing the Voigt profile from the Faddeeva Function.

Let's migrate to PyTorch 1.9 so that

The log versus linear debate, revisited for tellurics

There's a tradeoff in whether to take the log of the model flux before processing, or simply leave them as-is, in linear fluxes. I chose the latter for stellar spectra for various reasons (that I still think are valid).

However, the calculus for tellurics is different. Telluric spectra---as produced by TelFit at least---have the continuum set to exactly 1.0 everywhere, with no additional continuum opacity to deal with. Tellurics also have have deep saturated lines. These two differences tip the balance of the scales towards using log rather than linear, ideally.

The problem is that I've already coded up the stellar emulator for linear and it may be awkward to do one in linear and one in log.

What to do! I'm not sure.

Flag for blended lines

You can imagine a function that takes in a line list (or heatmap of line properties) and identifies those lines that are blended. It would then assign a binary flag that can be preserved as metadata and then used in subsequent processing steps.

Title for the paper(s)

AAS Journals Paper:

Transfer learning for échelle spectroscopy with blasé
Semi-empirical forward models for échelle spectra with automatic differentiation
Gradient-based inference for line-by-line whole-spectrum fitting
A differentiable algorithm for simultaneously fitting an entire spectrum
Automatically fitting an entire stellar spectrum
[...]

JOSS Paper:

blasé: flexible models for astronomical échelle spectroscopy built on PyTorch'
[...]

Experiment with training on M1 Mac GPU, with MetalPerformanceShaders

PyTorch just announced some support for training and inference on M1 Mac GPUs. This is awesome!

Let's try to demo this. It should hypothetically be as easy as swapping device = torch.device('mps') instead of either 'cuda' or 'cpu'.

Tutorials on the website

We have a bunch of examples in either examples/ or experiments/ or notebooks/. While these are already useful guides, we should turn them into tutorials rendered on the online documentation, with some nice explanatory material.

A built-in `.optimize()` method for the individual models.

It seems like we copy over a lot of boilerplate code for optimizing the model. Why not build that code into a method on the emulators themselves? So we would get an emulator, and then could simply do:

emulator.optimize(epochs=1000, LR=0.01)

This would simplify a lot of the notebooks and examples that have to first pretrain the clone and then apply it, so there are two stages of optimizing that share a lot of the same boilerplate.

Ideas for Discussion Section and reorganize by scientific impact, not ease of implementation

We currently order the presentation of the section 7 by "ease of implementation" (sort of, it's hard to judge what is easy sometimes...)
We should probably focus the discussion on science outcomes rather than technical hurdles.

So that would mean ordering the presentation of ideas in terms of scientific impact. For example there's not much mention about joint modeling spectroscopic binaries, or a better spelling out of multi-epoch spectroscopy, etc.

Get blasé up and running on TACC and the Morley machine

We are approaching a point where we are GPU memory-limited on our laptops. That's great, it means we are pushing the computational frontier. Let's get blasé up-and-running on our local group computer and TACC so that we can push to wider bandwidths.

Make an animation of how blasé works

We should have an animation of how it works, for talks.
Here is a demo:

voigt_demo.mp4

Weird CUDA error when initializing with `wing_cut_pixels=None`

I'm getting a weird GPU-specific error that I hadn't seen before. It only occurs when I initialize the SparseLogEmulator with wing_cut_pixels=None. It's expecting an integer, but should have an default when passed None, so it seems like a passthrough glitch. Meh! No matter, we have a workaround--- simply pass in your wingcut.

Voigt profiles pose a computational bottleneck

The naive computation of the Voigt profile involves assembling a Tensor of shape:

N_lines x N_wavelength_points x N_approximation_terms
=
1258 x 338624 x 27

A Tensor of this size is too big for RAM on my Laptop!

We have a few options:

Use a supercomputer/GPU with more RAM
Use minibatches of lines or wavelengths
Use sparse tensors

I kind of like the idea of using minibatches of wavelengths--- it means we could probably use the DataParallel technique to parallelize on multiple GPUs, and it can expand to multiple larger datasets. The question is whether to use minibatches of lines or of wavelengths. Hmm...

Referee report from Sept 14, 2022

We got our referee report back and it is really positive, with mostly minor revisions. Yay! 🥳 Here is a paraphased and editorialized version of some of the feedback from the report, plus some themes that the report may have indirectly catalyzed. I'll add check boxes for the purposes of addressing each line item:

First round:

Revise running title. It's currently "Transfer Learning with Blasé"
Transfer learning does not appear to be defined in the text of the paper. (Maybe we should add it to Section 7.1 "Surrogate modeling" sub paragraph?). The phases "physics-informed ML" and "simulation intelligence" could slot in here as well. Basically, we what to describe how an ML may describe this approach, and transfer learning is only partly accurate.
Define the use of "clone" (we appear to just use it without explanation)
(aside) Consider citing more work from the Cannon
Expand and revise Section 8.4 about previously unknown spectral lines
Why PyTorch? Compared to TensorFlow / Jax / Theano
Related: Discuss framework incompatibility and vendor lock-in.
Consider discussing the approach in terms of chemically peculiar stars (e.g. HgMn stars)

Reorganize examples, experiments, notebooks, etc. directories

We currently have examples/, experiments/, tutorials, FAQ/, and notebooks/ directories. Many of those examples pre-dated our telluric emulator and new API, so they are somewhat stale now. Let's deprecate some content, take the useful stuff, and put it somewhere conspicuous.

I'm almost inclined to put everything into notebooks as "Case studies" following the awesome lead of the exoplanet case study gallery.

Move to muler and gollum APIs as interfaces to Data and Models.

We should move towards using muler and gollum as the APIs for data and models respectively. Then blase will simply become the layer for cloning the models and transferring to data. It's not clear if we want to make those dependencies of blase or just use them in the examples. TBD!

Discussion section: refinements to the model

I am scanning through the Methodology section and I see a bunch of examples where I introduce a design choice and then describe a possible refinement to that choice. I think these asides are distracting.

A few examples:

the Pseudo-Voigt could be sampled
We could have used GP's
We could have done gradients analytically
We could have fit for the Polynomial at the same time

I think it would be better to relocate those "we could have done X..."-type statements, placing them in their own standalone Discussion subsection, titled something like "Refinements to the model" or "Assessing alternative design choices".

That way the method section is much clearer and more compact. We want the method section to appear as simple as possible.

Overhaul paper figure demos

Currently the paper figures do not illustrate the telluric cloning (the figure predated the existence of effective telluric cloning). We should pick an example spectrum that has both telluric and stellar lines in the same wavelength chunk.

Calibrating Teff, logg, and metallicity

We should mention the prospect of calibrating Teff, logg, and metalicity. There are many ways to achieve that goal.

One is to apply Machine Learning on the outputs of blase. Basically, run blase cloning on the entire grid, do some cluster finding or correlation finding on the lines, and then calibrate the models en masse based on piles of spectra. This strategy may amount to fancy equivalent widths, but 1) it's systematic! and 2) it leverages the whole dataset!

There are fancier strategies including hierarchical models for an ensemble of similar spectra. But that's much more complicated to fit into a blurb in a discussion. So conveying the concept is key.

Fit Voigt Profiles instead of Lorentzians

We currently fit Lorentzian profiles rather than the more righteous Voigt. That choice was partly due to the absence of an easy-to-use Voigt profile function in PyTorch. Ideally we would use wofz, and take the real part to get the Voigt Profile, but there is no current implementation of wofz in PyTorch. However this is a new erfc method in torch.special that can compose wofz alongside an exponential term. Composing the two terms in this way is known to be numerically unstable, I think especially around z~0. But I am hoping that we can avoid numerical artifacts by placing a finite minimum Gaussian width, which is kosher anyways since we are going to convolve all lines with a large instrumental profile anyways.

JOSS / AAS paper

I think blasé is the one framework in our trifecta (w/ muler, gollum) that may deserve the royal treatment: Both an AAS Journals paper and an affiliated (linked) JOSS paper. So I have gone ahead and made a JOSS short paper with a long statement of need.

The AAS Journals paper will be the pacing item. Maybe aim for a mid-March timescale for that one? That would put:

muler: February
blase 1 and 2: March
gollum: April

Idea for paper figure: histogram of stellar vs telluric lines, chunked by order

Figure 1 already shows how many stellar lines materialize as a function of effective temperature, over the entire bandwidth. Those lines are clustered in wavelength, and so it could be helpful to show how they cluster both for stellar and telluric lines. This figure would be useful for planning purposes, and in the Discussion when we explain that some "telluric-free" regions could forgo the joint telluric modeling step. It would also be helpful for showing that blasé can handle a large range of spectral line density, and overlap, owing to its sparse implementation, and native-resolution cloning.

This figure should be considered optional though-- it's not fundamental to the undersanding of the approach, just a supporting piece of information.

Paper edits for the proofs stage

We should add some citations that came to mind:

HITRAN, EXOMOL, etc. -- Mentioned, but not cited
Jo Ciucă has a Neural Net paper for astro that could be included when mentioning the heritage of PyTorch
We didn't mention PICASO as a relevant open source spectral modeling toolkit

Run the grammar checker on the Appendix (we only did the main body, due to a word limit on grammarly)

Add IGRINS as a facility? (although these are already-published spectra, not new, pending info from G Mace).

Possibly add a link to the pathfinder fiatlux project for physics-based telluric forward modeling in JAX?

Radial Velocity needs to be able to "sense" backpropagation-- implement interpolation from native models instead of resampling

Currently we shift the wavelength grid with the scalar radial velocity (a nn.Parameter that requires grad), and then simply use the data's wavelength grid to define the boundaries for conducting a groupby sum operation. This sequence of steps is not "backprop"-able---- how do you take the derivative of boundaries. In other words, for this operation to work as intended, backprop would have to somehow decide which model indices would have moved into adjacent bins given different choices for RV. That's not something that a derivative alone can answer, and so the message-passing aspect of backprop will fail. Instead, we should follow wobble, and use an interpolation scheme. The problem is that wobble implemented their own interpolation, because apparently TensorFlow didn't have one at the time. In any case, pytorch does not support irregularly sampled interpolation, although a feature request for it has existed for a few years: pytorch/pytorch#1552

There are a few ways to get around this:

1. Use a Gaussian Process mean posterior model.

Train a GP on the synthetic spectrum at native or near-native resolution (it doesn't even have to be right, just close, in fact some smoothing is desired). Then compute the mean model, evaluated at the RV-shifted data grid points. This approach should work, since everything is matrix multiplies and solves, and it should be fine to back-prop through all that like a normal GP. The demerits are: 1) it'll be "slow" since you have to make a GP on ~20,000 points (that's OK on GPU actually!), we'd probably have to code it up, exactly, ourselves, since at first sight, it looks wants to separate train and eval steps for the mean posterior evaluation. So while it would work, and might even be preferable, there some roadblocks that make this a backup or rainy day option. By the way, this idea is similar in spirit to the psoap framework.

2. Find a way to do interpolation with PyTorch

This is probably the easiest way forward. There's at least one third party tool that seems to work:
https://github.com/aliutkus/torchinterp1d

Let's pursue option 2 for now!

Experiment 3: Envelope of injection/recovery of line amplitude for fixed SNR, fixed vsini, fixed RV.

Experiment 3

Envelope of injection/recovery of line amplitude for fixed SNR, fixed vsini, fixed RV.

We will perturb some fraction, f_perturb ~ 15% of spectral lines with some known injection from its initialization. We will generate a noisefre spectrum from the perturbed value, and save it. Then we will draw a noise vector with a characteristic SNR of 100:1 per pixel, at the HPF sampling, and add it to the noise free spectrum to create a "synthetic spectrum". We will train the emulator and save the recovered parameters to plot them against the injected params. In a noise-free scenario it should be one-to-one, a straight line spanning say 4 orders of magnitude of amplitude, but the injection of noise into the synthetic spectrum will produce some characteristic turn-over amplitude where the recovered amplitudes go from matching injected to being dominated by overfitting the noise.

We will repeat this process with maybe 100 different instances of new noise draw vectors, sending different weak lines into overfitting noise spikes. The result will be a heatmap of the turnover amplitude, with contours for, say, 1/2/3 sigma.

How low can you go: Will blase work for M dwarfs?

I cloned an M-dwarf spectrum across the entire HPF bandwidth, yielding 23112 lines. I started with a 2800 K, logg=5 PHOENIX spectrum at native resolution. Here is a snippet of what that looked like:

The cloning process captures lots but not all of the residuals. I think we would need Gaussian processes to define the pseudo-continuum to get much better performance, since some lines chaotically devolve into pseudo-continuum right now.

Here is a snapshot of what real data looks like for an M-dwarf with comparable photospheric properties:

Key result: the imperfections in cloning are much smaller than the data-model mismatch. That's about what I expected, but it's useful to see it quantified. This result means tuning the blase model to the data will yield a much better fit, but the interpretability gets tricky: each line may not map onto its adjusted value correctly. In other words, I think the M-dwarf lines may be missing or out-of-order, not simply quasistatically off in amplitude and width.

In any case it'll be easier to demo the method with warmer M-dwarfs or K stars, that do not have this mapping confusion problem.

Update the readme with a badge for the paper!

It's been a whole year! Duh!

Applying blasé to spectra from new instruments, e.g. VLT ESPRESSO

In the paper, we demonstrate blasé on IGRINS and HPF spectra, using muler to do the pre-processing. There has been some interest in applying blasé to spectra from other instruments, which should be straightforward, since muler is not a prerequisite to using blasé.

I propose we make a tutorial that shows an end-to-end worked example, which would have these pre-processing steps:

Read in a .csv file of wavelength, flux, error
Pack that into a Specutils Spectrum1D object
Retrieve a PHOENIX template for that target (with or without gollum, probably without for simplicity)
Access the .bin_edges attribute of the Spectrum1D object, for the resampling step
Proceed identically as in existing tutorials

Software Architecture strategy for loading pretrained models without repeating the spectral line-finding in the emulator initialization

We currently have to hand-in the wl_native and flux_native attributes in order to load a pretrained model. This requires us to exactly mimic the preprocessing steps that led to the compressed model in the first place, in particular it has to have exactly the same number of spectral lines identified.

Instead it would be convenient to simply initiate and load a pretrained model in one-fell-swoop. In order for that to happen, we would need to make wl_native and flux_native attributes would have to be nn.Parameters (with requires_grad=False).

We could then do something like:

state_dict = torch.load('file.pt') # file.pt is a pretrained PyTorch model state dictionary
emulator = SparsePhoenixEmulator.from_pretrained(state_dict)

So we've reduced all the delicate preamble of loading raw PHOENIX models. An alternative would be to make wl_native and flux_native as optional, so long as a n_lines parameter is passed in. They would get init'ed with zeros and loaded from the pre-trained model. That's the more PyTorch-centric way to do things, I think, and the more I think of it I like it better than a static_method strategy.

Migrate towards Log Emulator (rather than Linear) for all stellar spectra

I just made a demo showing that the stellar spectra can be modeled just as well (in fact probably better) with a Log emulator than a Linear Emulator. The choice of Linear is mostly historical at this point. Let's not deprecate the Linear one, just favor Log in practice.

Advanced lineshapes for broad lines

We will need to implement advanced lineshapes---or at least more flexible lineshapes---for Issue #9. The lines are so broad that the pseudo-Voigt and even the exact Voigt profiles are no longer accurate enough. We'll have to experiment with some more flexible functional forms.

Paper outline

Outline

Transfer learning for echelle spectroscopy

Introduction
- Spectral fitting past and present
- Ideal goal is both physical self-consistency and high predictive accuracy
  - We want noise-free templates
  - We want accurate posterior inferences (ability to do retrievals)
  - We want precision and accuracy over a wide bandwidth and high dynamic range
  - Ideally we also want accuracy in telluric and instrument models
- Some problems prevent this goal
  - High latent dimensionality: Many effects matter at the precision level of the data
  - Degeneracy among parameters
  - Underlying atomic and molecular data may be wrong or approximate or missing
  - Underlying radiative transfer may be inaccurate or approximate (T-P profile / T-Tau)
  - Time is a dimension: Star is changing
  - Stellar surface inhomogeneities become important: Doppler Imaging and limb darkening
  - We do not know how to accurately parameterize some dimensions
  - High dynamic range is both a blessing and a curse:
    - Feeding back locations of model imperfections becomes unwieldy for many lines
    - Many lines essentially becomes a book-keeping and continuum assignment problem
    - Computational problem
    - Some lines have large line wings, blurring continuum and line
    - Line blanketing has almost no continuum
- So in practice, tradeoffs between model flexibility and physical self-consistency
- Semi-empirical models may be a middle ground: informed from models, but revised with data
  - The radiative transfer step is computationally expensive, so we want to avoid redoing that
  - Instead focus on pre-computed models that have that wisdom baked-in
  - We are essentially compressing pre-computed models into evaluable, interpretable models
- Transfer learning is one plausible approach: pre-train on pre-computed models, transfer to data
- Neural network frameworks make this possible with autodiff
- We use the autodiff and GPU acceleration, but not the neural architectures
- We do not use the neural architectures because we have good understand for how a spectrum is generated
- Instead of tuning weights of a neural network, we tune the atomic and molecular properties of a spectrum
Data and reduction
- Example data from HPF
  - A0V
  - Narrow-lined G-star? M Dwarf?
Methodology: Blase
- Currently limit the bandwidth to a single high-bandwidth echelle spectrograph (plus buffer at edges)
  - This choice is principally computational at the moment, no fundamental limit
- We fetch a high resolution pre-computed spectrum from PHOENIX
- We use the exact native wavelength sampling, yielding N pixels
- Tradeoff of working in log or linear flux
  - First normalize by a scalar since we do not intend to use absolute flux calibration
  - Log prevents negative fluxes
  - Trickier to do Pseudo-Voigt approximation as sum in linear space
- Initialize the model with preprocessing steps:
  - Low-pass filter to remove narrow spectral lines
  - (Optional) Place known Hydrogen, Sodium lines with large wings and non-standard line shapes
  - Define a prominence threshold and find locations of peaks above that threshold
  - Derive coarse properties about those peaks: widths and amplitudes
- We now have a coarse model that needs to be fine-tuned with autodiff
- How to describe the continuum:
  - Black body / Rayleigh Jeans plus perturbation
  - High order polynomial: Chebyshev? Somewhere between 3rd and 20th order?
  - Use cross validation?
  - GPs are also possible, but high Number of pixels--> need scalable GP
  - Celerite not implemented for PyTorch?
  - GPyTorch possible, but has certain restrictions on kernel and sampling
- Faced with a question: what lineshape to use
  - To First order the lines are Gaussian or Lorentzian
  - To Second order the lines are Voigt Profiles
  - In detail the lines are the result of radiative transfer that can smear the Voigt profile
  - In practice the high-res lines get convolved with instrumental profiles, so details don't matter too much
  - Computational tradeoff of Lorentzian/Gaussian versus Pseudo-Voigt versus Voigt
  - Additional Implementation challenge: Fadeeva function not implemented in PyTorch
  - We choose a pseudo-Voigt as a balance between computation and adequate accuracy
- Total number of parameters:
  - For the lines = N-lines x 4 (lambda_c, fwhm_L, sigma_G, Amplitude)
  - For the "continuum" = Nth order polynomial = N+1 parameters, or Gaussian Process = 2 hyperparameters
- GPU- and autodiff- specific considerations:
  - Each Tunable parameter gets grad=True
  - We assemble the product of all lines and all continuum sources, this is the computational bottleneck
  - This becomes a matrix of size N_lines x N_wavelength points
  - We take the element-wise product over the lines axis, contracting to 1 x N_wavelength points
  - For N_lines = X and N_wavelength_points = Y, we expect XYk primative operations.
  - For N_lines = X and N_wavelength_points = Y, we expect XY32 bits of memory => XX GB
  - NVIDIA GPUs can handle 8 - 40 GB of memory depending on the model
- In the parlance of Machine Learning, we have a big model and small data
- Really our model is too big: most of the entries are zero, so we should use a sparse implementation
- If the model is too big to fit in memory, we have to train in mini-batches
- We hand-in some fraction f ~ 1/20th of the entire wavelength range as a mini-batch
  - The model is only evaluated at that ~5% of pixels, N_lines x (N_wl *0.05)
  - Technically all lines are updated
  - In practice, only lines that are sufficiently informed by those pixels will get large updates
  - Causes some stochasticity--- Stochastic Gradient Descent "SGD"
  - Analogous but different from the notion of minibatch in ML: not a distinct axis, like N_images
  - We use random minibatches
- Run the training:
  - Set the optimizer: Adam with a LR=0.004
  - Set the loss function: Chi-squared with no per-pixel uncertainty
  - Set the "mini-batch" fraction f~1-5%
  - Choice of epochs or steps N_steps = 300-10,000
  - The training process sees 20 minibatches per epoch, sees N_pixels, with some repeats some misses
  - Monitor with Tensorboard: see the improvement
  - The training takes X minutes on an RTX2070 GPU with PyTorch v1.11.x, CUDA v11.x, and Intel X CPU
  - We save the model parameters, refer to them collectively as the pre-trained model
Cloning performance
- We compute the residual of native PHOENIX minus cloned model
- We see X percent residuals at native resolution
- We smooth the residual to HPF instrumental resolution and see Y percent residuals
- The main residuals come from missing line opacity due to our prominence threshold:
  - Including smaller prominence lines means smaller residuals, but more lines and higher computational cost
- We see some lines that devolve into missing continuum opacity
  - Especially acute near broad line wings if you don't handle the non-standard line shapes
- How do the lineshapes cluster in sigma_G vs fwhm_L?
- How do the lineshapes cluster in Amplitude vs FWHM? Can some lines be rejected?
- Does the pre-trained model round-trip effectively (model saved at end of training = model from loaded params)?
Application to data
- Augment the pre-trained model with extrinsic parameters:
  - vsini and RV
- How to re-sample into the data space?
- Option 1: Convolve the Pseudo-Voigt with a Instrumenal Gaussian
  - The convolution of a Gaussian with a Pseudo-Voigt is simply a new Pseudo-Voigt!
  - Simply evaluate on a new coarse wavelength grid! Saves computation!
- Option 2: Numerically convolve output on a fine grid and sum pixels
  - Computationally wasteful
  - Requires generating full bandwidth spectrum at native resolution
- Both options should give the same answer (right?)
  - Need to double-check commutivity of convolution and element-wise product...
- Ideally you need wavelength-dependent resolution, not just a single stock resolving power
- Need accurate per-pixel uncertainties
- We have to apply some amount of regularization to the pre-trained model
  - Regularization tunes how much to trust the data versus the model in the semi-empirical model
  - No Regularization means likely to overfit noise in the data
  - Too much regularization resembles a static (fixed) PHOENIX model
Data transfer learning performance
- What is the residual level with bare PHOENIX (i.e. tuning continuum only, not lines)?
- What is the residual level with no regularization (i.e. tune all lines and continuum)?
  - Should be near-zero except for missing lines
  - What should we do about missing lines?
- What is the residual level with modest regularization? What is the typical change to the line?
- Plot of cloned FWHM versus FWHM transferred
- The pseudo-voigt line properties can be analytically integrated to give an equivalent width
- Plot of cloned EW (before) versus transfer EW (after)
Discussion
- Questioning the assumptions of the method
  - Is PseudoVoigt Adequate?
  - What performance level is adequate? (depends on your application)
  - What about PRV applications
  - What about the choice of minibatch size?
  - How would a sparse tensor change the computational performance?
  - Are we using the GPU effectively? Are we memory bandwidth limited?
  - Should we update the line center positions?
- Limitations
  - What to do when continuum is absent even in at the native resolution of precomputed synthetic model?
    - Relevant to brown dwarfs and possibly M dwarfs molecular bands
  - Bandwidth limitations
  - Interplay with tellurics

Special handling for lines with broad wings-- either dense sampling or larger sparse arrays

Some lines, such as Hydrogen, sodium, potassium, neutral metal lines, etc. have extremely broad line wings, approaching larger than the 1000-6000 pixels we allocate for the sparse implementation. These need to be handled separately from the weak lines, both from a computational performance perspective and an accuracy perspective.

Computational performance

We may actually gain performance by isolating these strong lines as dense (i.e. evaluated across the entire bandwidth), or semi-dense (i.e. evaluated across a large fraction of the bandwidth, say 50,000 pixels or 20% of the bandwidth), since we reduce the sparse evaluation window on all the other spectral lines from 6000 to 1000 pixels, or even less, perhaps as low as 100 pixels for most lines.

Accuracy

Extremely broad lines will exhibit truncation effects if the sparse window is small compared to the linewing size. The truncation effects will look like tophat functions severing the asymptotic wings. We can afford to increase the sparse window on a few, say ~20 spectral lines. We would then evaluate two matrices at each forward/backward pass of the emulator: ~330,000 x 20 for the dense broad-wing lines, and ~2000 x 8000 for the sparse narrow lines. The number of FLOPS in each category scales as ~6 Million versus ~16 Million, depending on the exact choices for wingcuts.

Experiment 4: Full bandwidth HPF training, most params

Let's go for gold: Full bandwidth HPF training with most parameters turned on. I am inclined to freeze the lam_centers for now, but allow RV to vary.

We will have to mask tellurics for now because those will complicate things. The current design of wavelength sampling by fence posts (and not spaces) will produce an awkward scenario, where the big blocks of masked regions may get undue attention. Let's think about applying a "good pixel" mask to both the model and data?

Deprecate, refactor, or scrap old source code in favor of the new sparse implementation.

The new sparse implementation is awesome! There's a bunch of heritage code that can probably be scrapped as we move towards supporting just the sparse implementation. In particular the multiorder module is largely deprecated but has a few useful parts: Chebyshev coefficients and some hard-coded attributes, such as speed of light, n_pix, etc.

We should completely remove read_native_PHOENIX_model now that we've migrated to gollum-based inputs.

Telluric Emulator from PhoenixEmulator

A historical preamble: this code started as a telluric modeling code and has morphed into a stellar modeling code. But the infrastructure exists to do line-by-line radiative transfer of the Earth's atmosphere. It has felt like a distant dream to unite these two arms of the code: joint modeling of stellar and telluric lines.

One augmentation might make this unification even easier than we had thought. If you inspect the PhoenixEmulator code, there is now zero code that makes it truly specialized to PHOENIX models (we refactored that stuff out with gollum). So PhoenixEmulator should simply be rebranded as a more general "Precomputed Synthetic Spectrum Emulator" or something like that. To take that idea to the extreme, we could hypothetically just hand-in a TelFit-generated telluric spectrum and voilà, obtain a cloned telluric model. We could then finally marry the stellar and telluric models by injecting in a multiplication step after the extrinsic model but before the instrumental layer.

So in summary, all the tech for joint stellar and telluric modeling is in place.

There are a few implementation considerations, not necessarily "bad" or "good", just aspects to keep in mind:

The pixel sampling of Telluric models is finer than the Phoenix models, therefore
The PHOENIX models would have to be evaluated on this finer grid, driving up computation time, but
The telluric lines are usually much narrower than stellar lines, so therefore,
We can get away with much shorter sparse matrix evaluation sequences, going from 6000 to possibly 60 pixels. But,
The Stellar models might have to remain at ~6000 with finer sampling.
The instrumental convolution step would have to spot-check that the convolution is operating on the correct pixel-scale-to-wavelength-delta conversion (we've been a bit cavalier about this).
We'd still have to regularize the precomputed telluric model and so ideally we would start with a, say, TelFit model that is as close as possible to the observed weather conditions at the time of observation. This lookup step is laborious and we would probably skip it for now, but make it easier later.

I think we should go for it.