anthony-wang / crabnet Goto Github PK

Predict materials properties using only the composition information!

Home Page: https://doi.org/10.1038/s41524-021-00545-1

License: MIT License

Python 100.00%

predict-materials-properties neural-networks materials-informatics machine-learning pytorch materials-science materials-screening scikit-learn attention-mechanism attention

crabnet's Introduction

Hi, I'm Anthony 👋

A passionate Data Scientist from Germany

🔭 I’m currently working on my PhD at the Technische Universität Berlin
- Topic keywords: deep learning, machine learning, data science and visualization, materials science
🛠 I use the following Python tools:
- AI/ML: PyTorch, TensorFlow, scikit-Learn
- Data wrangling: Pandas, NumPy, SQLite
- Data visualization: Matplotlib, Seaborn, Plotly, Datashader
☁️ I also have experience with cloud computing and HPC:
- Colab / Jupyter / Kaggle notebooks
- Grid.ai
- Model training and deployment on SageMaker
- High-performance computing at the TU Berlin
🌱 I’m currently learning reinforcement learning and code golfing
📄 More about me & contact info: https://anthonywang.de/
💬 Ask me about our latest work, CrabNet for the accurate and inspectable predictions of materials properties based on the Transformer architecture!

crabnet's People

Contributors

Stargazers

Watchers

crabnet's Issues

pinned pytorch and cudatoolkit dependencies possibly defunct for GPU usage

conda env create --file conda-env.yml
conda activate crabnet
from torch import cuda
cuda.is_available()

False

about the predictor

I'm a beginner . May I ask that can your code predict the yield_strength of the material?

The order of the two hyperlinks on the "Publications / How to cite" module in the readme file seems to be reversed

Facing issue while importing model (from crabnet.model import Model)

Dear Developers
I am trying to import the model, while training. I am working on Google Colab. When I am trying to use the import model, I am getting an error.
I am attaching the screenshot of the error below:

Reason to use pe_scaler and ple_scaler

Could you explain the reason why you use the pe_scaler and ple_scaler in the forward pass of the Encoder-class in kingcrab.py?
In particular, why do you choose the form
pe_scaler = 2**(1-self.pos_scaler)**2
and
ple_scaler = 2**(1-self.pos_scaler_log)**2?
I don't really understand why one needs these two scalers (and also self.emb_scaler) in the first place and why you chose the above exponential forms for them.

Does CrabNet use the validation data to improve the model?

For example, as it is progressing through epochs, would the results change at all if dummy validation were supplied instead of "true" validation data?

Reproducing RooSt results, error using RooSt Colab example is higher than what's reported in CrabNet paper

@AndrewFalkowski and I have both had some trouble reproducing the RooSt results from the CrabNet paper, and the parameters used to produce the RooSt results don't seem to be in the paper. Could you comment on what parameters to use?

fit() and predict() methods

Not necessarily critical, but struck me that it might be good to have a mdl.fit() and a mdl.predict() relative to what happens in train_crabnet.py to make it easier to "train once" and then reuse the model several times and for consistency with similar packages. Then I realized there is a model.fit() call inside of train_crabnet.py relative to model = Model(CrabNet(...), ...). If I want to "train once" and then reuse the model, how would you suggest going about that?

What is KingCrab?

saved models are not displayed well.

use_crabnet.py has a function "list_saved_models".

This function lists models, but they aren't usable when you copy them. We should clean this up so you can double click on a model and copy it directly into the predict_crabnet() function.

Do any of the CSV files in data/element_properties/ need citations? (e.g. in a README in the element_properties folder)

Trying to track down the source of oliynyk.csv

Parameter used for the results in the published work

Please correct me if I am wrong, but right now, when I try to run the sample dataset in example_materials_property, it runs for 40 epochs with three checks for early stopping. I would like to know if the same setting was used to produce the results in the paper or was it different as I could not find the exact details on parameters for the runs? Thank You

Is Python 3.8+ a known requirement, or are earlier versions just untested?

There's another repository I'm trying to link up that specified Python 3.7 as a dependency.

CrabNet matbench data, possible mismatch between submission notebook and results

@MahamadSalah74 I ran the matbench notebook posted on the matbench GitHub page (materialsproject/matbench#23), and got somewhat higher MAEs for a few repeat runs (see below) compared to what was reported.

0.3683
0.3661
0.3658

The reported matbench result (0.3463) seems a bit more in line with matbench_crabnet.py which uses 300 epochs and the full train/val dataset for training. It's not a huge difference, but I'm trying to figure out if/what the discrepancy is.

Maybe I'm missing something basic. Could you comment on this?

AttributeError: 'SWA' object has no attribute '_optimizer_step_pre_hooks'. Did you mean: '_optimizer_step_code'?

The pytorch Optimizer class has changed with recent releases, which leads to the following error:

[...]
stepping every 16 training passes, cycling lr every 1 epochs
checkin at 2 epochs to match lr scheduler
Traceback (most recent call last):
  File "/home/pbenner/Source/pycoordinationnet-results/model_comparison/crabnet/eval.py", line 141, in <module>
    run_cv(X, y, f'eval-{task}-{target}.txt', n_splits)
  File "/home/pbenner/Source/pycoordinationnet-results/model_comparison/crabnet/eval.py", line 95, in run_cv
    model = train_model()
  File "/home/pbenner/Source/pycoordinationnet-results/model_comparison/crabnet/eval.py", line 62, in train_model
    model.fit(epochs=1000, losscurve=False)
  File "/home/pbenner/Source/pycoordinationnet-results/model_comparison/crabnet/model.py", line 228, in fit
    self.train()
  File "/home/pbenner/Source/pycoordinationnet-results/model_comparison/crabnet/model.py", line 140, in train
    self.optimizer.step()
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 69, in wrapper
    return wrapped(*args, **kwargs)
  File "/home/pbenner/.local/opt/anaconda3/envs/crysfeat/lib/python3.10/site-packages/torch/optim/optimizer.py", line 271, in wrapper
    for pre_hook in chain(_global_optimizer_pre_hooks.values(), self._optimizer_step_pre_hooks.values()):
AttributeError: 'SWA' object has no attribute '_optimizer_step_pre_hooks'. Did you mean: '_optimizer_step_code'?

The following patch fixed the issue:

diff --git a/utils/optim.py b/utils/optim.py
index 33008dd..18224ea 100644
--- a/utils/optim.py
+++ b/utils/optim.py
@@ -1,6 +1,7 @@
-from collections import defaultdict
+from collections import defaultdict, OrderedDict
 from itertools import chain
 from torch.optim import Optimizer
+from typing import Callable, Dict
 import torch
 import warnings
 import numpy as np
@@ -116,6 +117,8 @@ class SWA(Optimizer):
         self.optimizer = optimizer
 
         self.defaults = self.optimizer.defaults
+        self._optimizer_step_pre_hooks: Dict[int, Callable] = OrderedDict()
+        self._optimizer_step_post_hooks: Dict[int, Callable] = OrderedDict()
         self.param_groups = self.optimizer.param_groups
         self.state = defaultdict(dict)
         self.opt_state = self.optimizer.state

attention-heads as samples from posterior distribution in a Bayesian sense

https://aclanthology.org/2020.emnlp-main.17.pdf

Though I think CrabNet might need to be refitted for new samples (i.e. if you specify N=10, then you only get 10 samples from the posterior, to get more would probably require refitting, and not sure if these would be directly comparable to the 10 from the first run). Also not exactly sure how this could be converted to individual predictions. Maybe just some basic plumbing in and after:

CrabNet/crabnet/kingcrab.py

Lines 151 to 157 in 9e0d79c

 if self.attention: 

 encoder_layer = nn.TransformerEncoderLayer(self.d_model, 

 nhead=self.heads, 

 dim_feedforward=2048, 

 dropout=0.1) 

 self.transformer_encoder = nn.TransformerEncoder(encoder_layer, 

 num_layers=self.N)

pip or conda install

How tough do you think it would be to package CrabNet into one of the package managers?

CrabNet matbench results - possibly neglecting 25% of the training data it could have used

@anthony-wang,

In the CrabNet matbench notebook, it does train/val/test splits. However, if #15 (comment) is correct such that the validation data (i.e. val.csv) doesn't contribute to hyperparameter tuning, then that 25% of the training data is essentially getting thrown away, correct?

In other words, the CrabNet results are based on only 75% of the training data compared to what the other matbench models use for training. From what I understand, the train/val/test split in the context of matbench only really makes sense if you're doing hyperparameter optimization in a nested CV scheme, as follows:

(Source: https://hackingmaterials.lbl.gov/automatminer/advanced.html)

To correct this, I think all that needs to be done is change:

#split_train_val splits the training data into two sets: training and validation
def split_train_val(df):
  df = df.sample(frac = 1.0, random_state = 7)
  val_df = df.sample(frac = 0.25, random_state = 7)
  train_df = df.drop(val_df.index)

  return train_df, val_df

#split_train_val splits the training data into two sets: training and validation
def split_train_val(df):
  train_df = df.sample(frac = 1.0, random_state = 7)
  val_df = df.sample(frac = 0.25, random_state = 7)

  return train_df, val_df

which makes it so there's data bleeding between train_df and val_df, but val_df ends up being essentially just a dummy dataset so that CrabNet doesn't error out when a val.csv isn't available.

Sterling

Multiclass classification

Is there an easy way to modify crabnet for a multi-class classification problem? I can see there is a built-in function for binary classification. I mean, is there any way other than introducing my own custom function (let's say which uses CrossEntropyLoss) for the task?

CrabNet sometimes ignores/skips certain compounds. Why? How to keep track of compound IDs?

Say there are 10,000 validation compounds and only ~9900 get pushed to the validation results. First, is this likely because of repeated compounds or certain compounds being "invalid"? Second, how to keep track of the "ID" of each compound? (in my case a Materials Project task_id).

the classification criterion doesn't factor in the uncertainty - does this mean ignore the uncertainty for classification?

i.e. log_std is an unused parameter in the classification criterion:

CrabNet/utils/utils.py

Lines 263 to 265 in a5be06f

 def BCEWithLogitsLoss(output, log_std, target): 

 loss = nn.functional.binary_cross_entropy_with_logits(output, target) 

 return loss

If this is the case, should the uncertainty output from CrabNet be ignored by the user during classification? In other words, are the uncertainty values essentially just a bunch of random numbers for classification?

Add skipatom featurizer to the repository

https://github.com/lantunes/skipatom

@lantunes

Seems like an "extend_features" option for CrabNet could be useful for several people

Similar to the CBFV package extend_features option. Does this seem feasible?

Marianne and I are trying to incorporate CGCNN features. Trupti wants to incorporate temperature into the model. Hasan would be able to use his custom mat2vec/robocrystallographer feature vectors.

	if self.attention:
	encoder_layer = nn.TransformerEncoderLayer(self.d_model,
	nhead=self.heads,
	dim_feedforward=2048,
	dropout=0.1)
	self.transformer_encoder = nn.TransformerEncoder(encoder_layer,
	num_layers=self.N)

	def BCEWithLogitsLoss(output, log_std, target):
	loss = nn.functional.binary_cross_entropy_with_logits(output, target)
	return loss

anthony-wang / crabnet Goto Github PK

crabnet's Introduction

Hi, I'm Anthony 👋

A passionate Data Scientist from Germany

crabnet's People

Contributors

Stargazers

Watchers

Forkers

crabnet's Issues

Recommend Projects

Recommend Topics

Recommend Org