Giter Club home page Giter Club logo

suncat-center / catlearn Goto Github PK

View Code? Open in Web Editor NEW
96.0 19.0 61.0 92.03 MB

A machine learning environment for atomic-scale modeling in surface science and catalysis.

Home Page: http://catlearn.readthedocs.io/

License: GNU General Public License v3.0

Python 99.80% Shell 0.02% Dockerfile 0.18%
machine-learning materials-science computational-chemistry catalyst materials-informatics python catalysis atomistic-machine-learning nanotechnology

catlearn's Introduction

CatLearn

An environment for atomistic machine learning in Python for applications in catalysis.

DOI Build Status Coverage Status Documentation Status PyPI version License: GPL v3

Utilities for building and testing atomic machine learning models. Gaussian Processes (GP) regression machine learning routines are implemented. These will take any numpy array of training and test feature matrices along with a vector of target values.

In general, any data prepared in this fashion can be fed to the GP routines, a number of additional functions have been added that interface with ASE. This integration allows for the manipulation of atoms objects through GP predictions, as well as dynamic generation of descriptors through use of the many ASE functions.

CatLearn also includes the MLNEB algorithm for efficient transition state search, and the MLMIN algorithm for efficient atomic structure optimization.

Please see the tutorials for a detailed overview of what the code can do and the conventions used in setting up the predictive models. For an overview of all the functionality available, please read the documentation.

Table of contents

Installation

(Back to top)

The easiest way to install the code is with:

$ pip install catlearn

This will automatically install the code as well as the dependencies.

Installation without dependencies

(Back to top)

If you want to install catlearn without dependencies, you can do:

$ pip install catlearn --no-deps

MLMIN and MLNEB will not need anything apart from ASE 3.17.0 or newer to run, but there are other parts of the code, which need the dependencies listed in requirements.txt

Developer installation

$ git clone https://github.com/SUNCAT-Center/CatLearn.git

And then put the <install_dir>/ into your $PYTHONPATH environment variable.

You can install dependencies in with:

$ pip install -r requirements.txt

Docker

To use the docker image, it is necessary to have docker installed and running. After cloning the project, build and run the image as follows:

$ docker build -t catlearn .

Then it is possible to use the image in two ways. It is possible to run the docker image as a bash environment in which CatLearn can be used will all dependencies in place.

$ docker run -it catlearn bash

Or python can be run from the docker image.

$ docker run -it catlearn python2 [file.py]
$ docker run -it catlearn python3 [file.py]

Use Ctrl + d to exit the docker image when done.

Optional Dependencies

The tutorial scripts will generally output some graphical representations of the results etc. For these scripts, it is advisable to have at least matplotlib installed:

$ pip install matplotlib seaborn

Tutorials

(Back to top)

Helpful examples and test scripts are present in tutorials.

Usage

(Back to top)

Set up CatLearn's Gaussian Process model and make some predictions using the following lines of code:

import numpy as np
from catlearn.regression import GaussianProcess

# Define some input data.
train_features = np.arange(200).reshape(50, 4)
target = np.random.random_sample((50,))
test_features = np.arange(100).reshape(25, 4)

# Setup the kernel.
kernel = [{'type': 'gaussian', 'width': 0.5}]

# Train the GP model.
gp = GaussianProcess(kernel_list=kernel, regularization=1e-3,
                     train_fp=train_features, train_target=target,
                     optimize_hyperparameters=True)

# Get the predictions.
prediction = gp.predict(test_fp=test_features)

Functionality

(Back to top)

There is much functionality in CatLearn to assist in handling atom data and building optimal models. This includes:

  • API to other codes:
  • Fingerprint generators:
    • Bulk systems
    • Support/slab systems
    • Discrete systems
  • Preprocessing routines:
    • Data cleaning
    • Feature elimination
    • Feature engineering
    • Feature extraction
    • Feature scaling
  • Regression methods:
    • Regularized ridge regression
    • Gaussian processes regression
  • Cross-validation:
    • K-fold cv
    • Ensemble k-fold cv
  • Machine Learning Algorithms
    • Machine Learning Nudged Elastic Band (ML-NEB) algorithm.
  • General utilities:
    • K-means clustering
    • Neighborlist generators
    • Penalty functions
    • SQLite db storage

How to cite CatLearn

(Back to top)

If you find CatLearn useful in your research, please cite

1) M. H. Hansen, J. A. Garrido Torres, P. C. Jennings, 
   Z. Wang, J. R. Boes, O. G. Mamun and T. Bligaard.
   An Atomistic Machine Learning Package for Surface Science and Catalysis.
   https://arxiv.org/abs/1904.00904

If you use CatLearn's ML-NEB module, please cite:

2) J. A. Garrido Torres, M. H. Hansen, P. C. Jennings,
   J. R. Boes and T. Bligaard. Phys. Rev. Lett. 122, 156001.
   https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.122.156001

Contribution

(Back to top)

Anyone is welcome to contribute to the project. Please see the contribution guide for help setting up a local copy of the code. There are some TODO items in the README files for the various modules that give suggestions on parts of the code that could be improved.

catlearn's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

catlearn's Issues

Requirements for PyPi.

The setup.py file currently has the requirements defined.

 25     install_requires=['ase==3.16.0',
 26                       'h5py==2.7.1',
 27                       'networkx==2.1.0',
 28                       'numpy==1.14.2',
 29                       'pandas==0.22.0',
 30                       'pytest-cov==2.5.1',
 31                       'scikit-learn==0.19.1',
 32                       'scipy==1.0.1',
 33                       'tqdm==4.20.0',
 34                       ],

This is a bad idea as they need to be kept updated along with the requirements.txt file. At some point it is highly likely these will diverge.

There needs to be a way to automatically parse the requirements.txt when setup.py is run that is compatible with the uploaded PyPi package.

Error in docs on featurizing

Just a small error, but the docs say that one should use:

from catlearn.fingerprint.setup import FeatureGenerator

whereas the FeatureGenerator function now seems to be in catlearn.featurize.setup

sklearn deprecated Imputer breaking `clean_data.py` module

The most recent version of sklearn (0.22) has removed the Imputer class from within from preprocessing.imputation location and as a result the following traceback is given:

~/TEMP/CatLearn/catlearn/preprocess/clean_data.py in <module>
      2 import numpy as np
      3 from collections import defaultdict
----> 4 from sklearn.preprocessing import Imputer
      5 from scipy.stats import skew
      6 

ImportError: cannot import name 'Imputer'

The following message is located in the old preprocessing.imputation file

@deprecated("Imputer was deprecated in version 0.20 and will be "                      
    "removed in 0.22. Import impute.SimpleImputer from "                       
    "sklearn instead.")                

It looks like simply replacing Inputer with SimpleImputer would be sufficient, but we should make sure that these classes are in fact the same before fixing

Force prediction in the Gaussian process model

Hi all,

I have noticed that CatLearn includes both energy and forces in the training of a Gaussian process model, but only predicts energy from the GP model. The predicted forces are computed using finite differences according to Phys. Rev. Lett. 122, 156001 (2019) . However, predicting forces directly from the GP model is also starightforward once forces are included in the training just as what J. Chem. Phys. 147, 152720 (2017) did. Why not predicting forces directly from the GP model in CatLearn? Is there any benifit of using the finite difference approach?

Best,
Zeyuan

Update Gradients Tutorials

The gradients tutorials need updating to jupyter notebook format with some additional discussion of what is going on/expected.

VASP internal relaxation for CatLearn NEB

Hi all,
I am just wondering if it possible to run CatLearn ML-NEB without using the ASE VASP calculator but using the VASP internal relaxation (I.e. optimizing the initial and final end-points in VASP)

Evaluate diagonal only on predict std

Predicting mean and covariance on a test set, X_test, scale N**2, because we are constructing K(X_test, X_test). In case X_test is large, we would prefer to not calculate the covariance matrix, but just the standard deviation based on the diagonal (See gpfunctions.uncertainty)

Convergence issue with newer ASE

Workaround: ML-NEB is still stable and compatible with ASE 3.17.0.

The two latest ASE stable releases, however, breaks ML-NEB, causing each iteration to slow down dramatically and possibly prevents convergence.

Help is wanted in identifying the bug.

Parallel Testing

There are some issues with parallelism in Python2.7 with the adsorbate fingerprinting and maybe others. This is specific to 2.7 and does not affect Python3+.

In general I think it would be best for tests to be run with nprocs=None to pick up these errors in any code with parallelism.

TravisCI has a server for parallel testing that could be used I think. But specific tests would need to be written for this. Otherwise I think we only have access to a single core by default, so even when nprocs=None everything is still being run in serial.

Tests generate 200000 PendingDeprecationWarnings, which causes failure.

This is due to use of np.mat or np.matrix in ASE. ASE has fixed this in the master branch, but the warnings will crash our tests until the next ASE release.

All warnings were supposed to be filtered to "once" or "ignore", but unfortunately either unittests or pytest overrides this.

PLOTNEB problem with plot's text positional requirement

I am currently testing the tutorials, in particular the tutorials/11_NEB/04_CO_Cu111/nebCO.py
Everything ran properly except the PLOTNEB module.
I encountered the following output:

The ML-NEB algorithm required  19.636363636363637 times less number of function evaluations than the standard NEB algorithm.
Energy barrier: 0.05631016623911789 eV
Traceback (most recent call last):
  File "/home/krojas/student/rai/catlearn_test/test03/nebCO.py", line 137, in <module>
    plotneb(trajectory='ML-NEB.traj', view_path=False)
  File "/home/krojas/APPS/mambaforge/envs/catlearn/lib/python3.10/site-packages/catlearn/optimize/tools.py", line 50, in plotneb
    ax.annotate(s=str(np.round(e_barrier, 3))+' eV',
TypeError: Axes.annotate() missing 1 required positional argument: 'text'

Method to replicate:

  1. Create conda environent conda create -n pycatlearn python=3.10 catlearn
  2. Download and run tutorials/11_NEB/04_CO_Cu111/nebCO.py

May I ask how to fix this?
Should I specify the matplotlib version ?

Highly pedantic `requirements.txt`

Right now the requirements.txt looks as follows:

ase==3.16.0
click==6.7
cycler==0.10.0
decorator==4.3.0
flask==1.0.2
h5py==2.7.1
itsdangerous==0.24
jinja2==2.10
kiwisolver==1.0.1
markupsafe==1.0
matplotlib==2.2.2
networkx==2.1.0
numpy==1.14.3
pandas==0.23.0 
pyparsing==2.2.0
python-dateutil==2.7.3
pytz==2018.4
scikit-learn==0.19.1
scipy==1.1.0
six==1.11.0
tqdm==4.23.3
werkzeug==0.14.1

This means, in order to install CatLearn with pip install catkit my system needs to match all the packages down to the patch level or pip will refuse to install it. Point in case, if I upgrade my numpy today I would get version 1.14.4 but pip refuses to install CatLearn in this situation since it thinks that CatLearn requires exactly 1.14.3. This leaves me with two options: either I downgrade or all my other packages to matches exactly CatLearn (and potentially break other packages) or I have to escape into a virtualenv or docker to spin exactly those versions. Would it be possible to relax some of these version numbers using >= or ~=. >= simply fixes the version of greater or equal to the stated number but allows to skip trailing digits. So, numpy>=1.4 would allow everyting greater or equal than 1.4.0. ~= skips one more rank. To quote the essential part of the following website

Mopidy-Dirble ~= 1.1        # Compatible release. Same as >= 1.1, == 1.*

This website documents the different possible qualifiers https://pip.pypa.io/en/stable/reference/pip_install/#example-requirements-file . Better yet unless there is a specific reason I would never state the third version number because assuming the depedency sticks to semantic version this would only count the number of patches not break backwards compatibility.

Question: Correct parallelization usage

Hi I would like to ask about how to properly run the catlearn code with proper parallelization.

I tested catlearn and compared it with traditional neb.x code of quantum espresso.
With the same system and number of images, the catlearn (single node - 64 core) and the neb.x (5 nodes - 64 core each - image parallelized) have the same duration. This means that catlearn is more efficient in resources.

I would like to further expand this by utilizing more node for the DFT calculation, say 5 nodes for 1 DFT evaluation.
When I do the catlearn (5 node - 64 core each), the calculation become rather slow.
The 5 node method is applied to the DFT calculation via ASE_ESPRESSO_COMMAND.
I think the bottleneck maybe due to the parallelization of the catlearn for 5 nodes ? (at least the automatic treatment is not correct).

May I ask how to properly do this ?

Update Docstrings

  • Add docstrings to all functions.
  • Make sure everything has Returns.
  • Add attributes to docstring.

MLMIN bug with ASE 3.19

There is a major bug in MLMIN when used with ASE 3.19. It behaves like previous NEB bug (if im not mistaken)
The bug is: the iteration of geo_opt using GPR in line 283 doesn't works well

I have compared them when used with ASE 3.17 and 3.19
the one that optimized quickly is done with ASE 3.17
Screenshot from 2020-07-05 13-54-44
Screenshot from 2020-07-05 12-51-58

MLNeb hangs with latest ASE from git

The tutorial described in 11_NEB/00_Tutorial/Tutorial_MLNEB.ipynb hangs when using the latest ASE git head due to changes in the Dynamics class located in ase/optimize/optimize.py. Specifically, this loop can repeat forever:

while ml_converged is False:
# Save prev. positions:
prev_save_positions = []
for i in self.images:
prev_save_positions.append(i.get_positions())
neb_opt.run(fmax=(fmax * 0.85), steps=1)
get_results_predicted_path(self)
unc_ml = np.max(self.uncertainty_path[1:-1])
e_ml = np.max(self.e_path[1:-1])
if e_ml >= self.max_target + 0.2:
for i in range(0, self.n_images):
self.images[i].positions = prev_save_positions[i]
if self.fullout is True:
print('Pred. energy above max. energy. '
'Early stop.')
ml_converged = True
if unc_ml >= max_step:
for i in range(0, self.n_images):
self.images[i].positions = prev_save_positions[i]
if self.fullout is True:
print('Maximum uncertainty reach. Early stop.')
ml_converged = True
if neb_opt.converged():
ml_converged = True
n_steps_performed = neb_opt.__dict__['nsteps']
if np.isnan(ml_neb.emax):
sp = str(-self.n_images) + ':'
self.images = read('./all_predicted_paths.traj', sp)
for i in self.images:
i.get_potential_energy()
n_steps_performed = 10000
if n_steps_performed > ml_steps-1:
if self.fullout is True:
print('Not converged yet...')
ml_converged = True

This happens when neb_opt doesn't immediately converge, because neb_opt.run(..., steps=1) now returns before performing any steps if neb_opt.max_steps has been reached. Additionally, neb_opt.nsteps isn't incremented by neb_opt.run(...) beyond max_steps, so the bailout condition of n_steps_performed > ml_steps-1 never evaluates to True.

A simple workaround would be to manually set neb_opt.nsteps = 0 immediately before neb_opt.run(..). There's probably a more elegant way of telling ASE's NEB class to iterate once, but that would require more changes to the code, and the workaround I describe seems to work for me.

As an aside, I don't understand why CatLearn accesses nsteps through neb_opt's __dict__ attribute (L407). Is there a reason for this atypical access pattern?

CI Docs

It would be good if the CI could generate the sphinx docs on-the-fly so we didn't have to keep updating things. It would probably be as simple as calling:

sphinx-apidoc -o docs catlearn

within so additional post complete part of the CI.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.