qzhu2017 / pyxtal_ml Goto Github PK

View Code? Open in Web Editor NEW

11.0 4.0 1.0 11.44 MB

a Python3 library for ML modeling materials properties

License: MIT License

Python 100.00%

pyxtal_ml's Introduction

PyXtal_ml

Overview

A library of ML training of materials' properties

datasets: python class to download the data from open database + data in json format
descriptors: python class for different types of descriptors (RDF, ADF, chemical labeling and enviroments)
ml: python class for the choice of different pipelines of ML methods (KRR, ...)
test: python class for unit test (to implement)
results: a collection of results.

Hierarchy

This code has the following hierarchy

pyxtal_ml
├── main.py
├── descriptors -> descriptors.py -> (RDF.py, ADF.py, DDF.py, Chem.py, .etc)
├── datasets -> collection.py -> (json files)
├── ml -> methods.py (KRR, KNN, ANN, .etc.)
├── test -> (various scripts to test the accuracy and efficiency of the code)

Installation

# git clone https://github.com/qzhu2017/PyXtal_ml.git
# python setup.py install

Dependencies:

pyxtal_ml's People

Contributors

Stargazers

Watchers

Forkers

chrinide

pyxtal_ml's Issues

DDF.py plot_DDF option

@David-Zagaceta will implement a plot_DDF method

[Plan] to prepare for documentation

Need to build a sphinx documentation at some stage

Needs to implement chemical descriptors

To do soon

[Plan] Productive run on our Ternary system

I need to do the followings,
1, prepare the dataset from USPEX calcs
2, ml training on these data
3, ml prediction to see the accuracy

yaml file readability

I would like to improve the readability in yaml file

[Alert] Voronoi isn't working with the main code yet.

ML learning with GradientBoosting algorithm

-------------SUMMARY of ML--------------

Number of samples: 100
Number of features: 1344
Algorithm: GradientBoosting
Feature: Chem+RDF+ADF+Charge+Packing_efficiency+Volume_stats+Bond_stats+Coord_number+Chemical_ordering
Property: formation_energy
r^2: -0.1959
MAE: 0.8273
Mean train_score: 1.0000
Std train_score: 0.0000

ML learning with GradientBoosting algorithm

-------------SUMMARY of ML--------------

Number of samples: 100
Number of features: 1344
Algorithm: GradientBoosting
Feature: Chem+RDF+ADF+Charge
Property: formation_energy
r^2: -0.0926
MAE: 0.7997
Mean train_score: 1.0000
Std train_score: 0.0000

Despite adding features, the number of features is still 1344.

[todo] add outlier info in main.py

need to add this option.

Parameter sets for ML methods

@yanxon
I have resolved the conflicts. Please update your repo.

For the next step, please implement three sets of ml parameters as we discussed.
say
-light
-medium
-tight

covariance

@David-Zagaceta @yanxon
Below is my understanding on the covariance base on Fig. 1 of that paper.

the covariance is only used between two element features and structure features. Say if you have

4 atoms in the unit cell
20 element features (element index, melting points...)
18 structure features (say bop, q1, q2, q3, ...)
You should calculate the covariance between structure and element features,
element index (a 4 numbers array) against q1.(a 4 number array), the covariance is a symmetrical 2*2 matrix, we just take this upper triangle, which has 3 numbers.

You repeat this for each pair, in the end you get 20183 numbers for the covariance.

[Plan] add Multiprocessing

In order to speed up the calculation, I think the most important thing is to add Multiprocessing in our descriptor calculation.
We need to run the following code in parallel

    def convert_data_1D(self):
        """
        convert the structures to descriptors in the format of 1D array
        """
        start = time()
        print('Calculating the descriptors for {:} materials'.format(len(self.strucs)))
        pbar = ProgressBar()
        feas = []
        for struc in pbar(self.strucs):
            try:
                feas.append(descriptor(struc, self.feature0))
            except:
                feas.append([])
                print('Problem occurs in {}'.format(struc.formula))
        end = time()
        self.time['convert_data'] = end-start

        self.features = feas

The best way is to include it multiple process function.
https://docs.python.org/3.6/library/multiprocessing.html

@David-Zagaceta @yanxon Can you look into this?

[Concern] Crystal graph

@qzhu2017

PyXtal_ml/pyxtal_ml/descriptors/crystal_graph_QZ.py

Line 68 in b11b340

QZ: I am not sure if it is necessary.

I just read your comments in the code.

I think the GaussianDistance is the secret sauce of crystal graph. Without GaussianDistance, nbr_fea is just distances between two atoms without the mapping.

bispectrum calc

After a study on the codes how we calculate the bispectrum coefficients. I think there exist at least two places where we can improve the code,
1, We routinely call the Clebsch-Gordan (CG) coefficients. In this case, it would be easier if we just create a table, and then obtain the value.
2, There exist a lot of cases when CG=0. In this case, we shouldn't do calculation on the c coefficients.
3, I guess the most expansive part is for the calculation of c coefficients. However, when you check the formula. Again, we repeats such calculations for many times. We should create a table for such calcs.

@David-Zagaceta @yanxon I suggest you take care of the above statements and try to improved the python code, and see how much we can benefit from such improvements.

Please try to come to see me on Monday. Let's have another go on the formulas.
@yanxon I am also eager to know the performances of linear regression. As I said, I would like to know the contribution of each bispectrum component to the total energy.

General guides in commits

@David-Zagaceta
@yanxon

When you make the commits, please make sure you follow some standards,
1, always run git pull before you start to work on the code.

2, selectively add your files. Please just add ONLY the .py and data source files. Don't include the folders like pycache

3, It is advised that you commit the code more frequently. Don't make a commit which contains many modified files. As soon as you are done with one file, please commit it right away,

4, Please try to avoid the conflicts during the commit. If you encounter such conflicts, notify the author who is responsible for this part.

gaussian derivatives

I just finished implementing the gaussian derivatives (G2, G4, and G5).

I will clean up the code (perhaps, will make class for similar things), so that it's readable.

RDF + Chem

The GridSearch for Gradientboosting method just got done.

Qiang mentioned to try RDF + Chem to predict enthalpy formation. I got a similar result to what discussed in Jarvis paper (https://journals.aps.org/prmaterials/pdf/10.1103/PhysRevMaterials.2.083801). The result in our simulation is r2 = 0.939 and MAE = 0.156.

Here, I used a big grid search. I believe if we do a fine grid search we can get a lower mae and better r2. I think we couldn't get any nice result for RDF only is because machine learning is dataset-dependent. In their paper, they didn't mention RDF only result.

I will integrate this machine learning method in the ML-Materials repo today so it can be reproduceable.

[Announcement] Update you local repo

Dear All,

I have just managed to make this library installable via setup.py. I also renamed the library as PyXtal_ml.

This is gonna be a new stage of our development. For your convenience, I suggest you update your repo via.

git clone https://github.com/qzhu2017/PyXtal_ml.git

From now on, please ignore the previous copy, and implement your code based on the new local copy.

Please follow the instructions below to commit your code.

$ git pull
$ python setup.py install
$ python main.py

This will make sure that you get the latest copy and run it without any problem.
Before commit your code:

$ python setup.py install
$ python main.py

This will make sure you can run the modified code without any problem.

To commit your code

$ git status  # check if you have any changes
$ git add `the file you want to change`
$ git push

When you want to import the functions from PyXtal_ml,
please use such statements.

from pyxtal_ml.descriptors.descriptors import descriptor

[FYI] Pie Chart

The way the Jarvis paper got their pie chart is taking absolute percentage of MAE with ALL descriptors as the reference:
(All-Chem)/Chem ~ 42% ---> Chem
(All-(Chem+RDF))/(Chem+RDF) ~ 18% ---> RDF

With this data right here created by Dr. Zhu and I, we can clearly see that adding Charge actually makes the performance to drop. MAE of Chem is 0.2197 and MAE of (Chem+Charge) is 0.2226.

Descriptors                       r2            mae     mae diff(%)
Chem+RDF+ADF+Charge+Voronoi       0.9257	0.2023	0
Chem+RDF+ADF+Charge               0.9161	0.2217	9.58971824
Chem+ADF+Charge                   0.9144	0.223	10.23232823
Chem+Charge                       0.9136	0.2226	10.03460208
Chem                              0.9153	0.2197	8.601087494

[call for volunteers] Project function in GitHub

I am thinking it is good to use the project function in GitHub to optimize our development.
Can any of you have a look and set a template for our project?

https://help.github.com/articles/about-project-boards/

PRDF returning NAN for some structures

@qzhu2017

This is troublesome during learning, this is causing main to crash. I need to find these structures in order to determine the reasoning for this issue.

first complete test

@yanxon

I have implemented all parts to make the code run as follows.

$ python main.py

However, the ML is very poor. Below is an example which I got for band_gap

Time elapsed for reading the json files: 3.005 seconds
Total number of materials: 8049
The chosen feature is: RDF+Chem
Time elapsed for creating the descriptors: 267.185 seconds
Each material has 498 descriptors
Time elapsed for machine learning: 3.926 seconds

Please try to repeat the calculation, and tell me what's wrong with it.

A package design

@David-Zagaceta
@yanxon
I have completed the class design for dataset and descriptors.
Now we need to add ML as well

The ultimate goal is that we can just use one script main.py to complete the following
1, choose the data set and specify the target from property (Y)
2, compute the descriptor for the structure set (X)
3, perform machine learning for X/Y

@yanxon
1, Try to follow the way I did to create a method.py file in ml/ to
2, Please complete the ML part in main.py to call different ml methods.

@David-Zagaceta , you can start to think about implementing new descriptors, by following the way I did for ADF.py, RDF.py, Chem.py

After this is done, we can make it as a python package via setup.py

[Alert] from pyxtal_ml.descriptors.packing_efficiency import packing_efficiency

Do we really need to import this library in here?

I don't think we need it.

link the dynamic C library

Looks like we cannot just cope .so file and call it from python. Need to figure out a solution later.

[Question] how to pass the user defined parameters to ml training

@yanxon , as we discussed, we should allow the user to provide their own parameter list for ml training. However, I don't know how to call it from run.py or main.py. Could you please create an example. For instance, I want to call
n_estimator = 20 in a single run of RF training
for n_estimator = [10, 50] in gridsearch for RF training?

PRDF speed

@qzhu2017

right now the PRDF is consuming a lot of memory and time. Instead of returning the RDF for each element combination including the elements not in the crystal (these would be 0 distributions), I think we should return the integrals of the distribution functions. This would return a single value for each pairwise element combination and would reduce the overhead by a large factor.

Right now, stacking each RDF array at the end is consuming most of the computation time.

[ToDo] Rewrite the ADF.py to include DDF as an option

The proposed algorithm
1, get the table of neighbors for each atom (based on the criterion of 1.2*sum(radii))
2, For each i, find the list of j-i-k,
if the option of calc_DDF is on, for each j-i and i-k, find the list of l-j-i-k and j-i-k-l.
3, calculate the angles for (j-i-k) and (l-j-i-k)
4, error handling if the list is empty, return all zeros.
5, if there is only one atom in the unit cell, l-j-i-k list cannot be created. A simple fix is to double the cell.

Further optimization:
1, check the scaling of Step(1), this may be an issue when there exist a large crystal.
2, check the Step(2), looping over the table may become slow as well.

[Question] How to export the pre-trained model?

Can anyone investigate how to export the pre-trained model. I wonder if we can save the ml model trained from the calculation and then use it as an executable to directly call it to calculate any arbitrary crystal structure.

a bug in BOP

@David-Zagaceta I noticed that there exists a bug in the bond order parameter calc.
For NaCl, the first four atoms are equivalent. They should return the same results. But you see the 2nd number in the qw_series for atom 1 and 2 are different. Please fix this issue asap

qzhu@cms:/scratch/qzhu/github/PyXtal_ml/pyxtal_ml/descriptors$ python bond_order_params.py -c POSCARs/POSCAR-NaCl 
bond_order_params.py:76: ComplexWarning: Casting complex values to real discards the imaginary part
  return float(np.sum(q1*np.conjugate(q2)))
[[ 4.59381495e-17 -2.81584178e-02  7.72009938e-17  0.00000000e+00
   7.63762616e-01  9.29351343e-02  3.03372715e-16  0.00000000e+00
   3.53553391e-01  1.64507509e-03  2.60540884e-16 -1.44091792e-18
   7.18070331e-01  3.01407518e-02  3.78571193e-16 -1.39700044e-16
   4.11425368e-01  1.52564160e-02  5.99375599e-16  6.03663967e-16
   6.95502666e-01  1.15831953e-02]
 [ 3.53525080e-17  0.00000000e+00  8.17406390e-17  0.00000000e+00
   7.63762616e-01  9.29351343e-02  2.96368233e-16  0.00000000e+00
   3.53553391e-01  1.64507509e-03  2.60638893e-16 -1.02806645e-18
   7.18070331e-01  3.01407518e-02  3.81349846e-16 -1.25248782e-16
   4.11425368e-01  1.52564160e-02  5.99227097e-16  7.22169401e-16
   6.95502666e-01  1.15831953e-02]
 [ 3.53525080e-17  0.00000000e+00  8.03665665e-17  0.00000000e+00
   7.63762616e-01  9.29351343e-02  2.96837331e-16  0.00000000e+00
   3.53553391e-01  1.64507509e-03  2.82115348e-16 -9.72836723e-19
   7.18070331e-01  3.01407518e-02  3.85474658e-16 -1.72989429e-16
   4.11425368e-01  1.52564160e-02  5.98118812e-16  7.82250664e-16
   6.95502666e-01  1.15831953e-02]
 [ 3.53525080e-17  0.00000000e+00  9.05876542e-17  0.00000000e+00
   7.63762616e-01  9.29351343e-02  2.89962676e-16  0.00000000e+00
   3.53553391e-01  1.64507509e-03  2.56941181e-16  0.00000000e+00
   7.18070331e-01  3.01407518e-02  3.73211073e-16 -1.13775913e-16
   4.11425368e-01  1.52564160e-02  5.96878535e-16  6.68980069e-16
   6.95502666e-01  1.15831953e-02]
 [ 3.53525080e-17  0.00000000e+00  7.65647998e-17  0.00000000e+00
   7.63762616e-01  9.29351343e-02  2.99893688e-16  0.00000000e+00
   3.53553391e-01  1.64507509e-03  2.66454730e-16 -3.07906266e-18
   7.18070331e-01  3.01407518e-02  3.91991224e-16 -1.49241280e-16
   4.11425368e-01  1.52564160e-02  6.04876861e-16  6.40434954e-16
   6.95502666e-01  1.15831953e-02]
 [ 3.53525080e-17  0.00000000e+00  8.88474250e-17  0.00000000e+00
   7.63762616e-01  9.29351343e-02  2.95461933e-16  0.00000000e+00
   3.53553391e-01  1.64507509e-03  2.56299165e-16 -1.29741383e-18
   7.18070331e-01  3.01407518e-02  3.67717983e-16 -1.13198929e-16
   4.11425368e-01  1.52564160e-02  5.98187807e-16  6.56140861e-16
   6.95502666e-01  1.15831953e-02]
 [ 3.53525080e-17  0.00000000e+00  7.88597060e-17  0.00000000e+00
   7.63762616e-01  9.29351343e-02  3.00675996e-16  0.00000000e+00
   3.53553391e-01  1.64507509e-03  2.74096457e-16 -1.76790361e-18
   7.18070331e-01  3.01407518e-02  3.77367346e-16 -1.61950379e-16
   4.11425368e-01  1.52564160e-02  5.98796693e-16  7.10296295e-16
   6.95502666e-01  1.15831953e-02]
 [ 4.59381495e-17 -2.81584178e-02  8.43191699e-17  0.00000000e+00
   7.63762616e-01  9.29351343e-02  2.92501848e-16  0.00000000e+00
   3.53553391e-01  1.64507509e-03  2.62248501e-16 -1.81665439e-18
   7.18070331e-01  3.01407518e-02  3.77818589e-16 -1.41862359e-16
   4.11425368e-01  1.52564160e-02  5.96288214e-16  6.17792040e-16
   6.95502666e-01  1.15831953e-02]] (8, 22)

NAN value in bond_order_params class

run: 100%|####################################| 100/100 [03:53<00:00,  2.07s/it]

ML learning with KRR algorithm
Traceback (most recent call last):
  File "main.py", line 27, in <module>
    runner.ml_train(algo=algorithm)
  File "/homes/davidz87/PyXtal_ml/pyxtal_ml/run.py", line 165, in ml_train
    pipeline=self.pipeline, params=self.level)
  File "/homes/davidz87/PyXtal_ml/pyxtal_ml/ml/ml_sklearn.py", line 94, in __init__
    self.ml()
  File "/homes/davidz87/PyXtal_ml/pyxtal_ml/ml/ml_sklearn.py", line 204, in ml
    best_estimator.fit(self.X_train, self.Y_train)
  File "/homes/davidz87/Python-3.6.5/lib/python3.6/site-packages/sklearn/model_selection/_search.py", line 722, in fit
    self._run_search(evaluate_candidates)
  File "/homes/davidz87/Python-3.6.5/lib/python3.6/site-packages/sklearn/model_selection/_search.py", line 1191, in _run_search
    evaluate_candidates(ParameterGrid(self.param_grid))
  File "/homes/davidz87/Python-3.6.5/lib/python3.6/site-packages/sklearn/model_selection/_search.py", line 711, in evaluate_candidates
    cv.split(X, y, groups)))
  File "/homes/davidz87/Python-3.6.5/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 983, in __call__
    if self.dispatch_one_batch(iterator):
  File "/homes/davidz87/Python-3.6.5/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 825, in dispatch_one_batch
    self._dispatch(tasks)
  File "/homes/davidz87/Python-3.6.5/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 782, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/homes/davidz87/Python-3.6.5/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 182, in apply_async
    result = ImmediateResult(func)
  File "/homes/davidz87/Python-3.6.5/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 545, in __init__
    self.results = batch()
  File "/homes/davidz87/Python-3.6.5/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 261, in __call__
    for func, args, kwargs in self.items]
  File "/homes/davidz87/Python-3.6.5/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 261, in <listcomp>
    for func, args, kwargs in self.items]
  File "/homes/davidz87/Python-3.6.5/lib/python3.6/site-packages/sklearn/model_selection/_validation.py", line 568, in _fit_and_score
    test_scores = _score(estimator, X_test, y_test, scorer, is_multimetric)
  File "/homes/davidz87/Python-3.6.5/lib/python3.6/site-packages/sklearn/model_selection/_validation.py", line 605, in _score
    return _multimetric_score(estimator, X_test, y_test, scorer)
  File "/homes/davidz87/Python-3.6.5/lib/python3.6/site-packages/sklearn/model_selection/_validation.py", line 635, in _multimetric_score
    score = scorer(estimator, X_test, y_test)
  File "/homes/davidz87/Python-3.6.5/lib/python3.6/site-packages/sklearn/metrics/scorer.py", line 228, in _passthrough_scorer
    return estimator.score(*args, **kwargs)
  File "/homes/davidz87/Python-3.6.5/lib/python3.6/site-packages/sklearn/base.py", line 328, in score
    return r2_score(y, self.predict(X), sample_weight=sample_weight,
  File "/homes/davidz87/Python-3.6.5/lib/python3.6/site-packages/sklearn/kernel_ridge.py", line 193, in predict
    K = self._get_kernel(X, self.X_fit_)
  File "/homes/davidz87/Python-3.6.5/lib/python3.6/site-packages/sklearn/kernel_ridge.py", line 125, in _get_kernel
    filter_params=True, **params)
  File "/homes/davidz87/Python-3.6.5/lib/python3.6/site-packages/sklearn/metrics/pairwise.py", line 1559, in pairwise_kernels
    return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
  File "/homes/davidz87/Python-3.6.5/lib/python3.6/site-packages/sklearn/metrics/pairwise.py", line 1069, in _parallel_pairwise
    return func(X, Y, **kwds)
  File "/homes/davidz87/Python-3.6.5/lib/python3.6/site-packages/sklearn/metrics/pairwise.py", line 722, in linear_kernel
    X, Y = check_pairwise_arrays(X, Y)
  File "/homes/davidz87/Python-3.6.5/lib/python3.6/site-packages/sklearn/metrics/pairwise.py", line 111, in check_pairwise_arrays
    warn_on_dtype=warn_on_dtype, estimator=estimator)
  File "/homes/davidz87/Python-3.6.5/lib/python3.6/site-packages/sklearn/utils/validation.py", line 568, in check_array
    allow_nan=force_all_finite == 'allow-nan')
  File "/homes/davidz87/Python-3.6.5/lib/python3.6/site-packages/sklearn/utils/validation.py", line 56, in _assert_all_finite
    raise ValueError(msg_err.format(type_err, X.dtype))
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

I will investigate the source of these values tomorrow. I think this may be coming from the odd numbered bond order parameters.

[todo] convert the descriptors to pandas dataframe

It looks like the choose_feature function can be time-consuming when there exist a large dataset. need to convert it to pandas data frame.

[Plan] Add a step of feature selection in ML pipeline

Currently, we have only a relative small array of descriptors, compared to the database. However, there may be a case we have more descriptors than the materials data. In this case, it would be good add a step of feature selection as part of a pipeline.

clf = Pipeline([
  ('feature_selection', SelectFromModel(LinearSVC(penalty="l1"))),
  ('classification', RandomForestClassifier())
])
clf.fit(X, y)

Something needs to be done in future.

[Example]

I created an example folder to systematically test the performance of different algorithm and descriptors.

example_01

Dataset:
~8000 non metal materials from the materials project.

For the descriptors,
It looks to me that the chem descriptor plays the dominant role, the addition of other descriptors does not significantly improve the method.

For the algorithm,
Both gradientBoosting and RandomForest have good performance, they have better score and take short time.
KNN seems to be very fast, but the r2 value is too small.
KRR takes longer time, and the r2 is slightly better than KNN.

Of course, all the results are based on the light setting. The gridsearchCV function may improve the performances of KRR and KNN.

We need to expand the pool of datasets, descriptors and ml algorithms, the can help us gain better understanding on how to optimize the ml training.

Please feel free to add more examples in our repo.

Wish list for other ML methods

Right now, we have KNN, KRR, GridentBoosting, we need to try more to develop the sense about the choice of methods for different types of problems. Let's make the list here.

RandomForest
Stochastic Gradient Descent
LASSO
SVM
...

It would make sense to explore this link
http://scikit-learn.org/stable/supervised_learning.html

dataset

@yanxon

I notice a couple of issues in sp_metal_aflow_844.json

1, It has a lot of entries which return None in 'form_energy_cell'
2, It would be good to normalize the values when you create the json files. This applies to those extensive properties.

For instance,
DOS: should be always /A^3
Formation energy: should be always /atom.