Giter Club home page Giter Club logo

pyxtal_ml's Introduction

PyXtal_ml

Overview

A library of ML training of materials' properties

  • datasets: python class to download the data from open database + data in json format
  • descriptors: python class for different types of descriptors (RDF, ADF, chemical labeling and enviroments)
  • ml: python class for the choice of different pipelines of ML methods (KRR, ...)
  • test: python class for unit test (to implement)
  • results: a collection of results.

Hierarchy

This code has the following hierarchy

pyxtal_ml
├── main.py
├── descriptors -> descriptors.py -> (RDF.py, ADF.py, DDF.py, Chem.py, .etc)
├── datasets -> collection.py -> (json files)
├── ml -> methods.py (KRR, KNN, ANN, .etc.)
├── test -> (various scripts to test the accuracy and efficiency of the code)

Installation

# git clone https://github.com/qzhu2017/PyXtal_ml.git
# python setup.py install

Dependencies:

pyxtal_ml's People

Contributors

david-zagaceta avatar qzhu2017 avatar yanxon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

chrinide

pyxtal_ml's Issues

[Alert] Voronoi isn't working with the main code yet.

ML learning with GradientBoosting algorithm

-------------SUMMARY of ML--------------

Number of samples: 100
Number of features: 1344
Algorithm: GradientBoosting
Feature: Chem+RDF+ADF+Charge+Packing_efficiency+Volume_stats+Bond_stats+Coord_number+Chemical_ordering
Property: formation_energy
r^2: -0.1959
MAE: 0.8273
Mean train_score: 1.0000
Std train_score: 0.0000

ML learning with GradientBoosting algorithm

-------------SUMMARY of ML--------------

Number of samples: 100
Number of features: 1344
Algorithm: GradientBoosting
Feature: Chem+RDF+ADF+Charge
Property: formation_energy
r^2: -0.0926
MAE: 0.7997
Mean train_score: 1.0000
Std train_score: 0.0000

Despite adding features, the number of features is still 1344.

Parameter sets for ML methods

@yanxon
I have resolved the conflicts. Please update your repo.

For the next step, please implement three sets of ml parameters as we discussed.
say
-light
-medium
-tight

covariance

@David-Zagaceta @yanxon
Below is my understanding on the covariance base on Fig. 1 of that paper.

the covariance is only used between two element features and structure features. Say if you have

  • 4 atoms in the unit cell
  • 20 element features (element index, melting points...)
  • 18 structure features (say bop, q1, q2, q3, ...)
    You should calculate the covariance between structure and element features,
    element index (a 4 numbers array) against q1.(a 4 number array), the covariance is a symmetrical 2*2 matrix, we just take this upper triangle, which has 3 numbers.

You repeat this for each pair, in the end you get 20183 numbers for the covariance.

[Plan] add Multiprocessing

In order to speed up the calculation, I think the most important thing is to add Multiprocessing in our descriptor calculation.
We need to run the following code in parallel

    def convert_data_1D(self):
        """
        convert the structures to descriptors in the format of 1D array
        """
        start = time()
        print('Calculating the descriptors for {:} materials'.format(len(self.strucs)))
        pbar = ProgressBar()
        feas = []
        for struc in pbar(self.strucs):
            try:
                feas.append(descriptor(struc, self.feature0))
            except:
                feas.append([])
                print('Problem occurs in {}'.format(struc.formula))
        end = time()
        self.time['convert_data'] = end-start

        self.features = feas

The best way is to include it multiple process function.
https://docs.python.org/3.6/library/multiprocessing.html

@David-Zagaceta @yanxon Can you look into this?

bispectrum calc

After a study on the codes how we calculate the bispectrum coefficients. I think there exist at least two places where we can improve the code,
1, We routinely call the Clebsch-Gordan (CG) coefficients. In this case, it would be easier if we just create a table, and then obtain the value.
2, There exist a lot of cases when CG=0. In this case, we shouldn't do calculation on the c coefficients.
3, I guess the most expansive part is for the calculation of c coefficients. However, when you check the formula. Again, we repeats such calculations for many times. We should create a table for such calcs.

@David-Zagaceta @yanxon I suggest you take care of the above statements and try to improved the python code, and see how much we can benefit from such improvements.

Please try to come to see me on Monday. Let's have another go on the formulas.
@yanxon I am also eager to know the performances of linear regression. As I said, I would like to know the contribution of each bispectrum component to the total energy.

General guides in commits

@David-Zagaceta
@yanxon

When you make the commits, please make sure you follow some standards,
1, always run git pull before you start to work on the code.

2, selectively add your files. Please just add ONLY the .py and data source files. Don't include the folders like pycache

3, It is advised that you commit the code more frequently. Don't make a commit which contains many modified files. As soon as you are done with one file, please commit it right away,

4, Please try to avoid the conflicts during the commit. If you encounter such conflicts, notify the author who is responsible for this part.

gaussian derivatives

I just finished implementing the gaussian derivatives (G2, G4, and G5).

I will clean up the code (perhaps, will make class for similar things), so that it's readable.

RDF + Chem

The GridSearch for Gradientboosting method just got done.

Qiang mentioned to try RDF + Chem to predict enthalpy formation. I got a similar result to what discussed in Jarvis paper (https://journals.aps.org/prmaterials/pdf/10.1103/PhysRevMaterials.2.083801). The result in our simulation is r2 = 0.939 and MAE = 0.156.

Here, I used a big grid search. I believe if we do a fine grid search we can get a lower mae and better r2. I think we couldn't get any nice result for RDF only is because machine learning is dataset-dependent. In their paper, they didn't mention RDF only result.

I will integrate this machine learning method in the ML-Materials repo today so it can be reproduceable.

enthalpy_form

[Announcement] Update you local repo

Dear All,

I have just managed to make this library installable via setup.py. I also renamed the library as PyXtal_ml.

This is gonna be a new stage of our development. For your convenience, I suggest you update your repo via.

git clone https://github.com/qzhu2017/PyXtal_ml.git

From now on, please ignore the previous copy, and implement your code based on the new local copy.

Please follow the instructions below to commit your code.

$ git pull
$ python setup.py install
$ python main.py

This will make sure that you get the latest copy and run it without any problem.
Before commit your code:

$ python setup.py install
$ python main.py

This will make sure you can run the modified code without any problem.

To commit your code

$ git status  # check if you have any changes
$ git add `the file you want to change`
$ git push 

When you want to import the functions from PyXtal_ml,
please use such statements.

from pyxtal_ml.descriptors.descriptors import descriptor

[FYI] Pie Chart

The way the Jarvis paper got their pie chart is taking absolute percentage of MAE with ALL descriptors as the reference:
(All-Chem)/Chem ~ 42% ---> Chem
(All-(Chem+RDF))/(Chem+RDF) ~ 18% ---> RDF

With this data right here created by Dr. Zhu and I, we can clearly see that adding Charge actually makes the performance to drop. MAE of Chem is 0.2197 and MAE of (Chem+Charge) is 0.2226.

Descriptors                       r2            mae     mae diff(%)
Chem+RDF+ADF+Charge+Voronoi       0.9257	0.2023	0
Chem+RDF+ADF+Charge               0.9161	0.2217	9.58971824
Chem+ADF+Charge                   0.9144	0.223	10.23232823
Chem+Charge                       0.9136	0.2226	10.03460208
Chem                              0.9153	0.2197	8.601087494

first complete test

@yanxon

I have implemented all parts to make the code run as follows.

$ python main.py

However, the ML is very poor. Below is an example which I got for band_gap

Time elapsed for reading the json files: 3.005 seconds
Total number of materials: 8049
The chosen feature is: RDF+Chem
Time elapsed for creating the descriptors: 267.185 seconds
Each material has 498 descriptors
Time elapsed for machine learning: 3.926 seconds

image

Please try to repeat the calculation, and tell me what's wrong with it.

A package design

@David-Zagaceta
@yanxon
I have completed the class design for dataset and descriptors.
Now we need to add ML as well

The ultimate goal is that we can just use one script main.py to complete the following
1, choose the data set and specify the target from property (Y)
2, compute the descriptor for the structure set (X)
3, perform machine learning for X/Y

@yanxon
1, Try to follow the way I did to create a method.py file in ml/ to
2, Please complete the ML part in main.py to call different ml methods.

@David-Zagaceta , you can start to think about implementing new descriptors, by following the way I did for ADF.py, RDF.py, Chem.py

After this is done, we can make it as a python package via setup.py

link the dynamic C library

Looks like we cannot just cope .so file and call it from python. Need to figure out a solution later.

[Question] how to pass the user defined parameters to ml training

@yanxon , as we discussed, we should allow the user to provide their own parameter list for ml training. However, I don't know how to call it from run.py or main.py. Could you please create an example. For instance, I want to call
n_estimator = 20 in a single run of RF training
for n_estimator = [10, 50] in gridsearch for RF training?

PRDF speed

@qzhu2017

right now the PRDF is consuming a lot of memory and time. Instead of returning the RDF for each element combination including the elements not in the crystal (these would be 0 distributions), I think we should return the integrals of the distribution functions. This would return a single value for each pairwise element combination and would reduce the overhead by a large factor.

Right now, stacking each RDF array at the end is consuming most of the computation time.

[ToDo] Rewrite the ADF.py to include DDF as an option

The proposed algorithm
1, get the table of neighbors for each atom (based on the criterion of 1.2*sum(radii))
2, For each i, find the list of j-i-k,
if the option of calc_DDF is on, for each j-i and i-k, find the list of l-j-i-k and j-i-k-l.
3, calculate the angles for (j-i-k) and (l-j-i-k)
4, error handling if the list is empty, return all zeros.
5, if there is only one atom in the unit cell, l-j-i-k list cannot be created. A simple fix is to double the cell.

Further optimization:
1, check the scaling of Step(1), this may be an issue when there exist a large crystal.
2, check the Step(2), looping over the table may become slow as well.

[Question] How to export the pre-trained model?

Can anyone investigate how to export the pre-trained model. I wonder if we can save the ml model trained from the calculation and then use it as an executable to directly call it to calculate any arbitrary crystal structure.

a bug in BOP

@David-Zagaceta I noticed that there exists a bug in the bond order parameter calc.
For NaCl, the first four atoms are equivalent. They should return the same results. But you see the 2nd number in the qw_series for atom 1 and 2 are different. Please fix this issue asap

qzhu@cms:/scratch/qzhu/github/PyXtal_ml/pyxtal_ml/descriptors$ python bond_order_params.py -c POSCARs/POSCAR-NaCl 
bond_order_params.py:76: ComplexWarning: Casting complex values to real discards the imaginary part
  return float(np.sum(q1*np.conjugate(q2)))
[[ 4.59381495e-17 -2.81584178e-02  7.72009938e-17  0.00000000e+00
   7.63762616e-01  9.29351343e-02  3.03372715e-16  0.00000000e+00
   3.53553391e-01  1.64507509e-03  2.60540884e-16 -1.44091792e-18
   7.18070331e-01  3.01407518e-02  3.78571193e-16 -1.39700044e-16
   4.11425368e-01  1.52564160e-02  5.99375599e-16  6.03663967e-16
   6.95502666e-01  1.15831953e-02]
 [ 3.53525080e-17  0.00000000e+00  8.17406390e-17  0.00000000e+00
   7.63762616e-01  9.29351343e-02  2.96368233e-16  0.00000000e+00
   3.53553391e-01  1.64507509e-03  2.60638893e-16 -1.02806645e-18
   7.18070331e-01  3.01407518e-02  3.81349846e-16 -1.25248782e-16
   4.11425368e-01  1.52564160e-02  5.99227097e-16  7.22169401e-16
   6.95502666e-01  1.15831953e-02]
 [ 3.53525080e-17  0.00000000e+00  8.03665665e-17  0.00000000e+00
   7.63762616e-01  9.29351343e-02  2.96837331e-16  0.00000000e+00
   3.53553391e-01  1.64507509e-03  2.82115348e-16 -9.72836723e-19
   7.18070331e-01  3.01407518e-02  3.85474658e-16 -1.72989429e-16
   4.11425368e-01  1.52564160e-02  5.98118812e-16  7.82250664e-16
   6.95502666e-01  1.15831953e-02]
 [ 3.53525080e-17  0.00000000e+00  9.05876542e-17  0.00000000e+00
   7.63762616e-01  9.29351343e-02  2.89962676e-16  0.00000000e+00
   3.53553391e-01  1.64507509e-03  2.56941181e-16  0.00000000e+00
   7.18070331e-01  3.01407518e-02  3.73211073e-16 -1.13775913e-16
   4.11425368e-01  1.52564160e-02  5.96878535e-16  6.68980069e-16
   6.95502666e-01  1.15831953e-02]
 [ 3.53525080e-17  0.00000000e+00  7.65647998e-17  0.00000000e+00
   7.63762616e-01  9.29351343e-02  2.99893688e-16  0.00000000e+00
   3.53553391e-01  1.64507509e-03  2.66454730e-16 -3.07906266e-18
   7.18070331e-01  3.01407518e-02  3.91991224e-16 -1.49241280e-16
   4.11425368e-01  1.52564160e-02  6.04876861e-16  6.40434954e-16
   6.95502666e-01  1.15831953e-02]
 [ 3.53525080e-17  0.00000000e+00  8.88474250e-17  0.00000000e+00
   7.63762616e-01  9.29351343e-02  2.95461933e-16  0.00000000e+00
   3.53553391e-01  1.64507509e-03  2.56299165e-16 -1.29741383e-18
   7.18070331e-01  3.01407518e-02  3.67717983e-16 -1.13198929e-16
   4.11425368e-01  1.52564160e-02  5.98187807e-16  6.56140861e-16
   6.95502666e-01  1.15831953e-02]
 [ 3.53525080e-17  0.00000000e+00  7.88597060e-17  0.00000000e+00
   7.63762616e-01  9.29351343e-02  3.00675996e-16  0.00000000e+00
   3.53553391e-01  1.64507509e-03  2.74096457e-16 -1.76790361e-18
   7.18070331e-01  3.01407518e-02  3.77367346e-16 -1.61950379e-16
   4.11425368e-01  1.52564160e-02  5.98796693e-16  7.10296295e-16
   6.95502666e-01  1.15831953e-02]
 [ 4.59381495e-17 -2.81584178e-02  8.43191699e-17  0.00000000e+00
   7.63762616e-01  9.29351343e-02  2.92501848e-16  0.00000000e+00
   3.53553391e-01  1.64507509e-03  2.62248501e-16 -1.81665439e-18
   7.18070331e-01  3.01407518e-02  3.77818589e-16 -1.41862359e-16
   4.11425368e-01  1.52564160e-02  5.96288214e-16  6.17792040e-16
   6.95502666e-01  1.15831953e-02]] (8, 22)

NAN value in bond_order_params class

run: 100%|####################################| 100/100 [03:53<00:00,  2.07s/it]

ML learning with KRR algorithm
Traceback (most recent call last):
  File "main.py", line 27, in <module>
    runner.ml_train(algo=algorithm)
  File "/homes/davidz87/PyXtal_ml/pyxtal_ml/run.py", line 165, in ml_train
    pipeline=self.pipeline, params=self.level)
  File "/homes/davidz87/PyXtal_ml/pyxtal_ml/ml/ml_sklearn.py", line 94, in __init__
    self.ml()
  File "/homes/davidz87/PyXtal_ml/pyxtal_ml/ml/ml_sklearn.py", line 204, in ml
    best_estimator.fit(self.X_train, self.Y_train)
  File "/homes/davidz87/Python-3.6.5/lib/python3.6/site-packages/sklearn/model_selection/_search.py", line 722, in fit
    self._run_search(evaluate_candidates)
  File "/homes/davidz87/Python-3.6.5/lib/python3.6/site-packages/sklearn/model_selection/_search.py", line 1191, in _run_search
    evaluate_candidates(ParameterGrid(self.param_grid))
  File "/homes/davidz87/Python-3.6.5/lib/python3.6/site-packages/sklearn/model_selection/_search.py", line 711, in evaluate_candidates
    cv.split(X, y, groups)))
  File "/homes/davidz87/Python-3.6.5/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 983, in __call__
    if self.dispatch_one_batch(iterator):
  File "/homes/davidz87/Python-3.6.5/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 825, in dispatch_one_batch
    self._dispatch(tasks)
  File "/homes/davidz87/Python-3.6.5/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 782, in _dispatch
    job = self._backend.apply_async(batch, callback=cb)
  File "/homes/davidz87/Python-3.6.5/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 182, in apply_async
    result = ImmediateResult(func)
  File "/homes/davidz87/Python-3.6.5/lib/python3.6/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 545, in __init__
    self.results = batch()
  File "/homes/davidz87/Python-3.6.5/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 261, in __call__
    for func, args, kwargs in self.items]
  File "/homes/davidz87/Python-3.6.5/lib/python3.6/site-packages/sklearn/externals/joblib/parallel.py", line 261, in <listcomp>
    for func, args, kwargs in self.items]
  File "/homes/davidz87/Python-3.6.5/lib/python3.6/site-packages/sklearn/model_selection/_validation.py", line 568, in _fit_and_score
    test_scores = _score(estimator, X_test, y_test, scorer, is_multimetric)
  File "/homes/davidz87/Python-3.6.5/lib/python3.6/site-packages/sklearn/model_selection/_validation.py", line 605, in _score
    return _multimetric_score(estimator, X_test, y_test, scorer)
  File "/homes/davidz87/Python-3.6.5/lib/python3.6/site-packages/sklearn/model_selection/_validation.py", line 635, in _multimetric_score
    score = scorer(estimator, X_test, y_test)
  File "/homes/davidz87/Python-3.6.5/lib/python3.6/site-packages/sklearn/metrics/scorer.py", line 228, in _passthrough_scorer
    return estimator.score(*args, **kwargs)
  File "/homes/davidz87/Python-3.6.5/lib/python3.6/site-packages/sklearn/base.py", line 328, in score
    return r2_score(y, self.predict(X), sample_weight=sample_weight,
  File "/homes/davidz87/Python-3.6.5/lib/python3.6/site-packages/sklearn/kernel_ridge.py", line 193, in predict
    K = self._get_kernel(X, self.X_fit_)
  File "/homes/davidz87/Python-3.6.5/lib/python3.6/site-packages/sklearn/kernel_ridge.py", line 125, in _get_kernel
    filter_params=True, **params)
  File "/homes/davidz87/Python-3.6.5/lib/python3.6/site-packages/sklearn/metrics/pairwise.py", line 1559, in pairwise_kernels
    return _parallel_pairwise(X, Y, func, n_jobs, **kwds)
  File "/homes/davidz87/Python-3.6.5/lib/python3.6/site-packages/sklearn/metrics/pairwise.py", line 1069, in _parallel_pairwise
    return func(X, Y, **kwds)
  File "/homes/davidz87/Python-3.6.5/lib/python3.6/site-packages/sklearn/metrics/pairwise.py", line 722, in linear_kernel
    X, Y = check_pairwise_arrays(X, Y)
  File "/homes/davidz87/Python-3.6.5/lib/python3.6/site-packages/sklearn/metrics/pairwise.py", line 111, in check_pairwise_arrays
    warn_on_dtype=warn_on_dtype, estimator=estimator)
  File "/homes/davidz87/Python-3.6.5/lib/python3.6/site-packages/sklearn/utils/validation.py", line 568, in check_array
    allow_nan=force_all_finite == 'allow-nan')
  File "/homes/davidz87/Python-3.6.5/lib/python3.6/site-packages/sklearn/utils/validation.py", line 56, in _assert_all_finite
    raise ValueError(msg_err.format(type_err, X.dtype))
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

I will investigate the source of these values tomorrow. I think this may be coming from the odd numbered bond order parameters.

[Plan] Add a step of feature selection in ML pipeline

Currently, we have only a relative small array of descriptors, compared to the database. However, there may be a case we have more descriptors than the materials data. In this case, it would be good add a step of feature selection as part of a pipeline.

clf = Pipeline([
  ('feature_selection', SelectFromModel(LinearSVC(penalty="l1"))),
  ('classification', RandomForestClassifier())
])
clf.fit(X, y)

Something needs to be done in future.

[Example]

I created an example folder to systematically test the performance of different algorithm and descriptors.

example_01

Dataset:
~8000 non metal materials from the materials project.

For the descriptors,
It looks to me that the chem descriptor plays the dominant role, the addition of other descriptors does not significantly improve the method.

For the algorithm,
Both gradientBoosting and RandomForest have good performance, they have better score and take short time.
KNN seems to be very fast, but the r2 value is too small.
KRR takes longer time, and the r2 is slightly better than KNN.

Of course, all the results are based on the light setting. The gridsearchCV function may improve the performances of KRR and KNN.

We need to expand the pool of datasets, descriptors and ml algorithms, the can help us gain better understanding on how to optimize the ml training.

Please feel free to add more examples in our repo.

dataset

@yanxon

I notice a couple of issues in sp_metal_aflow_844.json

1, It has a lot of entries which return None in 'form_energy_cell'
2, It would be good to normalize the values when you create the json files. This applies to those extensive properties.

For instance,
DOS: should be always /A^3
Formation energy: should be always /atom.

[Results] Add your results

I just run a nice to have a feeling about the accuracy and efficiency of our code at the moment.

Example-01

Please have a look and feel free to add some new results there. By doing this, we can gain more experience with code.

General guide in programming

I will keep updating these lists.

  • When you import the libraries, please avoid import * statement. People will be confused which function is used in the code.

[Bug] main.py is not working

It looks like main.py is no longer working.

Also, I found that the code complains when I increased the N_sample from 300 to 800 in torch_main.py

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.