cedergrouphub / smol Goto Github PK

View Code? Open in Web Editor NEW

61.0 5.0 14.0 90.11 MB

Statistical Mechanics on Lattices

Home Page: https://cedergrouphub.github.io/smol/

License: Other

Python 95.15% Cython 4.85%

cluster-expansion materials-science monte-carlo statistical-mechanics ising-model lattice-gas

smol's People

Contributors

Stargazers

Watchers

Forkers

lbluque kamronald qchempku2017 juliayang raghurama123 irumeria lijiezhong xig126 standardgalactic 5l1v3r1 leo3701 liaojh123 lnnbig

smol's Issues

Provide a method in smol.cofe.extern.ewald to allow electrostatic correction on CE

Reduce ewald terms into pair interactions, return the corresponding pairs in a form similar to that mentioned in:
#34

Enable user-specified spin flips to canonical MC

Is your feature request related to a problem? Please describe.
There are some examples which require modification to the way spin flips are currently implemented:

Species which can exist on different sublattices should be flipped. For example, Li+ exists in both tetrahedral or octahedral sites. Therefore, Li+ on sublattice 1 (tetrahedral) or sublattice 2 (octahedral) should be allowed to flip between the two. Currently a Li+ on sublattice 1 will only be allowed to flip within sublattice 1.
Mn disproportionation are relevant in some systems. Therefore, Mn3+ and Mn3+ on sublattice 1 should be allowed to flip to Mn2+ and Mn4+ on sublattice 1 (or to another sublattice).

It has been suggested by Gerd to include a feature where a user can specify allowed perturbations since it seems that other perturbations, besides for the two examples above, may want to be enforced.

Describe the solution you'd like
Ideally, user specifies a table of allowed perturbations and a probability by which the perturbation can occur. In example 1, a user could specify: {'Li+': {'sublattice_1': 0.5}, {'sublattice_2': 0.5}} indicating that a Li+ can equally flip into sublattice 1 or sublattice 2. The code will then enforce that such possible flips are picked with equal probability.

Note that detailed balance should be obeyed at all times, which could be either up the user, or the code, to enforce.

Describe alternatives you've considered
@tchen0965 has code which can address the first issue of Li+ spin flips between different sublattices. She may want to submit a PR soon.

A more convenient convenience method to create a sampler!

Is your feature request related to a problem? Please describe.
Please describe the problem you are trying to solve.
Not related to a problem, but I find myself creating to many ensembles that I rarely use directly.
Except for obtaining an initial occupancy, but other than that not really used for anything else.

Describe the solution you'd like
Please describe the desired behavior.
Perhaps having a convenience constructor in the sampler to create directly from a Cluster Expansion like so,
Sampler.from_cluster_expansion(expansion, supercell_matrix, ...all other goodies needed)

Describe alternatives you've considered
Please describe alternative solutions or features you have considered.
Doing so would likely limit how much tweaking a user could do for each of the classes involved (processor, ensemble, kernel, etc) but for those cases one can just fall back to the lengthier more verbose construction of each class individually.

Modify ClusterSubspace to allow intialization from lattice and generaized species

[Code Review] ewald.py

File Name	Module Path	Authors
ewald.py	smol.extern	William Davidson Richard, Luis Barroso-Luque

Code Review

We use code review issues to open up a pinned location to discuss a specific
source file.

Source file summary

Implementation of EwaldTerm class.

Provides the functionality to fit an Ewald term as an additional feature in a
cluster expansion as proposed refs below to improve convergence of expansions
of ionic materials.
Chapter 4.6 of W.D. Richard's thesis.
W. D. Richards, et al., Energy Environ. Sci., 2016, 9, 3272–3278

Use github features when possible please!

Point to specific lines in code

If you are mentioning a set of specific lines in code already implemented, it
may come in handy to add a
permanent link to it.

Format suggested code niceley

def roast(beans, *flavoring_oils):
    for oil in flavoring_oils:
        beans.add_oil(oils)
    return np.trapz(beans.get_heat_vector())

Use other handy features too

Using many of other handy features offered, can help improve discussion.
https://docs.github.com/en

Unit tests for prune method in ClusterExpansion class

I think a pruning test could be a good idea. Could just be something simple like before and after pruning, when the tolerance is small (1e-6 or even 0), the energy difference should be below some reasonable number like 10 meV. Maybe also plus a check that n_bit_orderings is equal to number of ECI + # external terms.

Clusters visualization?

Hey just a small idea for user friendliness:

How about we add a method that visualizes clusters, so that users knows what it is on a projected structure plot? I think this might be very helpful when user wants to analyze short range orderings. They can just type like: some_orbit.visualize() and get a image right away, not having to match the indices of each atom in a cluster, and connect them manually.

Is your feature request related to a problem? Please describe.
Please describe the problem you are trying to solve.

Describe the solution you'd like
Please describe the desired behavior.

Describe alternatives you've considered
Please describe alternative solutions or features you have considered.

New update to ClusterSubspace.from_cutoffs can miss some clusters below cutoff

Expected Behavior

When creating a ClusterSubspace.from_cutoffs all the clusters of given size below the provided cutoff should be obtained.

Current Behavior

The newest update which improved performance of orbit/cluster generation has too tight of a search window such that in some cases it may miss clusters that are within the given cutoff. For example for a system I am currently using setting the cutoff for clusters of size 4 to 3, misses a quadruplet of diameter 2.97. Setting the cutoff slightly higher to 3.1 correctly finds it.

Possible Solution

The optimal point where not too many unnecessary or ideally no clusters are discarded is needed to get the correct behavior.

[Code Review] wrangler.py

File Name	Module Path	Authors
wrangler.py	smol	Luis Barroso-Luque

Code Review

We use code review issues to open up a pinned location to discuss a specific
source file.

Source file summary

Implementation of the StructureWrangler class and functions to obtain fit weights.

Includes functions used to preprocess and check (wrangle) fitting data of
structures and properties.

Use github features when possible please!

Point to specific lines in code

If you are mentioning a set of specific lines in code already implemented, it
may come in handy to add a
permanent link to it.

Format suggested code niceley

def roast(beans, *flavoring_oils):
    for oil in flavoring_oils:
        beans.add_oil(oils)
    return np.trapz(beans.get_heat_vector())

Use other handy features too

Using many of other handy features offered, can help improve discussion.
https://docs.github.com/en

updating name of classes

(1)
path :CederGroupHub/smol/examples/using-monte-carlo.ipynb
modification needed for cell 1:

please update names: "ClusterExpansionProcessor" to CExpansionProcessor and SGCanonicalEnsemble to BaseSemiGrandEnsemble
may missing : from smol.cofe.configspace import EwaldTerm
So, overall, cell 1 can be changed to:

import numpy as np
import json
import matplotlib.pyplot as plt
from pymatgen.io.cif import CifParser
from pymatgen.core.structure import Structure
from pymatgen.transformations.standard_transformations import OrderDisorderedStructureTransformation
from smol.cofe import ClusterExpansion
from smol.moca import CExpansionProcessor, CanonicalEnsemble, MuSemiGrandEnsemble, FuSemiGrandEnsemble, BaseSemiGrandEnsemble
from smol.cofe.configspace import EwaldTerm
(2) In the section "Now Create a Processor"
you can change processor = ClusterExpansionProcessor(ce, sc_matrix) to
processor = CExpansionProcessor(ce, sc_matrix)

(3) in the section "Look at the ensemble current state properties"
you can change print(f'The current step energy is {censemble.energy} eV') to
print(f'The current step energy is {censemble.current_energy} eV')

So , it seems censemble.energy does not exist.

Adding EwaldTerm with ClusterExpansion.from_radii()

I'd like to include the Ewald energy from pymatgen's Ewald summation algorithm and was thinking I could add it when generating the ClusterExpansion object as follows:

`
fitting_data = [(Structure.from_dict(x['s']), x['toten']) for x in json.load(fin)]
EwaldTerms = []

for sData in fitting_data:
EwaldTerms.append(EwaldSummation(sData[0]))

ce = ClusterExpansion.from_radii(structure = prim,
data = fitting_data,
external_terms = [EwaldTerms],
...)
`

However, the input to external_terms throws an error. Any leads or suggestions on how to add it properly?

Thank you in advance!

Move occupancy related tool functions from moca.processor to cofe.clusterspace

Is your feature request related to a problem? Please describe.
Please describe the problem you are trying to solve.

Describe the solution you'd like
Please describe the desired behavior.

Describe alternatives you've considered
Please describe alternative solutions or features you have considered.

Automatic relaxation time detection method in MC Ensembles

Is your feature request related to a problem? Please describe.
For MC runs a production iteration sampling start can be set by the user as something like ensemble.production_start = X. It would be helpful to have a class method that can auto-detect a value for this.

Describe the solution you'd like
A helper method in either the BaseEnsemble or the CanonicalEnsemble class, so that other ensembles simply inherit this. Since there are a few methods to do this we can start with one we think is best or allow users to choose how this is done.

[Code Review] cluster.py

File Name	Module Path	Authors
cluster.py	smol.configspace	Luis Barroso-Luque, William Davidson Richard

Code Review

We use code review issues to open up a pinned location to discuss a specific
source file.

Source file summary

Implementation of the Cluster class.

Represents a group of sites of a given lattice. These are the building blocks
for a cluster basis of functions over configurational space.

Use github features when possible please!

Point to specific lines in code

If you are mentioning a set of specific lines in code already implemented, it
may come in handy to add a
permanent link to it.

Format suggested code niceley

def roast(beans, *flavoring_oils):
    for oil in flavoring_oils:
        beans.add_oil(oils)
    return np.trapz(beans.get_heat_vector())

Use other handy features too

Using many of other handy features offered, can help improve discussion.
https://docs.github.com/en

Provide a method in smol.cofe.config_space.clustersubspace to give a list of all specie occupied, 'bit clusters'

This functionality will be very helpful for ground state solver implementation. For a 1D chain lattice, with each site's site domain = ['A','B','C'], and 2 sites in each sublattice.
Consider a supercell containing 3 adjacent sites ,therefore nbits = [[0,1],[0,1],[0,1]], and the corresponding site Boolean indices are:
[[0,1],[2,3],[4,5]]. If we consider only NN interactions, then all possible 'bit cluster' terms should be (including periodic boundary considition): [[0,2],[0,3],[1,2],[1,3],[2,4],[2,5],[3,4],[3,5],[0,4],[0,5],[1,4],[1,5]].
If we can make a list like this, it will be a lot easier to write down the ground state energy Boolean function.

Obtaining general observables during MC sampling

Is your feature request related to a problem? Please describe.
In analysis of MC sampling we will eventually want access to many observables, for example correlation vectors, the occupation of (certain) clusters, etc.

Describe the solution you'd like
In theory all these things could be calculated from the sampled occupancy strings and/or properties. But if they are already being calculated in the run might as well save them. Here are some thoughts:

Create a general observable like interface that when added can save the observable it represents at each sample. If the information for that is already being computed saving it will have little effect in slowing down a run.
If the computation of an observable is not already done and/or takes a bit of time then we can just define functions that compute these after running monte carlo using the necessary sample information.

Ewald energies and energy differences from EwaldTerm do not match EwaldSummation for charged cells

Expected Behavior

The portions of the Ewald energy (total, real, reciprocal, and point) should match when computed using the "expanded" matrices created with the EwaldTerm or directly with an EwaldSummation for an ordered structure.

For the case of charged cells all should match except the charged cell correction since the "expanded" Ewald structure has the charge of the disordered structure and not the particular ordered structure.

Current Behavior

The computed Ewald interaction for charged cells using the "expanded" ewald matrices created by the EwaldTerm, do not match the corresponding Ewald energies computed directly for a structure using pymatgen.EwaldSummation when the structure is charged.

The real and reciprocal energies do not match, and as a result the total energy (even excluding the charged cell correction) does not match. The point energies do match.

In MC we should decide if it is necessary to add charge corrections when going off charge, because that is currently not being considered.

Everything works well for charge neutral structures.

Possible Solution

Need to find out what is changing the values! Currently I do not know what is causing this.
Need to decide if and how to add the charged cell correction in MC sampling.

Steps to Reproduce

For the real space energy for example:

Create a disordered structure and an ordered structure with the same supercell.
Obtain ewald energy directly from EwaldSummation(ordered structure).real_space_energy
Obtain ewald energy by using EwaldTerm(use_term='real').value_from_occupancy(occu for ordered structure, disordered structure)

[Code Review] orbit.py

File Name	Module Path	Authors
orbit.py	smol.configspace	Luis Barroso-Luque, William Davidson Richard

Code Review

We use code review issues to open up a pinned location to discuss a specific source file.

Source file summary

Implementation of the Orbit class.

A set of symmetrically equivalent (with respect to the given random structure symmetry) clusters.

Use github features when possible please!

Point to specific lines in code

If you are mentioning a set of specific lines in code already implemented, it may come in handy to add a
permanent link to it.

Format suggested code niceley

def roast(beans, *flavoring_oils):
    for oil in flavoring_oils:
        beans.add_oil(oils)
    return np.trapz(beans.get_heat_vector())

Use other handy features too

Using many of other handy features offered, can help improve discussion.
https://docs.github.com/en

AttributeError with EwaldCEProcessor

The error message is:

raise AttributeError('The provided ClusterExpansion must have only'
AttributeError: The provided ClusterExpansion must have onlyone external term being an EwaldTerm

Arises for both ce.prune() and not pruning. CEProcessor(ce, scs_matrix) works fine even though ce here also contains the EwaldTerm.

[Code Review] expansion.py

Make sure that a code review issue for the source file you are about to create
one does not already exist. Please edit the above fields based on the file you
are opening this issue for.

File Name	Module Path	Authors
expansion.py	smol	Luis Barroso-Luque

Code Review

We use code review issues to open up a pinned location to discuss a specific
source file.

Source file summary

This module implements the ClusterExpansion class.

A ClusterExpansion holds the necessary attributes to represent a CE and predict
the property for new structures.

Use github features when possible please!

Point to specific lines in code

If you are mentioning a set of specific lines in code already implemented, it
may come in handy to add a
permanent link to it.

Format suggested code nicely

def roast(beans, *flavoring_oils):
    for oil in flavoring_oils:
        beans.add_oil(oils)
    return np.trapz(beans.get_heat_vector())

Use other handy features too

Using many of other handy features offered, can help improve discussion.
https://docs.github.com/en

[Code Review] canonical.py

File Name	Module Path	Authors
canonical.py	smol.moca.ensembles	Luis Barroso-Luque

Code Review

We use code review issues to open up a pinned location to discuss a specific
source file.

Source file summary

Implementation of a CanonicalEnsemble Class.

Used when running Monte Carlo simulations for fixed number of sites and fixed
concentration of species.

Use github features when possible please!

Point to specific lines in code

If you are mentioning a set of specific lines in code already implemented, it
may come in handy to add a
permanent link to it.

Format suggested code niceley

def roast(beans, *flavoring_oils):
    for oil in flavoring_oils:
        beans.add_oil(oils)
    return np.trapz(beans.get_heat_vector())

Use other handy features too

Using many of other handy features offered, can help improve discussion.
https://docs.github.com/en

[Code Review] basis.py

File Name	Module Path	Authors
basis.py	smol.configspace	Luis Barroso-Luque

Code Review

We use code review issues to open up a pinned location to discuss a specific
source file.

Source file summary

Implementations of SiteBasis classes used to define different basis sets for site function spaces.

The product of single site functions make up a cluster/orbit function used to
obtain correlation vectors. Site function spaces include the basis functions
and measure that defines the inner product for a single site. Most commonly a
uniform measure, but this can be changed to use "concentration" biased bases.

Use github features when possible please!

Point to specific lines in code

If you are mentioning a set of specific lines in code already implemented, it
may come in handy to add a
permanent link to it.

Format suggested code niceley

def roast(beans, *flavoring_oils):
    for oil in flavoring_oils:
        beans.add_oil(oils)
    return np.trapz(beans.get_heat_vector())

Use other handy features too

Using many of other handy features offered, can help improve discussion.
https://docs.github.com/en

Add a more complex system to the synthetic test data

Many of the unit-tests are being ran on a synthetic binary system. It would be much more robust adding a somewhat more complex system such that things like multiple orbit labelings and different site spaces are also being checked in tests.

Failure in reloading trimmed cluster subspace.

If you remove some bit_combos with zero ecis, then you first dumpfn, then reload with loadfn, the number of correlation functions won't match before and after reload

The reason might be that you actually never save the changed orbit.bit_combos in as_dict, and you always re_initialize bit_combos when you re_initialize an orbit object. So for each saved orbit, you have added some previously removed bit_combo (correlation function) back during re-initialization.

I would not really call this as a bug, because people can still redo the trimming process whenever they want to use a cluster expansion with zero eci correlation functions trimmed out. But you know this still causes some inconvenience, and personally I would feel safer and more comfortable if I'm able to save and reload the trimmed cluster expansion directly.

[Code Review] processor.py

Make sure that a code review issue for the source file you are about to create
one does not already exist. Please edit the above fields based on the file you
are opening this issue for.

File Name	Module Path	Authors
processor.py	smol.moca	Luis Barroso-Luque

Code Review

We use code review issues to open up a pinned location to discuss a specific
source file.

Source file summary

Implementation of processor classes for a fixed size super cell.

A processor is optimized to compute correlation vectors and local changes in
correlation vectors. This class allows the use a cluster expansion hamiltonian
to run Monte Carlo based simulations.

Two classes are implemented

CEProcessor - for flips in CE Hamiltonians only
EwaldCEProcessor - for flips in CE Hamiltonians with Ewald electrostatics

Use github features when possible please!

Point to specific lines in code

If you are mentioning a set of specific lines in code already implemented, it
may come in handy to add a
permanent link to it.

Format suggested code nicely

def roast(beans, *flavoring_oils):
    for oil in flavoring_oils:
        beans.add_oil(oils)
    return np.trapz(beans.get_heat_vector())

Use other handy features too

Using many of other handy features offered, can help improve discussion.
https://docs.github.com/en

Traces as numpy structured arrays

Is your feature request related to a problem? Please describe.
Please describe the problem you are trying to solve.

In #154 Trace classes where introduced to generalize sampling values during MC. These classes only allow np.arrays as attributes. It seems this is a good opportunity to use numpy structured arrays instead, since they essentially provide all the functionality needed and will also benefit from the better memory layout,

Describe the solution you'd like
Please describe the desired behavior.

Use numpy structured arrays instead of Trace/StepTrace classes

Enabling orbit (undecorated) hierarchy.

Is your feature request related to a problem? Please describe.
Functionality for bit combo hierarchy (or function decorated clusters) has been added in #106. It would be nice to allow to get orbit hierarchy as well.

Describe the solution you'd like
The ClusterSubspace class should have a method similar to bit_combo_hierarchy that gives a hierarchy of only the undecorated Orbit objects. It will also be helpful if a user can quickly obtain all the sub-orbits (or finer grained sub orbit and bit_combos) for a given orbit (and bit combo).

Describe alternatives you've considered
I would suggest adding functionality to the Orbit class to determine if a given orbit is a sub-orbit of it. This can also be fine-grained to clean up the bit combo hierarchy code, by enable orbits to check directly if a given orbit and bit combo is a subset of one of its bit combos.

Rewrite smol's get_site_domains to allow generalized species

In the old smol.cofe.config_space.utils.get_site_domains, a site domain is generated as a list of strings. Now for more generalized purposes (for example, if we want to consider a molecular fragment as a 'specie', or encode more properties such as magnetization direction into a specie), we may need to return a list of GenealizedSpecie object, rather than just strings.

Other function that uses a site domain as an input should also be modified accordingly.
`def get_site_domains(structure, include_measure=False):

"""Get site domains for sites in a disordered structure.



Helper method to obtain the single site domains for the sites in a

structure. The single site domains are represented by the allowed species

for each site (with an optional measure/concentration for disordered sites)



Vacancies are included in sites where the site element composition does not

sum to 1 (i.e. the total occupation is not 1)



Args:

    structure (Structure):

        Structure to determine site domains from at least some sites should

        be disordered, otherwise there is no point in using this.

    include_measure (bool): (optional)

         To include the site element compositions as the domain measure.



Returns:

    list: Of allowed species for each site if include_measure is False

    Ordereddict: Of allowed species and their measure for each site if

        include_measure is True`

Recent changes to pymatgen Structure.sites_in_sphere break ClusterSubspace

Recent changes to pymatgen Structure break its use in ClusterSubspace to obtain sites in sphere.

Expected Behavior

The new implementations returns a PeriodicNeighbor object and no longer a list of tuples.

Current Behavior

The new single return value is causing an unpacking error.

Possible Solution

At some point just update the unpacking to match new pymatgen signature.

Vacancy class can not be copied correctly

smol.cofe.space.domain.Vacancy can not be copied correctly.
We have a deepcopy method in the cn-sgmc branch, but it's not adapted into the master branch

Expected Behavior

copied Vacancy() should be exactly the same with before.

Current Behavior

from smol.cofe.space.domain import Vacancy
Vacancy().as_dict()
{'@module': 'smol.cofe.space.domain', '@Class': 'Vacancy', 'element': 'A', 'oxidation_state': 0}
Vacancy().copy().as_dict()
Traceback (most recent call last):
File "", line 1, in
File "/home/fengyu_xie/anaconda3/envs/py37/lib/python3.7/site-packages/pymatgen/core/periodic_table.py", line 1475, in getattr
raise AttributeError(a)
AttributeError: copy
from copy import deepcopy
deepcopy(Vacancy()).as_dict()
{'@module': 'pymatgen.core.periodic_table', '@Class': 'DummySpecies', 'element': 'A', 'oxidation_state': 0}

Possible Solution

Add copy and deepcopy methods to Vacancy class.

Steps to Reproduce

Context

Detailed Description

Possible Implementation

[Code Review] sgcanonical.py

File Name	Module Path	Authors
sgcanonical.py	smol.moca.ensembles	Luis Barroso-Luque

Code Review

We use code review issues to open up a pinned location to discuss a specific
source file.

Source file summary

Implementation of Semi-Grand Canonical Ensemble Classes.

These are used to run Monte Carlo sampling for fixed number of sites but
variable concentration of species.

Two classes are different SGC ensembles implemented:

MuSemiGrandEnsemble - for which relative chemical potentials are fixed
FuSemiGrandEnsemble - for which relative fugacity fractions are fixed.

Use github features when possible please!

Point to specific lines in code

If you are mentioning a set of specific lines in code already implemented, it
may come in handy to add a
permanent link to it.

Format suggested code niceley

def roast(beans, *flavoring_oils):
    for oil in flavoring_oils:
        beans.add_oil(oils)
    return np.trapz(beans.get_heat_vector())

Use other handy features too

Using many of other handy features offered, can help improve discussion.
https://docs.github.com/en

Using pymatgen ComputedStructureEntry in StructureWrangler

This is just a half chewed idea...

Would it make sense to use ComputedStructureEntry for items added to StructureWranglers? As a first benefit it punts part of the saving procedures to pymatgen. But additionally could make the code cleaner and allow an easier path of other analysis that can be done with pymatgen entries.

ensemble.processor.occupancy_from_structure not reproducing expected occupancy?

Expected Behavior

struct1 = ensemble.processor.structure_from_occupancy(init_occu)
occu2 = ensemble.processor.occupancy_from_structure(struct1)
print (np.where(init_occu != occu2)) # should return (array([], dtype=int64),)

Current Behavior

Instead, many occupancies are different. init_occu is not being reproduced with occu2:
print (np.where(init_occu != occu2)) # returns (array([ 2, 14, 25, ..., 3447, 3450, 3453]),)

I am using a large cell with 3456 occupancies. There are 1210 occupancies in occu2 which are different from init_occu.

Possible Solution

Is it related to the "#noqa" in the occupancy_from_structure() line? My supercell is big (864 supercell) so maybe that has something to do with this bug?

Steps to Reproduce

I can provide all my mson, MC data files if helpful. The steps I am taking to get this bug are as follows:

Load the ClusterExpansion mson file, and initialize an expansion.
Initialize the ensemble and sc_matrix which is the same as the one used to generate init_occu:
ensemble = CanonicalEnsemble.from_cluster_expansion(expansion, sc_matrix)
Load the occupancy from a MC saved run:
init_occu = mc_data[T]['occupancies'][0]
structure1 = ensemble.processor.structure_from_occupancy(init_occu)
occu2 = ensemble.processor.occupancy_from_structure(structure1)
The issue is that init_occu and occu2 are different.

Context

I would like to analyze occupancies for order parameter calculations during an MC simulation. One example is the Mn-16d occupancy which tells how spinel-like a MC structure is.

I am using the table-swap method, but I don't think using this algorithm should affect the regeneration of the occupancies from structure, and vice versa. I am mystified because I haven't encountered this bug before...

Reloading clustersubspace using orthonormal, does not keep orthonormalization

Orthonormalizing the site basis sets in a clusterspace is not recorded when saving the objects as mson dicts, so when reloaded the basis sets are not orthonormal.

Expected Behavior

The state of a cluster subspace should be saved completely to faithfully recreate the exact same object from a dictionary.

Current Behavior

Since the SiteBasis class is not MSONable, the exact state of the basis sets is not completely saved when saving a ClusterSubspace.

Possible Solution

Either make the SiteBasis class MSOnable and save its full state, or create boolean flags as attributes to keep track of changes, specifically orthonormalizing. As a quick and dirty fix users can simply "remember" if the were using an orthonormal set and then set the basis accordingly (this should NOT be a long term solution and the issue should be fixed in the source code asap):

subspace = ClusterSubpsace.from_dict(d)
subspace.change_site_bases('same basis as before', orthormal=True)

Steps to Reproduce

Create a ClusterSubspace with orthonormal=True
Save it as dict
Load it from dict

from smol.cofe import ClusterSubspace
subspace = ClusterSubspace.from_cutoffs(prim, {2: 6, 3: 4}, basis='sinusoid', orthonormal=True)
subspace1 = ClusterSubspace.from_dict(subspace.as_dict())

assert subspace.basis_orthonormal  # this is good
assert subspace1.basis_orthonormal # this is not good!

Possible Implementation

Make the SiteBasis class MSONable such that it saves the exact values of all the bases arrays and therefore faithfully saves any changes done to it. (Better solution imo)
Add boolean attributes to record changes in state, ie, orbit._basis_orthonormal, and save in in dict, such that when the object are recreated the basis sets are properly normalized.

Add a check of basis sets in unit tests checking as_dict/from_dict of ClusterSubspaces

Save fit information and metrics in ClusterExpansion

Is your feature request related to a problem? Please describe.
After fitting and creating a ClusterExpansion object there is currently no saving of the provenance and quality of the fit. Things like the regression model that was used, the hyper-parameters, and model fit metrics is important information that users may want (and may not remember) at some point in the future.

Describe the solution you'd like
It would be nice to save this information in the ClusterExpansion object. Such that for most practical cases if a user gets a CE object they can easily get the information used in the fit. The main glaring issue here is that this is somewhat heterogeneous information and is not straightforward to create a fixed schema for it.
A list of things to save:

The regression model (Ideally name of python module/object used).
The parameters/hyper-parameters used in the above model.
Fit metrics such as error (in/out sample), CV score, R2, etc.
The feature matrix used in fit (this is already saved).
The property vector used in fit.
Possible test set feature matrix and property vector.

Describe alternatives you've considered
Currently, a CE has a metadata dictionary as an attribute that is meant to save this information. But the user is fully responsible of saving whatever they deem important and most usually this is basically nothing.

One solution would be to implement a "FitData" dataclass that can be optionally passed to the CE constructor, and if not passed a warning can be raised letting the user know they might regret not keeping this information.

smol.cofe.basis.SiteBasis._encode

The _encode function should return a integer for function sinusoid to work ,but currently it's written as:

def _encode(self, specie): """Encode species to another set (i.e. species names to integers).""" return specie

This is not very safe to inputs types. If your imput species are strings, the basis functions will break. I have edited the _encode function in my forked version to solve this:

return self.species.index(specie)

Poor scaling of MC with number of sites

There is a pretty big difference in scaling when the ewald term (EwaldCEProcessor) is used in canonical MC compared to when its not used (CEProcessor). For example, a 12x12x10 supercell of my system (containing ~6000 sites) appears to take ~2.5 seconds per attempted flip when the Ewald term is included but only ~.003 seconds per attempted flip without the Ewald term. It seems that the update to the Ewald term is quite expensive per flip once the Ewald matrix gets very large. However, I'm not yet sure which part of updating the Ewald term is the limiting step.

I'm not entirely sure what a solution to this would look like. Potentially only using the reciprocal space part of the Ewald sum would help. There are also likely a number of instances in literature of people trying to optimize electrostatic/Ewald sum calculations that we can turn to.

[Code Review] base.py

Make sure that a code review issue for the source file you are about to create
one does not already exist. Please edit the above fields based on the file you
are opening this issue for.

File Name	Module Path	Authors
base.py	smol.moca.ensembles	Luis Barroso-Luque

Code Review

We use code review issues to open up a pinned location to discuss a specific
source file.

Source file summary

Abstract base class for Monte Carlo Ensembles.

Use github features when possible please!

Point to specific lines in code

If you are mentioning a set of specific lines in code already implemented, it
may come in handy to add a
permanent link to it.

Format suggested code niceley

def roast(beans, *flavoring_oils):
    for oil in flavoring_oils:
        beans.add_oil(oils)
    return np.trapz(beans.get_heat_vector())

Use other handy features too

Using many of other handy features offered, can help improve discussion.
https://docs.github.com/en

Correlation function hierarchy may be missing or in some cases can add incorrect hierarchy

Expected Behavior

The method ClusterSubspace.bit_combo_hierarchy(self, min_size=2, invert=False) should give a list of lists where each sub list should give the id's of all correlation functions that are subfunctions (or factors) of the one for the corresponding index of the outer list.
i.e. bit_combo_hierarchy()[l] = [i, j, k] where i, j, k are sub corr functions of corr function l.

Current Behavior

Currently the method can miss some correlation functions in systems with only 1 Sublattices with species n > 2. (It will give a weak inclusion constraints). @zhongpc and I looked into this and it seems it is because some fractional coordinates will not be correctly matched.

For systems with more than 1 sublattice that have n > 2 (such as two different ternary sublattices). The method can add additional incorrect hierarchy entries, since it does not distinguish which bit indices correspond to each different sublattice.

Possible Solution

I have implemented a new method ClusterSubspace.function_hierarchy(self, level=1, min_size=2, invert=False), that I think addresses both issues above, but I have not tested it yet.

https://github.com/lbluque/smol/blob/ce4f8697ca55a1aa224935cd7720c558aed0bb56/smol/cofe/space/clusterspace.py#L377-L404

Steps to Reproduce

For the weak inclusion problem simply compare the outputs of both methods for a say a ternary-binary rocksalt (like Li-Mn-Vac + O-F
For the incorrect constraints do the same thing but with a ternary-ternary system.

Unit tests for anneal method in ensemble classes

The simulated annealing method in the ensemble classes is not tested. There is a TODO in the code, but going to go ahead and just create an issue for it.

smol/tests/test_moca/test_canonical.py

Line 111 in 9e4625d

def test_anneal(self):

It's not super straightforward whats the best type of unit test for this, but something is better than nothing.

CompositeProcessor created from dictionary does not have coefs property

When a CompositeProcessor is created using as_dict(), it does not recreate the coefs property. Not sure whether it's a bug or not, but may be helpful if it is more clear how to recreate the CompositeProcessor. I believe this should be easily fixable though.

Expected Behavior

When recreated from CompositeProcessor.from_dict(), the new CompositeProcessor object should have a coefs property, as I believe this is necessary to be used in an ensemble.

Current Behavior

The CompositeProcessor is instantiated with an empty array.

Possible Solution

When created from a dictionary, CompositeProcessor can set its coefficients based on its constituent Processor objects.

Steps to Reproduce

Create a CompositeProcessor
Create a dictionary representation of the CompositeProcessor object (as_dict)
Create a new CompositeProcessor using the dictionary representation (from_dict). The resulting newly created CompositeProcessor does not have any coefficients.

This isn't a super necessary change, as it's easy enough to add the coefs once its apparent that this is the issue, but it's slightly more clear if the coefficients are automatically set based on the coefficients of the CompositeProcessor's individual processors.

Define the feature branches we are working on (Can we transfer this to milestones? I do not have permission)

As a followup of today's discussion. I have a few feature branches suggestions in mind that we probably want to work on in the near feature. Feel free to modify or comment on it. I feel some of them are a bit urgent as we may want to update in next Materials Project meeting or we want to send papers out.

The purpose of this issue is just to plan ahead of time so we can make progress with some directions, also we can prioritize something that is more important. We can also come back to amend and merge some branches later.

Charge neutral feature branch (mentioned by Fengyu already)
Lite version of smol for distribution to other group (This is urgent in my opinion)
Cluster expansion automator branch (Are we developing something independently or maybe just a new module?)
KMC related new features (I suppose Tina and Zinab will work on it)
Sample selection features (Peichen used to work on that, do we want to implement?)
New regularization/compressive sensing features
SRO modules (Bin can work on that)
Others that I forgot to mention

Adding cluster hierarchy as a @property of clustersubspace

Implementing cluster hierarchy as a part of clustersubspace will greatly help with the implementation of hierarchy constraints.

In near future, we will make peichen's L0L2-hierarchy method as default, therefore adding this part to smol is essential.

Sanity check for energy differences in MC

Is your feature request related to a problem? Please describe.
During an attempted MC flip, a difference in the property vector needs to be computed. We can call this dProp.

However, the user should be able to easily check that dProp is close to the true value, which is the CE-predicted final state minus the CE-predicted initial state.

Describe the solution you'd like
Ideally the solution would be some assertion check, which would look like:

assertTrue(np.allclose(dProp, CE.predict(final) - CE.predict(initial), rtol = rtol, atol = atol))

The sanity check should be within some numerical tolerance such that the energy drift over many MC steps should be small.

Describe alternatives you've considered
Alternatives would be user should carry out this test on their own.

Implement cooling functions for simulated annealing

Implement allowing using cooling functions for the simulated annealing method in the ensemble classes.

The option is already there but it has not actually been implemented.

smol/smol/moca/ensembles/canonical.py

Line 103 in 9e4625d

cool_function=None, set_min_occu=True):

Some testing would be necessary too (#26).

missing ECI because of cutoff issue

 max_lp = max(exp_struct.lattice.abc) / 2 
        for size, diameter in sorted(cutoffs.items()):
            new_orbits = []
            neighbors = exp_struct.get_sites_in_sphere([0.5, 0.5, 0.5],
                                                       diameter + max_lp,
                                                       include_index=True)

get neighbors in [0.5,0.5,0.5] with a max_lp can potentially miss the interactions with the centroid. For example, four-anion interactions on a tetrahedra.

To resolve this issue, by changing max_lp = max(exp_struct.lattice.abc) / 2 into max_lp = max(exp_struct.lattice.abc) would work.

This gives exactly the same correlation function as pyabinitio. (More general tests all also welcomed)

[Code Review] clusterspace.py

File Name	Module Path	Authors
clusterspace.py	smol.cofe.configspace	Luis Barroso-Luque, William Davidson Richard

Code Review

We use code review issues to open up a pinned location to discuss a specific
source file.

Source file summary

This file contains the implementation of the ClusterSubspace class.

A ClusterSubspace is necessary to define the terms to be included in a cluster
expansion. A cluster subspace is a finite set of clusters, more precisely
orbits that contain symmetrically equivalent clusters, that are used to define
orbit/cluster basis functions which span a subspace of the total function space
over the configurational space of a given lattice system.

Use github features when possible please!

Point to specific lines in code

If you are mentioning a set of specific lines in code already implemented, it
may come in handy to add a
permanent link to it.

Format suggested code niceley

def roast(beans, *flavoring_oils):
    for oil in flavoring_oils:
        beans.add_oil(oils)
    return np.trapz(beans.get_heat_vector())

Use other handy features too

Using many of other handy features offered, can help improve discussion.
https://docs.github.com/en

Allow using reciprocal space Ewald term only

Is your feature request related to a problem? Please describe.
As mentioned by The Boss on slack, allowing to use only the reciprocal space portion of the ewald summation term to be used as a feature and in subsequent monte carlo sampling could be useful to account only for the long range electrostatic interactions, and hopefully let the CE pair terms take care of the short term ones.

Describe the solution you'd like
Solution should be "simple" to implement (maybe a little more involved to test and be comfortable with results). This involves simply allowing one to use the reciprocal space matrix instead of the total energy matrix for computing the feature. For the monte carlo we just need to make sure we carefully account for the local contributions only.

Loading cluster subpspace with polynomial basis from an mson fails

Summary

When loading a ClusterSubspace created with a polynomial site basis (ie Chebyshev) from a mson serialized dictionary, loading fails when attempting to create the SiteBasis objects in the orbits.

Expected Behavior

The cluster subspace object should be created without any errors.

Current Behavior

An error is thrown when attempting to create the SiteBasis objects in the Orbits. Here is the traceback:

Traceback (most recent call last):
  File "/home/lbluque/Develop/smol/smol/utils.py", line 45, in derived_class_factory
    derived_class = get_subclasses(base_class)[class_name]
KeyError: 'AbstractIterator'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/home/lbluque/opt/miniconda3/envs/matx_dev/lib/python3.8/site-packages/IPython/core/interactiveshell.py", line 3331, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-5-5e9c1382ac61>", line 1, in <module>
    cs2 = loadfn('test.mson')
  File "/home/lbluque/opt/miniconda3/envs/matx_dev/lib/python3.8/site-packages/monty/serialization.py", line 88, in loadfn
    return json.load(fp, *args, **kwargs)
  File "/home/lbluque/opt/miniconda3/envs/matx_dev/lib/python3.8/json/__init__.py", line 293, in load
    return loads(fp.read(),
  File "/home/lbluque/opt/miniconda3/envs/matx_dev/lib/python3.8/json/__init__.py", line 370, in loads
    return cls(**kw).decode(s)
  File "/home/lbluque/opt/miniconda3/envs/matx_dev/lib/python3.8/site-packages/monty/json.py", line 368, in decode
    return self.process_decoded(d)
  File "/home/lbluque/opt/miniconda3/envs/matx_dev/lib/python3.8/site-packages/monty/json.py", line 340, in process_decoded
    return cls_.from_dict(data)
  File "/home/lbluque/Develop/smol/smol/cofe/space/clusterspace.py", line 866, in from_dict
    orbits = {int(s): [Orbit.from_dict(o) for o in v]
  File "/home/lbluque/Develop/smol/smol/cofe/space/clusterspace.py", line 866, in <dictcomp>
    orbits = {int(s): [Orbit.from_dict(o) for o in v]
  File "/home/lbluque/Develop/smol/smol/cofe/space/clusterspace.py", line 866, in <listcomp>
    orbits = {int(s): [Orbit.from_dict(o) for o in v]
  File "/home/lbluque/Develop/smol/smol/cofe/space/orbit.py", line 345, in from_dict
    site_bases = [basis_factory(*sb_d) for sb_d in site_bases]
  File "/home/lbluque/Develop/smol/smol/cofe/space/orbit.py", line 345, in <listcomp>
    site_bases = [basis_factory(*sb_d) for sb_d in site_bases]
  File "/home/lbluque/Develop/smol/smol/cofe/space/basis.py", line 352, in basis_factory
    basis_funcs = derived_class_factory(iterator_name, BasisIterator, species)
  File "/home/lbluque/Develop/smol/smol/utils.py", line 48, in derived_class_factory
    raise NotImplementedError(f'{class_name} is not implemented.')
NotImplementedError: AbstractIterator is not implemented.

Possible Solution

Seems like the issue is the derived_class_factory in basis_factory is not being able to recursively find the derived class ChebyshevIterator from BasisIterator. Probably need to fix that. The issue does not occur with other basis iterator classes that are derived directly from BasisIterator.

Steps to Reproduce

Create a ClusterSubspace with basis=chebyshev
Save using monty serialization (ie as_dict)
Load the file/dictionary to recreate the ClusterSubspace

from monty.serialization import loadfn, dumpfn
from smol.cofe import ClusterSubspace

subspace = ClusterSubspace.from_cutoffs(prim, {2: 6}, basis='chebyshev')

# This throws error
dumpfn(subspace, 'test.mson')
subspace1 = loadfn('test.mson')

# So does this
subspace1 = ClusterSubspace.from_dict(subspace.as_dict())

Cached supercell orbit mappings can make structurewrangler/clustersubspace json files very large

The cached dictionary for orbit mappings for different supercells can be very very large in terms of file size when saving a ClusterSubspace or a StructureWrangler with a cluster subspace that has seen many different structures and cached their orbit mappings.

This is not a "bug" per say but it can become hard to load large files since python consumes much more memory than the file itself during loading which can lead to crashing python itself.

It may be useful to find a more efficient way to save it, or an option to load or not the cached mappings.