cedergrouphub / materialparser Goto Github PK

Utility to compile string of chemical terms into data structure with chemical formula and composition

License: MIT License

Python 100.00%

natural-language-processing materials-science material-parser text-mining chemical-compounds composition chemical-terms python

materialparser's Introduction

MaterialParser

This is an old repo for the text2chem parser. The new one is available at https://github.com/CederGroupHub/text2chem.

Material Parser is a package that provides various functionality to parse material entities and convert them into a chemical structure.

The Material Parser addresses the problem of unification of materials entities widely used in scientific literature.

This tool is build to facilitate and promote text mining research in material science.

Material Parser functionality includes:

converting standard chemical terms into a chemical formula
parsing chemical formula into chemical composition,
constructing dictionary of materials abbreviations from a text snippets,
finding the values of stoichiometric and elements variables from text snippets,
splitting mixtures/composites/alloys/solid solutions into compounds

Note: Material Parser is not intended to be used for finding material entities in the text or to perform any chemical named entities recognition (NER) in a text. It rather processes and standardizes already extracted chemical (material) entities and extracts relevant chemical information from them.

The parser was tested to work well for inorganic material terms, its performance on organic terms was not evaluated thoroughly.

Pipeline overview

Material Parser runs as a pipeline of default, pre- and post-processing modules. Default module of the pipeline process chemical formula and converts into a chemical structure. Preprocessing modules deal with an input material string, separate relevant information from an input string and fill it into an output chemical structure. Postprocessing modules augment chemical structure with any other information that can be found in a surrounding text.

The current version of Material Parser package includes the following preprocessing modules:

PhaseProcessing - gets information about material phase if it is presented in the string, e.g. for "P2-Na2/3(CoxNi1/3-xMn2/3)O2", "P2" will be separated and stored in the structure while the rest of the string will be sent down the pipeline.
AdditivesProcessing - gets information about any additives, such as dopants, stabilizers or mixed compounds, e.g. i) for "(Na0.5K0.5)NbO3 + 1.5 mol% CuF2", "CuF2" will be separated as an additive and "(Na0.5K0.5)NbO3" will be sent down the pipeline; ii) for "Ca2BO3Cl:Sm3+, Eu3+", "Sm3+" and "Eu3+" will be separated and "Ca2BO3Cl" will be sent down the pipeline.
ChemicalNameProcessing - attempts to convert sequence of chemical terms into a formula, e.g. "zinc (II) acetate dihydrate" will be converted into "Zn(CH3COO)2·2H2O" that will be sent down the pipeline and the chemical terms will be stored in output structure as material name. This tool can also separate chemical formula from the rest of the terms is they are combined by tokenizer, e.g. "ammonium molybdate (NH4)6Mo7O24⋅4H2O" will be split into "ammonium molybdate" - material name and "(NH4)6Mo7O24⋅4H2O" - chemical formula that will be sent down the pipeline.
MixtureProcessing - split mixture/solid solution/alloy/composite/hydrate into constituting compounds, e.g. "(1-x-y)BaTiO3-xBaBiO3-y(Bi0.5Na0.5)TiO3" will be split into "BaTiO3", "BaBiO3", and "(Bi0.5Na0.5)TiO3" with the corresponding amounts "(1-x-y)", "x" and "y", respectively.
PubchemPreprocessing - looks up chemical terms in PubChem database, if chemical formula is not found. This step usually slows down the overall pipeline perfomance.

The postprocessing module includes:

Substitute_Additives - appends dopand elements to the formula to have integer total stoichiometry or add mixture compound to the composition, e.g. i) "Zn1.92-2xYxLixSiO4:0.08Mn2+" becomes "Mn0.08Zn1.92-2xYxLixSiO4" and ii) "(Na0.5K0.5)NbO3 + 1.5 mol% CuF2" becomes "(1-x)(Na0.5K0.5)NbO3-(x)CuF2".

Installation:

git clone https://github.com/CederGroupHub/MaterialParser.git
cd MaterialParser
pip install -r requirements.txt -e .

Running default setup:

By default, Material Parser gets chemical composition from a proper material formula (i.e. no chemical terms, dopants or mixtures):

from material_parser.core.material_parser import MaterialParserBuilder
mp = mp = MaterialParserBuilder().build()
output = mp.parse("Li5+xLa3Ta2-xGexO12")
print(output.to_dict())

Output:

{'additives': [],
 'amounts_x': {'x': []},
 'composition': [{'amount': '1',
                  'elements': OrderedDict([('Li', 'x+5'),
                                           ('La', '3'),
                                           ('Ta', '2-x'),
                                           ('Ge', 'x'),
                                           ('O', '12')]),
                  'formula': 'Li5+xLa3Ta2-xGexO12',
                  'species': OrderedDict([('Li', 'x+5'),
                                          ('La', '3'),
                                          ('Ta', '2-x'),
                                          ('Ge', 'x'),
                                          ('O', '12')])}],
 'elements_x': {},
 'material_formula': 'Li5+xLa3Ta2-xGexO12',
 'material_name': '',
 'material_string': 'Li5+xLa3Ta2-xGexO12',
 'oxygen_deficiency': '',
 'phase': ''}

Other attributes can be filled using additional pipeline blockes if required.

Adding modules to the pipeline:

from material_parser.core.material_parser import MaterialParserBuilder
from material_parser.core.preprocessing_tools.additives_processing import AdditivesProcessing
from material_parser.core.preprocessing_tools.chemical_name_processing import ChemicalNameProcessing
from material_parser.core.preprocessing_tools.mixture_processing import MixtureProcessing
from material_parser.core.preprocessing_tools.phase_processing import PhaseProcessing
from material_parser.core.postprocessing_tools.substitute_additives import SubstituteAdditives

mp = MaterialParserBuilder() \
    .addPreprocessing(AdditivesProcessing()) \
    .addPreprocessing(ChemicalNameProcessing()) \
    .addPreprocessing(PhaseProcessing()) \
    .addPreprocessing(MixtureProcessing())\
    .addPostprocessing(SubstituteAdditives())\
    .build()

mp.parse("Nasicon (Na1+x+yZr2-yYySixP3-xO12)")

Output:

{'additives': [],
 'amounts_x': {'x': [], 'y': []},
 'composition': [{'amount': '1',
                  'elements': OrderedDict([('Na', 'x+y+1'),
                                           ('Zr', '2-y'),
                                           ('Y', 'y'),
                                           ('Si', 'x'),
                                           ('P', '3-x'),
                                           ('O', '12')]),
                  'formula': 'Na1+x+yZr2-yYySixP3-xO12',
                  'species': OrderedDict([('Na', 'x+y+1'),
                                          ('Zr', '2-y'),
                                          ('Y', 'y'),
                                          ('Si', 'x'),
                                          ('P', '3-x'),
                                          ('O', '12')])}],
 'elements_x': {},
 'material_formula': 'Na1+x+yZr2-yYySixP3-xO12',
 'material_name': 'Nasicon',
 'material_string': 'Nasicon (Na1+x+yZr2-yYySixP3-xO12)',
 'oxygen_deficiency': '',
 'phase': ''}

Note: the order of the modules may affect the resulted output.

Running customized pipeline:

Material Parser provides capabilities for creating customized pre- and post-processing modules. This are defined by the corresponding interface: core/preprocessing_tools/preprocessing_abc.py and core/postprocessing_tools/postprocessing_abc.py. Add the class implementing the interface into a corresponding directory and import as a regular module.

Output

mp.parse() output the ChemicalStructure with the following attributes:

material_string: string
material_name: string
material_formula: string
additives: list or string
phase: string
oxygen_deficiency: char
amounts_x: Variables
elements_x: Variables
composition: list of Compound

Citing

If you use Material Parser in your work, please cite the following paper:

Kononova et. al "Text-mined dataset of inorganic materials synthesis recipes", Scientific Data 6 (1), 1-11 (2019) 10.1038/s41597-019-0224-1

materialparser's People

Contributors

Stargazers

Watchers

Forkers

lfoppiano faight4869 abarbarov idocx clyxcn abishek85 lijiezhong

materialparser's Issues

'MaterialParserBuilder' object has no attribute 'addPreprocessing'

Installed this version of Material Parser. However, unable to run the code given in example.py

Getting the error message as follows:

TypeError Traceback (most recent call last)
in
----> 1 mp = (MaterialParserBuilder().add_preprocessing(AdditivesProcessing())
2 .add_preprocessing(ChemicalNameProcessing())
3 .add_preprocessing(PhaseProcessing())
4 .add_preprocessing(MixtureProcessing())
5 .add_postprocessing(SubstituteAdditives())

TypeError: init() missing 1 required positional argument: 'regex_parser'

Formula string not parsed when unicode character \u2212 included

When unicode character for minus sign (\u2212) is included in formula string, string is not parsed. Replacing with hyphen fixes issue.

Some materials that are not parsed in data v11

Here is a list of materials that are not parsed during the construction v11 dataset:

Li(Mny2+Fe1-y2+)PO4
SrTiO3:Pr,Al,Ga
Sr3Gd1-y(BO3)3:yTb3+
Ca1-xSnSiO5: Rx3+
Na0.67[Mn0.5+yNiyFe0.5-2y]O2
Ba5Si8O21:0.02Eu2+,xDy3+
Na2Y2Ti3O10:Eu3+,Sm3+
Ca3Y(GaO)3(BO3)4:zCe3+,xMn2+,yTb3+
Ba[Mg(1-x)/3SnxTa2(1-x)/3]O3
Sr2+x-yMgSi2O7+x:yCe3+
RE1.98-2xO3:xYb3+,1Er3+
(Ba0.6Sr04)TiO3
Ba(1- x)Mg(1- y)P2O7:xEu2+,yMn2+
Gd2O3:Yb3+/Er3+

Mostly they are due to doping representation and ion representation. But doping information can be equally important as well...

descriptive string appears in 'amount'

This occurs 6 times [6200, 7304, 10655, 13285, 13654, 13743]

2 examples

Item 6200:
data['target']['composition'] = {'W': {'elements': {'W': '1.0'}, 'amount': 'ultrafine'}, 'Y2O3': {'elements': {'Y': '2.0', 'O': '3.0'}, 'amount': '1.0'}}

Item 7304:
data['target']['composition'] = {'BaCe0.5Zr0.4Y0.1O3': {'elements': {'Ba': '1.0', 'Ce': '0.5', 'Zr': '0.4', 'Y': '0.1', 'O': '3.0'}, 'amount': 'bariumzirconates'}}

keep or remove H2O from material_formula

remove H2O from formula, keep it in string and as an item in composition

composite amounts variable '1-x' not distributed properly

This occurs 3 times [376, 7327, 18580]

These all look like composites (1-x)STUFF-x*OTHER_STUFF, but only the (1-x)STUFF is being included in the composition.

1 example

Item 376 (CuO is not included in composition):

data['target']['material_string'] = 1 - x(K0.5Na0.5)(Nb0.995Mn0.005O3)-xCuO
data['target']['composition'] = {'K0.5Na0.5)(Nb0.995Mn0.005O3)-xCuO': {'elements': {'Nb': '-0.995*x', 'Mn': '-0.005*x', 'O': '-3*x'}, 'amount': '-x + 1'}}

remove "()"

in recipe_extractor: range is mixed with values

results in always non-empty values for amounts_vars if max/min values are given

Materials with the form A(BC)x

Examples of materials: Ca0.6La0.267Ti0.9(Nb0.05Ga0.05)xO3 or BaTi0.9(Co0.05Nb0.05)xO3 or (BiFeO3)1-x+(PbTiO3)

For the first material, the stoechiometry of Nb should be 0.05*x but the parser finds 0.05.

Materials Parser - some problems

@OlgaGKononova
cases = {
'Ca2AlMg0.5Si1.5O7x:Eu2+': 'error',
'Ca1-xEuxBi2Ta2O9': '(0)+(1-x)(1)',
'Tb3+solely-dopedNa2Ca3Si2O8': 'should return error',
'Co2xOx-2(OFe)x-3Cox': '((0)+(2x)(1))+(x)(1) -- (0)+(x)(2))+(x)*(1)'
}

'elements' dictionary is empty

This occurs 29 times (all item numbers provided at the end).

For some compositions, 'elements' = {}.

A couple examples:

Item 18736:
data['target']['composition'] = {'BiFe1-xCrxO3': {'elements': {'Bi': '1.0', 'Fe': '-x + 1', 'Cr': 'x', 'O': '3.0'}, 'amount': '0.7'}, 'BaTiO3(BFOCx-BT)': {'elements': {}, 'amount': '0.3'}}

Item 18500:
data['target']['composition'] = {'Fe2O3': {'elements': {'Fe': '2.0', 'O': '3.0'}, 'amount': 'x'}, 'PbZn1/3Nb2/3O3': {'elements': {'Pb': '1.0', 'Zn': '0.3333', 'Nb': '0.6667', 'O': '3.0'}, 'amount': '0.2'}, 'Pb(Ti05Zr0.5)O3': {'elements': {}, 'amount': '0.8'}}

Occurrences = [199, 200, 1828, 1829, 2123, 2124, 2271, 2273,
                      2414, 2613, 2615, 2723, 2955, 2956, 4858, 5038,
                      5885, 5886, 5887, 8019, 8024, 8877, 11886, 12760, 
                      18211, 18499, 18500, 18735, 18736]

Distinguish oxygen deficiency from oxygen excess

This effects 1802 entries.

Currently, data['target']['oxygen_deficiency'] == True when oxygen is deficient, in excess, or both.

Excess example (item 19390):
data['target']['material_string'] = 'GdBa1-xSrxCo2O5+δ'

Deficiency example (item 19700):
data['target']['material_string'] = 'SrFe0.75Mo0.25O3-δ'

Excess/deficiency example (19040):
data['target']['material_string'] = 'Sr1-1.5xCexTiO3 ± δ'

Functionality request:
Instead of 'oxygen_deficiency' (bool) as key, make new key which indicates the direction of oxygen nonstoichiometry - e.g., 'oxygen_nonstoichiometry' which can be False, 'excess', 'deficient', 'both'

composite amounts variable 'b' not distributed properly

This occurs 2 times [1947, 1948].

data['target']['composition']['Y2O3+bCeO2']['elements'] = {'Y': '2.0', 'O': 'b + 5', 'Ce': '1.0'}

This should be {'Y' : '2.0', 'O' : '3.0 + 2.0*b', 'Ce' : 'b'} or perhaps better, Y2O3 and CeO2 should be separated as their own keys in data['target']['composition']

Sort out elements and ions

delta sometimes appears in 'amount'

This occurs 2 times.

This seems irregular, considering for all other 'oxygen_deficiency = True examples, there is no delta in 'amount'

2 examples

Item 441:
data['target']['composition'] = {'(SrCoO3-δ)1-x(Sr2Fe3O6.5±δ)x': {'elements': {'Sr': '-x + 1', 'Co': '-x + 1', 'O': '(x - 1)*(δ - 3)'}, 'amount': '1.0'}}

Item 4866:
data['target']['composition'] = {'(YBa2Cu3O7-δ)1-x': {'elements': {'Y': '-x + 1', 'Ba': '-2*x + 2', 'Cu': '-3*x + 3', 'O': '(x - 1)*(δ - 7)'}, 'amount': '1.0'}}

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.