Giter Club home page Giter Club logo

materialparser's Introduction

MaterialParser

This is an old repo for the text2chem parser. The new one is available at https://github.com/CederGroupHub/text2chem.

Material Parser is a package that provides various functionality to parse material entities and convert them into a chemical structure.

The Material Parser addresses the problem of unification of materials entities widely used in scientific literature.

This tool is build to facilitate and promote text mining research in material science.

Material Parser functionality includes:

  • converting standard chemical terms into a chemical formula
  • parsing chemical formula into chemical composition,
  • constructing dictionary of materials abbreviations from a text snippets,
  • finding the values of stoichiometric and elements variables from text snippets,
  • splitting mixtures/composites/alloys/solid solutions into compounds

Note: Material Parser is not intended to be used for finding material entities in the text or to perform any chemical named entities recognition (NER) in a text. It rather processes and standardizes already extracted chemical (material) entities and extracts relevant chemical information from them.

The parser was tested to work well for inorganic material terms, its performance on organic terms was not evaluated thoroughly.

Pipeline overview

Material Parser runs as a pipeline of default, pre- and post-processing modules. Default module of the pipeline process chemical formula and converts into a chemical structure. Preprocessing modules deal with an input material string, separate relevant information from an input string and fill it into an output chemical structure. Postprocessing modules augment chemical structure with any other information that can be found in a surrounding text.

The current version of Material Parser package includes the following preprocessing modules:

  • PhaseProcessing - gets information about material phase if it is presented in the string, e.g. for "P2-Na2/3(CoxNi1/3-xMn2/3)O2", "P2" will be separated and stored in the structure while the rest of the string will be sent down the pipeline.
  • AdditivesProcessing - gets information about any additives, such as dopants, stabilizers or mixed compounds, e.g. i) for "(Na0.5K0.5)NbO3 + 1.5 mol% CuF2", "CuF2" will be separated as an additive and "(Na0.5K0.5)NbO3" will be sent down the pipeline; ii) for "Ca2BO3Cl:Sm3+, Eu3+", "Sm3+" and "Eu3+" will be separated and "Ca2BO3Cl" will be sent down the pipeline.
  • ChemicalNameProcessing - attempts to convert sequence of chemical terms into a formula, e.g. "zinc (II) acetate dihydrate" will be converted into "Zn(CH3COO)2·2H2O" that will be sent down the pipeline and the chemical terms will be stored in output structure as material name. This tool can also separate chemical formula from the rest of the terms is they are combined by tokenizer, e.g. "ammonium molybdate (NH4)6Mo7O24⋅4H2O" will be split into "ammonium molybdate" - material name and "(NH4)6Mo7O24⋅4H2O" - chemical formula that will be sent down the pipeline.
  • MixtureProcessing - split mixture/solid solution/alloy/composite/hydrate into constituting compounds, e.g. "(1-x-y)BaTiO3-xBaBiO3-y(Bi0.5Na0.5)TiO3" will be split into "BaTiO3", "BaBiO3", and "(Bi0.5Na0.5)TiO3" with the corresponding amounts "(1-x-y)", "x" and "y", respectively.
  • PubchemPreprocessing - looks up chemical terms in PubChem database, if chemical formula is not found. This step usually slows down the overall pipeline perfomance.

The postprocessing module includes:

  • Substitute_Additives - appends dopand elements to the formula to have integer total stoichiometry or add mixture compound to the composition, e.g. i) "Zn1.92-2xYxLixSiO4:0.08Mn2+" becomes "Mn0.08Zn1.92-2xYxLixSiO4" and ii) "(Na0.5K0.5)NbO3 + 1.5 mol% CuF2" becomes "(1-x)(Na0.5K0.5)NbO3-(x)CuF2".

Installation:

git clone https://github.com/CederGroupHub/MaterialParser.git
cd MaterialParser
pip install -r requirements.txt -e .

Running default setup:

By default, Material Parser gets chemical composition from a proper material formula (i.e. no chemical terms, dopants or mixtures):

from material_parser.core.material_parser import MaterialParserBuilder
mp = mp = MaterialParserBuilder().build()
output = mp.parse("Li5+xLa3Ta2-xGexO12")
print(output.to_dict())

Output:

{'additives': [],
 'amounts_x': {'x': []},
 'composition': [{'amount': '1',
                  'elements': OrderedDict([('Li', 'x+5'),
                                           ('La', '3'),
                                           ('Ta', '2-x'),
                                           ('Ge', 'x'),
                                           ('O', '12')]),
                  'formula': 'Li5+xLa3Ta2-xGexO12',
                  'species': OrderedDict([('Li', 'x+5'),
                                          ('La', '3'),
                                          ('Ta', '2-x'),
                                          ('Ge', 'x'),
                                          ('O', '12')])}],
 'elements_x': {},
 'material_formula': 'Li5+xLa3Ta2-xGexO12',
 'material_name': '',
 'material_string': 'Li5+xLa3Ta2-xGexO12',
 'oxygen_deficiency': '',
 'phase': ''}

Other attributes can be filled using additional pipeline blockes if required.

Adding modules to the pipeline:

from material_parser.core.material_parser import MaterialParserBuilder
from material_parser.core.preprocessing_tools.additives_processing import AdditivesProcessing
from material_parser.core.preprocessing_tools.chemical_name_processing import ChemicalNameProcessing
from material_parser.core.preprocessing_tools.mixture_processing import MixtureProcessing
from material_parser.core.preprocessing_tools.phase_processing import PhaseProcessing
from material_parser.core.postprocessing_tools.substitute_additives import SubstituteAdditives

mp = MaterialParserBuilder() \
    .addPreprocessing(AdditivesProcessing()) \
    .addPreprocessing(ChemicalNameProcessing()) \
    .addPreprocessing(PhaseProcessing()) \
    .addPreprocessing(MixtureProcessing())\
    .addPostprocessing(SubstituteAdditives())\
    .build()

mp.parse("Nasicon (Na1+x+yZr2-yYySixP3-xO12)")

Output:

{'additives': [],
 'amounts_x': {'x': [], 'y': []},
 'composition': [{'amount': '1',
                  'elements': OrderedDict([('Na', 'x+y+1'),
                                           ('Zr', '2-y'),
                                           ('Y', 'y'),
                                           ('Si', 'x'),
                                           ('P', '3-x'),
                                           ('O', '12')]),
                  'formula': 'Na1+x+yZr2-yYySixP3-xO12',
                  'species': OrderedDict([('Na', 'x+y+1'),
                                          ('Zr', '2-y'),
                                          ('Y', 'y'),
                                          ('Si', 'x'),
                                          ('P', '3-x'),
                                          ('O', '12')])}],
 'elements_x': {},
 'material_formula': 'Na1+x+yZr2-yYySixP3-xO12',
 'material_name': 'Nasicon',
 'material_string': 'Nasicon (Na1+x+yZr2-yYySixP3-xO12)',
 'oxygen_deficiency': '',
 'phase': ''}

Note: the order of the modules may affect the resulted output.

Running customized pipeline:

Material Parser provides capabilities for creating customized pre- and post-processing modules. This are defined by the corresponding interface: core/preprocessing_tools/preprocessing_abc.py and core/postprocessing_tools/postprocessing_abc.py. Add the class implementing the interface into a corresponding directory and import as a regular module.

Output

mp.parse() output the ChemicalStructure with the following attributes:

  • material_string: string
  • material_name: string
  • material_formula: string
  • additives: list or string
  • phase: string
  • oxygen_deficiency: char
  • amounts_x: Variables
  • elements_x: Variables
  • composition: list of Compound

Citing

If you use Material Parser in your work, please cite the following paper:

  • Kononova et. al "Text-mined dataset of inorganic materials synthesis recipes", Scientific Data 6 (1), 1-11 (2019) 10.1038/s41597-019-0224-1

materialparser's People

Contributors

hhaoyan avatar olgagkononova avatar vtshitoyan avatar zherenwang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

materialparser's Issues

composite amounts variable 'b' not distributed properly

This occurs 2 times [1947, 1948].

data['target']['composition']['Y2O3+bCeO2']['elements'] = {'Y': '2.0', 'O': 'b + 5', 'Ce': '1.0'}

This should be {'Y' : '2.0', 'O' : '3.0 + 2.0*b', 'Ce' : 'b'} or perhaps better, Y2O3 and CeO2 should be separated as their own keys in data['target']['composition']

composite amounts variable '1-x' not distributed properly

This occurs 3 times [376, 7327, 18580]

These all look like composites (1-x)STUFF-x*OTHER_STUFF, but only the (1-x)STUFF is being included in the composition.

1 example

Item 376 (CuO is not included in composition):

data['target']['material_string'] = 1 - x(K0.5Na0.5)(Nb0.995Mn0.005O3)-xCuO
data['target']['composition'] = {'K0.5Na0.5)(Nb0.995Mn0.005O3)-xCuO': {'elements': {'Nb': '-0.995*x', 'Mn': '-0.005*x', 'O': '-3*x'}, 'amount': '-x + 1'}}

Materials with the form A(BC)x

Examples of materials: Ca0.6La0.267Ti0.9(Nb0.05Ga0.05)xO3 or BaTi0.9(Co0.05Nb0.05)xO3 or (BiFeO3)1-x+(PbTiO3)

For the first material, the stoechiometry of Nb should be 0.05*x but the parser finds 0.05.

Materials Parser - some problems

@OlgaGKononova
cases = {
'Ca2AlMg0.5Si1.5O7x:Eu2+': 'error',
'Ca1-xEuxBi2Ta2O9': '(0)+(1-x)(1)',
'Tb3+solely-dopedNa2Ca3Si2O8': 'should return error',
'Co2xOx-2(OFe)x-3Cox': '((0)+(2x)
(1))+(x)(1) -- (0)+(x)(2))+(x)*(1)'
}

'elements' dictionary is empty

This occurs 29 times (all item numbers provided at the end).

For some compositions, 'elements' = {}.

A couple examples:

Item 18736:
data['target']['composition'] = {'BiFe1-xCrxO3': {'elements': {'Bi': '1.0', 'Fe': '-x + 1', 'Cr': 'x', 'O': '3.0'}, 'amount': '0.7'}, 'BaTiO3(BFOCx-BT)': {'elements': {}, 'amount': '0.3'}}

Item 18500:
data['target']['composition'] = {'Fe2O3': {'elements': {'Fe': '2.0', 'O': '3.0'}, 'amount': 'x'}, 'PbZn1/3Nb2/3O3': {'elements': {'Pb': '1.0', 'Zn': '0.3333', 'Nb': '0.6667', 'O': '3.0'}, 'amount': '0.2'}, 'Pb(Ti05Zr0.5)O3': {'elements': {}, 'amount': '0.8'}}

Occurrences = [199, 200, 1828, 1829, 2123, 2124, 2271, 2273,
                      2414, 2613, 2615, 2723, 2955, 2956, 4858, 5038,
                      5885, 5886, 5887, 8019, 8024, 8877, 11886, 12760, 
                      18211, 18499, 18500, 18735, 18736]

delta sometimes appears in 'amount'

This occurs 2 times.

This seems irregular, considering for all other 'oxygen_deficiency = True examples, there is no delta in 'amount'

2 examples

Item 441:
data['target']['composition'] = {'(SrCoO3-δ)1-x(Sr2Fe3O6.5±δ)x': {'elements': {'Sr': '-x + 1', 'Co': '-x + 1', 'O': '(x - 1)*(δ - 3)'}, 'amount': '1.0'}}

Item 4866:
data['target']['composition'] = {'(YBa2Cu3O7-δ)1-x': {'elements': {'Y': '-x + 1', 'Ba': '-2*x + 2', 'Cu': '-3*x + 3', 'O': '(x - 1)*(δ - 7)'}, 'amount': '1.0'}}

descriptive string appears in 'amount'

This occurs 6 times [6200, 7304, 10655, 13285, 13654, 13743]

2 examples

Item 6200:
data['target']['composition'] = {'W': {'elements': {'W': '1.0'}, 'amount': 'ultrafine'}, 'Y2O3': {'elements': {'Y': '2.0', 'O': '3.0'}, 'amount': '1.0'}}

Item 7304:
data['target']['composition'] = {'BaCe0.5Zr0.4Y0.1O3': {'elements': {'Ba': '1.0', 'Ce': '0.5', 'Zr': '0.4', 'Y': '0.1', 'O': '3.0'}, 'amount': 'bariumzirconates'}}

Some materials that are not parsed in data v11

Here is a list of materials that are not parsed during the construction v11 dataset:

  • Li(Mny2+Fe1-y2+)PO4
  • SrTiO3:Pr,Al,Ga
  • Sr3Gd1-y(BO3)3:yTb3+
  • Ca1-xSnSiO5: Rx3+
  • Na0.67[Mn0.5+yNiyFe0.5-2y]O2
  • Ba5Si8O21:0.02Eu2+,xDy3+
  • Na2Y2Ti3O10:Eu3+,Sm3+
  • Ca3Y(GaO)3(BO3)4:zCe3+,xMn2+,yTb3+
  • Ba[Mg(1-x)/3SnxTa2(1-x)/3]O3
  • Sr2+x-yMgSi2O7+x:yCe3+
  • RE1.98-2xO3:xYb3+,1Er3+
  • (Ba0.6Sr04)TiO3
  • Ba(1- x)Mg(1- y)P2O7:xEu2+,yMn2+
  • Gd2O3:Yb3+/Er3+

Mostly they are due to doping representation and ion representation. But doping information can be equally important as well...

'MaterialParserBuilder' object has no attribute 'addPreprocessing'

Installed this version of Material Parser. However, unable to run the code given in example.py

Getting the error message as follows:

TypeError Traceback (most recent call last)
in
----> 1 mp = (MaterialParserBuilder().add_preprocessing(AdditivesProcessing())
2 .add_preprocessing(ChemicalNameProcessing())
3 .add_preprocessing(PhaseProcessing())
4 .add_preprocessing(MixtureProcessing())
5 .add_postprocessing(SubstituteAdditives())

TypeError: init() missing 1 required positional argument: 'regex_parser'

Distinguish oxygen deficiency from oxygen excess

This effects 1802 entries.

Currently, data['target']['oxygen_deficiency'] == True when oxygen is deficient, in excess, or both.

Excess example (item 19390):
data['target']['material_string'] = 'GdBa1-xSrxCo2O5+δ'

Deficiency example (item 19700):
data['target']['material_string'] = 'SrFe0.75Mo0.25O3-δ'

Excess/deficiency example (19040):
data['target']['material_string'] = 'Sr1-1.5xCexTiO3 ± δ'

Functionality request:
Instead of 'oxygen_deficiency' (bool) as key, make new key which indicates the direction of oxygen nonstoichiometry - e.g., 'oxygen_nonstoichiometry' which can be False, 'excess', 'deficient', 'both'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.