Giter Club home page Giter Club logo

text-mined-synthesis_public's Introduction

Text-mined Synthesis

In our project on text-mining data from literature, we have build up a large dataset of solid-state reactions. Here, we provide our auto-generated open-source dataset of 30,031 chemical reactions retrieved from 95,283 solid-state synthesis paragraphs: text-mined dataset. The data are collected using an automated extraction pipeline (see below) which converts unstructured scientific paragraphs describing inorganic materials synthesis into so-called “codified recipe” of synthesis. The pipeline utilizes a variety of text mining and NLP approaches to find information about target materials, starting compounds, synthesis steps and conditions in the text, and to process them into chemical equation.

Intro

This repo contains necessary codes and modules built to create the solid-state reactions dataset. If you find the codes and data useful, please cite our papers:

Dataset:

  • Kononova, O., Huo, H., He, T., Rong Z., Botari, T., Sun, W., Tshitoyan, V. and Ceder, G., 2019. Text-mined dataset of inorganic materials synthesis recipes. Scientific Data 6: 203.

Paragraphs classification:

  • Huo, H., Rong, Z., Kononova, O., Sun, W., Botari, T., He, T., Tshitoyan, V. and Ceder, G., 2019. Semi-supervised machine-learning classification of materials synthesis procedures. npj Computational Materials, 5(1), p.62.

Materials Entity Recognition (MER):

  • He, T., Sun, W., Huo, H., Kononova, O., Rong, Z., Tshitoyan, V., Botari, T. and Ceder, G., 2020. Similarity of Precursors in Solid-State Synthesis as Text-Mined from Scientific Literature. Chemistry of Materials, 32(18), pp.7861-7873.

Versions

  • [2020-07-13] Updated dataset 31782 solid state reactions and 9518 sol-gel precursor synthesis reactions. Updated data schema is dataset_typing.py.

Getting help

If you have questions about the project, please submit a issue or contact us ([email protected]). Thanks!

text-mined-synthesis_public's People

Contributors

hhaoyan avatar olgagkononova avatar zhugeyicixin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

text-mined-synthesis_public's Issues

MER install problem

Hi, there is an error when I install Materials Entity Recognition (MER), using 'git clone [email protected]:CederGroupHub/MatEntityRecognition.git'

ERROR: Repository not found.
fatal: Could not read from remote repository.

do you have any idea?
Thanks!

Full script for recipe extraction

Hi,

Is there a script for running the full pipeline? Specifically, I am interested in a script that takes a HTML/XML/PDF article and outputs the recipe extracted from it in json format like the ones in the dataset. If not, could you provide instructions on exactly what to run to do this?

Problem with MER

When I try to import MatRecognition from material_entity_recognition, I get the error "zsh: illegal hardware instruction python3"

I have installed the git LFS successfully and installed all the dependencies as described in the README file.

Please can anyone help out.

Thanks

OSError: [E053] Could not read config file from \OperationsExtraction\operations_extractor\models\SpaCy_updated_v1.model\config.cfg

from operations_extractor import OperationsExtractor

w2v_model = 'models/w2v_embeddings_lemmas_v3'
classifier_model = 'models/fnn-model-1_7classes_dense32_perSentence_3'
spacy_model = 'models/SpaCy_updated_v1.model'

OC = OperationsExtractor(w2v_model, classifier_model, spacy_model)

I have run the above code, but it says it cannot find the (config.cfg) for the spacy model

OSError: [E053] Could not read config file from \OperationsExtraction\operations_extractor\models\SpaCy_updated_v1.model\config.cfg

Data request

Hey,

Thanks for the great project.
Is it possible to share the original data, I mean not the automatically generated data (the json one) but the original text used for learning.
In the shared json data, there is this paragraph_string but only include the 50 first and last characters of the original text data. Is it possible to get this full paragraph text data?

Again, thank you very much for the great effort.

The full corpus is not provided?

Hi,
I would like to use your corpus to challenge the extraction of the recipes. However, when I checked the contents of the json files (extracted from solid-state_dataset_2019-12-03.json.xz and solid-state_dataset_20200713.json.xz), the text of the paragraph is omitted, as shown below in "<...>".

Is the full dataset published?

    {"token": "repelleted", "type": "ShapingOperation", "conditions": null}],
  "paragraph_string": "All materials were obtained from Aldrich Chemicals<...>d repelleted after each 24 or 48 h heating period."
},

[Bug]: OSError: [E053] Could not read config file from <path>\OperationsExtraction\operations_extractor\models\SpaCy_updated_v1.model\config.cfg

Email (Optional)

No response

Version

not sure

Which OS(es) are you using?

  • MacOS
  • Windows
  • Linux

What happened?

from operations_extractor import OperationsExtractor

w2v_model = 'models/w2v_embeddings_lemmas_v3'
classifier_model = 'models/fnn-model-1_7classes_dense32_perSentence_3'
spacy_model = 'models/SpaCy_updated_v1.model'

OC = OperationsExtractor(w2v_model, classifier_model, spacy_model)

I have run the above code, but it says it cannot find the (config.cfg) for the spacy model

OSError: [E053] Could not read config file from \OperationsExtraction\operations_extractor\models\SpaCy_updated_v1.model\config.cfg

Code snippet

from operations_extractor import OperationsExtractor

w2v_model = 'models/w2v_embeddings_lemmas_v3'
classifier_model = 'models/fnn-model-1_7classes_dense32_perSentence_3'
spacy_model = 'models/SpaCy_updated_v1.model'

OC = OperationsExtractor(w2v_model, classifier_model, spacy_model)

Log output

Operations Extractor v2.9
---------------------------------------------------------------------------
OSError                                   Traceback (most recent call last)
Cell In[12], line 5
      2 classifier_model = 'models/fnn-model-1_7classes_dense32_perSentence_3'
      3 spacy_model = 'models/SpaCy_updated_v1.model'
----> 5 OC = OperationsExtractor(w2v_model, classifier_model, spacy_model)

File D:\conda_envs\Text-Mined dataset Paper\text-mined-synthesis_public\MaterialParser\OperationsExtraction\operations_extractor\operations_extractor.py:120, in OperationsExtractor.__init__(self, w2v_model, classifier_model, spacy_model)
    117 print("Operations Extractor v2.9")
    119 my_folder = os.path.dirname(os.path.realpath(__file__))
--> 120 self.__nlp = spacy.load(os.path.join(my_folder, spacy_model))
    121 self.__embeddings = Word2Vec.load(os.path.join(my_folder, w2v_model))
    122 self.__model = keras.models.load_model(os.path.join(my_folder, classifier_model))

File ~\anaconda3\envs\MaterialParser\Lib\site-packages\spacy\__init__.py:51, in load(name, vocab, disable, enable, exclude, config)
     27 def load(
     28     name: Union[str, Path],
     29     *,
   (...)
     34     config: Union[Dict[str, Any], Config] = util.SimpleFrozenDict(),
     35 ) -> Language:
     36     """Load a spaCy model from an installed package or a local path.
     37 
     38     name (str): Package name or model path.
   (...)
     49     RETURNS (Language): The loaded nlp object.
     50     """
---> 51     return util.load_model(
     52         name,
     53         vocab=vocab,
     54         disable=disable,
     55         enable=enable,
     56         exclude=exclude,
     57         config=config,
     58     )

File ~\anaconda3\envs\MaterialParser\Lib\site-packages\spacy\util.py:467, in load_model(name, vocab, disable, enable, exclude, config)
    465         return load_model_from_package(name, **kwargs)  # type: ignore[arg-type]
    466     if Path(name).exists():  # path to model data directory
--> 467         return load_model_from_path(Path(name), **kwargs)  # type: ignore[arg-type]
    468 elif hasattr(name, "exists"):  # Path or Path-like to model data
    469     return load_model_from_path(name, **kwargs)  # type: ignore[arg-type]

File ~\anaconda3\envs\MaterialParser\Lib\site-packages\spacy\util.py:538, in load_model_from_path(model_path, meta, vocab, disable, enable, exclude, config)
    536 config_path = model_path / "config.cfg"
    537 overrides = dict_to_dot(config, for_overrides=True)
--> 538 config = load_config(config_path, overrides=overrides)
    539 nlp = load_model_from_config(
    540     config,
    541     vocab=vocab,
   (...)
    545     meta=meta,
    546 )
    547 return nlp.from_disk(model_path, exclude=exclude, overrides=overrides)

File ~\anaconda3\envs\MaterialParser\Lib\site-packages\spacy\util.py:714, in load_config(path, overrides, interpolate)
    712 else:
    713     if not config_path or not config_path.is_file():
--> 714         raise IOError(Errors.E053.format(path=config_path, name="config file"))
    715     return config.from_disk(
    716         config_path, overrides=overrides, interpolate=interpolate
    717     )

OSError: [E053] Could not read config file from D:\conda_envs\Text-Mined dataset Paper\text-mined-synthesis_public\MaterialParser\OperationsExtraction\operations_extractor\models\SpaCy_updated_v1.model\config.cfg

Code of Conduct

  • I agree to follow this project's Code of Conduct

Problem with MER

this is to sincerely seek your professional acvice about the MER.
Following the steps in the README, i try to run the example in the test folder, and encounter with this error:
File "D:\aconcon\envs\txmine\lib\site-packages\keras\optimizers\optimizer_experimental\optimizer.py", line 1151, in weight_decay_fn
wd = tf.cast(self.weight_decay, variable.dtype)
Node: 'Cast_1'
Cast string to float is not supported
[[{{node Cast_1}}]] [Op:__inference_train_function_52474]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.