ai-multimodal / aimmdb Goto Github PK

License: BSD 3-Clause "New" or "Revised" License

Python 53.39% Jupyter Notebook 38.12% HTML 7.91% Shell 0.58%

aimmdb's Introduction

AIMMDB

AIMMDB is a data access and search tool for multimodal scientific data built on top of tiled. Currently aimmdb is focused on xray absorption spectroscopy (XAS) data.

Examples

Code examples for interacting with the database are maintained in examples folder of this repository.

Funding acknowledgement

This research is based upon work supported by the U.S. Department of Energy, Office of Science, Office Basic Energy Sciences, under Award Number FWP PS-030. This research used resources of the Center for Functional Nanomaterials (CFN), which is a U.S. Department of Energy Office of Science User Facility, at Brookhaven National Laboratory under Contract No. DE-SC0012704.

Disclaimer

The Software resulted from work developed under a U.S. Government Contract No. DE-SC0012704 and are subject to the following terms: the U.S. Government is granted for itself and others acting on its behalf a paid-up, nonexclusive, irrevocable worldwide license in this computer software and data to reproduce, prepare derivative works, and perform publicly and display publicly.

THE SOFTWARE IS SUPPLIED "AS IS" WITHOUT WARRANTY OF ANY KIND. THE UNITED STATES, THE UNITED STATES DEPARTMENT OF ENERGY, AND THEIR EMPLOYEES: (1) DISCLAIM ANY WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE OR NON-INFRINGEMENT, (2) DO NOT ASSUME ANY LEGAL LIABILITY OR RESPONSIBILITY FOR THE ACCURACY, COMPLETENESS, OR USEFULNESS OF THE SOFTWARE, (3) DO NOT REPRESENT THAT USE OF THE SOFTWARE WOULD NOT INFRINGE PRIVATELY OWNED RIGHTS, (4) DO NOT WARRANT THAT THE SOFTWARE WILL FUNCTION UNINTERRUPTED, THAT IT IS ERROR-FREE OR THAT ANY ERRORS WILL BE CORRECTED.

IN NO EVENT SHALL THE UNITED STATES, THE UNITED STATES DEPARTMENT OF ENERGY, OR THEIR EMPLOYEES BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, CONSEQUENTIAL, SPECIAL OR PUNITIVE DAMAGES OF ANY KIND OR NATURE RESULTING FROM EXERCISE OF THIS LICENSE AGREEMENT OR THE USE OF THE SOFTWARE.

aimmdb's People

Contributors

Watchers

Forkers

jmaruland zleung9 msegal347 yimingchen-eng dylanmcreynolds elistavitski charlesc30 matthewcarbone runboj hasitha-wijesuriya

aimmdb's Issues

Add ingestion code to aimmdb.ingest

The code used for the recent ingestion work is local on @jmaruland. It should be added to the aimmdb.ingest module for reference and possible reuse.

Migrate non-deployment code into AIMM Adapter repo

Add averaging operator to postprocessing

Eli's next request is to add an operator to postprocessing for averaging data from multiple aimmdb entries. This operator should take an arbitrary number of entries as input and return a new entry with the averaged data, and metadata that points to the original unaveraged data.

@x94carbone I noticed all the currently defined operators use the UnaryOperator class. Is there any abstraction defined for operators that take multiple inputs?

Update GH workflows to avoid using jkleinh account

Both workflows in .github/workflows use jkleinh account to access NERSC resources, as in:

aimmdb/.github/workflows/aimm-tiled-image.yml

Lines 14 to 17 in bce09d3

 env: 

 REGISTRY: registry.services.nersc.gov 

 IMAGE_NAME: aimm-tiled 

 USER_NAME: jkleinh

Presumably that account may stop working and should be updated.

Remove Joe's write access?

@danielballan @jmaruland we should probably remove Joe's generic write access if only as a precaution. I can easily do this but I don't want to break anything inadvertently.

Fix confirmation message

There are two problems with

https://github.com/AI-multimodal/aimm-spin-deployment/blob/05d75a8facc08d017f7f0a14159a4e6ab1a30025/deploy/spin/config/config.yml#L45

It's missing the word "in".
To comply with the ORCID Terms of Service, our confirmation message should include the user's ORCID.

Let's change to

"You have logged in with ORCID as {id}."

Rewrite tests

We want to do two things:

Update the unit tests in the CI pipeline
Figure out why the current tests are failing

Confirm that hard XAS data from APS has been ingested

This is item 5b from the Google doc. Data owner is Inhui.

Data comprises Ni, Co and Mn (Inhui has both xanes and exafs).

Ingest NCM XAS simulations

This is item 5i in the Google doc. Data owner is Yiming.

Data comprises.TM K-edge (from Yiming), TM-L edge, O K-edge (from Haili).

There are two kinds of data here:

the simulated structure (chi file)
the simulated spectrum

For a given simulated structure, there may be multiple simulated spectra; that is, there is a one-to-many relationship between structure and spectra. We can encode this association in the metadataof each spectrum:{"structure": <uid of structure>}. Later, this would be an ideal use case for references but we don't have that feature availabe in aimmdb yet.

Upgrade aimmdb to new tiled

AIMMDB is based on a pretty old version of tiled, where everything was based on and stored in Mongo. Tiled has progressed to use sqllite or postgres. We need to upgrade the aimmdb to a more modern version of tiled, which should be a single effort that also lets us remove much of the custom code that was built along the way into the aimmdb.

Some functionality will be lost, at the expense of much better maintainabilty:

This will no longer support the kay/value browsing functionality.
We would lose, I think, the data validation logic? Can this easily be preserved?

I think the work looks something like:

Change spin from using the docker image from this repo to using the standard tiled image
Change configuration yaml, adding snippet below
Reload data from the client

example config.yaml addition:

trees:
  - path: /
    tree: catalog
    args:
      uri: /data/aimm/catalog.db # change to postgres
      writable_storage: /data/aimm/data
    access_control:
      access_policy: tiled.access_policies:SimpleAccessPolicy
      args:
        provider: orcid
        access_lists:
          <an orcid> : tiled.access_policies:ALL_ACCESS 
          <another orcid> : tiled.access_policies:ALL_ACCESS

Why not just use the uuid library?

Referring to the uid module. Just curious since by default that's my go-to.

Refactor schemas

Given we're now working on different types of schemas (we have an NMC one, we're developing a FEFF one and I'm hoping to develop more in the future), does it make sense to refactor schemas.py into a module itself? E.g.

schemas/
|---- base.py
|---- xas.py
|---- nmc.py
|---- feff.py
|---- ...

Where base.py contains GenericDocument, xas inherits base and the others inherit xas?

@danielballan @jmaruland thoughts?

If dataset (or value in general) does not exist, no warning provided

@kleinhenz I have come across an interesting quirk when using the client.

from tiled.client import from_uri
CLIENT = from_uri("https://aimm.lbl.gov/api")

CLIENT["dataset"]["mmc"]
# <Node {'element', 'uid', 'sample', 'edge'}>
# As expected

CLIENT["dataset"]["dddd"]
# <Node {'element', 'uid', 'sample', 'edge'}>
# ?

Similarly,

CLIENT["dataset"]["nmc"]["edge"]["of-tomorrow"]
# <Node {'element', 'uid', 'sample'}>

I think it might be prudent to warn the user when they've queried a key-value pair that doesn't make sense. What do you think?

Also, is this a possible tiled issue or an aimmdb one?

Pin deployment to tiled v0.1.0a74

I believe (but have no verified) that the aimmdb deployment on Spin tracks tiled main, using the latest container. This can be pragmatic when a project is quickly moving, but it's not generally good practice. We should pin aimmdb to a specific version of tiled and update intentionally, with coordinated changes to aimmdb.

Confirm that all soft XAS data from ALS has been ingested.

This is 5a from Google doc. Data owner is Gihyeok.

Data comprises Ni and Co L-edge and O K-edge on all the samples; Mn L-edge on 1st cycle only.

Deploy Prometheus and Grafana containers

Today the AIMM server became unusable because of a transient database problem on Spin. (Or so it seems: we do not have thorough evidence for this.)

At NSLS2 we deploy Prometheus and Grafana to monitor availability and performance. We receive notifications (alarms) when there are outages. It would be useful to grow a historical record of aimmdb uptime and to know proactively when it is down.

Separate GraphQL Service

Some of the challenges to introducing GraphQL to the AIMM project include:

Many of the datasets in the metadata in AIMM do not contain metadata sufficiently rich to justify building a full GraphQL implementation for it
GraphQL tools favor having a single endpoint for downloading the schema and issuing queries. Making a single endpoint for all of the experimental and synthetic data in the AIMMDb would be involve a lot of work at startup
GraphQL does is not appropriate for downloading datasets, at least compared to tiled. GraphQL in this context is appropriate for finding and returning metadata

At the tooling update call, @danielballan brought up the idea of a middle ground that might allow us to demonstrate the utility of GraphQL. Some of the data sets in the AIMMDB (especially synthetic) do have rich metadata. We could:

Greate a separate GraphQL service specifically for that dataset.
Have a GraphQL schema for just those datasets
The return fields can return a tiled URL to each dataset, letting client code be responsible for doing the data fetching in tiled and outside of GraphQL

This seems like a nice middle ground that builds a GraphQL implementation that could be useful for users trying to query the datasets that do have rich metadata structures. If this design proves useful, we could consider it for other datasets that have richer metadata structures. It also allows us to do this without force-fitting GraphQL into tiled or AIMMDB code directly.

I'm submitting this and assigning to @kleinhenz for consideration and comment.

Replace the "normalization" scheme with Larch

At Eli's request we will be replacing our postprocessing operators used for normalization with the scheme that is used in Larch. For example,

from larch import Group as xafsgroup
from larch.xafs import pre_edge, autobk, mback, xftf
from larch import Interpreter

a = xafsgroup()
a.mu = np.array(mu)
a.energy = np.array(energy)
pre_edge(a, group=a, _larch=_larch)

def flatten(group):
    step_index = int(np.argwhere(group.energy > group.e0)[0])
    zeros = np.zeros(step_index)
    ones = np.ones(group.energy.shape[0] - step_index)
    step = np.concatenate((zeros, ones), axis=0)
    diffline = (group.post_edge - group.pre_edge) / group.edge_step
    group.flat = group.norm + step * (1 - diffline)

flatten(a)
a.energy  # x-axis
a.flat  y-axis

@zleung9 FYI. I'll be working on this a bit with one of Eli's students.

Provide column names in metata

Apply offset to Gihyeok's data

This applies to the Ni L3 edge.

Refactor XDIElement into Element and Edge

aimmdb/aimmdb/schemas.py

Line 83 in 673b5ce

class XDIElement(pydantic.BaseModel):

@kleinhenz as we discussed, unless you think this will break a bunch of stuff, I think we should refactor this into an Edge and Element, two separate classes, if only for readability. What do you think? I'm happy to take a shot at this just for practice.

A FEFF schema

In this issue, I'll outline the plan for constructing a schema for FEFF data. We wish to store FEFF data for two purposes:

Medium term storage
For use as a intermediary for storing jobs that have not yet been run

Point 2 is the more interesting one here. I would like the FEFF schema to allow for two "states" of completeness.

Pending calculation: the data would consist of an empty data frame, with just the column names. metadata would contain just the information required for submitting a job.

Complete calculation: the data will now contain the actual spectral data/FEFF output. metadata will contain output logs in addition to everything contained in the pending calculations.

Schema plan

Instead of one schema for both incomplete and complete jobs, let's have two schemas, one for completed FEFF jobs and one for incomplete jobs. I will detail below (lots of edits).

The data

Completed FEFF jobs

FEFF9 spectra output is quite simple. It consists of columnated data with the following columns:

omega
e
k
mu
mu0
chi

Each column simply contains floats. This should be quite straightforward to implement.

Incomplete FEFF jobs

The DataFrame will have the same columns but will be trivially empty.

The metadata

Note that complete and incomplete FEFF jobs will be linked by a metadata field analogous to sample_id. I think we can actually just call it sample_id. For example, a molecule-site pair will have one entry in the incomplete database and one in the complete (once the job is done); these two data points will be linked by this sample_id. always required

Common metadata that will be searchable:

The XAS edge, of course. Just like the experimental data, this is an XDIElement (edge+element pair). Though I do wish to reference #21 as I feel the name XDIElement is misleading... for now we'll stick with it though. always required
identifier: string. This can mean a few things, but in particular, for molecules it could mean the SMILES string. It's important that this be searchable because a single molecule may have multiple absorbing sites and therefore multiple FEFF spectra. always required
absorbing_site_index: int, zero indexed; always required
calculation_type: string, either XANES or EXAFS. always required

Completed FEFF jobs

feff.out: string (output file read as a string); always required

Incomplete FEFF jobs

feff.inp: string (input file read as a string, or perhaps can be decomposed into different blocks); always required

Comments

@danielballan I know this might not be exactly what you had in mind as far as aimmdb's use cases are concerned, but I would love your feedback on this. We'll be using it for dynamic querying of completed FEFF spectra for inverse design of molecules, and for Mike's really cool frontend GUI for visualizing XAS.

If this idea works we can duplicate the principle for e.g. Gaussian and do geometry optimization.

Finally, this does have a multi-modal aspect, since for a given molecule we'll compute e.g. the C, N and O XANES and use them all for multi-modal structure refinement.

Add README with installation/configuration instructions

Currently the readme is blank. Let's give it a description of what the project is and how to install it.

While doing that, we might start with a fresh conda environment to make sure that that works.

500 error on node export

In [6]: c['heald']['ChambersID'].export('data.h5')
---------------------------------------------------------------------------
HTTPStatusError                           Traceback (most recent call last)
<ipython-input-6-8a4c8cfc150d> in <module>
----> 1 c['heald']['ChambersID'].export('data.h5')

~/Repos/bnl/tiled/tiled/client/node.py in export(self, filepath, format)
    488 
    489         """
--> 490         return export_util(
    491             filepath,
    492             format,

~/Repos/bnl/tiled/tiled/client/utils.py in export_util(file, format, get, link, params)
     80                 suffix[1:] for suffix in Path(file).suffixes
     81             )  # e.g. "csv"
---> 82         content = get(link, params={"format": format, **params})
     83         with open(file, "wb") as buffer:
     84             buffer.write(content)

~/Repos/bnl/tiled/tiled/client/context.py in get_content(self, path, accept, stream, revalidate, **kwargs)
    433             # No cache, so we can use the client straightforwardly.
    434             response = self._send(request, stream=stream)
--> 435             handle_error(response)
    436             if response.headers.get("content-encoding") == "blosc":
    437                 import blosc

~/Repos/bnl/tiled/tiled/client/utils.py in handle_error(response)
     20         return
     21     try:
---> 22         response.raise_for_status()
     23     except httpx.RequestError:
     24         raise  # Nothing to add in this case; just raise it.

~/miniconda3/envs/py38/lib/python3.8/site-packages/httpx/_models.py in raise_for_status(self)
   1506         error_type = error_types.get(status_class, "Invalid status code")
   1507         message = message.format(self, error_type=error_type)
-> 1508         raise HTTPStatusError(message, request=request, response=self)
   1509 
   1510     def json(self, **kwargs: typing.Any) -> typing.Any:

HTTPStatusError: Server error '500 Internal Server Error' for url 'https://aimm-staging.lbl.gov/node/full/heald/ChambersID?format=h5'
For more information check: https://httpstatuses.com/500

Get aimmdb image-publishing workflow working

Ensure tiled version is pinned, so we have to opt in to newer tiled versions.
Pin Spin to specific hash or tag so that Spin does not auto-update.
Un-break workflow. Change context from docker/tiled to deploy/spin/docker/tiled.

Revisit the use of "dataset"

We use "dataset" to group all the NMC data, but elsewhere we use it to group different batches of data (iss, iss-raw). It seems like we need mechanisms for grouping data by its {source, origin, contributor, batch?} and other for grouping data that we are interested in analyzing together (NMC).

Build the ingestion pipeline

Building the ingestion pipeline

We are working with Eli to develop a pipeline for uploading his XAS beam line data into aimmdb. Particularly, we want to accomplish the following with this issue:

Summary

Our endpoint (for now) will be a .dat file which contains comments starting with # (some of which are critical pieces of metadata), and otherwise columnated data which are space-delimited.
Using the existing xas schema, each channel (column other than the energy, basically) will be read into aimmdb, where the energy column is self-explanatory, and the mu column will be any of the many channels. The channel which is chosen will be indicated in the metadata as measurement_type. Eli's code below provides a good starting point. It's somewhat pseudocode and some work needs to be done.

import numpy as np
import pandas as pd


MEASUREMENT_INSTRUCTIONS = {
    "transmission": {
        "name": "transmission",
        "numerator": "it",
        "denominator": "i0",
        "log": True,
        "invert": True,
        "col_name": "mu_trans",
    },
    "fluorescence": {
        "name": "fluorescence",
        "numerator": "iff",
        "denominator": "i0",
        "log": False,
        "invert": False,
        "col_name": "mu_fluo",
    },
}


def extract_mu(path, measurement_kind):

    df = pd.read_csv(path)
    
    measurement_description = MEASUREMENT_INSTRUCTIONS[measurement_kind]

    energy = df["energy"]

    mu = (
        df[measurement_description["numerator"]]
        / df[measurement_description["denominator"]]
    )

    if measurement_description["log"]:
        mu = np.log10(mu)

    if measurement_description["invert"]:
        mu = -mu

    # Also read the metadata from the file, include all commented lines, but
    # we need to pick out the particularly important databroker unique id
    metadata = ...

    # process data frame...

    return df, metadata

Specific steps

Create a module aimmdb.ingest
Create a particular file aimmdb/ingest/eli.py (we'll rename this to the name of Eli's beam line later)
Create a single function (ingest) which takes a single path as an argument and returns a pd.DataFrame (the data) and dict (metadata).
Don't forget that the pd.DataFrame columns must be energy and mu. The actual column we use for mu will change depending on the channel we're looking at.
In Eli's examples, we only have "transmission" and "fluorescence". Eli has provided instructions (code above) on how to process these particular types of data and how they should be represented in aimmdb
We MUST document every type of processing we do (see above code) before it gets uploaded into aimmdb. I recommend a README file in aimmdb.ingest for now, until we move to a more standard documentation solution.

Add outlier rejection to postprocessing operations

Denis also requested we add outlier rejection to postprocessing. This could be incorporated into an AverageData operator, or there could be two separate operators for averaging with/without outlier rejection.

Determining outliers

First the trimmed mean and trimmed standard deviation will be calculated at each energy point (see here).
For each spectrum the following will be then be calculated:
1 / number of energy points * sum[(trimmed_mean - spectrum / trimmed_stddev)**2]
This value is essentially a measure of how many trimmed standard deviations the spectrum typically deviates from the trimmed mean, and it can be compared to a threshold to determine if a given spectrum is an outlier in the group.

A few notes from Denis:

Different amounts of data can be trimmed during calculation. Typically trimming the top and bottom 20% works well and can be used by default.
A threshold value of ~10-25 typically works well for outlier determination. By default we can use 10 as a conservative threshold.
This method works best when the number of spectra is ≥10. Some kind of warning should be presented if run on a set of less than 10 spectra.

Add new postprocessing grid alignment schema

Denis (from ISS beamline) requested a new schema be added to the postprocessing operations for aligning energy grids from multiple spectra. Currently there is a StandardizeGrid UnaryOperator, which operates on single spectrum and standardizes the energy grid to a uniformly spaced values (i.e., new_grid = np.linspace(self.x0, self.xf, self.nx)).

However XAS spectra are often not sampled on uniformly spaced grids because there is more information at different regions of the spectrum.

This new schema will work in the following way:

operates on a group of spectra (MimoOperator)
select one "master energy grid" from the group (this could be the first spectrum passed by default)
align all spectra in group to "master grid" via InterpolatedUnivariateSpline

Re-implement validation classes

In the original AIMMDB, validation callables were added so that we can test the validity of different types of XAS files on import. They feel a lot like the currently supported tiled versions of validation, but are configured differently. When we migrated the database from old to modern tiled, we skipped the step of reimplementing them. The code for them are still in the code base, but they are not called anywhere.

Meanwhile, the validation written for aimmdb mostly made it to tiled. So, let's do what we need to do to refactor what was in the old aimmdb and make it avaiable in the new tiled-based aimmdb. This could be as simple as modifying the config file and making sure that the python files are available in a place where tiled picks them up. We might also think about sending a PR to tiled with an example config for validators.

For some context, tiled has the concept of specs which define the specification of a type of data, and a validator can be configured for a spec.

Support Text Search

It would be nice for the AIMM Tiled implementations to support full text search. I think this requires a couple things:

Enhance the adapters to register text search
Add a text index to Mongo. As a pattern, databroker add indexes on init of database update tool (which is actually suitcase. If the index is already there, no error is thrown.
As a bonus to enhance the search, we could add a field to common called "description" and add to that a tokenized version of the dataset name. So As-K-1 would become As K 1. This would make As, K and 1 searchable with Mongo's text search.

Confirm that hard XAS data from NSLS-II has been ingested

This is item 5c in the Google doc. Data owner is Eli.

Data comprises Co, Ni, Mn K-edge of NMC. This may already be ingested as iss / iss-raw (processed and unprocessed, respectively). It should be moved under the ncm dataset, though it should retain some indication of the batch that it came from under some other metadata.

Apply licenses

When we get around to adding "license" as a concept in Tiled bluesky/tiled#135 we should link the public Newville dataset to its source, which is https://github.com/XraySpectroscopy/XASDataLibrary.

Investigate 'beep'

https://github.com/TRI-AMDD/beep

API key usage should bypass requirement of client.login()

@kleinhenz It would be really nice to not have to run e.g.

client = from_uri("https://aimm.lbl.gov/api")
client.login()  # <--- this

in order to access the private databases when my API key is already set. Any way to do this? Or is this a tiled issue not an aimmdb one?

Quality of life: use @cache instead of global variable

@kleinhenz I'm just perusing through the code and making issues as I go btw. Hope this is ok. Just trying to understand it better and help improve where I can 😊

aimmdb/aimmdb/utils.py

Line 40 in 673b5ce

_ELEMENT_DATA = None

Nitpick: I think you can do something like

from functools import cache

@cache 
def get_element_data():
    fname = importlib.resources.files("aimmdb") / "data" / "elements.json"
    with open(fname, "r") as f:
        data = json.load(f)
    return data

instead of using global. Less lines of code and safer.

	env:
	REGISTRY: registry.services.nersc.gov
	IMAGE_NAME: aimm-tiled
	USER_NAME: jkleinh

ai-multimodal / aimmdb Goto Github PK

aimmdb's Introduction

AIMMDB

Examples

Funding acknowledgement

Disclaimer

aimmdb's People

Contributors

Watchers

Forkers

aimmdb's Issues

A FEFF schema

Schema plan

The data

Completed FEFF jobs

Incomplete FEFF jobs

The metadata

Completed FEFF jobs

Incomplete FEFF jobs

Comments

Building the ingestion pipeline

Summary

Specific steps

Determining outliers

Recommend Projects

Recommend Topics

Recommend Org