kmcluskey / flymet Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 0.0 26.57 MB

A multi-omics web app for Drosophila tissues

License: MIT License

Python 41.88% CSS 4.42% JavaScript 17.70% HTML 28.21% Jupyter Notebook 7.77% Procfile 0.01%

drosophila metabolomics

flymet's People

Contributors

Stargazers

Watchers

flymet's Issues

Filter peaks and/or metabolites by Identified/fragments

Allow user to get a list of peaks and/or metabolites with greater confidence if they want to.

Flexible way to store experimental design

For loading of the PyMT data into the pipeline, we need to do the following:

Define a way to flexibly specify the experimental design, e.g. whether it's a case-control study, which comparisons to make, what are the factors used, and so on.
Change django views to use (1) rather than hardcoding them in the codes.

This could be good for FlyMet as well, since it will make it more flexible.

It means any generic analysis function we develop in the future can be easily available for FlyMet too, e.g. heatmaps, omics integration etc.

PALS colouring is not very well spread out

Traffic light colour seems wrong on metabolite search page

The F button is blue - should be orange.

Update PALS version

We should be running the most recent version of the pals program

Fatbody should be Fat Body on the pages

Missing loguru dependencies on Windows

A fresh installation of the Pipenv virtual environment throws the following error when trying to run manage.py

  File "C:\Users\joewa\.virtualenvs\FlyMet-fLQaCvEH\lib\site-packages\loguru\_file_sink.py", line 13, in <module>
    from ._ctime_functions import get_ctime, set_ctime
  File "C:\Users\joewa\.virtualenvs\FlyMet-fLQaCvEH\lib\site-packages\loguru\_ctime_functions.py", line 4, in <module>
    import win32_setctime
ModuleNotFoundError: No module named 'win32_setctime'

and

  File "C:\Users\joewa\.virtualenvs\FlyMet-fLQaCvEH\lib\site-packages\loguru\_colorama.py", line 25, in should_wrap
    from colorama.win32 import winapi_test
ModuleNotFoundError: No module named 'colorama'

Seems that the package win32_setctime and colorama are not installed as part of the loguru dependencies, for whatever reason.

Missing KEGG compound id for standard compounds

This line is pretty awkward: https://github.com/kmcluskey/FlyMet/blob/master/met_explore/compound_selection.py#L483-L486

In particular often missing_cmpd_dict is missing the compound name (because it's coded by hand), leading to a KeyError.

                    added_id = missing_cmpd_dict[c.cmpd_name]

Proposed fix: either we load missing_cmpd_dict from a longer list of KEGG names -> IDs, or maybe do an online search for it.

Add details of permanent files needed for installation

Add details of the files needed for installation and how these scrips are run - e.g. chebi_ontology_df_PERMANENT.pkl and chebi_relation_dict.pkl and the fact they require .owl and .tsv files, respectively to be created if absent.

Get rid of hardcoded path

https://github.com/kmcluskey/FlyMet/blob/master/met_explore/preprocessing.py#L498

Make the codes more generic

At the moment, the sample metadata stored in the database is specific to FlyMet. It assumes the following info are provided: lifestage, tissue, mutant. For other dataset, these info won't be available, and different types of metadata will be available. I would like to perform the following enhancement to the pipeline to make it more generic in order to be able to load other non-FlyMet dataset in.

In particular here are the codes that will be modified:

In the population script

def populate_samples(sample_csv):
    '''
    Give the sample CSV file to populate the samples.
    KMcL: Working but need to consider the filepath.
    '''
    sample_details = np.genfromtxt(sample_csv, delimiter=',', dtype=str)[2:]
    logger.debug("sd_type %s" % sample_details)
    for sample in sample_details:
        # sample = s.split()
        sample_serializer = SampleSerializer(
            data={"name": sample[0], "group": sample[1], "life_stage": sample[2], "tissue": sample[3],
                  "mutant": sample[4]})
        if sample_serializer.is_valid():
            db_sample = sample_serializer.save()
            logger.debug("sample saved %s" % db_sample.name)
        else:
            logger.debug(sample_serializer.errors)

In the model

class Sample(models.Model):
    """
    Model class defining an instance of an experimental Sample including the tissue and life-stage from which it came
    """
    # Here the sample name is unique as this is important for processing FlyMet data
    name = models.CharField(max_length=250, unique=True, blank=False)
    life_stage = models.CharField(max_length=250, blank=False)
    tissue = models.CharField(max_length=250)
    group = models.CharField(max_length=250, blank=True, null=True)
    mutant = models.CharField(max_length=250, blank=True, null=True)

    def  __str__(self):
        """
        Method to return a representation of the Sample
        """

        return "Sample " + self.name

and in the serialiser

class SampleSerializer(serializers.ModelSerializer):
    class Meta:
        model = Sample
        fields = ('name','life_stage', 'group','tissue','mutant')

Reordering columns messes with the number formatting

Potentially only allow columns of the same type to be moved around?

Download Reactome2py

Test for linking reactions and compounds to genes.

Add Chebi IDS to the metabolites on Pathway Explorer page

Adding the Chebi IDs allows the user to compare the FlyMet metabolites directly with the pathway diagram.

Display the metabolites near the Pathway diagram

Allow the user to see the pathway diagram and the FlyMet metabolites without scrolling. This will make it easier to compare/see what metabolites they have and where.

Make pipeline faster

Notes on some things that could be done to make the pre-processing pipeline faster. Will keep adding to this as I go.

Speed up get_chebi_id
- Seems that get_chebi_id is called many times in a loop inside preprocessing. Maybe we can speed this up.
- Get rid of this print https://github.com/kmcluskey/FlyMet/blob/master/met_explore/preprocessing.py#L120 -- DONE
Speed up construct_all_peak_df.
- Try to remove the double for-loops in https://github.com/kmcluskey/FlyMet/blob/master/met_explore/peak_selection.py#L137-L145, or parallelise it?
- Get rid of this print https://github.com/kmcluskey/FlyMet/blob/master/met_explore/peak_selection.py#L710 -- DONE
Speed up remove_duplicates by vectorising it?
Speed up populate_peaks_cmpds_annots to make it insert in batch.
Speed up populate_peaksamples to make it insert in batch.

Allow user to choose own comparisons

Allow the user to compare peak results for any two tissues that they choose.

Can't access peak explorer

Got the following error message

Stack trace:

Traceback (most recent call last):
  File "C:\Users\joewa\.virtualenvs\FlyMet-fLQaCvEH\lib\site-packages\django\core\handlers\exception.py", line 34, in inner
    response = get_response(request)
  File "C:\Users\joewa\.virtualenvs\FlyMet-fLQaCvEH\lib\site-packages\django\core\handlers\base.py", line 115, in _get_response
    response = self.process_exception_by_middleware(e, request)
  File "C:\Users\joewa\.virtualenvs\FlyMet-fLQaCvEH\lib\site-packages\django\core\handlers\base.py", line 113, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "C:\Users\joewa\.virtualenvs\FlyMet-fLQaCvEH\lib\site-packages\django\utils\decorators.py", line 130, in _wrapped_view
    response = view_func(request, *args, **kwargs)
  File "C:\Users\joewa\Work\git\FlyMet\met_explore\views.py", line 471, in peak_explorer
    cache.set('my_group_df', cmpd_selector.get_group_df(peaks), 60*18000)

Exception Type: NameError at /met_explore/peak_explorer/All
Exception Value: name 'cmpd_selector' is not defined

Suspicious warning when running the pipeline

To check later. Got this when running

$ python reset_data.py && python manage.py migrate && python manage.py initialisedb data/tissues_life_stages_no_whole.csv data/67_peak_cmpd_export_1.json data/67_peak_int_export.json

Seems like the NaN values should be removed ..

List index out of range

Not sure why i'm now getting this when accessing the pathway explorer

Traceback (most recent call last):
  File "C:\Users\joewa\.virtualenvs\FlyMet-fLQaCvEH\lib\site-packages\django\core\handlers\exception.py", line 47, in inner
    response = get_response(request)
  File "C:\Users\joewa\.virtualenvs\FlyMet-fLQaCvEH\lib\site-packages\django\core\handlers\base.py", line 179, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "C:\Users\joewa\Work\git\FlyMet\met_explore\views.py", line 622, in pathway_explorer
    view_df, pals_min, pals_mean, pals_max = get_pals_view_data()
  File "C:\Users\joewa\Work\git\FlyMet\met_explore\views.py", line 954, in get_pals_view_data
    pals_df = get_cache_df()
  File "C:\Users\joewa\Work\git\FlyMet\met_explore\pathway_analysis.py", line 61, in get_cache_df
    cache.set('pals_df', get_pals_df(), 60 * 180000)
  File "C:\Users\joewa\Work\git\FlyMet\met_explore\pathway_analysis.py", line 48, in get_pals_df
    ds = get_cache_ds()
  File "C:\Users\joewa\Work\git\FlyMet\met_explore\pathway_analysis.py", line 38, in get_cache_ds
    cache.set('pals_ds', get_pals_ds(), 60 * 180000)
  File "C:\Users\joewa\Work\git\FlyMet\met_explore\pathway_analysis.py", line 26, in get_pals_ds
    fly_exp_design = get_pals_experimenal_design()
  File "C:\Users\joewa\Work\git\FlyMet\met_explore\pathway_analysis.py", line 385, in get_pals_experimenal_design
    control_dict = cmpd_selector.get_list_view_column_names(controls)
  File "C:\Users\joewa\Work\git\FlyMet\met_explore\compound_selection.py", line 234, in get_list_view_column_names
    sample = Sample.objects.filter(group=g)[0]  # Get the first sample of this group.
  File "C:\Users\joewa\.virtualenvs\FlyMet-fLQaCvEH\lib\site-packages\django\db\models\query.py", line 325, in __getitem__
    return qs._result_cache[0]

Exception Type: IndexError at /met_explore/pathway_explorer
Exception Value: list index out of range

Created using the following command on the performance branch:

$ python reset_data.py && python manage.py migrate && python manage.py initialisedb data\tissues_life_stages_no_whole.csv data\67_peak_cmpd_export_1.json data\67_peak_int_export.json

The resulting database and pickled data from running above command is attached.

db.sqlite3.zip

data.zip

Add Pathways to the Metabolite explorer

If a metabolite is in a pathway we should be able to find out which pathway it is in.

Add Pathway diagrams to the Pathway Explorer page

When a pathway is clicked, reveal the diagram at the bottom of the page.

Add peak ID to metabolite search side bar

Add the Pathway details blurb to the Pathway search page

The Pathway details should be revealed in the RHS fieldset as soon as the pathway is selected - underneath the coverage summary.

Separate PiMP pre-processing from database loading in initialisedb command

Not sure if it's already the case now, but it would be great if inside the initialisedb script, we can cleanly separate (1) the codes that deal with the pre-processing of PiMP data, and (2) the codes that load the pre-processed data into the database.

Ideally for part (1), we have a single object or method that uses only pandas or plain Python. It takes as input the path to the sample CSV, intensity and annotation JSON files from PiMP. The output is a list of cleaned peaks alongside their high-confidence compound annotations. This way it can be used in other workflow as well to clean imported PiMP data, e.g. in WebOmics.

Part (2) would then take the output from part (1) and populate Django database with it. I guess that will be unique to this project and won't really be re-used elsewhere. However if the codes are neatly put in one place, it makes it easier to optimise the database loading performance later on.

Q: can I use the PeakSelector and CompoundSelector to do only part 1 @kmcluskey, even without having a database at all? From the codes, it seems that it isn't entirely possible yet, but with small changes, it might be.

Add Adult male/female and larvae intensities to HighCharts as a comparison

PeakSelector self.std_temp_dict should be created from the standard file

This temporary dict should be read from a file and saved as pickle - all RT values for an experiment.

Tools tips wrong on Pathway Search datatables (esp LHS)

We only show peaks that have annotations, which is 9369 out of 27574 is this okay?

Code duplication

Code marked:

??KMCL*** This code is copied directly from the peak_explorer - can this be made reusable??

in pathway_search.js is taken directly from the peak_explorer.js - check if this would be easy to move to a common place and reuse it.

Fatbody (M/F) and Fat body (L) are used as treated as two separate tissue types.

Update the related chebis on the Server

add_related_chebis() from the population script needs to be run on the server.

In this future this will be done at the initialisation stage.

add fragmentation percentage to peak and metabolite explorer pages

The Frank Percentage is missing from these pages but is on the single metabolite search page. Please fix

Improve Pathway Explorer performance

Noticed a number of issues when trying to access the pathway explorer for the first time using the test data.

It's slow when retrieving the initial data. From console prints, I see:

Returning the metabolite data took: 139.86491279999998 S
...
Returning the pals data took: 285.96137580000004 S

So in total it took nearly 5 minutes until the pathway results is ready, which can be a while.

A number of django socket errors appeared on the console (see pathways.txt) . I guess it's because of (1) it took too long for the results to return.
Even when the results have finished being computed, nothing shows up on the table. I think it's because of (2) since the connection has timed out. We need to click away and return to the page for results to show up (once it's been cached).

One easy solution would be to precompute everything before-hand and store the results somewhere? It seems that django caching is used, but that is triggered only when the page is accessed, so it's a bit too late. I guess we can do this pathway computation as part of the initial data loading since this only needs to be done once.

Add a view for Tissues and a Search for the same

General code tidy-up

Use the logger from loguru rather than importing the root logging (which we need to configure further). The default logger from loguru is already formatted to show timestamp, originating line number etc. It makes debugging easier if we know the line number.
Also get rid of all print statements, and use the logger instead.
Get rid of import * from the codes. Usually it's better to import exactly what we need.