Giter Club home page Giter Club logo

flymet's People

Contributors

dependabot[bot] avatar joewandy avatar kmcluskey avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

flymet's Issues

Flexible way to store experimental design

For loading of the PyMT data into the pipeline, we need to do the following:

  1. Define a way to flexibly specify the experimental design, e.g. whether it's a case-control study, which comparisons to make, what are the factors used, and so on.

  2. Change django views to use (1) rather than hardcoding them in the codes.

This could be good for FlyMet as well, since it will make it more flexible.

It means any generic analysis function we develop in the future can be easily available for FlyMet too, e.g. heatmaps, omics integration etc.

Missing loguru dependencies on Windows

A fresh installation of the Pipenv virtual environment throws the following error when trying to run manage.py

  File "C:\Users\joewa\.virtualenvs\FlyMet-fLQaCvEH\lib\site-packages\loguru\_file_sink.py", line 13, in <module>
    from ._ctime_functions import get_ctime, set_ctime
  File "C:\Users\joewa\.virtualenvs\FlyMet-fLQaCvEH\lib\site-packages\loguru\_ctime_functions.py", line 4, in <module>
    import win32_setctime
ModuleNotFoundError: No module named 'win32_setctime'

and

  File "C:\Users\joewa\.virtualenvs\FlyMet-fLQaCvEH\lib\site-packages\loguru\_colorama.py", line 25, in should_wrap
    from colorama.win32 import winapi_test
ModuleNotFoundError: No module named 'colorama'

Seems that the package win32_setctime and colorama are not installed as part of the loguru dependencies, for whatever reason.

Add details of permanent files needed for installation

Add details of the files needed for installation and how these scrips are run - e.g. chebi_ontology_df_PERMANENT.pkl and chebi_relation_dict.pkl and the fact they require .owl and .tsv files, respectively to be created if absent.

Make the codes more generic

At the moment, the sample metadata stored in the database is specific to FlyMet. It assumes the following info are provided: lifestage, tissue, mutant. For other dataset, these info won't be available, and different types of metadata will be available. I would like to perform the following enhancement to the pipeline to make it more generic in order to be able to load other non-FlyMet dataset in.

In particular here are the codes that will be modified:

In the population script

def populate_samples(sample_csv):
    '''
    Give the sample CSV file to populate the samples.
    KMcL: Working but need to consider the filepath.
    '''
    sample_details = np.genfromtxt(sample_csv, delimiter=',', dtype=str)[2:]
    logger.debug("sd_type %s" % sample_details)
    for sample in sample_details:
        # sample = s.split()
        sample_serializer = SampleSerializer(
            data={"name": sample[0], "group": sample[1], "life_stage": sample[2], "tissue": sample[3],
                  "mutant": sample[4]})
        if sample_serializer.is_valid():
            db_sample = sample_serializer.save()
            logger.debug("sample saved %s" % db_sample.name)
        else:
            logger.debug(sample_serializer.errors)

In the model

class Sample(models.Model):
    """
    Model class defining an instance of an experimental Sample including the tissue and life-stage from which it came
    """
    # Here the sample name is unique as this is important for processing FlyMet data
    name = models.CharField(max_length=250, unique=True, blank=False)
    life_stage = models.CharField(max_length=250, blank=False)
    tissue = models.CharField(max_length=250)
    group = models.CharField(max_length=250, blank=True, null=True)
    mutant = models.CharField(max_length=250, blank=True, null=True)

    def  __str__(self):
        """
        Method to return a representation of the Sample
        """

        return "Sample " + self.name

and in the serialiser

class SampleSerializer(serializers.ModelSerializer):
    class Meta:
        model = Sample
        fields = ('name','life_stage', 'group','tissue','mutant')

Make pipeline faster

Notes on some things that could be done to make the pre-processing pipeline faster. Will keep adding to this as I go.

  1. Speed up get_chebi_id
  2. Speed up construct_all_peak_df.
  3. Speed up remove_duplicates by vectorising it?
  4. Speed up populate_peaks_cmpds_annots to make it insert in batch.
  5. Speed up populate_peaksamples to make it insert in batch.

Can't access peak explorer

Got the following error message

image

Stack trace:

Traceback (most recent call last):
  File "C:\Users\joewa\.virtualenvs\FlyMet-fLQaCvEH\lib\site-packages\django\core\handlers\exception.py", line 34, in inner
    response = get_response(request)
  File "C:\Users\joewa\.virtualenvs\FlyMet-fLQaCvEH\lib\site-packages\django\core\handlers\base.py", line 115, in _get_response
    response = self.process_exception_by_middleware(e, request)
  File "C:\Users\joewa\.virtualenvs\FlyMet-fLQaCvEH\lib\site-packages\django\core\handlers\base.py", line 113, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "C:\Users\joewa\.virtualenvs\FlyMet-fLQaCvEH\lib\site-packages\django\utils\decorators.py", line 130, in _wrapped_view
    response = view_func(request, *args, **kwargs)
  File "C:\Users\joewa\Work\git\FlyMet\met_explore\views.py", line 471, in peak_explorer
    cache.set('my_group_df', cmpd_selector.get_group_df(peaks), 60*18000)

Exception Type: NameError at /met_explore/peak_explorer/All
Exception Value: name 'cmpd_selector' is not defined

Suspicious warning when running the pipeline

To check later. Got this when running

$ python reset_data.py && python manage.py migrate && python manage.py initialisedb data/tissues_life_stages_no_whole.csv data/67_peak_cmpd_export_1.json data/67_peak_int_export.json

Seems like the NaN values should be removed ..

image

List index out of range

Not sure why i'm now getting this when accessing the pathway explorer

Traceback (most recent call last):
  File "C:\Users\joewa\.virtualenvs\FlyMet-fLQaCvEH\lib\site-packages\django\core\handlers\exception.py", line 47, in inner
    response = get_response(request)
  File "C:\Users\joewa\.virtualenvs\FlyMet-fLQaCvEH\lib\site-packages\django\core\handlers\base.py", line 179, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "C:\Users\joewa\Work\git\FlyMet\met_explore\views.py", line 622, in pathway_explorer
    view_df, pals_min, pals_mean, pals_max = get_pals_view_data()
  File "C:\Users\joewa\Work\git\FlyMet\met_explore\views.py", line 954, in get_pals_view_data
    pals_df = get_cache_df()
  File "C:\Users\joewa\Work\git\FlyMet\met_explore\pathway_analysis.py", line 61, in get_cache_df
    cache.set('pals_df', get_pals_df(), 60 * 180000)
  File "C:\Users\joewa\Work\git\FlyMet\met_explore\pathway_analysis.py", line 48, in get_pals_df
    ds = get_cache_ds()
  File "C:\Users\joewa\Work\git\FlyMet\met_explore\pathway_analysis.py", line 38, in get_cache_ds
    cache.set('pals_ds', get_pals_ds(), 60 * 180000)
  File "C:\Users\joewa\Work\git\FlyMet\met_explore\pathway_analysis.py", line 26, in get_pals_ds
    fly_exp_design = get_pals_experimenal_design()
  File "C:\Users\joewa\Work\git\FlyMet\met_explore\pathway_analysis.py", line 385, in get_pals_experimenal_design
    control_dict = cmpd_selector.get_list_view_column_names(controls)
  File "C:\Users\joewa\Work\git\FlyMet\met_explore\compound_selection.py", line 234, in get_list_view_column_names
    sample = Sample.objects.filter(group=g)[0]  # Get the first sample of this group.
  File "C:\Users\joewa\.virtualenvs\FlyMet-fLQaCvEH\lib\site-packages\django\db\models\query.py", line 325, in __getitem__
    return qs._result_cache[0]

Exception Type: IndexError at /met_explore/pathway_explorer
Exception Value: list index out of range

Created using the following command on the performance branch:

$ python reset_data.py && python manage.py migrate && python manage.py initialisedb data\tissues_life_stages_no_whole.csv data\67_peak_cmpd_export_1.json data\67_peak_int_export.json

The resulting database and pickled data from running above command is attached.

db.sqlite3.zip

data.zip

Separate PiMP pre-processing from database loading in initialisedb command

Not sure if it's already the case now, but it would be great if inside the initialisedb script, we can cleanly separate (1) the codes that deal with the pre-processing of PiMP data, and (2) the codes that load the pre-processed data into the database.

Ideally for part (1), we have a single object or method that uses only pandas or plain Python. It takes as input the path to the sample CSV, intensity and annotation JSON files from PiMP. The output is a list of cleaned peaks alongside their high-confidence compound annotations. This way it can be used in other workflow as well to clean imported PiMP data, e.g. in WebOmics.

Part (2) would then take the output from part (1) and populate Django database with it. I guess that will be unique to this project and won't really be re-used elsewhere. However if the codes are neatly put in one place, it makes it easier to optimise the database loading performance later on.

Q: can I use the PeakSelector and CompoundSelector to do only part 1 @kmcluskey, even without having a database at all? From the codes, it seems that it isn't entirely possible yet, but with small changes, it might be.

Code duplication

Code marked:

??KMCL*** This code is copied directly from the peak_explorer - can this be made reusable??

in pathway_search.js is taken directly from the peak_explorer.js - check if this would be easy to move to a common place and reuse it.

Improve Pathway Explorer performance

Noticed a number of issues when trying to access the pathway explorer for the first time using the test data.

  1. It's slow when retrieving the initial data. From console prints, I see:
Returning the metabolite data took: 139.86491279999998 S
...
Returning the pals data took: 285.96137580000004 S

So in total it took nearly 5 minutes until the pathway results is ready, which can be a while.

  1. A number of django socket errors appeared on the console (see pathways.txt) . I guess it's because of (1) it took too long for the results to return.

  2. Even when the results have finished being computed, nothing shows up on the table. I think it's because of (2) since the connection has timed out. We need to click away and return to the page for results to show up (once it's been cached).

One easy solution would be to precompute everything before-hand and store the results somewhere? It seems that django caching is used, but that is triggered only when the page is accessed, so it's a bit too late. I guess we can do this pathway computation as part of the initial data loading since this only needs to be done once.

General code tidy-up

  1. Use the logger from loguru rather than importing the root logging (which we need to configure further). The default logger from loguru is already formatted to show timestamp, originating line number etc. It makes debugging easier if we know the line number.

  2. Also get rid of all print statements, and use the logger instead.

  3. Get rid of import * from the codes. Usually it's better to import exactly what we need.

Tooltip error

On the peak comparisons page the first couple of tooltip headers have extra text.

Fix comparison calculation

If there are no peaks for whole flies - make sure that this does not flatten the data present in other tissues e.g. anserine found in eyes but not whole flies.

Speed up Pathway page

The Pathway page is REALLY slow to load first time round - this can cause the server to time out.

Compare results male v female

Sue would like a comaprison page of male versus female. Ideally the users could choose to compare any two sets of results.

Fix peak click - pathway search

When clicking on the peak in the Pathway search you can't view the results on long pages without scrolling back up. Eg. TCA cycle.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.