Light

moshi4 / pymsaviz Goto Github PK

MSA(Multiple Sequence Alignment) visualization python package for sequence analysis

Home Page: https://moshi4.github.io/pyMSAviz

License: MIT License

Shell 1.31% Python 98.69%

bioinformatics genomics matplotlib msa multiple-sequence-alignment python sequence-alignment sequence-analysis visualization

pymsaviz's Introduction

Softwares

Names	Stars & Forks	Issues	PRs	Downloads
pyCirclize
pyGenomeViz
pyMSAviz
ANIclustermap
COGclassifier
phyTreeViz
pybarrnap

pymsaviz's People

Contributors

Stargazers

Watchers

Forkers

vikash84 fomightez rnaimehaom bvaisvil m-hakmi yoshitakamo rohit-ranjan-tfj o-mics modelturnedgeek sablokgaurav

pymsaviz's Issues

get specific positions from MSA

hi what if i have to grep and display specific positions only lets say 30 40 50 and 60 from a MSA . is there a direct way rather than writing a new fasta file?
also it will be great if you can add a functionality to upload newick format tree and showing phylogeny on left hand side of header

msa without gaps

I think it is necessary to provide a function to draw msa without gaps

How do you show a more complete description of the aligned entries?

Hello,

First of all I want to say that I really like this python package. I think it is really neat!

I am currently having the problem that the labels for the various sequences are cut off.
According to how I understand the code, if I have a fasta file containing aligned sequence records with the format

>Genus1 species1 strain1 | accession_number1
----M----AD---A
>Genus2 species2 strain2 | accession_number2
----M----SD---A
etc.

Then when I run the package on that particular fasta file, only the Genus is shown on the left of each sequence.
Since I have files with many species of the same genus, I also cannot use the sorted=True setting when creating the MsaViz object, as I then get a "Duplicate values" error.

If I switch the format of my alignment files around in order to have the first thing be the accession_number, the sorted=True setting functions as intended but I am left with a figure showing only the accession_numbers as a description, which makes interpreting the results hard.

Current format:

>accession_number1 | Genus1 species1 strain1
----M----AD---A
>accession_number2 | Genus2 species2 strain2 
----M----SD---A
etc.

Am I doing something wrong currently, or is there some setting to ensure that the full title of the sequence record is shown?

Minimal working example assuming switched fasta files:


def make_pymsaviz_plot(path, name, outname, min_gap_length = 5, gap_fraction=0.05, gap_char="-", variable_consensus=0.4, variable_char="x", show_count=True, show_consensus=True, color_scheme="Clustal", sorted = False):
    # create the input and output file names
    infile = os.path.join(path, name)
    outfile = os.path.join(path, outname)

    # parse the input fasta file into an array of sequences
    sequences = []
    for record in SeqIO.parse(infile, "fasta"):
        sequences.append(record.seq)

    # for every position in the alignment, count the number of gaps and variable characters
    gap_count = np.zeros(len(sequences[0]))

    for sequence in sequences:
        for i, aa in enumerate(sequence):
            if aa == gap_char:
                gap_count[i] += 1


    # get the continuous stretches of gaps
    gap_stretches = []
    start = 0
    end = 0
    for i, count in enumerate(gap_count):
        if(count/len(sequences) > (1-gap_fraction)):
            end = i
        else:
            if end > start:
                gap_stretches.append((start+2, end+1))
            start = i
            end = i
    if end > start:
        gap_stretches.append((start+2, end+1))

    # remove the stretches of gaps that are too short
    gap_stretches = [stretch for stretch in gap_stretches if stretch[1] - stretch[0] > min_gap_length]
    print(gap_stretches)

    # create a new fasta file with the gap_stretches removed
    protein_accessions = []
    genus_species = []

    with open(outfile + "_gap_trimmed.fasta", "w") as f:
        for record in SeqIO.parse(infile, "fasta"):

            # get the protein accession and genus species
            protein_accessions.append(record.description.split("|")[0].strip())
            genus_species.append(record.description.split("|")[1])

            new_seq = ""
            for i, aa in enumerate(record.seq):
                if not any([i >= stretch[0]-1 and i <= stretch[1]-1 for stretch in gap_stretches]):
                    new_seq += aa
            f.write(">" + record.description + "\n")
            f.write(new_seq + "\n")

    # create a pymsaviz object from the both the non-trimmed and trimmed fasta file
    msa = pymsaviz.MsaViz(infile, show_count=show_count, show_consensus=show_consensus, sort=sorted, color_scheme=color_scheme)
    msa_trimmed = pymsaviz.MsaViz(outfile + "_gap_trimmed.fasta", show_count=show_count, show_consensus=show_consensus, sort=sorted, color_scheme=color_scheme)

    # add annotations to the pymsaviz object
    for gap in gap_stretches:
        msa.add_text_annotation(gap, "Gap Region", text_color="black", range_color="red")
    msa.savefig(outfile + "_gap.png")

    # add variable markers to the trimmed pymsaviz object to show for highly non-conserved regions
    high_variability = []
    identity_list = msa_trimmed._get_consensus_identity_list()

    for position, identity in enumerate(identity_list, 1):
        if identity < variable_consensus:
            high_variability.append(position)
    msa_trimmed.add_markers(high_variability, marker=variable_char, color="red")
    msa_trimmed.savefig(outfile + "_gap_trimmed.png")

    return msa, msa_trimmed

Thank you in advance for your help.

[Feature Request] new color scheme based on similarity (a la Uniprot)

Hi,

This tool is super! I was wondering if the tool can also have a color scheme based on the similarity of the characters, like the example below:

Question - Save plot without displaying the plot interactively?

Hi! Thank you for creating pyMSAviz, I've been actively using it to quickly view some MSAs I've generated. While parsing through each MSA and saving them using the .savefig() function, I noticed that the plots are always displayed interactively and quickly ran out of memory while I was doing this automatically for a bunch of files. Apologize for any ignorance, but is there a way to save the plot and not display it interactively? Added the way I am reading, and saving my MSAs below:

mv = MsaViz(rha_file, show_label=False, color_scheme='Clustal', show_consensus=True)
mv.savefig(f'../figures/{output_directory}_MSA_figures/{format_rha_file}/{format_rha_file}_RHA_MSA_figure.png')

Thank you!

matplotlib minimum version

With matplotlib version 3.5.3 and pymsaviz version 0.4.0 installed, this example from the docs gave an AttributeError:

from pymsaviz import MsaViz, get_msa_testdata

msa_file = get_msa_testdata("HIGD2A.fa")
mv = MsaViz(msa_file)
fig = mv.plotfig()

Extract from stack trace:

... pymsaviz/msaviz.py) in plotfig(self, dpi) ...
--> 420         fig.set_layout_engine("tight")
    421         gs = GridSpec(nrows=len(plot_ax_types), ncols=1, height_ratios=y_size_list)
    422         gs.update(left=0, right=1, bottom=0, top=1, hspace=0, wspace=0)

AttributeError: 'Figure' object has no attribute 'set_layout_engine'

Installing matplotlib 3.6.0 fixed the issue so perhaps the minimum version (e.g. matplotlib = ">=3.5.2" ) may need updating?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.