Giter Club home page Giter Club logo

Comments (10)

mgcizzu avatar mgcizzu commented on September 7, 2024 1

I also just fixed the L-probe example file in a way that should now work consistently, and changed again the pandas requirements (turns out I had changed that in a test branch and not merged the changes to the main branch).
Please keep finding bugs!
Marco

from lee_2023.

mgcizzu avatar mgcizzu commented on September 7, 2024 1

I haven't changed the Biopython requirements, as you really get a warning rather than an error. So I'll try to fix it with a bit more time in a later release :)

from lee_2023.

mgcizzu avatar mgcizzu commented on September 7, 2024

Thanks Ines,
we'll try to fix the biopython thing in the next update.
Can you please post your matplotlib error?
Thx!
Marco

from lee_2023.

Boehmin avatar Boehmin commented on September 7, 2024

Hi Marco,

the matplotlib error was
ModuleNotFoundError: No module named 'matplotlib'

I just installed matplotlib and it worked fine.

I now managed to run through the whole notebook (I can post these issues separately if preferred/or change the title of this issue? but here summed up for now):
I had the same pandas error as here. I freshly pulled from git yesterday. First I tried to change .append to pd.concat in probedesign.py which did not work obviously. I forced pandas==1.1.5 which fixed it as suggested in the linked issue.

Since I still had to input the cutadapt file path, it took me a while to realise that it was loading the probedesign.py from the .egg module and changing the probedesign.py file would not do much. I had not worked with modules/.egg files before so it was quicker for me to do the following:
swap the following line
from PLP_directRNA_design import probedesign as plp
for

import sys
# Add the folder path to sys.path
sys.path.insert(0, '/user/pathto/PLP_directRNA_design')
import probedesign as plp

Not sure if there is an easier way (or if this might have broken things?).

Another issue I had was in the "Assign Genes to Barcode" section as I tested how="start" on=LbarID. Since it wasn`t entirely clear to me whether to use the LbarID or a different variable/value, I tried all variations of "LbarID"/LbarID, numbers, column IDs, even barcode ID etc (maybe a full example could help?), until I figured out the issue was in the Lprobe_Ver2.csv. For clarity, this is the error I received:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/miniconda3/envs/probedesign/lib/python3.9/site-packages/pandas/core/indexes/base.py:3361, in Index.get_loc(self, key, method, tolerance)
   3360 try:
-> 3361     return self._engine.get_loc(casted_key)
   3362 except KeyError as err:

File ~/miniconda3/envs/probedesign/lib/python3.9/site-packages/pandas/_libs/index.pyx:76, in pandas._libs.index.IndexEngine.get_loc()

File ~/miniconda3/envs/probedesign/lib/python3.9/site-packages/pandas/_libs/index.pyx:108, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:5198, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:5206, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'number'

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[21], line 2
      1 customizedlib=r"C:\Users\sergio.salas\Documents\PhD\projects\gene_design\hower_example_5\assigned_gene_LID.csv"
----> 2 probes=plp.build_plps(path,specific_seqs_final,L_probe_library,plp_length,how='start',on="Lbar_ID")

File ~/02-spatial_transcriptomics/01-dataanalysis/08-ISS_processing/PLP_directRNA_design/PLP_directRNA_design/probedesign.py:510, in build_plps(path, specific_seqs_final, L_probe_library, plp_length, how, on)
    508 n=0
    509 for g in gname:
--> 510     gene_names_ID = gene_names_ID.append({"gene": g, "idseq" : np.array(sbh.loc[sbh['number']==ID+n,'ID_Seq'])[0], "Lbar_ID" : str(np.array(sbh.loc[sbh['number']==ID+n,'Lbar_ID'])[0]), "AffyID" : np.array(sbh.loc[sbh['number']==ID+n,'L_Affy_ID'])[0] }, ignore_index=True)
    511     n=n+1
    512 gene_names_ID2=gene_names_ID.set_index("gene", drop = False)

File ~/miniconda3/envs/probedesign/lib/python3.9/site-packages/pandas/core/frame.py:3458, in DataFrame.__getitem__(self, key)
   3456 if self.columns.nlevels > 1:
   3457     return self._getitem_multilevel(key)
-> 3458 indexer = self.columns.get_loc(key)
   3459 if is_integer(indexer):
   3460     indexer = [indexer]

File ~/miniconda3/envs/probedesign/lib/python3.9/site-packages/pandas/core/indexes/base.py:3363, in Index.get_loc(self, key, method, tolerance)
   3361         return self._engine.get_loc(casted_key)
   3362     except KeyError as err:
-> 3363         raise KeyError(key) from err
   3365 if is_scalar(key) and isna(key) and not self.hasnans:
   3366     raise KeyError(key)

KeyError: 'number'

I changed the values of the Lbar_ID rows "LbarID_0" in the .csv to a number (201>) and changed sbh['number'] to sbh['Lbar_ID'] such as below:

for g in gname:
            gene_names_ID = gene_names_ID.append({"gene": g, "idseq" : np.array(sbh.loc[sbh['Lbar_ID']==ID+n,'ID_Seq'])[0], "Lbar_ID" : str(np.array(sbh.loc[sbh['Lbar_ID']==ID+n,'Lbar_ID'])[0]), "AffyID" : np.array(sbh.loc[sbh['Lbar_ID']==ID+n,'L_Affy_ID'])[0] }, ignore_index=True)

Now it runs.

Additionally, I was wondering if you ever had issues with duplicate sequences? I tried this pipeline with a random gene (Zfp85) and got 6 PLP sequences, 2 out of those are duplicates.
Here the results attached:
good_targetsfinal.csv
designed_PLPs_final.csv

Thank you!
Ines

from lee_2023.

mgcizzu avatar mgcizzu commented on September 7, 2024

Hi Ines,
interesting, the pandas thing should have been fixed by a previous update. I'll double check.
Same goes for the redundant probes, we had solved this in a previous version of our code, but somehow made it here.
Give me a few days to go through the code.
M.

from lee_2023.

mgcizzu avatar mgcizzu commented on September 7, 2024

Ok here I am.
This is for both @Boehmin and @Sverreg (commenting on the issue mentioned here).
You get duplicate sequence (or sequences within a +-20 nt range, which overlap and should be excluded, you can check this in the position column) likely because your gene is a bad substrate for the probes. Either it's too short or doesn't comply very well with our GC requirement.
When the search doesn't find a number of target equal or superior to the one you specified (default=5) it will automatically return all the targets.
Here's the relevant code snipped from the select_sequences functions in PLP_design.py

if ele.shape[0]<number_of_selected:
            selec=ele
        else:    
            for num in range(0,number_of_selected):
                if ele.shape[0]>0:
                    randomlist = random.sample(range(0, ele.shape[0]), 1)
                    sele=ele.iloc[randomlist,:]
                    try:
                        seleall=pd.concat([seleall,sele])
                    except:
                        seleall=sele
                    exclude=list(range(int(sele['Position']-20),int(sele['Position']+20)))
                    ele=ele.loc[~ele['Position'].isin(exclude),:]
            selec=seleall
        selected2=pd.concat([selected2,selec])

I'd suggest to run the search again for these genes relaxing a bit the GC content or taking out the requirement for a terminal G/C. Maybe that will fix the issue. Keep in mind that sometimes it's impossible to design "good"probes against some genes. You can try your chances anyway, design them manually and they might work...
Please let me know if my explanation doesn't make sense.

Cheers,
Marco

from lee_2023.

Boehmin avatar Boehmin commented on September 7, 2024

Hi Marco,

thank you for the explanation! I`ll keep that in mind and will give this a go. Just to check, would lowering the target requirement to =3 or 4 potentially also help?

Cheers,
Ines

from lee_2023.

mgcizzu avatar mgcizzu commented on September 7, 2024

Hi Ines,
regarding your last question. I think it's wise to have 5 probes (targets) per gene if possible. This will ensure high detection efficiency. While setting the target requirement to 3 or 4 will help solving the issue above, I'd still try to design 5 and do some manual check to remove duplicates and overlaps.
Cheers and sorry for the late reply!
Marco

from lee_2023.

Boehmin avatar Boehmin commented on September 7, 2024

Hi Marco,

thanks for the tip. I`ll test this again.
Maybe you can answer another question. I am trying to design probes that are as species-agnostic as possible between mouse and human (following the description in the pre-print).
Do I understand correctly that I should:

  1. run the extract and align sequences for mouse & human
  2. Concatenate results?
  3. run plp.select_sequences() on mouse + human extracted 30kmers (run this on mouse & human separately or combined?)
  4. run plp.map_sequences() on selected sequences above, keep only 30mers found in both species, run this twice; once against human, once against mouse
  5. Continue normally to end like in the tutorial notebook

I`m also slowly going through the ISS analysis notebooks, so I will start a separate issue should I run into bugs there. :)

from lee_2023.

Boehmin avatar Boehmin commented on September 7, 2024

Hi @mgcizzu ,

one more question, is the anchor sequence in the notebook the primer sequence? Since you use a pseudo-anchor as described in the supplementary, I assume the anchor in this notebook is the complementary sequence of your RCA primer?
Thank you!

from lee_2023.

Related Issues (6)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.