Hello, I managed to complete the pipeline with your provided test da

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Thanks <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-u

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Pipeline works on your test dataset but fails for experimental dataset - ValueError: Bin edges must be unique: array about scib-pipeline HOT 10 CLOSED

stemangiola commented on August 15, 2024

Pipeline works on your test dataset but fails for experimental dataset - ValueError: Bin edges must be unique: array

from scib-pipeline.

Comments (10)

mumichae commented on August 15, 2024 1

Yes, I got that error too. This is a bug in scib's LISI function, when printing the warning about not having enough neighbours. This usually happens in smaller datasets and shouldn't be that case in larger ones.

I'm currently addressing it here. Feel free to update scib in your python environment through pip as follows:

pip install git+https://github.com/theislab/scib.git@lisi_warning_bug

from scib-pipeline.

LuckyMD commented on August 15, 2024

Hi @stemangiola,

It looks like you haven't filtered out genes that are not expressed, or only in very few cells. That's leading to many genes with the same mean expression. It might be worth filtering out genes that are not expressed in more than 50 cells or so.

from scib-pipeline.

stemangiola commented on August 15, 2024

Thanks @LuckyMD ,

I have filtered only the 2000 most variable gene before using the pipeline, however, the error remains (also with other errors of integration algorithms, quite cryptic). Again, everything works with your test dataset

I don't know if looking at my dataset you would be able to understand what could be "wrong".

Thanks a lot.

Input file https://wehieduau-my.sharepoint.com/:u:/g/personal/mangiola_s_wehi_edu_au/Ecr0-pQrqrJJkWGj1VNwj8EBGS16rMHaOg9-kSV5ZBgK_w?e=lsKrDG

Config file

ROOT: data_pbmc
r_env : scib-R4
py_env : scib-pipeline-R4

timing: false
unintegrated_metrics: false

FEATURE_SELECTION:
  #hvg: 2000
  full_feature: 0

SCALING:
  - unscaled
  #- scaled

METHODS:
# python methods
  bbknn:
    output_type: knn
  combat:
    output_type: full
  #desc:
  #  output_type: embed
#  mnn:
#    output_type: full
  #saucie:
  #  output_type:
  #    - full
  #    - embed
  scanorama:
    output_type:
      - embed
      - full
  scanvi:
    output_type: embed
    no_scale: true
    use_celltype: true
  scgen:
    output_type: full
    use_celltype: true
  scvi:
    no_scale: true
    output_type: embed
  #trvae:
  #  no_scale: true
  # output_type:
  #    - embed
  #    - full
  # trvaep:
  #   no_scale: true
  #   output_type:
  #     - embed
  #     - full
# R methods
  #conos: # temporary directory issue
  #  R: true
  #  output_type: knn
  fastmnn:
    R: true
    output_type:
      - embed
      - full
  harmony:
    R: true
    output_type: embed
  liger:
    no_scale: true
    R: true
    output_type: embed
  seurat:
    R: true
    output_type: full
  seuratrpca:
      R: true
      output_type: full

DATA_SCENARIOS:
  run_pbmc:
    batch_key: batch
    label_key: celltype
    organism: human
    assay: expression
    file: data/pbmc_simplified.h5ad

Environment files

name: scib-pipeline-R4
channels:
  - conda-forge
  - bioconda
dependencies:
  - python>=3.7
  - r-base=4.0
  - pip
  - rpy2=3.4.2
  - snakemake>=5
  - networkx
  - icu
  - GraphViz
  - igraph
  - gsl
  - openblas
  - llvmlite
  - libgcc-ng
  - anndata>=0.8
  - anndata2ri
  - r-essentials
  - r-devtools
  - r-stringi
  - bioconductor-scater
  - bioconductor-scran
  # Methods
  - scvi=0.6.7
  - scanorama=1.7.0
#  - mnnpddy=0.1.9.5
  - bbknn=1.3.9
  - r-seurat=3
  - r-spatstat=1.64
  - pip:
    - scib==1.0.1
    - trvaep==0.1.0
#    - git+https://github.com/pinin4fjords/mnnpy.git
    - scgen==1.1.5
#    - trvae==1.1.2
#    - desc==2.0.3
#    - keras==2.2.4

name: scib-R4
channels:
  - bioconda
  - conda-forge
dependencies:
  - python>=3.7
  - r-base=4.0
  - r-essentials
  - gsl
  - r-optparse
  - r-remotes
  - r-devtools
  - r-biocmanager
  - r-cairo
  # Seurat dependencies
  - bioconductor-multtest
  - r-sdmtools
  - r-stringi
  - r-metap
  - r-seurat=3
  - r-spatstat=1.64
  # batchelor
  - bioconductor-batchelor
  # Liger dependencies
  - r-liger=0.5.0
  - openjdk
  - r-hdf5r
  - r-hmisc

from scib-pipeline.

stemangiola commented on August 15, 2024

Hello,

I managed to complete the pipeline. I had to filter heavily all genes with median abundance > 0 across cells, although present in many cells (I have ~60k cells).

I was left with ~500 genes. I will test how relaxed I can be in the filtering, however it seems a bit strange that 2000 variable genes with which I can perform Seurat analyses with no problem, are not transcribed enough for the completion of the scib pipeline.

from scib-pipeline.

LuckyMD commented on August 15, 2024

Agreed that this shouldn't be the case. The error occurs within the sc.pp.highly_variable_genes() function from scanpy. This runs highly variable genes per batch. It must be the case that in one of your batches many of your 2000 genes are very lowly expressed and therefore you can't compute the highly variable genes in that batch based on mean filtering. You can either subset to genes that are highly variable in many batches (using sc.pp.highly_variable_genes() or the equivalent scib.pp function with the batch= parameter set), or you can remove the problematic batch. Finally, you can also turn off the HVG metric that is the source of this error.

from scib-pipeline.

mumichae commented on August 15, 2024

Hi @stemangiola,

just another thing I wanted to ask about your data, since it wasn't explicitly stated in the README. The pipeline expects log scaled data in adata.X, so are you working with normalised and log-transformed data? If not, I could imaging that being the cause of the duplicate bins during HVG computation.

from scib-pipeline.

stemangiola commented on August 15, 2024

Thanks,

Reading the test data you provide with the data simulation script, I see that there are 2 assays: X and counts and they are identical. Does it sound right?

The distribution of the counts is not logaritmic

y@assays@data$X %>% as.numeric %>% summary
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   1.000   2.000   2.657   4.000 171.000

so I used this data of yours directly.

Now I understand this might be wrong. So what steps do I have to do exactly?

integer counts as raw data
log1p transform (I assume I have to add 1 before logging)
scaling across genes?
scaling (aka normalising) across samples?

from scib-pipeline.

mumichae commented on August 15, 2024

You're correct, the test data hasn't been log-scaled. I'll fix that.

As for preprocessing, we recommend QC, normalisation when you have multiple batches, and log1p transformation, as described in the Best practices for scRNAseq analysis (tutorial). For normalisation across samples, you can use your method of choice. The scib package provides normalisation in scran through rpy2. The scanpy function sc.pp.log1p already does log + 1 transformation.

Scaling across genes is something that the pipeline does provide. You can include gene scaling to the key scaled in the config file.
e.g.

scaled:
  - scaled
  - unscaled

The usefulness of gene scaling likely to depends on the dataset you're using

from scib-pipeline.

stemangiola commented on August 15, 2024

Thanks.

I encounter another error that pops in and out without clear reason (given the lack of changes in the input)

Traceback (most recent call last):
  File "scripts/metrics/metrics.py", line 263, in <module>
    trajectory_=trajectory_
  File "/home/users/allstaff/mangiola.s/.conda/envs/scib-pipeline-R4/lib/python3.7/site-packages/scib/metrics/metrics.py", line 344, in metrics
    verbose=verbose
  File "/home/users/allstaff/mangiola.s/.conda/envs/scib-pipeline-R4/lib/python3.7/site-packages/scib/metrics/lisi.py", line 293, in clisi_graph
    verbose=verbose
  File "/home/users/allstaff/mangiola.s/.conda/envs/scib-pipeline-R4/lib/python3.7/site-packages/scib/metrics/lisi.py", line 433, in lisi_graph_py
    count)
  File "/home/users/allstaff/mangiola.s/.conda/envs/scib-pipeline-R4/lib/python3.7/multiprocessing/pool.py", line 276, in starmap
    return self._map_async(func, iterable, starmapstar, chunksize).get()
  File "/home/users/allstaff/mangiola.s/.conda/envs/scib-pipeline-R4/lib/python3.7/multiprocessing/pool.py", line 657, in get
    raise self._value
numpy.core._exceptions.UFuncTypeError: ufunc 'add' did not contain a loop with signature matching types (dtype('int64'), dtype('<U26')) -> None

[Fri Apr 29 19:01:01 2022]
Error in rule metrics_single:
    jobid: 30
    output: /stornext/Bioinf/data/bioinf-data/Papenfuss_lab/projects/mangiola.s/PostDoc/covid19pbmc/software/scib-pipeline/data_pbmc/run_pbmc/metrics/unscaled/full_feature/liger_embed.csv
    shell:

        conda run -n scib-pipeline-R4 python scripts/metrics/metrics.py -u data/pbmc_simplified.h5ad -i /stornext/Bioinf/data/bioinf-data/Papenfuss_lab/projects/mangiola.s/PostDoc/covid19pbmc/software/scib-pipeline/data_pbmc/run_pbmc/integration/unscaled/full_feature/R/liger.h5ad          -o /stornext/Bioinf/data/bioinf-data/Papenfuss_lab/projects/mangiola.s/PostDoc/covid19pbmc/software/scib-pipeline/data_pbmc/run_pbmc/metrics/unscaled/full_feature/liger_embed.csv -m liger          -b batch -l celltype --type embed          --hvgs 0 --organism human --assay expression -v

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Do you have idea about the reason?

from scib-pipeline.

mumichae commented on August 15, 2024

scib 1.0.2 is out now and should have fixed that bug

from scib-pipeline.

Pipeline works on your test dataset but fails for experimental dataset - ValueError: Bin edges must be unique: array about scib-pipeline HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent