Giter Club home page Giter Club logo

symlink-seqs's Introduction

Tests

symlink-seqs

Create fastq symlinks for selected samples in sequencer output directories based on project ID from SampleSheet files.

Usage

usage: symlink-seqs [-h] [-p PROJECT_ID] [-r RUN_ID] [-i IDS_FILE] [-s] [-c CONFIG] [--copy] [--csv] [-o OUTDIR]

optional arguments:
  -h, --help                                Show this help message and exit.
  -p PROJECT_ID, --project-id PROJECT_ID
  -r RUN_ID, --run-id RUN_ID
  -i IDS_FILE, --ids-file IDS_FILE          File of sample IDs (one sample ID per line)
  -s, --simplify-sample-id                  Simplify filenames of symlinks to include only sample-id_R{1,2}.fastq.gz
  -c CONFIG, --config CONFIG                Config file (json format).
  --copy                                    Create copies instead of symlinks.
  --csv                                     Print csv-format summary of fastq file paths for each sample to stdout.
  -o OUTDIR, --outdir OUTDIR                Output directory, where symlinks (or copies) will be created.

If you add the -s (or --simplify-sample-id) flag, then the filenames of the symlinks will be simplified to only sample-id_R1.fastq.gz, instead of the original sample-id_S01_L001_R1_001.fastq.gz as they are named in the original sequencer output directory.

Adding the --copy flag will create copies of the files instead of symlinks. Be aware of the extra data storage implications of creating copies.

When the --csv flag is used, neither symlinks nor copies will be created. Instead, a csv file with the following fields will be printed to standard output:

ID,R1,R2

...where ID is the sample ID, R1 is the path to the R1 fastq file, and R2 is the path to the R2 fastq file.

If a run ID is supplied, then only samples from that run will be symlinked. If no sample IDs file is supplied, then all samples on that run will be symlinked.

Configuration

The tool reads a config file from ~/.config/symlink-seqs/config.json by default. An alternative config file can be provided using the -c or --config flags.

A minimal config file must include a list of directories where sequencing run dirs can be found, under the key sequencing_run_parent_dirs:

{
	"sequencing_run_parent_dirs": [
		"/path/to/sequencer-01/output",
		"/path/to/sequencer-02/output",
		"/path/to/sequencer-03/output"
	]
}

Additional settings may be added to the config:

{
	"sequencing_run_parent_dirs": [
		"/path/to/sequencer-01/output",
		"/path/to/sequencer-02/output",
		"/path/to/sequencer-03/output"
	],
	"simplify_sample_id": true
}
Key Required? Value Type Description
sequencing_run_parent_dirs True List of paths
simplify_sample_id False Boolean
copy False Boolean When set to true, make copies instead of symlinks
csv False Boolean When set to true, print a csv summary of fastq files per sample
outdir False Path Directory to create symlinks or copies under

The file must be in valid JSON format.

symlink-seqs's People

Contributors

dfornika avatar

Watchers

 avatar

symlink-seqs's Issues

Error when no `--ids-file` supplied for NextSeq run

Whenno sample ID list is supplied via the--ids-file` flag, and a NextSeq run ID is supplied, then we run into this errror:

Traceback (most recent call last):
  File "symlink-seqs", line 627, in <module>
    main(args)
  File "symlink-seqs", line 601, in main
    fastq_paths = get_fastq_paths(config, run_dir, sample_ids)
  File "symlink-seqs", line 501, in get_fastq_paths
    for sample in selected_samples:
  File "symlink-seqs", line 480, in <lambda>
    selected_samples = filter(lambda x: x['sample_id'] in sample_ids, candidate_samples)
TypeError: argument of type 'NoneType' is not iterable

Note that when no --ids-file is supplied, the value of sample_id is None here:

symlink-seqs/symlink-seqs

Lines 580 to 582 in 718c9b9

sample_ids = None
if args.ids_file is not None:
sample_ids = parse_ids_file(args.ids_file)

...then passed to the get_fastq_paths() function here:

fastq_paths = get_fastq_paths(config, run_dir, sample_ids)

...then we fail to iterate over it here because None is not iterble:

selected_samples = filter(lambda x: x['sample_id'] in sample_ids, candidate_samples)

For NextSeq runs, use most recent Analysis subdir

We're currently assuming that the fastq files for NextSeq runs are under Analysis/1/Data/fastq:

https://github.com/BCCDC-PHL/symlink-illumina-fastq-by-project-id/blob/8c4e570483054b9d8239aa9fc9b8e2b66ac42064/symlink_illumina_fastq_by_project_id.py#L158-L159

But in cases where demultiplexing has been repeated, additional Analysis sub-directories are created:

Analysis/1/Data/fastq
Analysis/2/Data/fastq
Analysis/3/Data/fastq
...etc

In those cases, the most recent Analysis sub-directory is the one that we should link to.

Support finding demultiplexing-specific SampleSheet for new MiSeq directory structure

In the new MiSeq directory structure, there are two SampleSheets:

  • At the top-level of the run output directory
  • For a specific Alignment (demultiplexing): Alignment_<N>/<DATE_TIMESTAMP>/SampleSheetUsed.csv

...but we're currently only looking for a SampleSheet named SampleSheet.csv in the top-level of the run dir.

symlink-seqs/symlink-seqs

Lines 462 to 464 in eb38c15

if sequencer_type == 'miseq':
fastq_subdir = find_miseq_fastq_subdir(run_dir)
samplesheet_path = os.path.join(run_dir, 'SampleSheet.csv')

We should instead be looking for the SampleSheet for the most recent demultiplexing output dir.

Convert underscores in SampleSheet sample IDs to dashes, to match fastq files

When a sample ID in an illumina SampleSheet includes a underscores (_), they are automatically converted to dashes (-) in the fastq filename. This causes a mismatch when trying to match up SampleSheet entries with fastq files in this tool.

We can compensate for this by automatically converting underscores to dashes in the sample IDs prior to matching.

Exclude failed runs

When I run symlink-seqs on a sample list containing samples, some of which, were on a run that has been failed and then repeated, the symlink generated for those failed samples is currently to the first (failed) run. Is it possible to exclude these runs from the symlinking? At the moment the work around is to identify those samples that were on the failed run and then run a separate sample list providing the -r flag with the run that they were successfully sequenced on.

Symlink all samples on a run when a run ID is supplied with no sample IDs file

When a --run-id is supplied with no --ids-file, we get this error:

Traceback (most recent call last):
  File "symlink-seqs", line 618, in <module>
    main(args)
  File "symlink-seqs", line 577, in main
    sample_ids = parse_ids_file(args.ids_file)
  File "symlink-seqs", line 404, in parse_ids_file
    with open(ids_file_path, 'r') as f:
TypeError: expected str, bytes or os.PathLike object, not NoneType

We should support a mode where all samples on a run are symlinked when a run ID is supplied with no sample IDs file.

Support new MiSeq output directory structure

After upgrading to Windows 10, our MiSeq instruments are using a different output directory structure.

This tool should check which directory structure is used for the run before attempting to locate and symlink fastq files. This should be done automatically so the user doesn't need to specify which directory structure is used.

Crash on NextSeq run when `Analysis` dir is not present in top-level of run dir

If any NextSeq run is missing an Analysis directory at the top-level of the run dir, we get an error:

FileNotFoundError: [Errno 2] No such file or directory: '/path/to/run/Analysis'

...and the script terminates. That may cause some symlinks not to be created, if the script terminates before reaching those samples.

We should skip the run if there is no Analysis dir at the top-level for NextSeq runs.

Samples not symlinked if project ID is empty in SampleSheet

We often want to symlink all libraries for a specific project, but sometimes we just want to grab all the libraries on a run or just a list of arbitrary libraries regardless of the project. But we're currently excluding any libraries that have an empty project ID:

if 'sample_project' in sample and sample['sample_project'] != "":

if 'project_name' in sample and sample['project_name'] != "":

if 'project_id' in sample and sample['project_id'] != "":

We shouldn't exclude those libraries.

`IndexError: list index out of range` in `find_miseq_fastq_subdir()`

Ran into this error recently while attempting to create symlinks:

Traceback (most recent call last):
  File "symlink-seqs", line 648, in <module>
    main(args)
  File "symlink-seqs", line 622, in main
    fastq_paths = get_fastq_paths(config, run_dir, sample_ids, args.project_id)
  File "symlink-seqs", line 459, in get_fastq_paths
    fastq_subdir = find_miseq_fastq_subdir(run_dir)
  File "symlink-seqs", line 150, in find_miseq_fastq_subdir
    fastq_dir = fastq_dirs[-1]
IndexError: list index out of range

Calling with no parameters will symlink every FASTQ

When running symlink-seqs with no additional command line arguments, the tool will begin symlinking every FASTQ on the system to the current working directory instead of displaying a help message. Not a major issue, but can be troublesome for newer users (and experienced users who forget).

I believe this stems from having default arguments for both the --config and --outdir arguments simultaneously. I think the solution is to remove the default='.' from the --outdir argument, while addingrequired=True.

Old:

parser.add_argument('-o', '--outdir', default='.', help="Output directory, where symlinks (or copies) will be created.")

New:

 parser.add_argument('-o', '--outdir', required=True, help="Output directory, where symlinks (or copies) will be created.") 

Add csv mode

Add a mode where instead of creating symlinks, a csv will be written to stdout with the fields:

ID
R1
R2

These csv files are compatible with several of our pipelines that take input via a --samplesheet_input mode.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.