whitews / flowio Goto Github PK

View Code? Open in Web Editor NEW

38.0 5.0 16.0 5.12 MB

A Python library for reading and writing Flow Cytometry Standard (FCS) files

Home Page: https://flowio.readthedocs.io

License: BSD 3-Clause "New" or "Revised" License

Python 100.00%

python fcs flow-cytometry flow-cytometry-analysis flow-cytometry-files fcs-files cytometry immunology

flowio's Introduction

FlowIO

Overview

FlowIO is a Python library for reading / writing Flow Cytometry Standard (FCS) files, with zero external dependencies and is compatible with Python 3.7+.

FlowIO retrieves event data exactly as it is encoded in the FCS file: as a 1-dimensional list without separating the events into channels or performing any preprocessing (e.g. applying gain). Metadata stored in the FCS file is available as a dictionary via the 'text' attribute. Basic attributes are also available for commonly accessed properties. For example, the channel count can be used to easily convert the event data to a multi-column NumPy array:

import flowio
import numpy

fcs_data = flowio.FlowData('example.fcs')
npy_data = numpy.reshape(fcs_data.events, (-1, fcs_data.channel_count))

For higher level interaction with flow cytometry data, including GatingML and FlowJo 10 support, see the related FlowKit project.

Installation

The recommended way to install FlowIO is via the pip command:

pip install flowio

Or, if you prefer, you can install from the GitHub source:

git clone https://github.com/whitews/flowio
cd flowio
pip install .

Documentation

The FlowIO API documentation is available on ReadTheDocs here. If you have any questions about FlowIO or find any bugs please submit an issue to the GitHub repository here.

Changelogs

Changelogs for versions are available here

flowio's People

Contributors

Stargazers

Watchers

Forkers

mpschr nathan2wong lgrozinger matt-faria urvancev jcgkitten andreas-wilm mattjtodd imyizhang cellengine yinxx yujimlong tristan-ranff mensel123 krystianmoras cloner174

flowio's Issues

Removing a channel

I am looking for manipulating a certain amount of channels from an .fcs file. I am reading to the code and it is not entirely clear to me how I would proceed to do that using your library. Is it possible ?

Thanks in advance!

Better handling of data section byte offset discrepancy

The FlowData class uses the header values to locate the text section. Then, the text segment is parsed and the values in the text metadata are used to locate the data section. However, there are reports of rare occurrences of FCS files where the HEADER & TEXT data offset locations are different. FlowIO should raise a custom exception when there is a discrepancy in the data section byte offset location. However, in the case of large files (>99,999,999 bytes) a different value is expected. This is because the HEADER values are limited to 8 bytes.

Proposed solution:

By default, FlowIO will use the TEXT data offsets and load a file normally when the HEADER & TEXT agree or if the HEADER value is 0 for large files.

For a discrepancy in the 2 values, FlowIO will raise an Exception when the discrepancy is not due to the large file scenario.

For the case where the user wants to ignore a discrepancy and use the TEXT value they can force the loading of the file with a new ignore_offset_discrepancy option (which defaults to False).

For the case where the user wants to ignore a discrepancy and use the HEADER value they can force the loading of the file with a new use_header_offsets option (which also defaults to False).

Setting both ignore_offset_discrepancy & use_header_offsets to True will be equivalent to just setting use_header_offsets to True.

Getting event data as an Int but expecting a float?

Hi there,

I'm sure this is a going to be trivial but I just can't seem to get my head around it:

Opening up a standard FCS2.0 file in FlowIO results in values which appear as integers (312, 412 etc).

Opening the same file in FlowKit seems to result in the values being imported as floats with reasonable precision which is what I would expect. I lose quite a bit of information when importing with FlowIO vs FlowKit because of this.

As I said I'm sure this is something trivial like a missing transform argument but I was hoping I could be pointed in the right direction. Thanks!

Open files containing multiple datasets

Hey,

The current release doesn't allow users to choose which dataset they want to load if a file contains multiple datasets or am I missing something?

It seems like all the functions are already ready to read a selected dataset, but there is no check for the nextdata keyword to recognize if there is more then one dataset available.
I could try to open a pull request for this, but wanted to check in with you first.

Best wishes
Max

Clarify parsing of examples/fcs_files/100715.fcs

In R with FlowCore, the file 100715.fcs has sensible values if using the truncate_max_range option:

library(flowCore)
ff <- read.FCS("100715.fcs", truncate_max_range = FALSE)
print(summary(ff))

            FSC-A     FSC-H        SSC-A       B515-A       R780-A       R710-A
Min.     23406.00  27008.50    -8.014621    -67.28254    -67.11903    -44.55855
1st Qu.  34158.00  33993.69   168.284748   1988.90176    528.52808   1127.26749
Median   41644.25  40842.88   219.902458   2851.75439    898.31769   1732.62372
Mean     44672.32  43189.90   331.918749   3203.20944   1253.38825   2342.41606
3rd Qu.  50878.94  49323.50   278.990265   3344.79010   1515.64224   2637.69879
Max.    262143.50 256543.75 46248.464844 261572.65625 261566.53125 261455.73438
             R660-A      V800-A       V655-A      V585-A       V450-A
Min.       -79.8198   -110.4093    -66.27671   -110.4727    -28.87656
1st Qu.    738.0033   1302.3726   1136.13980   2360.6066   1785.75531
Median    1100.6382   1879.1382   1725.22186   3601.5178   2450.64038
Mean      1507.4539   2668.0656   2587.16628   4795.2020   2889.03029
3rd Qu.   1521.9344   2623.6032   2211.24213   4681.3687   3266.69647
Max.    261584.7656 261410.8906 261585.40625 261586.5312 253481.90625
             G780-A       G710-A       G660-A       G610-A       G560-A
Min.      -110.5278    -89.50339    -51.95771    -61.93935    -33.25663
1st Qu.   2011.9602   1462.61804   1812.38644   1411.36789   1940.51962
Median    3073.3077   2159.56616   2705.18494   2117.23547   2812.36877
Mean      4140.8475   2629.49203   3336.95553   2536.82272   3581.05912
3rd Qu.   4898.1101   2993.53485   3459.33716   2751.39990   4092.05743
Max.    261539.9219 261563.79688 261581.21875 261537.01562 261576.21875
Warning message:
No '$PnE' keyword available for the following channels: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
Using '0,0' as default.

With FlowIO (and also fcsparser), the raw values are way more variable, missing the correction of truncate_max_range:

import pandas as pd
import numpy as np
import flowio

fcs_data = flowio.FlowData('example.fcs')
npy_data = numpy.reshape(fcs_data.events, (-1, fcs_data.channel_count))
df_describe = pd.DataFrame(npy_data)
df_describe.describe()

                 0             1             2             3             4   \
count  6.498900e+04  6.498500e+04  6.498000e+04  6.497800e+04  6.498300e+04   
mean            inf           inf           inf           inf           inf   
std             inf           inf           inf           inf           inf   
min   -1.186825e+29 -1.186825e+29 -1.186825e+29 -1.186825e+29 -1.186825e+29   
25%   -1.483879e+00 -1.483879e+00 -1.483879e+00 -1.483879e+00 -1.483879e+00   
50%   -4.832434e-02 -4.832434e-02 -4.832434e-02 -4.820263e-02 -4.832434e-02   
75%    1.710476e+02  1.722441e+02  1.485543e+02  1.969365e+02  1.921069e+02   
max    3.393232e+38  3.286675e+38  3.379840e+38  3.339942e+38  3.339699e+38   

                 5             6             7             8             9   \
count  6.498400e+04  6.498500e+04  6.498600e+04  6.498100e+04  6.498800e+04   
mean            inf           inf           inf           inf           inf   
std             inf           inf           inf           inf           inf   
min   -1.186825e+29 -1.186825e+29 -1.186825e+29 -1.186825e+29 -1.186825e+29   
25%   -1.483879e+00 -1.483879e+00 -1.483879e+00 -1.483879e+00 -1.483879e+00   
50%   -4.832434e-02 -4.832434e-02 -4.820570e-02 -4.832434e-02 -4.821291e-02   
75%    1.402682e+02  1.932709e+02  2.060314e+02  1.809365e+02  1.839689e+02   
max    3.335518e+38  3.379818e+38  3.401980e+38  3.379732e+38  3.393125e+38   

                 10            11            12            13            14  \
count  6.497800e+04  6.498600e+04  6.498200e+04  6.497900e+04  6.498600e+04   
mean            inf           inf           inf           inf           inf   
std             inf           inf           inf           inf           inf   
min   -1.186825e+29 -1.186825e+29 -1.186825e+29 -1.186825e+29 -1.186825e+29   
25%   -1.483879e+00 -1.483879e+00 -1.483879e+00 -1.483879e+00 -1.483879e+00   
50%   -4.832434e-02 -4.819864e-02 -4.832434e-02 -4.832434e-02 -4.820315e-02   
75%    1.917931e+02  2.158215e+02  1.689365e+02  2.039365e+02  2.217711e+02   
max    3.379818e+38  3.180088e+38  3.326649e+38  3.393111e+38  3.261503e+38   

                 15  
count  6.498800e+04  
mean            inf  
std             inf  
min   -1.186825e+29  
25%   -1.483879e+00  
50%   -4.817861e-02  
75%    2.440290e+02  
max    3.299823e+38

It seems to be some mismatch between 64 and 32-bit integer types, which should raise a warning. Is there a parsing option similar to truncate_max_range in Python?

Support for FCS3.2

Hey @whitews,
there is the new FCS3.2 standard which has some new keywords, especially
the PnDATATYPE. This allows single columns to have another datatype then the one set in DATATYPE keyword. See 3.3.41 on page 41. Do you already have plans to support that? It might be a problem cause the values in the used array must be of the same type as I understand it.

I will also think about a solution but wanted to check in with you first.

Best wishes
Max

Error when exporting some FCS files: REPORT BUG: error calculating text offset

The error Exception: REPORT BUG: error calculating text offset is thrown when exporting some FCS files. See FlowKit issue here.

Issue loading fcs files

I'm just starting to look into implementing a python-based interpretation of fcs files using flowkit. However, I'm having trouble right at the beginning, with flowIO unable to load the fcs files I'm working with.
I'm getting the warning UserWarning: text in segment does not start and end with delimiter warn("text in segment does not start and end with delimiter")

and later:

error                                     Traceback (most recent call last)
<ipython-input-5-bca60f0715fe> in <module>
----> 1 fd = flowio.FlowData('G11.fcs')

~/anaconda2/envs/analyzefacsmore/lib/python3.6/site-packages/flowio/flowdata.py in __init__(self, filename)
     81             d_start,
     82             d_stop,
---> 83             self.text)
     84
     85         try:

~/anaconda2/envs/analyzefacsmore/lib/python3.6/site-packages/flowio/flowdata.py in __parse_data(self, offset, start, stop, text)
    192                 stop,
    193                 data_type.lower(),
--> 194                 order)
    195         else:  # ascii
    196             data = self.__parse_ascii_data(

~/anaconda2/envs/analyzefacsmore/lib/python3.6/site-packages/flowio/flowdata.py in __parse_float_data(self, offset, start, stop, data_type, order)
    256
    257         tmp = unpack('%s%d%s' % (order, num_items, data_type),
--> 258                      self.__read_bytes(offset, start, stop))
    259         return tmp
    260

error: unpack requires a buffer of 277676 bytes

I'm consistently getting this error for all the fcs files we are producing (an Attune, not sure the software version). The flowIO load works great for an example fcs file in the flowKit example.

I uploaded one file as an example.
issue_file.zip

FlowData.write_fcs attempts to write strings to file opened in binary mode

Issue:

FlowData.write_fcs will fail in Python3

Reproduce:

Running the following under Python3.X:

    data = flowio.FlowData(fcsfile)
    data.write_fcs(fcsfile, extra=annotations)

will result in:

  File "/home/campus.ncl.ac.uk/b8051106/.local/lib/python3.8/site-packages/flowio/flowdata.py", line 396, in write_fcs
    fh.write('FCS3.1')
TypeError: a bytes-like object is required, not 'str'

Missing license

Hi,
Would it be possible to have a license file o we can use your package?

Cheers,
Andrea

Question about flow cytometry data

I have some flow cytometry data with LMD format Is it possible to work with them in python using Flowio or I should look for another library
Thanks for helping me

Parsing an FCS file with variable int sizes fails raises an exception (and it could parse much faster)

When parsing an FCS file that contains int data with variable lengths, this line raises an exception in Python3:

FlowIO/flowio/flowdata.py

Line 259 in 51d10fb

unused_bit_widths[i])

as a map is not subscriptable. Possibly a holdover from converting from Python 2-3 ?

OS: windows, but should happen on everything.

This is pretty easy to fix; but a secondary issue is that parsing each value individually is very slow (multiple minutes for a decent sized file). I've got a fix for both (which also simplifies that part of the code); will make a pull request!

Incorrect estimation of data_stop

I am using flowio to parse an FCS 3.0 file generated from the iQue. When I use flowio to parse the file, no event data is returned even though there are 72374 events across 32 channels. Looking at the code I believe it is because in the metadata of the fcs, end_data = begin_data which makes the estimation of data_start and data_stop for function __calc_data_item_count incorrect. If instead I had data_stop to be equal to data_start + event_count * num_channels *4 - 1 then I am able to correctly read out event data. Also when using FCSParser, it produces the expected behavior.

Is this and edge case scenario, and if so, there someway to account for this via a parameter (i.e. estimate data_stop from events) or am I non instantiating the object correctly.

write_fcs mixes up PNN and PNS labels

When using write_fcs to save a FlowData to file, the PNN and PNS labels are mixed up, if there are more than 9 channels.

This is because the channel dictionary is sorted by lexicographic order in this method.

Missing set zeroes if data offsets are greater than header max data size

Hi Scott,

I just came across a FlowKit created file that couldn't be read by FlowCore, because checkOffset failed (could be read with FlowKit though). After some debugging we found that the data segment offset was larger than 99,999,999. In that case start and end should be set to zero according to the standard (see below). This limit is not checked in create_fcs.py, but the code contains the following comment:

# TODO: set zeroes if data offsets are greater than header max data size

It's an easy fix. Will try to issue a PR over the coming days

From https://www.genepattern.org/attachments/fcs_3_1_standard.pdf:

    FCS 3.1 maintains support introduced in FCS 3.0 for data sets larger than 99,999,999 bytes.
    When any portion of a segment falls outside the 99,999,999 byte limit, '0's are substituted in the
    HEADER for that segments begin and end byte offset. The byte offsets for begin DATA, end
    DATA, begin ANALYSIS, end ANALYSIS (begin and end supplemental TEXT if appropriate) will
    then only be found as keyword-value pairs in the primary TEXT segment. Note, when a segment
    is contained completely within the first 99,999,999 bytes of a data set, the byte offsets for that
    segment will be duplicated in the TEXT segment as keyword values. Note also, if the ANALYSIS
    offsets in the HEADER are zero, the $BEGINANALYSIS and $ENDANALYSIS keywords must be
    checked to determine if an ANALYSIS segment is present. "