Giter Club home page Giter Club logo

Comments (3)

proccaserra avatar proccaserra commented on June 21, 2024

Hi @ptth222 thanks for the detailed report.

The behaviour you observed is down to the fact that in the ISA-Tab format, the s_ (study) file can only have one Source Name (the graph should start with a Source Node), and one Sample Name Node.
However, Sample Name node are allowed in the a_ (assay) files to allow for aliquoting and fractions of a sample.

Can you open a PR so we can review and acknowledge your contribution to the code base?
many thanks and hi to Hunter.

from isa-api.

ptth222 avatar ptth222 commented on June 21, 2024

I assume for the PR you are talking about issue #501. I have created a PR for that issue.

For this issue I am not convinced that the study file can only have 1 Sample Name node and assay files can have multiple. If anything it actually seems like the opposite. At the very least, the documentation and code are at odds with what you have said and these things should be reconciled.

First, I will reiterate what the documentation says:

The last sentence here https://isa-specs.readthedocs.io/en/latest/isatab.html#study-table-file suggests to me that it should be possible:
"Node properties, such as Characteristics (for Material nodes), Parameter Value (for Process nodes) and additional Name columns for special cases of Process node to disambiguate Protocol REF entries of MUST follow the named node of context."

This is specifically saying there can be additional Name columns in the study file.

Secondly, as I said previously, you can generate a study file with multiple Sample Name columns using the converter from JSON to Tab. If you look at a part of the code in the write_study_table_files function you can see that the code specifically counts Sample Name nodes.

        sample_in_path_count = 0
        protocol_in_path_count = 0
        longest_path = _longest_path_and_attrs(paths, s_graph.indexes)
        
        for node_index in longest_path:
            node = s_graph.indexes[node_index]
            if isinstance(node, Source):
                olabel = "Source Name"
                columns.append(olabel)
                columns += flatten(
                    map(lambda x: get_characteristic_columns(olabel, x),
                        node.characteristics))
                columns += flatten(
                    map(lambda x: get_comment_column(
                        olabel, x), node.comments))
            elif isinstance(node, Process):
                olabel = "Protocol REF.{}".format(protocol_in_path_count)
                columns.append(olabel)
                protocol_in_path_count += 1
                if node.executes_protocol.name not in protnames.keys():
                    protnames[node.executes_protocol.name] = protrefcount
                    protrefcount += 1
                columns += flatten(map(lambda x: get_pv_columns(olabel, x),
                                       node.parameter_values))
                if node.date is not None:
                    columns.append(olabel + ".Date")
                if node.performer is not None:
                    columns.append(olabel + ".Performer")
                columns += flatten(
                    map(lambda x: get_comment_column(
                        olabel, x), node.comments))

            elif isinstance(node, Sample):
                olabel = "Sample Name.{}".format(sample_in_path_count)
                columns.append(olabel)
                sample_in_path_count += 1
                columns += flatten(
                    map(lambda x: get_characteristic_columns(olabel, x),
                        node.characteristics))
                columns += flatten(
                    map(lambda x: get_comment_column(
                        olabel, x), node.comments))
                columns += flatten(map(lambda x: get_fv_columns(olabel, x),
                                       node.factor_values))

The write_assay_table_files function however, does not count Sample Name nodes. You can actually see that at one point it did, but that has been commented out:

    for study_obj in inv_obj.studies:
        for assay_obj in study_obj.assays:
            a_graph = assay_obj.graph
            if a_graph is None:
                break
            protrefcount = 0
            protnames = dict()

            def flatten(current_list):
                return [item for sublist in current_list for item in sublist]

            columns = []

            # start_nodes, end_nodes = _get_start_end_nodes(a_graph)
            paths = _all_end_to_end_paths(
                a_graph, [x for x in a_graph.nodes()
                          if isinstance(a_graph.indexes[x], Sample)])
            if len(paths) == 0:
                log.info("No paths found, skipping writing assay file")
                continue
            if _longest_path_and_attrs(paths, a_graph.indexes) is None:
                raise IOError(
                    "Could not find any valid end-to-end paths in assay graph")
            for node_index in _longest_path_and_attrs(paths, a_graph.indexes):
                node = a_graph.indexes[node_index]
                if isinstance(node, Sample):
                    olabel = "Sample Name"
                    # olabel = "Sample Name.{}".format(sample_in_path_count)
                    # sample_in_path_count += 1
                    columns.append(olabel)
                    columns += flatten(
                        map(lambda x: get_comment_column(olabel, x),
                            node.comments))
                    if write_factor_values:
                        columns += flatten(
                            map(lambda x: get_fv_columns(olabel, x),
                                node.factor_values))

                elif isinstance(node, Process):
                    olabel = "Protocol REF.{}".format(
                        node.executes_protocol.name)
                    columns.append(olabel)
                    if node.executes_protocol.name not in protnames.keys():
                        protnames[node.executes_protocol.name] = protrefcount
                        protrefcount += 1
                    if node.date is not None:
                        columns.append(olabel + ".Date")
                    if node.performer is not None:
                        columns.append(olabel + ".Performer")
                    columns += flatten(map(lambda x: get_pv_columns(olabel, x),
                                           node.parameter_values))
                    if node.executes_protocol.protocol_type:
                        oname_label = get_column_header(
                            node.executes_protocol.protocol_type.term,
                            protocol_types_dict
                        )
                        if oname_label is not None:
                            columns.append(oname_label)
                        elif node.executes_protocol.protocol_type.term.lower() \
                                in protocol_types_dict["nucleic acid hybridization"][SYNONYMS]:
                            columns.extend(
                                ["Hybridization Assay Name",
                                 "Array Design REF"])
                    columns += flatten(
                        map(lambda x: get_comment_column(olabel, x),
                            node.comments))
                    for output in [x for x in node.outputs if
                                   isinstance(x, DataFile)]:
                        columns.append(output.label)
                        columns += flatten(
                            map(lambda x: get_comment_column(output.label, x),
                                output.comments))

                elif isinstance(node, Material):
                    olabel = node.type
                    columns.append(olabel)
                    columns += flatten(
                        map(lambda x: get_characteristic_columns(olabel, x),
                            node.characteristics))
                    columns += flatten(
                        map(lambda x: get_comment_column(olabel, x),
                            node.comments))

                elif isinstance(node, DataFile):
                    pass  # handled in process

I also modified an example to have multiple Sample Name columns in an assay file, and it does not get converted to JSON correctly. Specifically, there is not an error, but you cannot find the second Sample Name column samples anywhere in the JSON.

It should also be noted that multiple Sample Name columns in the study or assay files does not produce any sort of validation error or warning.

I hope I have demonstrated how both the documentation and code contradict what you said about Sample Name columns. If you are confident about how multiple Sample Names are supposed to work, then the documentation and code, both validation and conversion, should be changed to reflect that. If what I have shown is correct, however, then the ProcessSequenceFactory code needs to be changed to look for more than 1 Sample Name column. It may need to be changed regardless because it finds 1 set of samples from the study file and uses that as ground truth for the assays, so if there are new samples in an assay (due to having multiple Sample Name columns) they won't be found. Either way, the code and behavior do not agree with what you have said and it needs to be reconciled. I don't mind trying to make the code changes myself, but I do need to be sure about how things are supposed to function.

from isa-api.

ptth222 avatar ptth222 commented on June 21, 2024

I worked on another project for a bit, but now I am moving back to the one that involves this package. I really need this to be resolved so that I can move forward. Please consider this a gentle reminder. If a meeting would be better, then I would be happy to meet.

from isa-api.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.