I am trying to figure out how to create valid ISA-Tab/ISA-JSON with a more complex sam

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

I assume for the PR you are talking about issue <a class="issue-link js-issue-link" da

Are Multiple "<entity> Name" Columns Allowed? about isa-api HOT 3 OPEN

ptth222 commented on June 21, 2024

Are Multiple " Name" Columns Allowed?

from isa-api.

Comments (3)

proccaserra commented on June 21, 2024

Hi @ptth222 thanks for the detailed report.

The behaviour you observed is down to the fact that in the ISA-Tab format, the s_ (study) file can only have one Source Name (the graph should start with a Source Node), and one Sample Name Node.
However, Sample Name node are allowed in the a_ (assay) files to allow for aliquoting and fractions of a sample.

Can you open a PR so we can review and acknowledge your contribution to the code base?
many thanks and hi to Hunter.

from isa-api.

ptth222 commented on June 21, 2024

I assume for the PR you are talking about issue #501. I have created a PR for that issue.

For this issue I am not convinced that the study file can only have 1 Sample Name node and assay files can have multiple. If anything it actually seems like the opposite. At the very least, the documentation and code are at odds with what you have said and these things should be reconciled.

First, I will reiterate what the documentation says:

The last sentence here https://isa-specs.readthedocs.io/en/latest/isatab.html#study-table-file suggests to me that it should be possible:
"Node properties, such as Characteristics (for Material nodes), Parameter Value (for Process nodes) and additional Name columns for special cases of Process node to disambiguate Protocol REF entries of MUST follow the named node of context."

This is specifically saying there can be additional Name columns in the study file.

Secondly, as I said previously, you can generate a study file with multiple Sample Name columns using the converter from JSON to Tab. If you look at a part of the code in the write_study_table_files function you can see that the code specifically counts Sample Name nodes.

        sample_in_path_count = 0
        protocol_in_path_count = 0
        longest_path = _longest_path_and_attrs(paths, s_graph.indexes)
        
        for node_index in longest_path:
            node = s_graph.indexes[node_index]
            if isinstance(node, Source):
                olabel = "Source Name"
                columns.append(olabel)
                columns += flatten(
                    map(lambda x: get_characteristic_columns(olabel, x),
                        node.characteristics))
                columns += flatten(
                    map(lambda x: get_comment_column(
                        olabel, x), node.comments))
            elif isinstance(node, Process):
                olabel = "Protocol REF.{}".format(protocol_in_path_count)
                columns.append(olabel)
                protocol_in_path_count += 1
                if node.executes_protocol.name not in protnames.keys():
                    protnames[node.executes_protocol.name] = protrefcount
                    protrefcount += 1
                columns += flatten(map(lambda x: get_pv_columns(olabel, x),
                                       node.parameter_values))
                if node.date is not None:
                    columns.append(olabel + ".Date")
                if node.performer is not None:
                    columns.append(olabel + ".Performer")
                columns += flatten(
                    map(lambda x: get_comment_column(
                        olabel, x), node.comments))

            elif isinstance(node, Sample):
                olabel = "Sample Name.{}".format(sample_in_path_count)
                columns.append(olabel)
                sample_in_path_count += 1
                columns += flatten(
                    map(lambda x: get_characteristic_columns(olabel, x),
                        node.characteristics))
                columns += flatten(
                    map(lambda x: get_comment_column(
                        olabel, x), node.comments))
                columns += flatten(map(lambda x: get_fv_columns(olabel, x),
                                       node.factor_values))

The write_assay_table_files function however, does not count Sample Name nodes. You can actually see that at one point it did, but that has been commented out:

    for study_obj in inv_obj.studies:
        for assay_obj in study_obj.assays:
            a_graph = assay_obj.graph
            if a_graph is None:
                break
            protrefcount = 0
            protnames = dict()

            def flatten(current_list):
                return [item for sublist in current_list for item in sublist]

            columns = []

            # start_nodes, end_nodes = _get_start_end_nodes(a_graph)
            paths = _all_end_to_end_paths(
                a_graph, [x for x in a_graph.nodes()
                          if isinstance(a_graph.indexes[x], Sample)])
            if len(paths) == 0:
                log.info("No paths found, skipping writing assay file")
                continue
            if _longest_path_and_attrs(paths, a_graph.indexes) is None:
                raise IOError(
                    "Could not find any valid end-to-end paths in assay graph")
            for node_index in _longest_path_and_attrs(paths, a_graph.indexes):
                node = a_graph.indexes[node_index]
                if isinstance(node, Sample):
                    olabel = "Sample Name"
                    # olabel = "Sample Name.{}".format(sample_in_path_count)
                    # sample_in_path_count += 1
                    columns.append(olabel)
                    columns += flatten(
                        map(lambda x: get_comment_column(olabel, x),
                            node.comments))
                    if write_factor_values:
                        columns += flatten(
                            map(lambda x: get_fv_columns(olabel, x),
                                node.factor_values))

                elif isinstance(node, Process):
                    olabel = "Protocol REF.{}".format(
                        node.executes_protocol.name)
                    columns.append(olabel)
                    if node.executes_protocol.name not in protnames.keys():
                        protnames[node.executes_protocol.name] = protrefcount
                        protrefcount += 1
                    if node.date is not None:
                        columns.append(olabel + ".Date")
                    if node.performer is not None:
                        columns.append(olabel + ".Performer")
                    columns += flatten(map(lambda x: get_pv_columns(olabel, x),
                                           node.parameter_values))
                    if node.executes_protocol.protocol_type:
                        oname_label = get_column_header(
                            node.executes_protocol.protocol_type.term,
                            protocol_types_dict
                        )
                        if oname_label is not None:
                            columns.append(oname_label)
                        elif node.executes_protocol.protocol_type.term.lower() \
                                in protocol_types_dict["nucleic acid hybridization"][SYNONYMS]:
                            columns.extend(
                                ["Hybridization Assay Name",
                                 "Array Design REF"])
                    columns += flatten(
                        map(lambda x: get_comment_column(olabel, x),
                            node.comments))
                    for output in [x for x in node.outputs if
                                   isinstance(x, DataFile)]:
                        columns.append(output.label)
                        columns += flatten(
                            map(lambda x: get_comment_column(output.label, x),
                                output.comments))

                elif isinstance(node, Material):
                    olabel = node.type
                    columns.append(olabel)
                    columns += flatten(
                        map(lambda x: get_characteristic_columns(olabel, x),
                            node.characteristics))
                    columns += flatten(
                        map(lambda x: get_comment_column(olabel, x),
                            node.comments))

                elif isinstance(node, DataFile):
                    pass  # handled in process

I also modified an example to have multiple Sample Name columns in an assay file, and it does not get converted to JSON correctly. Specifically, there is not an error, but you cannot find the second Sample Name column samples anywhere in the JSON.

It should also be noted that multiple Sample Name columns in the study or assay files does not produce any sort of validation error or warning.

I hope I have demonstrated how both the documentation and code contradict what you said about Sample Name columns. If you are confident about how multiple Sample Names are supposed to work, then the documentation and code, both validation and conversion, should be changed to reflect that. If what I have shown is correct, however, then the ProcessSequenceFactory code needs to be changed to look for more than 1 Sample Name column. It may need to be changed regardless because it finds 1 set of samples from the study file and uses that as ground truth for the assays, so if there are new samples in an assay (due to having multiple Sample Name columns) they won't be found. Either way, the code and behavior do not agree with what you have said and it needs to be reconciled. I don't mind trying to make the code changes myself, but I do need to be sure about how things are supposed to function.

from isa-api.

ptth222 commented on June 21, 2024

I worked on another project for a bit, but now I am moving back to the one that involves this package. I really need this to be resolved so that I can move forward. Please consider this a gentle reminder. If a meeting would be better, then I would be happy to meet.

from isa-api.

Are Multiple "<entity> Name" Columns Allowed? about isa-api HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent