uofs-pulse-binfo / rawphenotypes Goto Github PK

A Tripal module for storing raw phenotypic data. Specifically meant to help researchers contribute raw data, visualize summaries and download for further analysis.

CSS 9.53% PHP 73.53% JavaScript 16.94%

tripal phenotypes trait loader raw-data

rawphenotypes's People

Contributors

Stargazers

Watchers

Forkers

visiting-scientist-3217 lhaclove

rawphenotypes's Issues

Days till calculator in generated spreadsheet not readable

Reduce font size of text since the spreadsheet writer uses a default width (option to alter width not implied in documentation).

Download: invalid input syntax for tripal_job progress

When downloading a file I get:

2018-04-23 11:22:07: Calling: rawpheno_trpdownload_generate_file(Array)
Generating CSV File: /var/www/dev/fresh/sites/default/files/tripal/tripal_downloads/rawpheno_csv2018Apr23_1524504124.csv
0% complete...
0.10288065843621% complete...
Job execution failed: SQLSTATE[22P02]: Invalid text representation: 7 ERROR: invalid input syntax for [error]
integer: "0.10288065843621"
LINE 1: UPDATE tripal_jobs SET progress='0.10288065843621' WHERE job...

Updates to support HTTPS

Issue #36 - Create a function to fetch support email address.

Create an API function that retrieves the support email. That API function in rawphenotypes would retrieve a default email and then call the alter function. Then KP nodes would implement the alter function and supply your email

Spaces in data

Trim leading and trailing spaces in data before saving to database. Especially for plant prop headers Plot, Entry, Rep etc.

Errors when the module is installed.

When installing the module I see the following errors:

$ drush en rawpheno
The following extensions will be enabled: rawpheno
Do you really want to continue? (y/n): y
CRITICAL (RAW PHENOTYPES): Chado/Tripal failed to insert cvterm (traits)
[site http://default] [TRIPAL CRITICAL] [RAW PHENOTYPES] Chado/Tripal failed to insert cvterm (traits)
CRITICAL (RAW PHENOTYPES): Chado/Tripal failed to insert cvterm (traits)
[site http://default] [TRIPAL CRITICAL] [RAW PHENOTYPES] Chado/Tripal failed to insert cvterm (traits)
CRITICAL (RAW PHENOTYPES): Chado/Tripal failed to insert cvterm (traits)
[site http://default] [TRIPAL CRITICAL] [RAW PHENOTYPES] Chado/Tripal failed to insert cvterm (traits)
CRITICAL (RAW PHENOTYPES): Chado/Tripal failed to insert cvterm (traits)
[site http://default] [TRIPAL CRITICAL] [RAW PHENOTYPES] Chado/Tripal failed to insert cvterm (traits)
CRITICAL (RAW PHENOTYPES): Chado/Tripal failed to insert cvterm (traits)
[site http://default] [TRIPAL CRITICAL] [RAW PHENOTYPES] Chado/Tripal failed to insert cvterm (traits)
CRITICAL (RAW PHENOTYPES): Chado/Tripal failed to insert cvterm (traits)
[site http://default] [TRIPAL CRITICAL] [RAW PHENOTYPES] Chado/Tripal failed to insert cvterm (traits)
rawpheno was enabled successfully.                                                                                                                                                                                                   [ok]
Custom table, 'rawpheno_rawdata_mview' ,  created successfully.                                                                                                                                                                      [status]
Materialized view 'rawpheno_rawdata_summary' created

Fail to add column header in manage project assets

An error happened on knowpulse at directory "Home » Administration » Tripal » Extensions» Manage Projects" when I tried to add a column header to one project (Lentil Diversity Panel Biomass or LR-11 Flowering Time) .

The webpage showed "Error: The website encountered an unexpected error. Please try again later." after I filled in blanks and submitted.

Other two functions "ADD EXISTING COLUMN HEADER" AND "ADD USER" on this webpage work fine.

Issue #24 - Allow user to download environment data in download.

Add an option to include environment data. The idea would be per project + location an environment data (when available) will be achived/zipped/compressed/packaged together with the raw phenotypic data generated by this form.

A separate tab for environment data lists all files, as well as, allow admin to add more files for a project and location. Additional information, year is requested which will become part of the filename and distinguishes environment data file from one year to another.

To implement: Update generate_file() function to fetch environment data file based on selected project + location combination and archive. To establish relationship between environment data file and project+location, a custom table containing the following fields would be necessary.

environment_data_id (serial) primary key, project_id fk, location (varchar), year (varchar), rank/sequence/version (varchar) and file_id fk. field rank/sequence/version is a series number for each file in case 2 or more environment data for a given project + location and year.

Support for NA and four-digit year YYYY in Planting Date.

Dates instead of Days to

We have requested "Days to" for many of our AGILE Phenotypes. However, some data collectors are still recording dates. This is going to cause issues with the validator. Kirstin has requested we think about automatically converting a date to "days to" with the argument that this would be less error prone then the data collector converting them all.

However, we are concerned because Excel does a lot of auto correcting of dates that is not only hard to predict but can also cause data collection errors. Do we really want to support this?

Allow abbreviation of not applicable N/A, NA, N.A. in Plot, Replicate, Location and other phenotypes.

Accept all variations (case-insensitive) of not applicable NA, N/A and N.A. in Plot, Replicate, Location
and other phenotypes.

plant/plot records only inserted (not re-used)

Expected Behaviour: Each unique combination of plot, germplasm, year, rep and location should have one record in the pheno_plant table.

Current Behaviour: There is one entry in the pheno_plant table per row/file combination.

Example

There are two researchers working on the same field trial (the same set of plots). One is taking data for traits 1-5 and the other is taking data for traits 6-10. This data is collected in two files (one per researcher) and uploaded independently.

Expected: On download the supervisor expects to have a single row for a given plot with data for traits 1-10. This means the underlying data should be attached to a single pheno_plant record.

Current: The download file has two rows for a given plot. The first has data for traits 1-5 with empty data for 6-10. The second has data for 6-10 with empty data for 1-5.

Validation does not occur in Step 1 for newly added traits which do not specify unit

When reviewing Pull request #32, I realized that allowing column headers to omit the unit in the format (unit) makes it difficult to validate that the unit actually makes sense. For example, cm = integer, date actually reflects a date, and so on. Sometimes this has resulted in an error during step 3 as in #32, but this is not always the case. Regardless, we have discussed the issue of whether or not we want to allow this kind of flexibility in the first place. Concerns that arose were:

Implementing validation at the second stage, or redesigning the entire flow of the module will take up too much time
"Yelling" at users who are uploading bonus traits does not feel right, and may discourage further use of the module or encourage abandonment at the second stage
What happens if the same "bonus" trait is uploaded twice? It will then be picked up by validation in step 1. Now what happens if it was fine the first time but not the second time? Not only can this discourage the user, but we shudder at the thought of the data being fixed only after a first attempt was successfully made, resulting in heterogenous values for this trait.

We propose the following solution to address all 3 concerns. This will occur during step 3:

Check if the trait is a newly-defined trait. If yes:
   Validate the unit. If validation passes:
      Save values
   Else if validation fails:
      Ignore values, but send an email to the administrator detailing the problem trait
Else if not a new trait:
   Save values

Thus, this issue can be solved by confirming the unit manually by the admin (or asking a local expert) or even contacting the original phenotyper for clarification (as the notification will be immediate), but the remainder of the data still gets saved.

Make Tripal v3 compatible

To be made Tripal v3 compatible this module simply needs the dependency in the .info file changed from tripal_core to tripal.

This is due to the main change between Tripal v2 and Tripal v3 being Nodes => Entities and this module does not interact with Tripal Nodes.

Make rawphenotypes pages easily accessible to data collectors.

It has been pointed out that raw phenotypes data collectors struggle to locate links when working with the module. To address this issue, add/relocate links relevant to rawphenotypes module to a section of KP where it can be easily seen or accessed.

Summary page update - wrap text in location

Location column header stores the location of a field trial. The module does not have a uniform way of encoding the value in this column and so in one project it shows only the country information and in another it shows region/city plus the country information. This issue will implement a standard format for location by wrapping text to two rows and ideally the first line should show the country and the second line to be the region/city information. For example Saskatoon, Canada

CANADA
Saskatoon

or Cordoba, Spain

SPAIN
Cordoba

Download Data Enhancements.

Allow heatmap elements and select fields in data summary page to be data donwload filter options.
Create filter by year option in data download page.
Create filter by RIL option in data download page.

Backup fails to save the file when Measurements tab is renamed.

Currently the Backup fails to save the file when Measurements tab is renamed. This is not okay as the file should always be saved during backup, regardless of validation failing.

Number or reps in Lodging (1st; Scale: 1-5) upright - lodged triggers validation error

Reps in lodging trait triggers validation error - trait not properly formatted.

Whitespace between words in column header --not recognized

When a user adds space between words in the column header for any trait (essential, optional or new), the trait should still be recognized. For example, Planting Date should match Planting Date. This should match for all column types throughout all stages: stage 1 validation, new trait detection and when loading the data in stage 3.

There is a partial fix for this in 40c2090#diff-b6f1b3636044514f9512c37ba5031205R938 but it is still showing errors.

WSOD of Download page

I'm experiencing a WSOD on the download page (phenotypes/raw/download) of my Tripal2 KnowPulse clone. It appears to be related to an empty location being passed to rawpheno_download_load_traits and fed directly into the SQL query. This results in the following PDO Exception:

Trait summary barchart labels incorrect

The trait barchart currently says the y-axis is the "number of germplasm" when what is actually represented is the "number of plots". Furthermore, it says the x-axis is the average which is misleading. It only averages if there is more then one measurement for a unique trait/plot combination which is highly unlikely. It should be "Average Observed Measurements per Plot".

Country name as location

Current summary chart shows location as the name of the country e.g. India, Spain etc. that is vague and might cause confusion when there are multiple trials in a same country.

Suggest a naming format that includes the specific town/city and the country.
For example: Sutherland Saskatoon, Canada
Central Ferry Washington, USA
Fill Location in advance when downloading data collection spreadsheet file.
Support the use of GPS coordinates.

Non-Microsoft Excel spreadsheet (eg. from LibreOffice) fails validation

Non-Microsoft Excel spreadsheet (eg. from LibreOffice) fails validation originally reported in #31 by @Jiu9Shen.

From @carolyncaron

@Jiu9Shen confirmed that your test spreadsheets prompts an error because of no content.
However, other test cases that contain data did not prompt errors during validation when expected. :-(
I think this is a difficult problem for you to debug without easy access to Linux or LibreOffice. And, since it is not urgent, I suggest we create a separate issue for this bug and ask @Jiu9Shen to make an attempt at it once we are upgraded to Tripal 3.

Download returns empty file when location has comma symbol eg. Saskatchewan, Canada

Function uses comma symbol to separate location(s) selected hence when location has this symbol to include ie city or country information, it interprets location as multiple and unrelated values when it should be treated as one.

Issue #27 - Implement a hook to search new stock names, ignore column header(s) and provide support email

Allow other module to implement drupal_alter() to some variables.

This will be used to search germplasm with tokens for particular project.
Ignore a set of column header(s)
Provide a support email in rawphenotypes block.

Non-breaking spaces in column headers cause issues

This bug was found while testing PR #58. Pre-existing traits will not be detected by the system if a non-breaking space is present at the beginning or end of the column header for those traits.

Should we reconsider separating multiple values for a single phenotype with a comma?

Currently, the module allows multiple values for a single phenotypic observation in the cases where multiple phenotypers may be uploading for the same project, and it does this by appending new values separated by a comma.

Derek suggests that commas, especially where numerical values occur, can be confusing for users who download the data down the road since other parts of the world use commas in place of decimals points. For example, 1st value = 2,3 and 2nd value = 2,6 to result in 2,3,2,6. Additionally, comments may also become hard to understand or separate.

He suggests we can use semicolons instead, which are R-friendly as well as human-readable. Thus, the previous example would look like 2,3;2,6.

Any thoughts?

Speed up Histogram in raw data summary page

Speed up histogram in raw data summary page.

Sorting of projects in the download page should be alphabetical

We've received feedback that the way projects are currently sorted in the dropdown appears "random" and is not intuitive. Reynold already addressed this in PR #65 but it might be buggy - instead of trying to fix it we think we should opt for the classic alphabetical sort as it looks to be the most intuitive option.

Currently (before PR #65) the default is set to "All Projects". PR #65 now sets the default to the project that has data uploaded most recently. We want to change this to "Select a Project" to force the user to choose when there are 2 or more. Otherwise, if there is only a single project it will be selected by default. :-)

Download is resulting in "Failed -No file"

When I download data (I have tried every combination of select all, select one, etc.) on the KnowPulse production site, once the file is generated and I click the link Chrome says "Failed - No file".

This worked on our development site... I double checked permissions of the tripal_downloads file directory so it's not that. I also checked and the file is not there. The only error in the Drupal log is that the file isn't there and the apache error log is silent.

Show environment data option by default, omit All projects option and sort projects option.

Show environment data option by default and enable or disable option based on filter combination selected in download page.
Remove All Projects option in select project select box and
Sort project options based on the recent data uploaded and/or by planting year.

Upgrade to Tripal 4 + Drupal 9

Work on this has been begun by @reynoldtan on branch 9.x-2.x. We have a PR open #83 for testing of the current upgrade and would appreciate feedback on this issue or the PR if you are interested in using this module for Tripal 4.

THIS ISSUE SHOULD NOT BE CLOSED UNTIL

Tripal 4 stable is released
9.x-2.x branch is made the default branch
all documentation has been upgraded to the Drupal 9 / Tripal 4 / Raw Phenotypes 2.x

Update tripal_get_cvterm() and tripal_get_cv() in Rawphenotypes.

Update api that uses tripal_get_cvterm() and tripal_get_cv (implentation returns empty result in Tripal 3).

Stock Names Look-up should be restricted to an organism.

We need to check that when the stock_id for a given row is looked up, we restrict the query to the organism. Currently, if you try to load data for any of the breeding program crosses, the system returns validation telling you they are not unique when in fact they are -by organism.

Each phenotyping project should only collect data for a single organism. Therefore, we can solve this bug by saving the organism for a project in the projectprop table and then looking it up when validating or loading a raw phenotype dataset. We should add form elements to the admin manage projects interface to allow setting of the organism.

Speed up summary chart page load time

Rawdata summary page takes a while ( > 2 mins) to load with under 500K phenotypes.

Error in deleting column header in manage project assets

An error happened on knowpulse at directory "Home » Administration » Tripal » Extensions» Manage Projects" when I tried to delete column headers (R7 Traits: Canopy Height (1st; cm) and R7 Traits: Canopy Height (2nd; cm)) in one project (Lentil Diversity Panel Biomass).

After I clicked delete button on webpage, a knowpuse notice showed and asked: "Are you sure to delete this column header?". It leads to an error page when I choose yes. The column headers I tried to delete still exist after several tries. However, they disappeared after several minutes without operation.

The headers I tried to delete are EXISTING COLUMN HEADERS.

Upload failing in Step3: "Germplasm doesn't exist" when it does

When uploading a test (file: Test-2-NoErrors.xlsx):

step1: successfully finds all germplasm giving me a beautiful green checkmark beside all germplasm exist.
step2: no new traits
step3: when job is run on the command-line it fails with the following output:

2018-04-23 13:19:14: Calling: rawpheno_load_spreadsheet(63, a:0:{}, 1548, a:4:{i:0;s:5:"Entry";i:1;s:8:"Location";i:2;s:4:"Plot";i:3;s:3:"Rep";})
0% complete...
WD rawpheno: Uploading Phenoypic Data: Germplasm doesn't exist (name=Nugget; row=2) [error]
CRITICAL (RAWPHENO): Uploading Phenoypic Data: Germplasm doesn't exist (name=Nugget; row=2)
[site http://default] [TRIPAL CRITICAL] [RAWPHENO] Uploading Phenoypic Data: Germplasm doesn't exist (name=Nugget; row=2)
WD rawpheno: [CODE 103] Failed to load phenoypic data (job 7895) [error]
CRITICAL (RAWPHENO): [CODE 103] Failed to load phenoypic data (job 7895)
[site http://default] [TRIPAL CRITICAL] [RAWPHENO] [CODE 103] Failed to load phenoypic data (job 7895)
Drush command terminated abnormally due to an unrecoverable error. [error]

NOTE: The germplasm does exists:

kp3_fresh=# SELECT * FROM chado.stock WHERE name~'ugget';
stock_id | dbxref_id | organism_id | name | uniquename | description | type_id | is_obsolete
----------+-----------+-------------+--------+-------------+-------------+---------+-------------
8110 | 1910939 | 4 | Nugget | KP:GERM8110 | | 3683 | f
(1 row)

AJAX Http request error

Unsure what is causing this, but report of this error started when moving host to HTTPS.

Related Drupal community discussion:
https://www.drupal.org/node/1232416

Summary Barchart incorrectly states "No Data"

I found this bug during the Tripal 3 upgrade but it is not related to the Tripal version.

Symptom: The summary barchart displayed at phenotypes/raw when a trait is chosen, says there is no data when there is. This was experienced consistently for all traits on a KP clone site but not on the production site.

Instructions page: Icon position issue

On the instructions page the icons for collect/backup/submit data are overlaying the vertical tab pane instead of underneath.

See http://knowpulse.usask.ca/portal/phenotypes/raw/instructions (image attached).

Whitespace between words in column header (essential, subset etc.)

This will handle all column headers. A partial fix for similar issue but only covers column headers as new column header.

Warn users not to use barchart for publication

The barchart provided on phenotypes/raw for a specific trait utilizes raw data and as such should never be used in publication. This should be made more clear to researchers by including a disclaimer on the chart.

The following chart uses raw data and, as such, should never be used in publication. It is meant to give you a quick visual and to identify problems such as outliers to aid you in your analysis.

Collecting data for plots that are segregating.

Some of our plots are segregating for specific phenotypes. For example, some plots might be segregating for flower colour (e.g. white/purple) or days to flower (e.g. 42/59 days). In these cases, some data collectors will record both phenotypes they observed (as shown in brackets above). Kirstin feels the loader should handle this.

Upload/Backup Validation: long process for in page load.

See c46f04b.

There is a concern with the the increase of max execution time needed for validation of long files. This causes there to be a long ajax upload spinner with no progress reported to the user. Furthermore since it is dependant upon the size of the file, at some point the files will likely reach a size to break this.

One option is to move it into a Tripal Job. This pulls validation out of the page load and allows us to provide progress reporting to the user.

One Concern (@Reynold):
This step can be subjected to numerous repetition, and steps to Register a job, wait for job queue and execute in tribal job each time might cause unnecessary wait time to user and might not give a
relatively quick response as what we currently have.

Warning on Upload page

There has been some confusion between the upload and backup pages since they look so similar. As such, it might be prudent to add a warning to the upload page indicating that this should only be done once per dataset once data collection has been completed. It would be helpful to point them to the backups page.

Documentation needs updating

We need to update the README to demonstrate the following functionality:

The dashboard on the front page
Reflect the environmental data option in the download screenshot as well as mention this functionality

Update the wiki to show:

How to upload environmental data files
How to set the email address for email support
Show tips for various drupal hooks to customize the module (examples: ignore column(s), add prefixes/suffixes to stock names)

Is there anything I missed?

Issue #31 - Page redirect, Non-MS Excel spreadsheet, Stage 2 and 3 errors.

Admin pages redirect to page not found in:
1. Create a project
2. Delete header, user and environment data file
Similar trait not suggested in Stage 2 - Describe Trait
Non-Microsoft Excel spreadsheet (eg. from LibreOffice) fails validation
Extra spaces in column headers fail Stage 3 - Save Spreadsheet

Gap in x axis

There appears to be gap along the x axis of the Trait histogram. I believe this is caused by the extra space I've added to the heatmap to accommodate multi-line location names.

Reynold Requested Review.

Changes: cb18bae...master

Check: #3, #4, #5.