Comments (8)
Hi @jjfarrell, it looks like the path to the chain file was not passed properly in the expression. You’ll want to use string interpolation:
liftover_expr = f”lift_over_coordinates(contigName, start, end, {chain_file}, .99)"
I’ll clarify this in the docs. Thanks for bringing this up!
from glow.
Thanks for the quick response @karenfeng!
However, I am now getting a different error message when I add the f"" and {chain_file}. Both error messages refer to (line 1, pos 46).
Traceback (most recent call last):
File "/share/pkg.7/spark/2.4.3/install/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/share/pkg.7/spark/2.4.3/install/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.functions.expr.
: org.apache.spark.sql.catalyst.parser.ParseException:
extraneous input '/' expecting {'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 'ANY', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', 'BETWEEN', 'LIKE', RLIKE, 'IS', 'NULL', 'TRUE', 'FALSE', 'N 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 'NATURAL', 'ON', 'PIVOT', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 'PRECEDING', 'FOLLOWING', 'CURRENT', 'FIRST', 'AFTEROW', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'DIRECTORY', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'COST', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', 'DROP', 'UNION', 'EXCEPT', 'MINUS', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IGNORE', 'BOTH', 'LEADING', 'TRAILING', 'IF', 'POSITION', 'EXTRACT', 'DIV', 'PERCENT', 'BUCKET', 'OUT', 'OF', 'SORT''DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', 'CLEAR', 'CACHE', 'UNCACHE', 'LAZY', 'GLOBAL', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 'CLUSTERED', 'SORTED', 'PURGE', 'INOUTPUTFORMAT', DATABASE, DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES', 'COMPACTIONS', 'PRINCIPALS', 'TRANSAEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', 'LOCAL', 'INPATH', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 46)
== SQL ==
lift_over_coordinates(contigName, start, end, /restricted/projectnb/casa/wgs.hg38/benchmark.giab/b37ToHg38.over.chain, .99)
----------------------------------------------^^^
at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:241)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:117)
at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseExpression(ParseDriver.scala:44)
at org.apache.spark.sql.functions$.expr(functions.scala:1366)
at org.apache.spark.sql.functions.expr(functions.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:745)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/restricted/projectnb/casa/wgs.hg38/benchmark.giab/glow_liftover.py", line 22, in
input_with_lifted_df = input_df.select('contigName', 'start', 'end').withColumn('lifted', expr(liftover_expr))
File "/share/pkg.7/spark/2.4.3/install/python/lib/pyspark.zip/pyspark/sql/functions.py", line 675, in expr
File "/share/pkg.7/spark/2.4.3/install/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call
File "/share/pkg.7/spark/2.4.3/install/python/lib/pyspark.zip/pyspark/sql/utils.py", line 73, in deco
pyspark.sql.utils.ParseException: "\nextraneous input '/' expecting {'SELECT', 'FROM', 'ADD', 'AS', 'ALL', 'ANY', 'DISTINCT', 'WHERE', 'GROUP', 'BY', 'GROUPING', 'SETS', 'CUBE', 'ROLLUP', 'ORDER', 'HAVING', 'LIMIT', 'AT', 'OR', 'AND', 'IN', NOT, 'NO', 'EXISTS', 'BETWEEN', 'LIKE', RLILL', 'TRUE', 'FALSE', 'NULLS', 'ASC', 'DESC', 'FOR', 'INTERVAL', 'CASE', 'WHEN', 'THEN', 'ELSE', 'END', 'JOIN', 'CROSS', 'OUTER', 'INNER', 'LEFT', 'SEMI', 'RIGHT', 'FULL', 'NATURAL', 'ON', 'PIVOT', 'LATERAL', 'WINDOW', 'OVER', 'PARTITION', 'RANGE', 'ROWS', 'UNBOUNDED', 'PRECEDING', 'CURRENT', 'FIRST', 'AFTER', 'LAST', 'ROW', 'WITH', 'VALUES', 'CREATE', 'TABLE', 'DIRECTORY', 'VIEW', 'REPLACE', 'INSERT', 'DELETE', 'INTO', 'DESCRIBE', 'EXPLAIN', 'FORMAT', 'LOGICAL', 'CODEGEN', 'COST', 'CAST', 'SHOW', 'TABLES', 'COLUMNS', 'COLUMN', 'USE', 'PARTITIONS', 'FUNCTIONS', N', 'EXCEPT', 'MINUS', 'INTERSECT', 'TO', 'TABLESAMPLE', 'STRATIFY', 'ALTER', 'RENAME', 'ARRAY', 'MAP', 'STRUCT', 'COMMENT', 'SET', 'RESET', 'DATA', 'START', 'TRANSACTION', 'COMMIT', 'ROLLBACK', 'MACRO', 'IGNORE', 'BOTH', 'LEADING', 'TRAILING', 'IF', 'POSITION', 'EXTRACT', 'DIV', 'PEET', 'OUT', 'OF', 'SORT', 'CLUSTER', 'DISTRIBUTE', 'OVERWRITE', 'TRANSFORM', 'REDUCE', 'SERDE', 'SERDEPROPERTIES', 'RECORDREADER', 'RECORDWRITER', 'DELIMITED', 'FIELDS', 'TERMINATED', 'COLLECTION', 'ITEMS', 'KEYS', 'ESCAPED', 'LINES', 'SEPARATED', 'FUNCTION', 'EXTENDED', 'REFRESH', 'E', 'UNCACHE', 'LAZY', 'FORMATTED', 'GLOBAL', TEMPORARY, 'OPTIONS', 'UNSET', 'TBLPROPERTIES', 'DBPROPERTIES', 'BUCKETS', 'SKEWED', 'STORED', 'DIRECTORIES', 'LOCATION', 'EXCHANGE', 'ARCHIVE', 'UNARCHIVE', 'FILEFORMAT', 'TOUCH', 'COMPACT', 'CONCATENATE', 'CHANGE', 'CASCADE', 'RESTRICT', 'SORTED', 'PURGE', 'INPUTFORMAT', 'OUTPUTFORMAT', DATABASE, DATABASES, 'DFS', 'TRUNCATE', 'ANALYZE', 'COMPUTE', 'LIST', 'STATISTICS', 'PARTITIONED', 'EXTERNAL', 'DEFINED', 'REVOKE', 'GRANT', 'LOCK', 'UNLOCK', 'MSCK', 'REPAIR', 'RECOVER', 'EXPORT', 'IMPORT', 'LOAD', 'ROLE', 'ROLES',', 'PRINCIPALS', 'TRANSACTIONS', 'INDEX', 'INDEXES', 'LOCKS', 'OPTION', 'ANTI', 'LOCAL', 'INPATH', IDENTIFIER, BACKQUOTED_IDENTIFIER}(line 1, pos 46)\n\n== SQL ==\nlift_over_coordinates(contigName, start, end, /restricted/projectnb/casa/wgs.hg38/benchmark.giab/b37ToHg38.over.chain, .--------------------------------------^^^\n"
from glow.
Whoops, I forgot to wrap the chain file string in single-quotes to express it as a literal. Can you try this?
liftover_expr = f”lift_over_coordinates(contigName, start, end, ‘{chain_file}’, .99)"
from glow.
I tried that and also did not work. Next, I tried eliminating the expression which also did not work.
Finally I found the errors were being generated by the syntax in the select part of the statement. The contigName, start, end needed to be quoted: 'contigName', 'start', 'end'
input_with_lifted_df = input_df.select('contigName', 'start', 'end').withColumn('lifted', glow.lift_over_coordinates('contigName', 'start','end', chain_file, 0.99))
Now that I was passed this issue, I tried the variant liftover.......
output_df = glow.transform('lift_over_variants', input_df, chain_file=chain_file, reference_file=reference_file)
That triggered the following error....
py4j.protocol.Py4JJavaError: An error occurred while calling o63.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 8, scc-q20.scc.bu.edu, executor 14): htsjdk.tribble.TribbleException: Badly formed variant context at location chr1:596697; getEnd() was 596797 but this VariantContext contains an END key with value 532177
The original b37 vcf has a deletion here:
1 532077 ACATTCATGCTCACTCATACACACCCAGATCATATATACACTCGTGCACACATTCACACTCATACACACCCAAATCATACTCACATTCATGCACACATGTT A
SVLEN=-100;;SVTYPE=DEL;END=532177;sizecat=100to299;
The liftover to hg38 should look like this:
chr1 596697 REF=ACATTCATGCTCACTCATACACACCCAGATCATATATACACTCGTGCACACATTCACACTCATACACACCCAAATCATACTCACATTCATGCACACATGTT
ALT=A
INFO Fields
SVLEN=-100; SVTYPE=DEL;END=596797;sizecat=100to299;
The error message suggests the transformer is not updating the INFO/END field from 532177 to 596797 and an error is being triggered since the END is before the start. An incorrect INFO/END will cause problems with tabix and other programs.
The VCF I am using is the GIAB SV vcf to replicated this. It can be downloaded from here:
ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_SVs_Integration_v0.6/HG002_SVs_Tier1_v0.6.vcf.gz
from glow.
Good to hear you got the syntax issues ironed out - I'll clean up the docs to reflect this usage. I believe that the issue may be resolved by upgrading our dependency on Picard (broadinstitute/picard#1469). I'll try updating our build to see if this resolves the problem.
from glow.
@karenfeng I tested a relatively new version of GATK LiftoverVcf gatk/4.1.7.0 and found the error was still there. So I submitted the issue to the GATK github repository where it was migrated to the Picard repo. Here is the issue:
broadinstitute/picard#1552
So it would be best hold off on the upgrade with Picard till a new release is available that resolves that.
from glow.
Thanks for investing @jjfarrell, I'll keep an eye on the thread and we'll bump our dependency when the issue is resolved.
from glow.
Why was this closed? Was the issue resolved?
from glow.
Related Issues (20)
- Improve handling of environment variables in the pipe transformer HOT 5
- Installing and running glow outside of databricks HOT 10
- Release with spark3.2 support HOT 1
- Vulnerable shared library might make glow-spark3 vulnerable. Can you help upgrade to patch versions? HOT 5
- Docker build.sh script is not current in tagged release HOT 1
- Ask: add glow to chocolatey HOT 4
- VCF Infinity/NaN values are not handled according to VCF spec
- PySpark 3.3 support HOT 8
- Load glow mave package from Python HOT 5
- spark.read.format("vcf").load() fails for vcf.bgz files HOT 3
- Cannot write INFO fields with LongType to VCF HOT 2
- Python tests fail with KeyError: '_glow_regression_values' HOT 1
- Logistic regression ValueError: Null fit failed! HOT 2
- AnalysisException: Column 'num_workers' does not exist. HOT 1
- pipe transformer support for file input/output for command line apps HOT 2
- Interaction Tests with GLOW HOT 1
- Spark version upgrade 3.4, 3.3 HOT 5
- Cannot install+use the python glow package any more HOT 2
- Regular updates to work with cloud providers HOT 1
- Problem importing Glow HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from glow.