projectglow / glow Goto Github PK

View Code? Open in Web Editor NEW

259.0 19.0 106.0 99.01 MB

An open-source toolkit for large-scale genomic analysis

Home Page: https://projectglow.io

License: Apache License 2.0

Shell 0.07% Scala 70.22% Java 0.10% Python 26.99% Jupyter Notebook 2.62%

genomics spark delta machine-learning gwas regression population-genetics

glow's Introduction

An open-source toolkit for large-scale genomic analyes
Explore the docs »

Issues

Glow is an open-source toolkit to enable bioinformatics at biobank-scale and beyond.

Easy to get started

The toolkit includes building blocks to perform common analyses right away:

Load VCF, BGEN, and Plink files into distributed DataFrames
Perform quality control and data manipulation with built-in functions
Variant normalization and liftOver
Perform genome-wide association studies
Integrate with Spark ML libraries for population stratification
Parallelize command line tools to scale existing workflows

Built to scale

Glow makes genomic data work with Spark, the leading engine for working with large structured datasets. It fits natively into the ecosystem of tools that have enabled thousands of organizations to scale their workflows. Glow bridges the gap between bioinformatics and the Spark ecosystem.

Flexible

Glow works with datasets in common file formats like VCF, BGEN, and Plink as well as high-performance big data standards. You can write queries using the native Spark SQL APIs in Python, SQL, R, Java, and Scala. The same APIs allow you to bring your genomic data together with other datasets such as electronic health records, real world evidence, and medical images. Glow makes it easy to parallelize existing tools and libraries implemented as command line tools or Pandas functions.

Building and Testing

This project is built using sbt and Java 8.

To build and run Glow, you must install conda and activate the environment in python/environment.yml.

conda env create -f python/environment.yml
conda activate glow

When the environment file changes, you must update the environment:

conda env update -f python/environment.yml

Start an sbt shell using the sbt command.

FYI: The following SBT projects are built on Spark 3.5.1/Scala 2.12.15 by default. To change the Spark version and Scala version, set the environment variables SPARK_VERSION and SCALA_VERSION.

To compile the main code:

compile

To run all Scala tests:

core/test

To test a specific suite:

core/testOnly *VCFDataSourceSuite

To run all Python tests:

python/test

These tests will run with the same Spark classpath as the Scala tests.

To test a specific Python test file:

python/pytest python/test_render_template.py

When using the pytest key, all arguments are passed directly to the pytest runner.

To run documentation tests:

docs/test

To run the Scala, Python and documentation tests:

test

To run Scala tests against the staged Maven artifact with the current stable version:

stagedRelease/test

Testing code on a Databricks cluster

You can use the build script to create artifacts that you can install on a Databricks cluster.

To build Python and Scala artifacts:

bin/build --scala --python

To build only Python (no sbt installation required):

bin/build --python

To install the artifacts on a Databricks cluster after building:

bin/build --python --scala --install MY_CLUSTER_ID

IntelliJ Tips

If you use IntelliJ, you'll want to:

Download library and SBT sources; use SBT shell for imports and build from IntelliJ
Set up scalafmt on save

To run Python unit tests from inside IntelliJ, you must:

Open the "Terminal" tab in IntelliJ
Activate the glow conda environment (conda activate glow)
Start an sbt shell from inside the terminal (sbt)

The "sbt shell" tab in IntelliJ will NOT work since it does not use the glow conda environment.

To test or testOnly in remote debug mode with IntelliJ IDEA set the remote debug configuration in IntelliJ to 'Attach to remote JVM' mode and a specific port number (here the default port number 5005 is used) and then modify the testJavaOptions in build.sbt to include:

"-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005"

glow's People

Contributors

Stargazers

Watchers

Forkers

karenfeng henrydavidge kianfar77 dennyglee avmi fnothaft ahirreddy psoares heuermh matthieurouland jpdna raonyguimaraes bhavintandel-forks goodbright2014 satya-lalam williambrandler emaxwell chenshan03 habeggel joellesophya 2016xb2017 bboutkov eric-czech sigmarkarl kermany null-sleep hvanhovell avkash cjnolet lelandbarnard tomohirouchida d3v3l0 victor8733 devops-utils ben-epstein synchris kasraimanirad dberma15 lanastazia datma-health humamal pantonim11 dna0ff bcajes fbemm cameronraysmith dmoore247 mah-databricks genostack ryanbrady1027 vonrosenchild ssunkara kaiyuezhou rv-eisai ryankim3gilead jassabidi dshakey nana7926 blademat krisfobes jkbradley abartch dvcastillo biomguler dmumpuu bill-95 spartakos077 lajd taka-yayoi genomics-lakehouse bharadwaj509 elswob classicvalues mkefly alartin vadim dtzeng affine-indrajit jecos kariusdx prosticoco jagjitnatt-db sellisinit hanraaijmakersmagnus a-li eric54321 genrait ulc0 naechyunc claudiavmbrito magdalenazz j-bradlee mwarqee project-defiant zavoraad bioinfovs ramya1499 evanyui duanjingqi gwaffen-mck

glow's Issues

Create PLINK datasource

The project includes BGEN and VCF datasources, but not PLINK.

Support for conda?

Are there any plans to support conda?

Support Scala 2.12

Spark has had 2.12 support for over a year; what's blocking Glow from also supporting 2.12?

Possible mistake in GWAS demo notebooks

I think this step in the WGR demo notebook has a mistake:

wgr_gwas = variant_df.join(adjusted_phenotypes, ['contigName']).select(
  ...
  expand_struct(linear_regression_gwas( 
    variant_df.values,
    adjusted_phenotypes.values,
    # ** This appears to get flattened as row-major and then reshaped as column-major internally **
    lit(covariates.to_numpy())
  )))

Here's a short program to show what made me think that:

import glow
from glow import linear_regression_gwas, expand_struct
import numpy as np
from pyspark.ml.linalg import DenseMatrix
from pyspark.sql.session import SparkSession
from pyspark.sql import Row
from pyspark.sql import functions as F

spark = SparkSession.builder\
    .config('spark.jars.packages', 'io.projectglow:glow_2.11:0.5.0')\
    .config("spark.sql.execution.arrow.pyspark.enabled", "true")\
    .getOrCreate()
glow.register(spark)

# Create arrays to test with
np.random.seed(0)
x = np.array([ # Covariates
    [1, -1],
    [2, -2],
    [3, -3],
    [4, -4.],
])
g = np.array([0., 1., 2., 0.])
b = np.array([0., 1.])
# The `beta` for g should be very close to 1
y = g + np.dot(x, b) + np.random.normal(scale=.01, size=g.size)

#### Version 1
# Pass the covariates directly with correct ordering
dm = DenseMatrix(numRows=x.shape[0], numCols=x.shape[1], values=x.ravel(order='F').tolist())
np.testing.assert_equal(x, dm.toArray()) # `dm` is definitely defined correctly
spark.createDataFrame([Row(genotypes=g.tolist(), phenotypes=y.tolist(), covariates=dm)])\
    .select(expand_struct(linear_regression_gwas('genotypes', 'phenotypes', 'covariates')))\
    .show()
+------------------+--------------------+--------------------+
|              beta|       standardError|              pValue|
+------------------+--------------------+--------------------+
|0.9948866298677059|0.001300287478677...|8.320427901051293E-4|
+------------------+--------------------+--------------------+
# *Correct result*

#### Version 2
# Pass the covariates raveled as row-major and let DenseMatrix (incorrectly)
# interpret the concatenated rows as columns
dm = DenseMatrix(numRows=x.shape[0], numCols=x.shape[1], values=x.ravel(order='C').tolist())
print(dm.toArray())
[[ 1.  3.]
 [-1. -3.]
 [ 2.  4.]
 [-2. -4.]]
spark.createDataFrame([Row(genotypes=g.tolist(), phenotypes=y.tolist(), covariates=dm)])\
    .select(expand_struct(linear_regression_gwas('genotypes', 'phenotypes', 'covariates')))\
    .show()
+------------------+-------------------+--------------------+
|              beta|      standardError|              pValue|
+------------------+-------------------+--------------------+
|-2.382793056158103|0.20221757585071454|0.053898082087957454|
+------------------+-------------------+--------------------+
# * Incorrect result, beta not ~= 1

#### Version 3
# Pass the covariates array as a literal the same way it is done in the demo notebook
spark.createDataFrame([Row(genotypes=g.tolist(), phenotypes=y.tolist())])\
    .select(expand_struct(linear_regression_gwas('genotypes', 'phenotypes', F.lit(x))))\
    .show()
+------------------+-------------------+--------------------+
|              beta|      standardError|              pValue|
+------------------+-------------------+--------------------+
|-2.382793056158103|0.20221757585071454|0.053898082087957454|
+------------------+-------------------+--------------------+
# * Incorrect result when doing what the notebook does, beta not ~= 1

Version 3 using F.lit(x), like the demo notebook, is seeing the same incorrectly ordered covariates as version 2 (and produces the wrong result). Perhaps there is a better way to fix it, but F.lit(covariates.to_numpy()) should actually be something likeF.lit(covariates.to_numpy().T.ravel(order='C').reshape(covariates.shape)). Here's an example that works as expected:

x_weird = x.T.ravel(order='C').reshape(x.shape)
print(x_weird)
spark.createDataFrame([Row(genotypes=g.tolist(), phenotypes=y.tolist())])\
    .select(expand_struct(linear_regression_gwas('genotypes', 'phenotypes', F.lit(x_weird))))\
    .show()
[[ 1.  2.]
 [ 3.  4.]
 [-1. -2.]
 [-3. -4.]]
+------------------+--------------------+--------------------+
|              beta|       standardError|              pValue|
+------------------+--------------------+--------------------+
|0.9948866298677059|0.001300287478677...|8.320427901051293E-4|
+------------------+--------------------+--------------------+

Hopefully there's a more graceful way to handle that internally.

Support exploded genotype structs in utility functions

It would be cool if all methods working for ArrayType(StructType([calls_schema])) genotypes could work for explode(genotypes) alias StructType([calls_schema]) as well.

Example:

genotype_states( <ArrayType(StructType([calls_schema]))> ) works
genotype_states( <(StructType([calls_schema])> ) does not

Of course, this only applies to methods that work on each sample separately.

WGR covariate regularization

I noticed that the REGENIE preprint mentions covariate orthogonalization (implicitly removing covariate regularization) while this implementation uses constant alpha values of 1 here:

glow/python/glow/wgr/linear_model/functions.py

Line 152 in f3edf5b

 np.concatenate([np.ones(n_cov), np.ones(XtX.shape[1] - n_cov) * a]) for a in alpha_values 

Is there any reason that shouldn't be np.zeros(n_cov) instead of 1's? I'm curious if choosing a different model was intentional here since it would otherwise seem to simplify a good bit of the code if you could just project the covariates out from the start.

WGR small inconsistencies and possible improvements

Hi,

I was looking through the preprint and some of the code for the new WGR functionality and I love the work you guys have done on it -- thank you for sharing that all! The work that went into the documentation is much appreciated too, especially the plain-text descriptions (an all too uncommon bridge between methods of a paper and the code itself IMO).

I had a few thoughts/suggestions that I thought I would share in looking at the code:

Matrix Inversion
- Have you considered using something other than LU decomposition (via np.linalg.inv) for inverting (XtX + λI)? I worry that even though the whole genotype matrix would rarely be singular, I think it would be reasonable to expect that certain blocks of it may be
  - I think one good solution could be to swap inv for pinv to use an SVD solver instead in:
    
    glow/python/glow/wgr/linear_model/functions.py
    
    Line 154 in f3edf5b
    
    return np.column_stack([(np.linalg.inv(XtX + np.diag(d)) @ XtY) for d in diags])
  - Some experiments on small datasets produced the exact same results this way when all blocks are non-singular
  - Another could be to use the fact that (XtX + λI) is symmetric and switch to Cholesky decomposition
  - I'm not very familiar with the performance characteristics of each, but I know the computational complexity of pseudo-inverses and cholesky can be lower than LU and even if they don't make for a big improvement in speed, I think the stability and robustness to a single block with perfectly correlated columns would be worth it
Genotype Scaling
- I saw that the demo notebook standardizes on 0 degrees of freedom for std calculations while the stddev calculated for the genotypes matrix itself does not:
  
  glow/core/src/main/scala/io/projectglow/sql/expressions/MomentAggState.scala
  
  Line 68 in 75ea111
  
  row.update(offset + 1, if (count > 0) Math.sqrt(m2 / (count - 1)) else null)
  - It's a very minor point, but perhaps this should be count instead of count - 1 to be consistent
Reduced Block Scaling
- It appears that the stage 1 predictions are standardized on a per-sample-block basis rather than across all samples as stated in the preprint and in the c++ code at: https://github.com/rgcgithub/regenie/blob/master/src/Step1_Models.cpp#L250
  - Was this intentional? I can see how that would be a difficult thing to change given that it happens in apply_model under different contexts, but I thought I would point it out since it could make for pretty dramatic differences when small sample block sizes are used (and be less stable if stddev approaches 0 in a block)

Thanks again!

Improve GWAS functions in Glow

Report contingency table for binary traits
Ensure that we're exposing all necessary association test result statistics
Performance optimization, particularly for multiple phenotypes

WGR incorrect default alphas in RidgeRegression

In the second stage regression for WGR, the default alphas generated here based on the number of distinct headers are including the label values:

glow/python/glow/wgr/linear_model/ridge_model.py

Line 220 in f3edf5b

self.alphas = generate_alphas(blockdf)

This means that the alphas become a function of the number of outcomes (i.e. labels), which I assume isn't what was intended right? It seems odd to me that the results would change for one outcome if it was fit alone vs w/ others, so I think it should probably only be based on the number of features for the regression like stage 1 -- number of variant blocks * number of stage 1 alphas (not then also * number of outcomes).

Lowercase annotations names

Currently Glow 0.3.0 keeps annotations INFO fields names in the form they exist in VCFs while flatten them to separate DF columns. E.g.

root
 |-- contigName: string (nullable = true)
 |-- start: long (nullable = true)
 |-- referenceAllele: string (nullable = true)
 |-- alternateAlleles: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- splitFromMultiAllelic: boolean (nullable = true)
 |-- calls: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: integer (containsNull = true)
 |-- VARIANT_CLASS: string (nullable = true)
 |-- Feature_type: string (nullable = true)
 |-- Consequence: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- IMPACT: string (nullable = true)
 |-- BIOTYPE: string (nullable = true)
 |-- CADD_RAW: string (nullable = true)
 |-- CADD_PHRED: string (nullable = true)
 |-- SIFT: string (nullable = true)
 |-- PolyPhen: string (nullable = true)
 |-- CLIN_SIG: string (nullable = true)
 |-- SOMATIC: string (nullable = true)
 |-- PHENO: string (nullable = true)

It causes compilation error during pattern matching. E.g.

df.rdd
.distinct
.map {
  // VARIANT_CLASS|Feature_type|Consequence|IMPACT|BIOTYPE|SIFT|PolyPhen|CADD_RAW|CADD_PHRED
  case Row(contigName: String, start: Long, referenceAllele: String, alternateAlleles: mutable.WrappedArray[String],
           splitFromMultiAllelic: Boolean, calls: mutable.WrappedArray[mutable.WrappedArray[Int]], VARIANT_CLASS: String,
           Feature_type: String, Consequence: mutable.WrappedArray[String], IMPACT: String, BIOTYPE: String,
           SIFT: String, PolyPhen: String, CADD_RAW: String, CADD_PHRED: String) => ....

This would cause "Pattern variables must start with a lower-case letter. (SLS 8.1.1.)" error.

Sure there is workaround with renaming the columns, but wouldn't it be better to avoid the problem in the first place and lowercase all the column names ?

Choose smaller datatypes for BGEN and maybe VCF datasources

Currently when reading BGEN files, we represent the genotype probabilities as doubles and the ploidy as an int. However, the probabilities are typically only 8 bits in the BGEN files. Although the representation is not IEEE, I believe that a single precision float is sufficient. Ploidy should be represented as a short since it is typically 2.

Simple merge utility

Sometimes people want to use Glow to merge a bunch of single sample VCFs, typically from array data. The basic logic can be expressed with a group by on (contigName, start, end, referenceAllele, alternateAlleles) and then a collect_list on the genotypes and custom aggregations for different INFO fields.

We could streamline this process by adding a transformer that handles the common case.

Reading UKBB bgen and writing vcf errors

I tried reading the 500,000 sample UK Biobank chr22 bgen file and writing it out as a vcf file.
input_df = spark.read.format("bgen")
.load(bgen_file,sampleFilePath=bgen_sample_file)
.cache()

path='/project/casa/benchmark/ukbb_chr22.b38.vcf.gz'
input_df.write.format("bigvcf").save(path)

I am getting a number of errors. Are there any optimal configurations for executor memory or other job submit parameters that might help?

0/07/26 22:18:13 WARN execution.CacheManager: Asked to cache already cached data.
20/07/26 22:27:35 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to remove executor 8 for reason Container from a bad node: container_e2453_1584403376502_0692_01_000015 on host: scc-q03.scc.bu.edu. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143.
Killed by external signal
.
20/07/26 22:27:35 ERROR cluster.YarnScheduler: Lost executor 8 on scc-q03.scc.bu.edu: Container from a bad node: container_e2453_1584403376502_0692_01_000015 on host: scc-q03.scc.bu.edu. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143.
Killed by external signal
.
20/07/26 22:27:35 WARN scheduler.TaskSetManager: Lost task 2.0 in stage 0.0 (TID 11, scc-q03.scc.bu.edu, executor 8): ExecutorLostFailure (executor 8 exited caused by one of the running tasks) Reason: Container from a bad node: container_e2453_1584403376502_0692_01_000015 on host: scc-q03.scc.bu.edu. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143.
Killed by external signal
.
20/07/26 22:27:35 WARN scheduler.TaskSetManager: Lost task 29.0 in stage 0.0 (TID 14, scc-q03.scc.bu.edu, executor 8): ExecutorLostFailure (executor 8 exited caused by one of the running tasks) Reason: Container from a bad node: container_e2453_1584403376502_0692_01_000015 on host: scc-q03.scc.bu.edu. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143.
Killed by external signal
.
20/07/26 22:27:35 WARN scheduler.TaskSetManager: Lost task 6.0 in stage 0.0 (TID 13, scc-q03.scc.bu.edu, executor 8): ExecutorLostFailure (executor 8 exited caused by one of the running tasks) Reason: Container from a bad node: container_e2453_1584403376502_0692_01_000015 on host: scc-q03.scc.bu.edu. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143.
Killed by external signal
.
20/07/26 22:27:35 WARN scheduler.TaskSetManager: Lost task 0.0 in stage 0.0 (TID 10, scc-q03.scc.bu.edu, executor 8): ExecutorLostFailure (executor 8 exited caused by one of the running tasks) Reason: Container from a bad node: container_e2453_1584403376502_0692_01_000015 on host: scc-q03.scc.bu.edu. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143.
Killed by external signal
.
20/07/26 22:27:35 WARN scheduler.TaskSetManager: Lost task 3.0 in stage 0.0 (TID 12, scc-q03.scc.bu.edu, executor 8): ExecutorLostFailure (executor 8 exited caused by one of the running tasks) Reason: Container from a bad node: container_e2453_1584403376502_0692_01_000015 on host: scc-q03.scc.bu.edu. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143.
Killed by external signal
.
20/07/26 22:30:00 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to remove executor 4 for reason Container from a bad node: container_e2453_1584403376502_0692_01_000008 on host: scc-q05.scc.bu.edu. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143.
Killed by external signal
.
20/07/26 22:30:00 ERROR cluster.YarnScheduler: Lost executor 4 on scc-q05.scc.bu.edu: Container from a bad node: container_e2453_1584403376502_0692_01_000008 on host: scc-q05.scc.bu.edu. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143.
Killed by external signal
.

path does not exist

/databricks-datasets/genomics/phenotypes/ used in https://glow.readthedocs.io/en/latest/_static/notebooks/tertiary/pandas-lmm.html
does not exist
it should be /databricks-datasets/genomics/1000G/phenotypes.normalized/

error reading vcf

After installing from source from current master 0.1.3, I start shell with

pyspark --packages io.projectglow:glow_2.11:0.1.3-SNAPSHOT

which seems to start fine, but when trying to read vcf I get the following error:

>>> import glow
>>> glow.register(spark)
>>> df = spark.read.format('vcf').load("test1.vcf")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/paschalj/glow/v7/venv/lib/python3.6/site-packages/pyspark/sql/readwriter.py", line 166, in load
    return self._df(self._jreader.load(path))
  File "/home/paschalj/glow/v7/venv/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
  File "/home/paschalj/glow/v7/venv/lib/python3.6/site-packages/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/home/paschalj/glow/v7/venv/lib/python3.6/site-packages/pyspark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o35.load.
: java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider io.projectglow.bgen.BgenFileFormat could not be instantiated
	at java.util.ServiceLoader.fail(ServiceLoader.java:232)
	at java.util.ServiceLoader.access$100(ServiceLoader.java:185)
	at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384)
	at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
	at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
	at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:44)
	at scala.collection.Iterator.foreach(Iterator.scala:941)
	at scala.collection.Iterator.foreach$(Iterator.scala:941)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:250)
	at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:248)
	at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108)
	at scala.collection.TraversableLike.filter(TraversableLike.scala:262)
	at scala.collection.TraversableLike.filter$(TraversableLike.scala:262)
	at scala.collection.AbstractTraversable.filter(Traversable.scala:108)
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:630)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NoClassDefFoundError: org/apache/spark/sql/execution/datasources/FileFormat$class
	at io.projectglow.bgen.BgenFileFormat.<init>(BgenFileFormat.scala:42)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at java.lang.Class.newInstance(Class.java:442)
	at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:380)
	... 29 more
Caused by: java.lang.ClassNotFoundException: org.apache.spark.sql.execution.datasources.FileFormat$class
	at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
	... 36 more

liftOver issue: adjusting the genotypes array when alleles are swapped

I tested Glow to liftOver a GRCh37 BGEN file to GRCh38. Ideally, sites with swapped alleles should get corrected and the genotypes array should get adjusted accordingly. However, it appears that the “posteriorProbabilities” in the genotypes array are not swapped for sites where “INFO_SwappedAlleles” is true.

References:

Issue: #97
Pull request: #100

Support for block-gzipped fasta files

Trying to run the "normalize_variants" transformer with a block-gzipped + indexed fasta file results in the following error:

Py4JJavaError: An error occurred while calling o376.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 1 times, most recent failure: Lost task 0.0 in stage 5.0 (TID 35, localhost, executor driver): htsjdk.samtools.SAMException: Indexed block-compressed FASTA file cannot be handled: /s/project/variantDatabase/gtex/v8/hg19/dna.fa.gz
	at htsjdk.samtools.reference.IndexedFastaSequenceFile.<init>(IndexedFastaSequenceFile.java:84)
	at htsjdk.samtools.reference.IndexedFastaSequenceFile.<init>(IndexedFastaSequenceFile.java:98)
	at io.projectglow.sql.expressions.NormalizeVariantExpr$.doVariantNormalization(NormalizeVariantExpr.scala:47)
	at io.projectglow.sql.expressions.NormalizeVariantExpr.doVariantNormalization(NormalizeVariantExpr.scala)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:255)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:836)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:123)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

Support for Spark 3.0

Since Spark 3.0 is right around the corner, can you folks comment on your plans to support it?

Add a code of conduct

Please consider adding a code of conduct for this project.

I have used the Contributor Covenant, with the addition of an escalation section, for several projects.

Databricks notebook docs

We have notebooks in the docs with commands that can only be run in Databricks (eg. dbutils, display). We should add some docs to explain this.

netlib native dependency issue

When I run:

pyspark --packages io.projectglow:glow_2.11:0.1.0

I get a dependency error:

py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.io.FileNotFoundException: File file:/home/paschalj/.ivy2/jars/com.github.fommil.netlib_netlib-native_ref-osx-x86_64-1.1.jar does not exist

I attempted to compile Spark with the needed libraries:

mvn -Pyarn -Phadoop-2.7 -Pnetlib-lgpl -DskipTests clean package

as described here:
https://github.com/Mega-DatA-Lab/SpectralLDA-Spark/wiki/Compile-Spark-with-Native-BLAS-LAPACK-Support

but so far still get that error, though have not validated that those lose libraries are properly working with my Spark install.
I'm using Ubuntu 18.04
Any suggestions much appreciated!
Thanks!

Record CONTIG header lines as metadata on contig column

Some users have hit issues with the pipe transformer or when reading VCFs written by Glow because we do not pass on the contig header lines from the original VCF. A simple fix would be to add these as column metadata, probably for the contigName column since they're logically related and people rarely drop contigName.

Some other options could be:

Add a sequenceDictionary argument when writing VCFs or using the pipe transformer that accepts a VCF or .dict file. This argument would only be valid when inferring the output header.

Poor performance for QC filtering on 1KG data

Hi, I am trying to run some basic QC on a 1KG dataset with ~25M variants and 629 samples and I'm having issues making this work efficiently with Glow. These same operations take 35 seconds with PLINK, about 20 minutes with Hail, and at least 3 hours with Glow (I killed the jobs after that long so I'm not sure how long it would have ultimately taken). Is there some way I can rewrite this more efficiently?

// Load a PLINK dataset generated as a part of Marees et al. 2018 tutorial
// and write as parquet (see: https://www.ncbi.nlm.nih.gov/pubmed/29484742)
val dataFile = "ALL.2of4intersection.20100804.genotypes_no_missing_IDs.parquet" 
ss.read.format("plink")
    .option("famDelimiter", "\t")
    .option("bimDelimiter", "\t")
    .load(data_dir / data_file + ".bed" toString)
    .write.parquet(data_dir / data_file + ".parquet" toString)

val df = ss.read.parquet(data_dir / data_file + ".parquet" toString)

// QC Functions
def count(df: DataFrame) = (df.count, df.select(size(col("genotypes"))).first.getAs[Int](0)) // (n_variants, n_samples)

def filterBySampleCallRate(threshold: Double)(df: DataFrame): DataFrame = { 
    df
        // Cross join original dataset with single-row data frame containing a map like (sampleId -> QC stats)
        .crossJoin(
            df
            .selectExpr("sample_call_summary_stats(genotypes, referenceAllele, alternateAlleles) as qc")
            .selectExpr("map_from_arrays(qc.sampleId, qc) as qc")
        )
        // For each row, filter the genotypes array (which has one element per sampleId) based on QC map lookup
        .selectExpr("*", s"filter(genotypes, g -> qc[g.sampleId].callRate >= ${threshold}) as filtered_genotypes")
        // Remove intermediate fields 
        .drop("qc", "genotypes").withColumnRenamed("filtered_genotypes", "genotypes")
        // Ensure that the original dataset schema was preserved
        .transform(d => {assert(d.schema.equals(df.schema)); d})
}

def filterByVariantCallRate(threshold: Double)(df: DataFrame): DataFrame = { 
    df
        .selectExpr("*", "call_summary_stats(genotypes) as qc")
        .filter(col("qc.callRate") >= threshold)
        .drop("qc")
        .transform(d => {assert(d.schema.equals(df.schema)); d})
}

val df_qc1 = df
    .transform(filterByVariantCallRate(threshold=.8))
    .transform(filterBySampleCallRate(threshold=.8))
    .transform(filterByVariantCallRate(threshold=.98))
    .transform(filterBySampleCallRate(threshold=.98))


println(count(df_qc1))

I'm using the latest version of Glow, but I'm running on my own snapshot build with the fix for #135 (which is necessary for the above to run). I'm also using Spark 2.4.4 in local mode with a 64G heap size (in 128G, 16 core machine). I know local mode is not a typical use case, but I don't see what the problem is here since all 16 cores are at nearly 100% utilization the full 3 hours. It's also not GC pressure since only ~32G of the heap is actually being used and according to the Spark UI, the task time to GC time ratio is 4.6 h to 1.4 min for one these jobs after a couple hours.

Do you have any ideas what might be going on here? Thank you!

pip version not updated

When I run:

 pip install glow.py

I get version 0.1.0
but the instructions at:
https://glow.readthedocs.io/en/latest/getting-started.html#running-locally
are looking for version 0.1.3
so those instructions don't work right now, says it can't find the dependency

Bug in VCF reader: Wrong start position

Hi, I'm failing to work with glow because it reports some variants at the wrong positions:

It's a SNP, so its start position should be 12513180 instead of 12513179:

#CHROM	POS	ID	REF	ALT	QUAL	FILTER
X	12513180	X_12513180_G_A_b37	G	A	4027.47	PASS

Below is a MCVE.
I attached the corresponding test.vcf file.

import pyspark
from pyspark.sql import SparkSession
import glow
import os
import pandas as pd

import pyspark.sql.functions as f

os.environ['PYSPARK_SUBMIT_ARGS'] = " ".join([
    "--packages io.projectglow:glow_2.11:0.2.0,org.bdgenomics.adam:adam-assembly-spark2_2.12:0.29.0,io.delta:delta-core_2.11:0.4.0",
    "--driver-memory %sk" % MEM,
    "--executor-memory %sk" % MEM,
    "pyspark-shell"
])
os.environ['PYSPARK_SUBMIT_ARGS']

## define custom display Head to simplify looking at spark DataFrames.
def displayHead(df, nrows = 5):
    return df.limit(nrows).toPandas()

spark = (
    SparkSession.builder
    .appName('abc')
    .config("spark.local.dir", os.environ.get("TMP"))
    .config("spark.sql.execution.arrow.enabled", "true")
    .config("spark.sql.shuffle.partitions", "2001")
    .config("spark.driver.maxResultSize", "48G")
    .config("spark.databricks.io.cache.enabled", "true")
    .config("spark.databricks.io.cache.maxDiskUsage", "50G")
    .getOrCreate()
)
glow.register(spark)

df = (
    spark
    .read
    .format('vcf')
    .option("includeSampleIds", True)
    .load(os.path.expanduser("~/test.vcf"))
    .drop("names")
)

displayHead(df)

Clarify 0 vs 1 based coordinates in the documentation

We should be very clear in the documentation that all ingested data is converted to 0 based coordinates regardless of how it's represented in the input file format. Furthermore, when writing to file formats like VCF, we unconditionally add 1 to match the file format spec.

Transitive dependencies of htsjdk:2.20.1 unresolvable with sbt

sbt:glow> test
[info] Updating core...
[info] Checking 35 Scala sources...
[info] Checking 80 Scala sources...
[warn] 	::::::::::::::::::::::::::::::::::::::::::::::
[warn] 	::          UNRESOLVED DEPENDENCIES         ::
[warn] 	::::::::::::::::::::::::::::::::::::::::::::::
[warn] 	:: org.apache.commons#commons-jexl;2.1.1: org.apache.commons#commons-jexl;2.1.1!commons-jexl.pom(pom.original) origin location must be absolute: file:/Users/foo/.m2/repository/org/apache/commons/commons-jexl/2.1.1/commons-jexl-2.1.1.pom
[warn] 	:: org.xerial.snappy#snappy-java;1.1.4: org.xerial.snappy#snappy-java;1.1.4!snappy-java.pom(pom.original) origin location must be absolute: file:/Users/foo/.m2/repository/org/xerial/snappy/snappy-java/1.1.4/snappy-java-1.1.4.pom
[warn] 	:: gov.nih.nlm.ncbi#ngs-java;2.9.0: gov.nih.nlm.ncbi#ngs-java;2.9.0!ngs-java.pom(pom.original) origin location must be absolute: file:/Users/foo/.m2/repository/gov/nih/nlm/ncbi/ngs-java/2.9.0/ngs-java-2.9.0.pom
[warn] 	::::::::::::::::::::::::::::::::::::::::::::::
[warn]
[warn] 	Note: Unresolved dependencies path:
[warn] 		org.apache.commons:commons-jexl:2.1.1
[warn] 		  +- com.github.samtools:htsjdk:2.20.1
[warn] 		  +- org.seqdoop:hadoop-bam:7.9.2
[warn] 		  +- org.bdgenomics.adam:adam-core-spark2_2.11:0.28.0
[warn] 		  +- org.bdgenomics.adam:adam-apis-spark2_2.11:0.28.0 (working/glow/build.sbt#L129)
[warn] 		  +- io.projectglow:glow_2.11:0.2.0-SNAPSHOT
[warn] 		org.xerial.snappy:snappy-java:1.1.4
[warn] 		  +- com.github.samtools:htsjdk:2.20.1
[warn] 		  +- org.seqdoop:hadoop-bam:7.9.2
[warn] 		  +- org.bdgenomics.adam:adam-core-spark2_2.11:0.28.0
[warn] 		  +- org.bdgenomics.adam:adam-apis-spark2_2.11:0.28.0 (working/glow/build.sbt#L129)
[warn] 		  +- io.projectglow:glow_2.11:0.2.0-SNAPSHOT
[warn] 		gov.nih.nlm.ncbi:ngs-java:2.9.0
[warn] 		  +- com.github.samtools:htsjdk:2.20.1
[warn] 		  +- org.seqdoop:hadoop-bam:7.9.2
[warn] 		  +- org.bdgenomics.adam:adam-core-spark2_2.11:0.28.0
[warn] 		  +- org.bdgenomics.adam:adam-apis-spark2_2.11:0.28.0 (working/glow/build.sbt#L129)
[warn] 		  +- io.projectglow:glow_2.11:0.2.0-SNAPSHOT
[error] sbt.librarymanagement.ResolveException: unresolved dependency: org.apache.commons#commons-jexl;2.1.1: org.apache.commons#commons-jexl;2.1.1!commons-jexl.pom(pom.original) origin location must be absolute: file:/Users/foo/.m2/repository/org/apache/commons/commons-jexl/2.1.1/commons-jexl-2.1.1.pom
[error] unresolved dependency: org.xerial.snappy#snappy-java;1.1.4: org.xerial.snappy#snappy-java;1.1.4!snappy-java.pom(pom.original) origin location must be absolute: file:/Users/foo/.m2/repository/org/xerial/snappy/snappy-java/1.1.4/snappy-java-1.1.4.pom
[error] unresolved dependency: gov.nih.nlm.ncbi#ngs-java;2.9.0: gov.nih.nlm.ncbi#ngs-java;2.9.0!ngs-java.pom(pom.original) origin location must be absolute: file:/Users/foo/.m2/repository/gov/nih/nlm/ncbi/ngs-java/2.9.0/ngs-java-2.9.0.pom
[error] 	at sbt.internal.librarymanagement.IvyActions$.resolveAndRetrieve(IvyActions.scala:332)
[error] 	at sbt.internal.librarymanagement.IvyActions$.$anonfun$updateEither$1(IvyActions.scala:208)
[error] 	at sbt.internal.librarymanagement.IvySbt$Module.$anonfun$withModule$1(Ivy.scala:239)
[error] 	at sbt.internal.librarymanagement.IvySbt.$anonfun$withIvy$1(Ivy.scala:204)
[error] 	at sbt.internal.librarymanagement.IvySbt.sbt$internal$librarymanagement$IvySbt$$action$1(Ivy.scala:70)
[error] 	at sbt.internal.librarymanagement.IvySbt$$anon$3.call(Ivy.scala:77)
[error] 	at xsbt.boot.Locks$GlobalLock.withChannel$1(Locks.scala:95)
[error] 	at xsbt.boot.Locks$GlobalLock.xsbt$boot$Locks$GlobalLock$$withChannelRetries$1(Locks.scala:80)
[error] 	at xsbt.boot.Locks$GlobalLock$$anonfun$withFileLock$1.apply(Locks.scala:99)
[error] 	at xsbt.boot.Using$.withResource(Using.scala:10)
[error] 	at xsbt.boot.Using$.apply(Using.scala:9)
[error] 	at xsbt.boot.Locks$GlobalLock.ignoringDeadlockAvoided(Locks.scala:60)
[error] 	at xsbt.boot.Locks$GlobalLock.withLock(Locks.scala:50)
[error] 	at xsbt.boot.Locks$.apply0(Locks.scala:31)
[error] 	at xsbt.boot.Locks$.apply(Locks.scala:28)
[error] 	at sbt.internal.librarymanagement.IvySbt.withDefaultLogger(Ivy.scala:77)
[error] 	at sbt.internal.librarymanagement.IvySbt.withIvy(Ivy.scala:199)
[error] 	at sbt.internal.librarymanagement.IvySbt.withIvy(Ivy.scala:196)
[error] 	at sbt.internal.librarymanagement.IvySbt$Module.withModule(Ivy.scala:238)
[error] 	at sbt.internal.librarymanagement.IvyActions$.updateEither(IvyActions.scala:193)
[error] 	at sbt.librarymanagement.ivy.IvyDependencyResolution.update(IvyDependencyResolution.scala:20)
[error] 	at sbt.librarymanagement.DependencyResolution.update(DependencyResolution.scala:56)
[error] 	at sbt.internal.LibraryManagement$.resolve$1(LibraryManagement.scala:45)
[error] 	at sbt.internal.LibraryManagement$.$anonfun$cachedUpdate$12(LibraryManagement.scala:93)
[error] 	at sbt.util.Tracked$.$anonfun$lastOutput$1(Tracked.scala:68)
[error] 	at sbt.internal.LibraryManagement$.$anonfun$cachedUpdate$19(LibraryManagement.scala:106)
[error] 	at scala.util.control.Exception$Catch.apply(Exception.scala:224)
[error] 	at sbt.internal.LibraryManagement$.$anonfun$cachedUpdate$11(LibraryManagement.scala:106)
[error] 	at sbt.internal.LibraryManagement$.$anonfun$cachedUpdate$11$adapted(LibraryManagement.scala:89)
[error] 	at sbt.util.Tracked$.$anonfun$inputChanged$1(Tracked.scala:149)
[error] 	at sbt.internal.LibraryManagement$.cachedUpdate(LibraryManagement.scala:120)
[error] 	at sbt.Classpaths$.$anonfun$updateTask$5(Defaults.scala:2561)
[error] 	at scala.Function1.$anonfun$compose$1(Function1.scala:44)
[error] 	at sbt.internal.util.$tilde$greater.$anonfun$$u2219$1(TypeFunctions.scala:40)
[error] 	at sbt.std.Transform$$anon$4.work(System.scala:67)
[error] 	at sbt.Execute.$anonfun$submit$2(Execute.scala:269)
[error] 	at sbt.internal.util.ErrorHandling$.wideConvert(ErrorHandling.scala:16)
[error] 	at sbt.Execute.work(Execute.scala:278)
[error] 	at sbt.Execute.$anonfun$submit$1(Execute.scala:269)
[error] 	at sbt.ConcurrentRestrictions$$anon$4.$anonfun$submitValid$1(ConcurrentRestrictions.scala:178)
[error] 	at sbt.CompletionService$$anon$2.call(CompletionService.scala:37)
[error] 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[error] 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
[error] 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
[error] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
[error] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
[error] 	at java.lang.Thread.run(Thread.java:748)

bigvcf writer creates header only output

I merged some vcfs and produced a df with content like this:

[Row(contigName='1', start=1557, referenceAllele='A', genotypes=[Row(sampleId='7WV160085_DH0003', calls=[0, 1]), Row(sampleId='7WV160085_DH0008', calls=[0, 1]), Row(sampleId='7WV160085_DH0025', calls=[0, 0])], alternateAlleles=['T'], end=None, refAllele='A'),
Row(contigName='1', start=1734, referenceAllele='C', genotypes=[Row(sampleId='7WV160085_DH0003', calls=[0, 1]), Row(sampleId='7WV160085_DH0008', calls=[0, 0]), Row(sampleId='7WV160085_DH0025', calls=[0, 0])], alternateAlleles=['CA'], end=None, refAllele='C')]

Looks ok to me, but trying to save it as bigvcf results in an empty file with partial headers.

.write.format("bigvcf").save("data_test/gvcf_sample_start_less2000")

.write.format("vcf").save("data_test/gvcf_sample_start_less2000") throws a error about empty partitions

In lieu of Cannoli/ADAM/Avocado？

Hello,

It seems the major differences between Glow and Cannoli/ADAM/Avocado are Machine Learning orientated features with MLFlow plus Delta Lake.

So will Glow make Cannoli/ADAM/Avocado Pipeline obsolete down the road?

Thanks!

Glow transform functions drop supplemental columns.

Currently, transform functions like 'normalize_variants' drop data in columns that are not part of the core VCF/BGEN schema by setting the values to null. Only columns of strings having the "INFO_" prefix are maintained if they are not part of the core schema. It would be ideal if Glow transformers could maintain these supplemental columns from the input DataFrame.

Issue with Glow liftOver function

Overall, the Glow liftOver function seems to work as expected, but there is an issue when the REF/ALT alleles are swapped. Below is an example:

VCF entry:
22 16849933 . C T . . .

Output from running Picard LiftoverVcf (see details below):
22 16369271 . T C . PASS OriginalContig=22;OriginalStart=16849933;SwappedAlleles

Output after running the Glow liftOver:

val df_liftOver = Glow.transform("lift_over_variants", df_b37, Map("reference_file" -> referencePath, "chain_file" -> chain)).drop("genotypes")

It appears that the Glow liftOver fails due to a “MismatchedRefAllele”.

However, the desired behavior would be as follows:

Liftover the site to the new genomic coordinates (22:16369271)
Swap the reference and alternate alleles
Set INFO_SwappedAlleles to ‘true’

Support BGEN formatters for pipe

The pipe transformer supports VCF/CSV/TXT input/output formatters, but not BGEN.

Memory utilization of variant normalization

The current implementation of normalize_variants is rather memory intensive, presumably because of the interfacing with HTSJDK. Optimizing this function to use more pure Spark machinery could yield a nice performance improvement.

Support for TileDB-VCF?

Hi, is it possible to use GenomicsDB with glow?
Somewhere I saw glow using VariantContext, so it should not be a problem to use GenomicsDB as well, should it?

Add htslib >= 1.10 as requirement

Hi, with htslib v1.9 I get the following error:

Py4JJavaError: An error occurred while calling o192.collectToPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 (TID 64, localhost, executor driver): java.lang.NoSuchMethodError: htsjdk.variant.vcf.VCFUtils.parseVcfDouble(Ljava/lang/String;)D
	at io.projectglow.vcf.VariantContextToInternalRowConverter$$anonfun$io$projectglow$vcf$VariantContextToInternalRowConverter$$updateInfoField$1$$anonfun$liftedTree2$1$1.apply(VariantContextToInternalRowConverter.scala:394)
	at io.projectglow.vcf.VariantContextToInternalRowConverter$$anonfun$io$projectglow$vcf$VariantContextToInternalRowConverter$$updateInfoField$1$$anonfun$liftedTree2$1$1.apply(VariantContextToInternalRowConverter.scala:394)
	at io.projectglow.vcf.VariantContextToInternalRowConverter.io$projectglow$vcf$VariantContextToInternalRowConverter$$makeArray(VariantContextToInternalRowConverter.scala:273)
	at io.projectglow.vcf.VariantContextToInternalRowConverter.io$projectglow$vcf$VariantContextToInternalRowConverter$$getAttributeArray(VariantContextToInternalRowConverter.scala:304)
	at io.projectglow.vcf.VariantContextToInternalRowConverter$$anonfun$io$projectglow$vcf$VariantContextToInternalRowConverter$$updateInfoField$1.liftedTree2$1(VariantContextToInternalRowConverter.scala:394)
	at io.projectglow.vcf.VariantContextToInternalRowConverter$$anonfun$io$projectglow$vcf$VariantContextToInternalRowConverter$$updateInfoField$1.apply$mcV$sp(VariantContextToInternalRowConverter.scala:390)
	at io.projectglow.vcf.VariantContextToInternalRowConverter.tryWithWarning(VariantContextToInternalRowConverter.scala:167)
	at io.projectglow.vcf.VariantContextToInternalRowConverter.io$projectglow$vcf$VariantContextToInternalRowConverter$$updateInfoField(VariantContextToInternalRowConverter.scala:364)
 [...]

Updating to htslib v1.10.2 solved this problem.

v0.6.x Release Date

Will you guys be pushing a v0.6 release soon? It would be great to be able to use the functions in the examples for WGR (at least reshape_for_gwas and transform_loco) that don't exist in the latest packaged release on pypi (v0.5.0).

[Feature request] Make pipe transformer lazy

It would be useful if one could run the pipe transformer in a lazy fashion by specifying the expected output schema beforehand.

The text output already has a pre-defined output schema (data) and could therefore be run lazily in each case.

Struct transformations, add_struct_fields example fails.

Running the add_struct fields example in the documentation seems to work first, but when calling .show() the example fails.

Setting up the initial DataFrame:

from pyspark.sql import Row
row_one = Row(Row(str_col='foo', int_col=1, bool_col=True))
row_two = Row(Row(str_col='bar', int_col=2, bool_col=False))
base_df = spark.createDataFrame([row_one, row_two], schema=['base_col'])
base_df.selectExpr("subset_struct(base_col, 'str_col', 'bool_col') as subsetted_col").show()

added_col_df = base_df.selectExpr("add_struct_fields(base_col, 'float_col', 3.14, 'rev_str_col', \
    reverse(base_col.str_col)) as added_col")

added_col_df.show()

Calling .show() gives me the following error:

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-37-1fafef989fbe> in <module>
----> 1 added_col_df.show()

/opt/modules/i12g/anaconda/3-2018.12/envs/glow-test/lib/python3.7/site-packages/pyspark/sql/dataframe.py in show(self, n, truncate, vertical)
    378         """
    379         if isinstance(truncate, bool) and truncate:
--> 380             print(self._jdf.showString(n, 20, vertical))
    381         else:
    382             print(self._jdf.showString(n, int(truncate), vertical))

/opt/modules/i12g/anaconda/3-2018.12/envs/glow-test/lib/python3.7/site-packages/py4j/java_gateway.py in __call__(self, *args)
   1255         answer = self.gateway_client.send_command(command)
   1256         return_value = get_return_value(
-> 1257             answer, self.gateway_client, self.target_id, self.name)
   1258 
   1259         for temp_arg in temp_args:

/opt/modules/i12g/anaconda/3-2018.12/envs/glow-test/lib/python3.7/site-packages/pyspark/sql/utils.py in deco(*a, **kw)
     61     def deco(*a, **kw):
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:
     65             s = e.java_exception.toString()

/opt/modules/i12g/anaconda/3-2018.12/envs/glow-test/lib/python3.7/site-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:
    330                 raise Py4JError(

Py4JJavaError: An error occurred while calling o1212.showString.
: java.lang.UnsupportedOperationException: Cannot evaluate expression: addstructfields(input[0, struct<bool_col:boolean,int_col:bigint,str_col:string>, true], float_col, 3.14, rev_str_col, reverse(input[0, struct<bool_col:boolean,int_col:bigint,str_col:string>, true].str_col))
	at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:261)
	at io.projectglow.sql.expressions.AddStructFields.doGenCode(glueExpressions.scala:52)
	at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
	at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105)
	at org.apache.spark.sql.catalyst.expressions.Cast.doGenCode(Cast.scala:660)
	at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:108)
	at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:105)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:105)
	at org.apache.spark.sql.catalyst.expressions.Cast.genCode(Cast.scala:655)
	at org.apache.spark.sql.catalyst.expressions.Alias.genCode(namedExpressions.scala:155)
	at org.apache.spark.sql.execution.ProjectExec$$anonfun$6.apply(basicPhysicalOperators.scala:60)
	at org.apache.spark.sql.execution.ProjectExec$$anonfun$6.apply(basicPhysicalOperators.scala:60)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
	at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
	at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
	at scala.collection.AbstractTraversable.map(Traversable.scala:104)
	at org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:60)
	at org.apache.spark.sql.execution.CodegenSupport$class.constructDoConsumeFunction(WholeStageCodegenExec.scala:216)
	at org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:187)
	at org.apache.spark.sql.execution.InputAdapter.consume(WholeStageCodegenExec.scala:374)
	at org.apache.spark.sql.execution.InputAdapter.doProduce(WholeStageCodegenExec.scala:403)
	at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:90)
	at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:85)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
	at org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:85)
	at org.apache.spark.sql.execution.InputAdapter.produce(WholeStageCodegenExec.scala:374)
	at org.apache.spark.sql.execution.ProjectExec.doProduce(basicPhysicalOperators.scala:45)
	at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:90)
	at org.apache.spark.sql.execution.CodegenSupport$$anonfun$produce$1.apply(WholeStageCodegenExec.scala:85)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
	at org.apache.spark.sql.execution.CodegenSupport$class.produce(WholeStageCodegenExec.scala:85)
	at org.apache.spark.sql.execution.ProjectExec.produce(basicPhysicalOperators.scala:35)
	at org.apache.spark.sql.execution.WholeStageCodegenExec.doCodeGen(WholeStageCodegenExec.scala:544)
	at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:598)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
	at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:339)
	at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)
	at org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3389)
	at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2550)
	at org.apache.spark.sql.Dataset$$anonfun$head$1.apply(Dataset.scala:2550)
	at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
	at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369)
	at org.apache.spark.sql.Dataset.head(Dataset.scala:2550)
	at org.apache.spark.sql.Dataset.take(Dataset.scala:2764)
	at org.apache.spark.sql.Dataset.getRows(Dataset.scala:254)
	at org.apache.spark.sql.Dataset.showString(Dataset.scala:291)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:745)

Glow documentation

Consider updating the documentation: move the "Variant Normalization" section from "Tertiary Analysis" to "Variant Data Manipulation"

Source in package org.bdgenomics.adam.rdd not necessary

VCFMetadataLoader does not appear to use ADAMContext, so I'm not sure why it needs to be in the org.bdgenomics.adam.rdd package

package org.bdgenomics.adam.rdd

import htsjdk.variant.vcf.VCFHeader
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.Path
import org.seqdoop.hadoop_bam.util.{VCFHeaderReader, WrapSeekable}

object VCFMetadataLoader {

  // Currently private in ADAM (ADAMContext).
  def readVcfHeader(config: Configuration, pathName: String): VCFHeader = {
    val is = WrapSeekable.openPath(config, new Path(pathName))
    val header = VCFHeaderReader.readHeaderFrom(is)
    is.close()
    header
  }
}

What’s the deal with the license?

Do y’all plan to relicense with a more standard FOSS license? The launch blog post indicates the project should be Apache 2-licensed.

Generalizing variant normalization beyond VCF/BGEN representations

The variant normalization function is currently implemented on the VCF and BGEN derived schema, which uses contigName, start, end, ref, and alt columns. However, there are often use cases for normalization of site lists (for example) that are not directly derived from a VCF or BGEN. As an example, it would be reasonable to normalize sites from an external resource such as dbSNP or ExAC. Those often have a CPRA representation (chromosome position ref alt). It would be good for normalization to work with these other representation types.

Hello, there is a problem with starting pyspark

The current environment:
ubuntu 18.4/spark-2.4.5/python3.7.5/scala-2.11.12/OpenJDK 64-Bit Server VM, Java 1.8.0_252
Spark-shell can use the glow component normally, but pyspark does glow.register(spark)
throws the following exception
File "<stdin>", line 1, in <module> File "/usr/local/lib/python3.7/dist-packages/glow/glow.py", line 53, in register session._jvm.io.projectglow.Glow.register(session._jsparkSession) File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ File "/usr/local/spark/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:io.projectglow.Glow.register. : java.lang.NoSuchMethodError: scala.Function1.$init$(Lscala/Function1;)V at io.projectglow.sql.GlowSQLExtensions.<init>(SqlExtensionProvider.scala:39) at io.projectglow.GlowBase.register(Glow.scala:39) at io.projectglow.Glow.register(Glow.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748)
I mentioned an issue yesterday, and your answer may be maven or scala version. However, I used the bin package of spark, and the scala version also conforms to the scala-2.11.12 version written on your build.stb. I sincerely hope that god can help me solve this problem

PlinkFileFormat fails with error "Cannot seek to a negative offset"

Hi, I was working some 1kg data and saw that the PLINK reader is having an issue I can't quite pinpoint the cause for. I'm hoping it might be more immediately obvious for someone more familiar with that code.

I'm fetching a vcf dataset and converting to PLINK as follows:

wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20100804/ALL.2of4intersection.20100804.genotypes.vcf.gz
plink --vcf ALL.2of4intersection.20100804.genotypes.vcf.gz \
  --make-bed --out ALL.2of4intersection.20100804.genotypes

When reading this via Glow:

spark
  .read.format("plink")
  .load("ALL.2of4intersection.20100804.genotypes.bed")
  .count

I encounter this (full stack trace):

20/01/16 13:14:54 INFO PlinkFileFormat$: hlsUsage:[plinkRead,{"includeSampleIds":true,"mergeFidIid":true}]
20/01/16 13:15:02 INFO PlinkFileFormat: Reading variants [0, 849480]
20/01/16 13:15:02 INFO PlinkFileFormat: Reading variants [849480, 1698959]
20/01/16 13:15:02 INFO PlinkFileFormat: Reading variants [1698959, 2548438]
20/01/16 13:15:06 INFO PlinkFileFormat: Reading variants [2548438, 3397918]
20/01/16 13:15:06 INFO PlinkFileFormat: Reading variants [3397918, 4247397]
20/01/16 13:15:06 INFO PlinkFileFormat: Reading variants [4247397, 5096876]
20/01/16 13:15:07 INFO PlinkFileFormat: Reading variants [5096876, 5946356]
20/01/16 13:15:07 INFO PlinkFileFormat: Reading variants [5946356, 6795835]
20/01/16 13:15:07 INFO PlinkFileFormat: Reading variants [6795835, 7645314]
20/01/16 13:15:07 INFO PlinkFileFormat: Reading variants [8494793, 9344273]
20/01/16 13:15:07 INFO PlinkFileFormat: Reading variants [9344273, 10193752]
20/01/16 13:15:07 INFO PlinkFileFormat: Reading variants [7645314, 8494793]
20/01/16 13:15:07 INFO PlinkFileFormat: Reading variants [10193752, 11043231]
20/01/16 13:15:08 INFO PlinkFileFormat: Reading variants [11043231, 11892711]
20/01/16 13:15:08 INFO PlinkFileFormat: Reading variants [11892711, 12742190]
20/01/16 13:15:08 INFO PlinkFileFormat: Reading variants [12742190, 13591669]
20/01/16 13:15:17 INFO PlinkFileFormat: Reading variants [25484379, 52671826]
20/01/16 13:15:17 ERROR Executor: Exception in task 30.0 in stage 0.0 (TID 30)
java.io.EOFException: Cannot seek to a negative offset
	at org.apache.hadoop.fs.FSInputChecker.seek(FSInputChecker.java:399)
	at org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:62)
	at org.apache.hadoop.fs.ChecksumFileSystem$FSDataBoundedInputStream.seek(ChecksumFileSystem.java:330)
	at io.projectglow.plink.PlinkFileFormat$$anonfun$buildReader$1.apply(PlinkFileFormat.scala:118)
	at io.projectglow.plink.PlinkFileFormat$$anonfun$buildReader$1.apply(PlinkFileFormat.scala:90)
	at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:148)

My first thought was that this might be related to the fact that many of the rsids are missing (i.e. "."), but that doesn't seem to be the issue. Hail can read the data w/o problems and reports that ~10M of ~25M are missing. Hail also shows that the following transformation to add unique ids results in a dataset with all unique rsids that still fails with the same error above when read by Glow:

plink --bfile ALL.2of4intersection.20100804.genotypes \
--set-missing-var-ids @:#[b37]\$1,\$2 --make-bed \
--out ALL.2of4intersection.20100804.genotypes.uniqids

It is worth nothing though that after removing variants and samples with low call rates and MAF via PLINK, I can read the result successfully w/ Glow:

spark.read
  .format("plink")
  // This dataset contains only ~5.8M variants out of the original ~25M
  .load("ALL.2of4intersection.20100804.genotypes.uniqids.filtered.bed")
  .drop("genotypes").show(5)

+----------+-----------------+--------+-----+-----+---------------+----------------+
|contigName|            names|position|start|  end|referenceAllele|alternateAlleles|
+----------+-----------------+--------+-----+-----+---------------+----------------+
|         1|     [rs58108140]|     0.0|10582|10583|              G|             [A]|
|         1|[1:11508[b37]A,G]|     0.0|11507|11508|              G|             [A]|
|         1|[1:15820[b37]G,T]|     0.0|15819|15820|              G|             [T]|
|         1|[1:16257[b37]C,G]|     0.0|16256|16257|              G|             [C]|
|         1|[1:16378[b37]C,T]|     0.0|16377|16378|              C|             [T]|
+----------+-----------------+--------+-----+-----+---------------+----------------+

Does anybody have thoughts on what else could cause this?

environment: Spark 2.4.4, Glow 0.20

plink bed schema not compatible with VCF when using pipe transformer

when reading a plink binary ped (bed) file into a spark dataframe and passing it through the pipe transformer as a VCF I get the following error "java.lang.IllegalArgumentException: Illegal base [0] seen in the allele", which suggests some compatibility issue between plink bed and VCF schemas

df_bed = spark.read.format("plink").load("dbfs:/tmp/[email protected]/plink/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.bed")

scriptFile = r"""#!/bin/sh
set -e
cat -
"""
cmd = json.dumps(["bash", "-c", scriptFile])

df_vcf = glow.transform('pipe', 
                                   df_bed.drop("position"), 
                                   cmd=cmd, 
                                   input_formatter='vcf',
                                   in_vcf_header='infer',
                                   output_formatter='vcf')

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 28.0 failed 4 times, most recent failure: Lost task 0.3 in stage 28.0 (TID 136, 10.126.239.152, executor 0): java.lang.IllegalArgumentException: Illegal base [0] seen in the allele

Hello, there was an exception during the installation

When running the glow. Register (spark) command, the following exception occurs
File "<stdin>", line 1, in <module> File "/usr/local/lib/python3.7/dist-packages/glow/glow.py", line 53, in register session._jvm.io.projectglow.Glow.register(session._jsparkSession) File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__ File "/usr/local/spark/python/pyspark/sql/utils.py", line 63, in deco return f(*a, **kw) File "/usr/local/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:io.projectglow.Glow.register. : java.lang.NoSuchMethodError: scala.Function1.$init$(Lscala/Function1;)V ；
Running environment: python3.7 python- PIP spark-2.4.5

Do not know where the configuration of the wrong cause not to start up for the boss to solve

Missing steps in sample QC demo notebook

Hi, in the sample-qc-demo notebook, are there steps missing between the "Build map of sampleId -> QC stats" cell and the "Show per-sample QC metrics" cell?

I don't think display(qc.selectExpr("explode(qc) as (sampleId, per_sample_qc)").selectExpr("sampleId", "expand_struct(per_sample_qc)")) is possible otherwise (at least not with Glow 0.2.0 and Spark 2.4.4).

Variant Liftover exception

When running this line to run a Variant Liftover......

liftover_expr = "lift_over_coordinates(contigName, start, end, chain_file, .99)"
input_with_lifted_df = input_df.select('contigName', 'start', 'end').withColumn('lifted', expr(liftover_expr))

I am getting this error. Any suggestions?

Traceback (most recent call last):
  File "/share/pkg.7/spark/2.4.3/install/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
  File "/share/pkg.7/spark/2.4.3/install/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o58.withColumn.
: org.apache.spark.sql.AnalysisException: cannot resolve '`chain_file`' given input columns: [contigName, start, end]; line 1 pos 46;
'Project [contigName#0, start#1L, end#2L, 'lift_over_coordinates(contigName#0, start#1L, end#2L, 'chain_file, 0.99) AS lifted#733]
+- Project [contigName#0, start#1L, end#2L]
   +- Relation[contigName#0,start#1L,end#2L,names#3,referenceAllele#4,alternateAlleles#5,qual#6,filters#7,splitFromMultiAllelic#8,INFO_NumTechs#9,INFO_HG003_GT#10,INFO_DistMin#11,INFO_TRgt10k#12,INFO_MendelianError#13,INFO_segdup#14,INFO_ClusterMaxEditDist#15,INFO_HG2count#16,INFO_BREAKSIMLENGTH#17,INFO_PBcalls#18,INFO_PBexactcalls#19,INFO_REPTYPE#20,INFO_Illexactcalls#21,INFO_TenXcalls#22,INFO_Illcalls#23,... 28 more fields] vcf

This line runs fine just before this. So the chain file is not a problem for this....
output_df = input_df.withColumn('lifted', glow.lift_over_coordinates('contigName', 'start','end', chain_file, 0.99))

The head of the chain file looks like this:

chain 20851231461 1 249250621 + 10000 249240621 chr1 248956422 + 10000 248946422 2
167376  50041   80290
40302   253649  288020
1044699 1       2
3716    0       3
1134    4       18
3377    0       1
7258    1       1
27      1       1

It looks like it is complaining about the first line.

NoClassDefFoundError: org/apache/logging/log4j/core/appender/AbstractAppender

I encountered a problem with commit 1700969 as assembly jar. This is working with io.projectglow:glow_2.11:0.3.0.
Am I missing some classes in my assembly jar?

Please just ignore if you do not want to support bleeding edge problems 😃

tests/test_glow.py::test_read_vcf 20/04/17 00:44:47 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
                                                                                
tests/test_glow.py:21 (test_read_vcf)
glow_session = <pyspark.sql.session.SparkSession object at 0x7f4dfad579d0>

    def test_read_vcf(glow_session: pyspark.sql.SparkSession):
        path = "resources/glow_example.vcf"
        # ref_genome =
    
        df = (
            glow_session
                .read
                .format('vcf')
                .option("includeSampleIds", True)
                .load(path)
        )
    
        # %%
    
        # normalized_variants_df = glow.transform(
        #     "normalize_variants",
        #     df,
        #     reference_genome_path=ref_genome,
        #     mode='split_and_normalize'
        # )
        # df = glow_session.read.format("parquet").load("resources/example.vcf")
>       displayHead(df)

tests/test_glow.py:43: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
firefly/utils.py:18: in displayHead
    return df.limit(nrows).toPandas()
/opt/anaconda/envs/default/lib/python3.7/site-packages/pyspark/sql/dataframe.py:2226: in toPandas
    pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)
/opt/anaconda/envs/default/lib/python3.7/site-packages/pyspark/sql/dataframe.py:563: in collect
    sock_info = self._jdf.collectToPython()
/opt/anaconda/envs/default/lib/python3.7/site-packages/py4j/java_gateway.py:1286: in __call__
    answer, self.gateway_client, self.target_id, self.name)
/opt/anaconda/envs/default/lib/python3.7/site-packages/pyspark/sql/utils.py:98: in deco
    return f(*a, **kw)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

answer = 'xro80'
gateway_client = <py4j.java_gateway.GatewayClient object at 0x7f4dfad383d0>
target_id = 'o73', name = 'collectToPython'

    def get_return_value(answer, gateway_client, target_id=None, name=None):
        """Converts an answer received from the Java gateway into a Python object.
    
        For example, string representation of integers are converted to Python
        integer, string representation of objects are converted to JavaObject
        instances, etc.
    
        :param answer: the string returned by the Java gateway
        :param gateway_client: the gateway client used to communicate with the Java
            Gateway. Only necessary if the answer is a reference (e.g., object,
            list, map)
        :param target_id: the name of the object from which the answer comes from
            (e.g., *object1* in `object1.hello()`). Optional.
        :param name: the name of the member from which the answer comes from
            (e.g., *hello* in `object1.hello()`). Optional.
        """
        if is_error(answer)[0]:
            if len(answer) > 1:
                type = answer[1]
                value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
                if answer[1] == REFERENCE_TYPE:
                    raise Py4JJavaError(
                        "An error occurred while calling {0}{1}{2}.\n".
>                       format(target_id, ".", name), value)
E                   py4j.protocol.Py4JJavaError: An error occurred while calling o73.collectToPython.
E                   : java.lang.NoClassDefFoundError: org/apache/logging/log4j/core/appender/AbstractAppender
E                   	at java.lang.ClassLoader.defineClass1(Native Method)
E                   	at java.lang.ClassLoader.defineClass(ClassLoader.java:763)
E                   	at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
E                   	at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
E                   	at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
E                   	at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
E                   	at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
E                   	at java.security.AccessController.doPrivileged(Native Method)
E                   	at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
E                   	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
E                   	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
E                   	at java.lang.ClassLoader.loadClass(ClassLoader.java:411)
E                   	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
E                   	at org.apache.logging.log4j.core.config.plugins.util.PluginRegistry.decodeCacheFiles(PluginRegistry.java:181)
E                   	at org.apache.logging.log4j.core.config.plugins.util.PluginRegistry.loadFromMainClassLoader(PluginRegistry.java:119)
E                   	at org.apache.logging.log4j.core.config.plugins.util.PluginManager.collectPlugins(PluginManager.java:132)
E                   	at org.apache.logging.log4j.core.pattern.PatternParser.<init>(PatternParser.java:129)
E                   	at org.apache.logging.log4j.core.pattern.PatternParser.<init>(PatternParser.java:110)
E                   	at org.apache.logging.log4j.core.layout.PatternLayout.createPatternParser(PatternLayout.java:216)
E                   	at org.apache.logging.log4j.core.layout.PatternLayout.<init>(PatternLayout.java:128)
E                   	at org.apache.logging.log4j.core.layout.PatternLayout.<init>(PatternLayout.java:56)
E                   	at org.apache.logging.log4j.core.layout.PatternLayout$Builder.build(PatternLayout.java:376)
E                   	at org.apache.logging.log4j.core.config.DefaultConfiguration.<init>(DefaultConfiguration.java:58)
E                   	at org.apache.logging.log4j.core.LoggerContext.<init>(LoggerContext.java:70)
E                   	at org.apache.logging.log4j.core.selector.ClassLoaderContextSelector.locateContext(ClassLoaderContextSelector.java:145)
E                   	at org.apache.logging.log4j.core.selector.ClassLoaderContextSelector.getContext(ClassLoaderContextSelector.java:70)
E                   	at org.apache.logging.log4j.core.selector.ClassLoaderContextSelector.getContext(ClassLoaderContextSelector.java:57)
E                   	at org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:142)
E                   	at org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:41)
E                   	at org.apache.logging.log4j.LogManager.getContext(LogManager.java:175)
E                   	at org.apache.logging.log4j.LogManager.getLogger(LogManager.java:426)
E                   	at org.broadinstitute.hellbender.utils.Utils.<clinit>(Utils.java:77)
E                   	at org.broadinstitute.hellbender.utils.SimpleInterval.validatePositions(SimpleInterval.java:61)
E                   	at org.broadinstitute.hellbender.utils.SimpleInterval.<init>(SimpleInterval.java:37)
E                   	at io.projectglow.vcf.FilterInterval.liftedTree1$1(TabixIndexHelper.scala:93)
E                   	at io.projectglow.vcf.FilterInterval.<init>(TabixIndexHelper.scala:92)
E                   	at io.projectglow.vcf.TabixIndexHelper$.parseFilter(TabixIndexHelper.scala:186)
E                   	at io.projectglow.vcf.TabixIndexHelper$.makeFilteredInterval(TabixIndexHelper.scala:402)
E                   	at io.projectglow.vcf.VCFFileFormat.buildReader(VCFFileFormat.scala:147)
E                   	at org.apache.spark.sql.execution.datasources.FileFormat.buildReaderWithPartitionValues(FileFormat.scala:130)
E                   	at org.apache.spark.sql.execution.datasources.FileFormat.buildReaderWithPartitionValues$(FileFormat.scala:121)
E                   	at org.apache.spark.sql.execution.datasources.TextBasedFileFormat.buildReaderWithPartitionValues(FileFormat.scala:170)
E                   	at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD$lzycompute(DataSourceScanExec.scala:394)
E                   	at org.apache.spark.sql.execution.FileSourceScanExec.inputRDD(DataSourceScanExec.scala:385)
E                   	at org.apache.spark.sql.execution.FileSourceScanExec.doExecute(DataSourceScanExec.scala:468)
E                   	at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:173)
E                   	at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:211)
E                   	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
E                   	at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:208)
E                   	at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:169)
E                   	at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:313)
E                   	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:405)
E                   	at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:47)
E                   	at org.apache.spark.sql.Dataset.$anonfun$collectToPython$1(Dataset.scala:3310)
E                   	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3472)
E                   	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$4(SQLExecution.scala:100)
E                   	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
E                   	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87)
E                   	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3468)
E                   	at org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:3307)
E                   	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
E                   	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
E                   	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
E                   	at java.lang.reflect.Method.invoke(Method.java:498)
E                   	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
E                   	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
E                   	at py4j.Gateway.invoke(Gateway.java:282)
E                   	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
E                   	at py4j.commands.CallCommand.execute(CallCommand.java:79)
E                   	at py4j.GatewayConnection.run(GatewayConnection.java:238)
E                   	at java.lang.Thread.run(Thread.java:748)
E                   Caused by: java.lang.ClassNotFoundException: org.apache.logging.log4j.core.appender.AbstractAppender
E                   	at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
E                   	at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
E                   	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
E                   	at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
E                   	... 71 more

/opt/anaconda/envs/default/lib/python3.7/site-packages/py4j/protocol.py:328: Py4JJavaError

Assertion failed

split_multiallelic should convert arrays to scalars

Hi, would it be possible to have split_multiallelic also adjusting the table schema?

I know this implies some work to also adjust every other method that assumes arrays due to multi-allelic property.
However, this would be very helpful to directly read VCF files in the correct schema.

For example, when I read the gnomad 2.1.1 VCF I get the following schema:

root
 |-- contigName: string (nullable = true)
 |-- start: long (nullable = true)
 |-- end: long (nullable = true)
 |-- names: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- referenceAllele: string (nullable = true)
 |-- alternateAlleles: array (nullable = false)
 |    |-- element: string (containsNull = true)
 |-- qual: double (nullable = true)
 |-- filters: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- splitFromMultiAllelic: boolean (nullable = false)
 |-- INFO_non_neuro_AC_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_nhomalt_oth_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_AN_nfe_female: integer (nullable = true)
 |-- INFO_controls_AN_amr_female: integer (nullable = true)
 |-- INFO_non_topmed_AN_fin_female: integer (nullable = true)
 |-- INFO_controls_AN_eas_male: integer (nullable = true)
 |-- INFO_controls_AN_nfe_onf: integer (nullable = true)
 |-- INFO_rf_positive_label: boolean (nullable = true)
 |-- INFO_controls_AN_fin: integer (nullable = true)
 |-- INFO_non_neuro_nhomalt_eas_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_AN_nfe_est: integer (nullable = true)
 |-- INFO_variant_type: string (nullable = true)
 |-- INFO_controls_AN_eas: integer (nullable = true)
 |-- INFO_AF_oth_female: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_topmed_AC_asj_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_AN_amr_male: integer (nullable = true)
 |-- INFO_nhomalt_asj: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_AF_nfe_nwe: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_topmed_nhomalt_asj_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_nhomalt_oth: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_AF_asj_male: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_topmed_AN_nfe_seu: integer (nullable = true)
 |-- INFO_controls_nhomalt_afr_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_faf95: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_nhomalt_amr_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_AF_afr_male: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_neuro_AF_male: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_nhomalt_nfe_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_AN_afr_male: integer (nullable = true)
 |-- INFO_non_neuro_AN_fin_female: integer (nullable = true)
 |-- INFO_controls_AF_amr_female: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_neuro_AF_amr_female: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_neuro_AF_amr_male: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_neuro_nhomalt_amr: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_nhomalt_afr_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_nhomalt_nfe_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_AF_fin: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_ab_hist_alt_bin_freq: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- INFO_AN_raw: integer (nullable = true)
 |-- INFO_faf95_nfe: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_topmed_AF: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_nhomalt_oth: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_pab_max: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_neuro_nhomalt_nfe_onf: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_nhomalt_eas: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_AF_fin_male: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_controls_nhomalt_asj_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_nhomalt_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_AN_amr_female: integer (nullable = true)
 |-- INFO_non_neuro_faf99_nfe: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_AC_oth_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_AC_asj: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_AN_fin: integer (nullable = true)
 |-- INFO_AC_nfe_onf: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_AC_amr: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_AC_raw: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_AC_nfe_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_AC_asj_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_AC_eas_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_nhomalt_popmax: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_dp_hist_alt_bin_freq: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- INFO_AN_male: integer (nullable = true)
 |-- INFO_nhomalt_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_nhomalt_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_nhomalt_afr: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_AC_nfe: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_AC_eas_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_AC_eas_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_AC: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_AC: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_AN_amr: integer (nullable = true)
 |-- INFO_AC_afr: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_gq_hist_alt_bin_freq: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- INFO_non_topmed_AC_oth: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_AF_asj_male: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_VQSR_culprit: string (nullable = true)
 |-- INFO_non_topmed_AC_fin_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_AC_nfe_nwe: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_AN_nfe: integer (nullable = true)
 |-- INFO_vep: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- INFO_non_topmed_AF_nfe_est: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_faf95_amr: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_segdup: boolean (nullable = true)
 |-- INFO_allele_type: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- INFO_non_neuro_AF_nfe_nwe: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_controls_nhomalt_amr_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_AF_nfe: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_topmed_AF_eas: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_AC_asj_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_AN_nfe_female: integer (nullable = true)
 |-- INFO_non_topmed_AN_oth_male: integer (nullable = true)
 |-- INFO_non_topmed_AC_nfe_onf: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_nhomalt_fin_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_nhomalt_raw: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_nhomalt_afr_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_AF_raw: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_topmed_nhomalt_amr: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_AF_nfe: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_controls_nhomalt_raw: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_AC_afr_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_AF_afr_male: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_controls_AN_nfe_est: integer (nullable = true)
 |-- INFO_non_topmed_nhomalt_nfe_onf: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_AF_eas_female: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_controls_AC_raw: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_nhomalt_fin_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_nhomalt_nfe_nwe: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_AC_oth_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_AN_eas: integer (nullable = true)
 |-- INFO_non_topmed_AN_male: integer (nullable = true)
 |-- INFO_non_topmed_nhomalt_nfe_seu: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_AF_eas_female: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_faf95_afr: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_SOR: double (nullable = true)
 |-- INFO_controls_AC: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_AN_afr: integer (nullable = true)
 |-- INFO_controls_AN_asj: integer (nullable = true)
 |-- INFO_non_topmed_AF_popmax: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_AN_eas: integer (nullable = true)
 |-- INFO_controls_AN_male: integer (nullable = true)
 |-- INFO_non_neuro_AN_asj_female: integer (nullable = true)
 |-- INFO_controls_AN_amr_male: integer (nullable = true)
 |-- INFO_non_topmed_AC_nfe_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_nhomalt_nfe_est: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_age_hist_het_n_smaller: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_AF_nfe_onf: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_topmed_nhomalt_amr_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_AN_popmax: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_nhomalt: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_AN_eas_male: integer (nullable = true)
 |-- INFO_non_topmed_AN_nfe_male: integer (nullable = true)
 |-- INFO_non_neuro_AF_nfe_seu: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_controls_faf95_nfe: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_topmed_AF_nfe_female: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_controls_AN_oth_male: integer (nullable = true)
 |-- INFO_AF_nfe_male: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_AC_eas: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_AC_eas: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_AC_fin_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_nhomalt_oth_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_AN_oth_male: integer (nullable = true)
 |-- INFO_non_neuro_AN_nfe_nwe: integer (nullable = true)
 |-- INFO_controls_nhomalt_nfe_nwe: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_nhomalt_nfe_seu: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_nhomalt_amr: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_AN_popmax: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_nhomalt_fin_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_AF_nfe_female: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_controls_nhomalt_nfe_est: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_faf99_amr: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_AC_nfe_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_nhomalt_nfe_onf: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_AN_fin_male: integer (nullable = true)
 |-- INFO_non_topmed_AF_asj_female: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_AC_nfe_nwe: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_ReadPosRankSum: double (nullable = true)
 |-- INFO_non_topmed_AF_oth_female: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_topmed_AC_amr_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_AN_fin: integer (nullable = true)
 |-- INFO_controls_nhomalt_oth_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_nhomalt_nfe_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_nhomalt_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_AN: integer (nullable = true)
 |-- INFO_AC_popmax: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_AF_afr_female: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_AF_afr_female: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_neuro_AF_nfe_est: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_topmed_AC_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_AC_fin_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_AF_oth_male: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_neuro_AN_female: integer (nullable = true)
 |-- INFO_non_neuro_AN_eas_male: integer (nullable = true)
 |-- INFO_nhomalt_asj_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_AN_oth: integer (nullable = true)
 |-- INFO_AF_asj: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_controls_AN_afr_female: integer (nullable = true)
 |-- INFO_non_topmed_faf99_nfe: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_controls_AF_afr_female: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_controls_AN_oth_female: integer (nullable = true)
 |-- INFO_non_topmed_nhomalt_eas_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_AF_oth_male: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_topmed_nhomalt_nfe_est: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_popmax: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- INFO_has_star: boolean (nullable = true)
 |-- INFO_non_neuro_AC_nfe_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_AN_afr: integer (nullable = true)
 |-- INFO_non_topmed_nhomalt_fin_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_nhomalt_fin: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_faf99_eas: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_rf_train: boolean (nullable = true)
 |-- INFO_controls_AN_oth: integer (nullable = true)
 |-- INFO_nhomalt_oth_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_nhomalt_fin_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_AF_nfe_female: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_topmed_nhomalt_afr_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_AC: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_AF_male: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_nonpar: boolean (nullable = true)
 |-- INFO_decoy: boolean (nullable = true)
 |-- INFO_AF_nfe_nwe: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_controls_AC_nfe: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_AF_oth_male: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_controls_AN_fin_female: integer (nullable = true)
 |-- INFO_AC_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_AC_afr: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_AC_asj_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_nhomalt_eas_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_AF_nfe_nwe: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_AF_popmax: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_AN_nfe_est: integer (nullable = true)
 |-- INFO_non_neuro_AF_nfe_onf: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_nhomalt_popmax: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_AF_amr_male: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_controls_faf95_amr: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_topmed_AN_asj: integer (nullable = true)
 |-- INFO_age_hist_hom_n_larger: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_nhomalt_fin: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_faf99_afr: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_neuro_AF_female: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_controls_AF_amr: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_neuro_nhomalt_oth_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_AN_nfe: integer (nullable = true)
 |-- INFO_non_neuro_faf99_afr: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_controls_AF_nfe_onf: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_controls_AN_nfe_seu: integer (nullable = true)
 |-- INFO_non_neuro_nhomalt_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_nhomalt_asj_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_AC_oth_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_AN_fin_male: integer (nullable = true)
 |-- INFO_non_topmed_AF_afr_male: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_neuro_nhomalt_asj_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_faf99_eas: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_InbreedingCoeff: double (nullable = true)
 |-- INFO_controls_AF_nfe_male: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_AF: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_AF_nfe: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_topmed_AN_afr_female: integer (nullable = true)
 |-- INFO_age_hist_hom_n_smaller: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_AN_raw: integer (nullable = true)
 |-- INFO_nhomalt_nfe_onf: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_AF_popmax: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_faf99_eas: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_AN_nfe_female: integer (nullable = true)
 |-- INFO_nhomalt_nfe_est: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_AF_raw: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_topmed_faf95_amr: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_neuro_nhomalt_raw: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_AF_asj: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_neuro_nhomalt_eas: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_AN: integer (nullable = true)
 |-- INFO_controls_AC_amr: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_nhomalt_fin_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_AC_amr_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_faf95_afr: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_AF_female: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_controls_AN_asj_female: integer (nullable = true)
 |-- INFO_controls_nhomalt_afr_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_nhomalt_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_nhomalt_popmax: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_AC_amr_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_faf99_afr: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_AC_amr_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_AN_oth_female: integer (nullable = true)
 |-- INFO_nhomalt_fin_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_AC_afr_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_AN: integer (nullable = true)
 |-- INFO_controls_nhomalt_afr: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_AC_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_rf_tp_probability: double (nullable = true)
 |-- INFO_non_neuro_AN_amr_male: integer (nullable = true)
 |-- INFO_AC_eas_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_AF_oth: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_neuro_AC_fin: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_AF_eas: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_neuro_nhomalt_amr_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_AN_amr: integer (nullable = true)
 |-- INFO_non_topmed_popmax: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- INFO_AF_eas_male: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_controls_AN: integer (nullable = true)
 |-- INFO_nhomalt_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_AC_nfe_est: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_AC_afr: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_AC_afr: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_nhomalt_oth_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_AC_amr_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_AF_eas_male: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_FS: double (nullable = true)
 |-- INFO_non_topmed_faf99_eas: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_controls_nhomalt_eas_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_AN_amr_male: integer (nullable = true)
 |-- INFO_AN_nfe_nwe: integer (nullable = true)
 |-- INFO_non_topmed_faf95_eas: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_controls_AF_oth: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_topmed_AN_fin_male: integer (nullable = true)
 |-- INFO_non_topmed_AC_asj: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_rf_negative_label: boolean (nullable = true)
 |-- INFO_non_topmed_AC_raw: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_nhomalt_afr: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_AF_eas: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_controls_nhomalt_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_nhomalt_fin: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_AN_nfe_male: integer (nullable = true)
 |-- INFO_controls_nhomalt_amr_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_AF_fin_female: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_topmed_nhomalt_asj: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_nhomalt_asj_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_AF_afr_female: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_DP: integer (nullable = true)
 |-- INFO_non_neuro_AF_eas_female: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_neuro_AF_amr: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_neuro_nhomalt_afr_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_gq_hist_all_bin_freq: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- INFO_controls_AN_afr_male: integer (nullable = true)
 |-- INFO_AN_nfe_seu: integer (nullable = true)
 |-- INFO_controls_AF_nfe_est: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_controls_AF_asj_female: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_topmed_AC_fin_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_VQSLOD: double (nullable = true)
 |-- INFO_AF_fin: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_controls_AN_nfe_male: integer (nullable = true)
 |-- INFO_AN_afr_female: integer (nullable = true)
 |-- INFO_non_topmed_AC_eas: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_AC_fin: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_AF_fin_female: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_neuro_AC_popmax: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_AF_nfe: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_AF_amr: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_topmed_AC_fin: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_nhomalt: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_AN_female: integer (nullable = true)
 |-- INFO_non_topmed_AC_eas_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_AN_nfe_onf: integer (nullable = true)
 |-- INFO_non_topmed_nhomalt_nfe_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_AN_oth_female: integer (nullable = true)
 |-- INFO_non_topmed_AN_eas: integer (nullable = true)
 |-- INFO_non_neuro_AN_nfe_seu: integer (nullable = true)
 |-- INFO_controls_nhomalt_amr: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_ClippingRankSum: double (nullable = true)
 |-- INFO_faf99_amr: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_topmed_AF_oth: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_topmed_AN_oth: integer (nullable = true)
 |-- INFO_AF_nfe_est: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_faf95_eas: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_AF_amr_male: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_topmed_AN_raw: integer (nullable = true)
 |-- INFO_AN_afr_male: integer (nullable = true)
 |-- INFO_controls_faf99: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_AN_eas_male: integer (nullable = true)
 |-- INFO_non_neuro_faf95_amr: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_controls_nhomalt_eas_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_nhomalt_amr_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_AC_nfe: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_nhomalt_asj: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_AC_nfe_nwe: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_nhomalt: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_faf99: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_nhomalt_nfe_nwe: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_AC_asj_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_AF_male: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_topmed_faf95_afr: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_AF_afr: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_neuro_AC_oth: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_AF_eas_male: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_AF_amr_female: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_AC_raw: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_popmax: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- INFO_controls_AF_popmax: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_neuro_nhomalt_afr_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_AC_nfe_seu: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_AF_female: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_age_hist_het_bin_freq: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- INFO_non_topmed_AC_nfe_nwe: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_VQSR_NEGATIVE_TRAIN_SITE: boolean (nullable = true)
 |-- INFO_nhomalt_asj_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_AC_amr_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_nhomalt_fin_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_nhomalt_eas_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_AF_afr: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_controls_nhomalt: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_AN_nfe_onf: integer (nullable = true)
 |-- INFO_non_neuro_AF_fin: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_neuro_AF_oth_female: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_neuro_nhomalt_amr_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_AC_nfe_est: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_AF_nfe_male: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_AN_asj_female: integer (nullable = true)
 |-- INFO_non_neuro_AC_afr_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_nhomalt_eas: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_AF_oth_female: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_controls_AC_asj: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_faf95: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_faf99: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_neuro_nhomalt_afr: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_AC_eas_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_faf99_amr: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_AC_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_faf99_afr: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_controls_nhomalt_nfe_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_AN_asj_male: integer (nullable = true)
 |-- INFO_controls_nhomalt_nfe_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_nhomalt_afr_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_AF_fin_female: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_VQSR_POSITIVE_TRAIN_SITE: boolean (nullable = true)
 |-- INFO_controls_AF_fin_female: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_rf_label: string (nullable = true)
 |-- INFO_non_neuro_faf95_nfe: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_AC_fin_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_AN_nfe_est: integer (nullable = true)
 |-- INFO_non_topmed_AC_amr_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_AF_asj_male: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_BaseQRankSum: double (nullable = true)
 |-- INFO_non_neuro_faf99: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_nhomalt_oth_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_nhomalt_oth_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_lcr: boolean (nullable = true)
 |-- INFO_AN_asj: integer (nullable = true)
 |-- INFO_non_neuro_AC_amr: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_AC_afr_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_AF_fin_male: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_neuro_AN_eas_female: integer (nullable = true)
 |-- INFO_non_neuro_AN_oth: integer (nullable = true)
 |-- INFO_AF_nfe_onf: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_AF_asj_female: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_controls_AF: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_topmed_AF_nfe_male: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_age_hist_het_n_larger: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_AC_nfe: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_AN_afr_female: integer (nullable = true)
 |-- INFO_age_hist_hom_bin_freq: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- INFO_MQ: double (nullable = true)
 |-- INFO_AN_nfe_male: integer (nullable = true)
 |-- INFO_non_neuro_AC_nfe_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_faf99_nfe: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_nhomalt_raw: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_AN_amr_female: integer (nullable = true)
 |-- INFO_controls_AF_asj: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_topmed_AF_afr: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_topmed_AN_eas_female: integer (nullable = true)
 |-- INFO_QD: double (nullable = true)
 |-- INFO_non_topmed_AF_raw: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_controls_AC_fin_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_AC_nfe_seu: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_AC_asj_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_AC_nfe_seu: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_nhomalt_oth: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_AC_amr_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_AC_asj_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_nhomalt_nfe: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_AF_afr: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_AN_female: integer (nullable = true)
 |-- INFO_non_neuro_AN_asj: integer (nullable = true)
 |-- INFO_controls_AC_popmax: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_nhomalt_nfe_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_faf95_eas: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_AN_afr: integer (nullable = true)
 |-- INFO_dp_hist_alt_n_larger: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_nhomalt_nfe: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_nhomalt_fin: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_AC_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_AF_oth_male: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_nhomalt_nfe_seu: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_transmitted_singleton: boolean (nullable = true)
 |-- INFO_non_topmed_AC_afr_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_AF_oth: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_topmed_faf95_nfe: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_neuro_AC_asj_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_nhomalt_eas_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_AN_amr_female: integer (nullable = true)
 |-- INFO_non_topmed_nhomalt_nfe: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_AF_nfe_female: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_was_mixed: boolean (nullable = true)
 |-- INFO_AN_fin_female: integer (nullable = true)
 |-- INFO_non_topmed_nhomalt_amr_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_AN_asj_male: integer (nullable = true)
 |-- INFO_non_neuro_AF_fin_male: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_neuro_AC_eas: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_nhomalt_nfe_seu: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_AF_male: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_topmed_nhomalt_eas_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_faf95_afr: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_neuro_AF_asj_female: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_AF_eas_female: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_controls_AC_afr_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_AN_fin: integer (nullable = true)
 |-- INFO_non_neuro_AN_asj_male: integer (nullable = true)
 |-- INFO_non_topmed_AF_nfe_seu: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_neuro_AF_asj_male: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_n_alt_alleles: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_nhomalt_nfe_nwe: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_AN_nfe_nwe: integer (nullable = true)
 |-- INFO_controls_AC_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_AN_asj_male: integer (nullable = true)
 |-- INFO_non_neuro_AN_afr_male: integer (nullable = true)
 |-- INFO_controls_AC_nfe_onf: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_AC_amr: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_AC_nfe_onf: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_AN_female: integer (nullable = true)
 |-- INFO_non_neuro_nhomalt_asj_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_AN_nfe: integer (nullable = true)
 |-- INFO_AF_nfe_seu: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_faf95: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_AC_nfe_est: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_AC_oth_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_AF_amr_female: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_MQRankSum: double (nullable = true)
 |-- INFO_controls_AC_oth_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_AC_nfe_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_AN_nfe_onf: integer (nullable = true)
 |-- INFO_AF_afr_male: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_topmed_faf99_amr: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_controls_nhomalt_nfe: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_nhomalt_popmax: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_AF_eas_male: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_AN_popmax: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_AN_nfe_female: integer (nullable = true)
 |-- INFO_non_topmed_AC_popmax: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_nhomalt_oth: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_AN_eas_female: integer (nullable = true)
 |-- INFO_AC_asj: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_AN_nfe: integer (nullable = true)
 |-- INFO_non_topmed_AN_nfe_nwe: integer (nullable = true)
 |-- INFO_non_neuro_faf95: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_AF_eas: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_dp_hist_all_bin_freq: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- INFO_non_neuro_AN_popmax: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_AC_nfe_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_AC_oth: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_AC_fin_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_nhomalt_eas: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_AC_fin_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_AN_amr: integer (nullable = true)
 |-- INFO_non_topmed_AC_nfe_est: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_AC_afr_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_AF_raw: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_dp_hist_all_n_larger: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_AC_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_popmax: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- INFO_controls_AC_eas_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_faf99_nfe: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_controls_AF_female: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_AC_fin: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_AN_male: integer (nullable = true)
 |-- INFO_non_topmed_AN_oth_female: integer (nullable = true)
 |-- INFO_AC_oth_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_AC_oth: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_neuro_AF: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_AC_afr_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_AF_fin_male: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_neuro_AF_asj: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_controls_AN_eas_female: integer (nullable = true)
 |-- INFO_non_neuro_nhomalt_asj: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_non_topmed_AF_fin: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_topmed_AN_afr: integer (nullable = true)
 |-- INFO_controls_AF_amr_male: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_controls_AF_nfe_seu: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_neuro_AC_oth_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_AN_amr: integer (nullable = true)
 |-- INFO_non_topmed_AN_asj_female: integer (nullable = true)
 |-- INFO_AN_fin_male: integer (nullable = true)
 |-- INFO_non_topmed_AF_amr: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_non_neuro_AC_eas_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_nhomalt_nfe_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_AC_oth_male: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_AC_nfe_female: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_faf95_eas: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- INFO_controls_AC_nfe_seu: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- INFO_controls_AN_raw: integer (nullable = true)
 |-- INFO_AN_oth_male: integer (nullable = true)
 |-- INFO_OLD_MULTIALLELIC: string (nullable = true)
 |-- genotypes: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- sampleId: string (nullable = true)

It's quite a hassle to convert every column to a scalar by hand.