openai / blocksparse Goto Github PK

View Code? Open in Web Editor NEW

1.0K 198.0 200.0 536 KB

Efficient GPU kernels for block-sparse matrix multiplication and convolution

Home Page: https://blog.openai.com/block-sparse-gpu-kernels/

License: MIT License

Makefile 0.34% Python 22.96% C++ 21.17% Cuda 40.95% CSS 11.33% Shell 0.01% C 3.25%

blocksparse's Introduction

Status: Active (under active development, breaking changes may occur)

Blocksparse

The blocksparse package contains TensorFlow Ops and corresponding GPU kernels for block-sparse matrix multiplication. Also included are related ops like edge bias, sparse weight norm and layer norm.

To learn more, see the launch post on the OpenAI blog.

Prerequisites

First, you need at least one Nvidia GPU. For best performance, we recommend using a Pascal or Maxwell generation GPU -- this is the full list of features by GPU type:

GPU Family	BSMatMul-ASM	BSMatMul-CudaC	BSConv
Kepler	-	X	-
Maxwell	X (fastest)	X	X
Pascal	X (fastest)	X	X
Volta	-	X (fastest)	-

Note that BSMatMul-CudaC only supports feature_axis=0, while BSMatMul-ASM only supports feature_axis=1.

Additionally, you need:

A working Linux installation (we run Ubuntu 16.04) with the Nvidia drivers for your GPU.
CUDA 8 (in /usr/local/cuda)
Python 3.5 or newer, or 2.7 or newer
TensorFlow 1.4.0 or newer, with GPU support (e.g. pip install tensorflow-gpu)
CUDA 9 and Volta will work if you update the build targets (-gencode=arch=compute_70,code=sm_70) and also build tenorflow from source.

Installation

pip install blocksparse

Usage

This example performs a block-sparse matrix multiplication:

from blocksparse.matmul import BlocksparseMatMul
import tensorflow as tf
import numpy as np

hidden_size = 4096
block_size = 32
minibatch_size = 64

# Create a (random) sparsity pattern
sparsity = np.random.randint(2, size=(hidden_size//block_size,hidden_size//block_size))

# Initialize the sparse matrix multiplication object
bsmm = BlocksparseMatMul(sparsity, block_size=block_size)

# Input to graph
x = tf.placeholder(tf.float32, shape=[None, hidden_size])

# Initialize block-sparse weights
w = tf.get_variable("w", bsmm.w_shape, dtype=tf.float32)

# Block-sparse matrix multiplication
y = bsmm(x, w)

# Run
sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())
result = sess.run([y], feed_dict = {x: np.ones((minibatch_size,hidden_size), dtype='float32')})
print(result)

For a more involved example using block-sparse ops to train a language model, see examples/.

Development

If you're interested in hacking on the ops and kernels, go ahead and build from source:

git clone [email protected]:openai/blocksparse.git
cd blocksparse

make compile
pip install dist/*.whl

# test it if you like
test/blocksparse_matmul_test.py
test/blocksparse_conv_test.py

If your CUDA is not in /usr/local/cuda or you have several versions, e.g. both /usr/local/cuda-8.0 and /usr/local/cuda-9.0, set CUDA_HOME to the base path to use when compiling make compile.

API Documentation:

blocksparse.matmul

class BlocksparseMatMul(object)

    def __init__(self, layout, block_size=32, feature_axis=1)
    """
    layout: a 2d array of ones and zeros specifying the block layout
    block_size: values 32, 16, 8 supported
    feature_axis: when block_size is less than 32 memory access becomes far more efficient
                  with a (C,N) activation layout
    """

    # shape helpers for generating tensors (N=minibatch)
    self.w_shape
    def i_shape(self, N)
    def o_shape(self, N)

    # return the coordinates (c,k) in the layout that corresponds to a given block id
    def block_coord(self, block)

    # experimental ortho init
    def ortho_init(self)

    # in practice, identity_init + layernorm is all you need for initialization
    # with gpu=True the init is performed by kernel on the device
    def identity_init(self, gpu=False)

    # To implement weight normalization.  In practice, layernorm works much better.
    def l2_normalize(self, W, gain=None, epsilon=1e-6, dtype=np.float32)

    def __call__(self, I, W, dw_dtype=tf.float32)
    """
    Execute the op.  Note that the weight variable is independant from the bsmm object.
    This allows multiple weights to be tied to the same bsmm layout.

    dw_dtype: allows control over dw precision format.
    """


def group_param_grads(param_grad, group_size=8, cast32=True)
"""
param_grad: the tensorflow parameter gradient for a give bsmm weight variable (returned from tf.gradients)
group_size: desired group size, up to 8 supported

This causes the tf graph to be rewritten so that weight grad matmuls from different time steps
(and shared weights across time) are combined into a more efficient single matmul.
"""


class SparseProj(object):
    def __init__(self, nhidden, nproj=None, proj_stride=None, block_size=32, gather_lut=None)
    """
    Experimental class to support dense-to-sparse and sparse-to-dense projections.
    Basically the same as the tensorflow ops but faster and support alternate precision formats.
    They assume a unique 1 to 1 mapping so atomics need not be used on backward ops.
    """

    def gather(self, x)
    def scatter(self, x)
    def scatter_add(self, x, y)
    def scatter_mul(self, x, y)

blocksparse.conv

class BlocksparseConv(object):
    def __init__(self, BCK, TRS, DHW, MPQ=None, strides=(1,1,1), dilates=(1,1,1), padding="SAME", edge_bias=False)
    """
    BCK: (                                             # block(B)/input(C)/output(K) feature dims
             ( (c0, c1, c2, ...), (k0, k1, k2, ...) ), # block 0 c,k are indeces into C,K dims
             ( (c0, c1, c2, ...), (k0, k1, k2, ...) ), # block 1
             ( (c0, c1, c2, ...), (k0, k1, k2, ...) ), # block 2 ...
         )
    TRS: (T,R,S) or (R,S) or (S,)         - filter spatial size dims
    DHW: (D,H,W) or (H,W) or (W,)         - input image spatial size dims
    MPQ: (M,P,Q) or (P,Q) or (Q,) or None - output image spatial size dims (used for ambiguous dims in strided transpose conv)
    strides: (1,1,1) or (1,1) or (1,)
    dilates: (1,1,1) or (1,1) or (1,)
    padding: (1,1,1) or (1,1) or (1,) or "SAME" or "VALID"
    edge_bias: True/False
    """

    # shape helpers for setting up variables or test tensors
    def edge_bias_shape(self)
    def f_shape(self, block=None)
    def i_shape(self, N)
    def o_shape(self, N)

    # execute op passing in param variables and input
    def __call__(self, F, I, edge_bias=None):

    # for implementing weight norm
    def l2_normalize(self, F, gain=None, epsilon=1e-6, dtype=np.float32):

class BlocksparseDeconv(BlocksparseConv)
    def __init__(self, BCK, TRS, DHW, MPQ=None, strides=(1,1,1), dilates=(1,1,1), padding="SAME", edge_bias=False)
    """
    Deconvolution.  Same params as above.
    """

def cwise_linear(x, a=None, b=None)
"""
In the NCHW tensor format, tensorflow is extremely slow at implementing simple broadcasting ops on the middle C dim.
This lets you do:
    y = a*x + b
    y = a*x
    y = x + b

Where a and b are of shape (1,C,1,1)
This is useful for ops like weight norm.

blocksparse.ew

# same as tf ops but generally more efficient and allow custom precision formats
def        add(x, y, name=None)
def   multiply(x, y, name=None)
def   subtract(x, y, name=None)
def     divide(x, y, name=None)
def    maximum(x, y, name=None)
def    minimum(x, y, name=None)

def   negative(x,    name=None)
def reciprocal(x,    name=None)
def     square(x,    name=None)
def       sqrt(x,    name=None)
def        exp(x,    name=None)
def        log(x,    name=None)
def    sigmoid(x,    name=None)
def       tanh(x,    name=None)
def       relu(x,    name=None)
def       elu (x, alpha=1.0, name=None)

# here args can be the 4 independant gate tensors or
# a single merged gate tensor (which gets split in 4 internally)
def fused_lstm_gates(c, *args, name=None)

def split4(x)
def concat4(x0, x1, x2, x3)

# A custom cast op to help explore novel precision formats
def float_cast(x, dtype, dx_dtype=None)

# a much faster (and non-deterministic) dropout op
# also supports novel precision formats
def dropout(x, keep_prob=0.8, mask=None)

# an op to be used in tf.gradients when adding together multiple contributions of a gradient.
# note that only 8 inputs are supported as you'd never want a single op to consume all possible inputs
# before it starts executing in the graph (and hence reducing the memory footprint)
def add_n8(xs, name=None)

blocksparse.norms

def layer_norm(x, g, b, axis=1, epsilon=1e-6, relu=False)
"""
Very fast layernorm to support both bsmm feature_axis activation layouts.
Also inlcludes optional integrated relu (applied to end)
"""

# basic batch norm ops for the NCHW layout
def batch_norm(x, g, b, epsilon=1e-6)
def batch_norm_inference(x, g, b, m, v, epsilon=1e-6)

blocksparse's People

Contributors

Stargazers

Watchers

Forkers

wwwslinger michaelpicus jiths geevi shubhampachori12110095 j143-zz spol-rafasoftware redstrike rdspring1 haraldkorneliussen yhyu13 hal2001 adrialuan chiahungtai ldm2020 zukobronja hyuen neuroidss mikhail2017 dd-repo a342604203 apilastri xingjinglu lukecwalters o7s8r6 leliaonvidia cocorobolabs dchichkov stegben jamivinaysagar sumit33k qiaohaijun gpuworld coffeeshop13 sreramk joserfjuniorllms andrei-pokrovsky albertz johbln yulhwakim yyzreal wolf1981 tbright17 ykankaya tuzhenyuan ourobouros limin2021 gstoica27 ailzy vodelerk daisy708913050 taiyeeka sptravel jbdatascience borchmann neelshah18 huahuajhu pradeepgopalakrishnan raphael7788 anantshah200 shi27feng blengerich christopherhesse joshisumit1990 thomashagebols fatchord joye archive-git-repo daiviet01 luanyunteng johnjjung kastnerkyle draa xysmlx sayanmutd soumith novaintrovert mcdavid109 deneutoy hwong39 dyqgithub jovialio peterzhousz cpehle backyes bojone dirty-south-supercomputing xerothermic kannanrn sxjscience litianjian yujunfeng georgepar warpuv luminartech hamedhaghighi zhaojp-frank kyroskoh bwasti daishu7

blocksparse's Issues

Issues in building from source

Configuration:
Operating System: Linux Ubuntu 16.04
Python version: 3.5.2
Tensorflow version: 1.12.0
Cuda version: 9.0
GPU: TITAN X (Pascal)

Command to Reproduce
make compile

Problem:
The build command fails with the following errors:

ptxas /tmp/tmpxft_00007d26_00000000-5_blocksparse_hgemm_cn_op_gpu.compute_70.ptx, line 252; error   : Illegal modifier '.m8n32k16' for instruction 'wmma.mma'
ptxas /tmp/tmpxft_00007d26_00000000-5_blocksparse_hgemm_cn_op_gpu.compute_70.ptx, line 713; error   : Illegal modifier '.m8n32k16' for instruction 'wmma.mma'
ptxas /tmp/tmpxft_00007d26_00000000-5_blocksparse_hgemm_cn_op_gpu.compute_70.ptx, line 1186; error   : Illegal modifier '.m8n32k16' for instruction 'wmma.mma'
ptxas /tmp/tmpxft_00007d26_00000000-5_blocksparse_hgemm_cn_op_gpu.compute_70.ptx, line 1661; error   : Illegal modifier '.m8n32k16' for instruction 'wmma.mma'
ptxas fatal   : Ptx assembly aborted due to errors
Makefile:106: recipe for target 'build/blocksparse_hgemm_cn_op_gpu.cu.o' failed
make: *** [build/blocksparse_hgemm_cn_op_gpu.cu.o] Error 255

pip install blocksparse fails too and results in #7

blocksparse_ops.so: undefined symbol: __cudaPushCallConfiguration

Traceback (most recent call last):
File "example.py", line 1, in
from blocksparse.matmul import BlocksparseMatMul
File "/home/shs/tensorflow/lib/python3.6/site-packages/blocksparse/init.py", line 3, in
from blocksparse.utils import (
File "/home/shs/tensorflow/lib/python3.6/site-packages/blocksparse/utils.py", line 17, in
_op_module = tf.load_op_library(os.path.join(data_files_path, 'blocksparse_ops.so'))
File "/home/shs/tensorflow/lib/python3.6/site-packages/tensorflow/python/framework/load_library.py", line 61, in load_op_library
lib_handle = py_tf.TF_LoadLibrary(library_filename)
tensorflow.python.framework.errors_impl.NotFoundError: /home/shs/tensorflow/lib/python3.6/site-packages/blocksparse/blocksparse_ops.so: undefined symbol: __cudaPushCallConfiguration

My installation environment：Ubuntu 18.04，python 3.6，cuda 10.0，TensorFlow 1.13.1
Has anyone installed successfully? I want to know your installation environment

bug report - in matrix-vector multiplication

when multiplying k-by-m (block-sparse) with m-by-n (dense), if n==1, result is vector of zeros.
I am using tf-1.12 with cuda 9.0 and 9.2 installed, I am not sure which one is used.

code:

from blocksparse.matmul import BlocksparseMatMul
import tensorflow as tf
import numpy as np

hidden_size = 16
block_size = 8
minibatch_size = 1
sparsity = np.array([[1, 0], [1, 0]])

Create a (random) sparsity pattern

sparsity = np.random.randint(2, size=(hidden_size//block_size,hidden_size//block_size))

sparsity = np.zeros(shape=(4,4))

sparsity[0,0] = 1

sparsity[0,1] = 1

sparsity[1,0] = 1

sparsity[1,1] = 1

Initialize the sparse matrix multiplication object

bsmm = BlocksparseMatMul(sparsity, block_size=block_size, feature_axis=0)

Input to graph

x = tf.placeholder(tf.float32, shape=[hidden_size, None])
w = tf.placeholder(tf.float32, shape=bsmm.w_shape)

x_data = np.random.randn(hidden_size, minibatch_size).astype(np.float32)

x_data = np.ones([hidden_size, minibatch_size], dtype='float32')
a,b,c = bsmm.w_shape

w_data = np.random.randn(a,b,c).astype(np.float32)

w_data = np.ones(bsmm.w_shape, 'float32')

Block-sparse matrix multiplication

y = bsmm(x, w)

Run

sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())
y_ = sess.run([y], feed_dict={x:x_data, w:w_data})
print(y_[0])

Efficient block-sparse attention kernels of Sparse Transformer

CPU implementation?

Hi!

Are you aware of a CPU implementation for this technique? Trying to reimplement this paper: https://arxiv.org/pdf/1802.08435.pdf

Thanks!

XLA (TPU) Support

Hey there,

Was wondering if there are plans to support TPUs in the future via XLA kernels?

Has anyone correctly build from source and run `test/blocksparse_matmul_test.py`?

Could you please share an exact setup and how you did it?
has been struggled to resolve this error after i compiled blocksparse from source.

If you managed to have pip install works, please also share.

tensorflow.python.framework.errors_impl.NotFoundError: /home/ruiwang/anaconda3/envs/spinningup/lib/python3.6/site-packages/blocksparse/blocksparse_ops.so: undefined symbol: _ZNK10tensorflow8OpKernel4nameB5cxx11Ev

Feature request: Support Windows

The subject says it all. Would it be possible?

how to use the BlocksparseMatMul

When I run the /example/simple.py, I get two different answer between the old and new blocksparse.The dimension of the result has been changed.
y=bsmm(x,w)
x's shape = (64,4096)
and w= (4096,4096)
the old blocksparse get y's shape =(64,4096)
but now it is (4096,4096)
So I wonder know how to use the BlocksparseMatMul after the update of last month.

Conv Layer Documentation?

Thanks so much for this module!
How would a user decide the number and sizes of the blocks' they're using, for the BCK in the convolution? I've looked at the conv test.py, and am still confused. I'd really appreciate a simple example, thank you!

no attribute for multi-processing

Do you support mpiexec ? :-(
op module has no attribute for 'allreduce_nccl'

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
--sparcity matrix : 160 X 160
--sparcity matrix : 160 X 160
Traceback (most recent call last):
File "/storage/blocksparse/examples/transformer/enwik8_psy.py", line 498, in
train_loss, train_op, gn, ns = model(X, Y, loss_scale, train=True)
File "/storage/blocksparse/examples/transformer/enwik8_psy.py", line 365, in model
loss = bs.allreduce(loss)
File "/usr/local/lib/python3.6/dist-packages/blocksparse/nccl.py", line 42, in allreduce
ret = _op_module.allreduce_nccl(x, op_num=op_counter, sync_size=sync_size, num_comms=num_comms, prereduce=prereduce, logfile=logfile, name=name)
AttributeError: module 'b4313feb600581ca30dcf30777c2167e' has no attribute 'allreduce_nccl'
Traceback (most recent call last):
File "/storage/blocksparse/examples/transformer/enwik8_psy.py", line 498, in
train_loss, train_op, gn, ns = model(X, Y, loss_scale, train=True)
File "/storage/blocksparse/examples/transformer/enwik8_psy.py", line 365, in model
loss = bs.allreduce(loss)
File "/usr/local/lib/python3.6/dist-packages/blocksparse/nccl.py", line 42, in allreduce
ret = _op_module.allreduce_nccl(x, op_num=op_counter, sync_size=sync_size, num_comms=num_comms, prereduce=prereduce, logfile=logfile, name=name)
AttributeError: module 'b4313feb600581ca30dcf30777c2167e' has no attribute 'allreduce_nccl'

Colab Notebook to run blocksparse

I wish to understand more with this awesome work!

I opened a Colab Notebook to test blocksparse:
https://colab.research.google.com/drive/1F7VofDAAXhwi46DX-HTmk1Hhq6XZB69g

But unfortunately I faced two problems:

1. When I run ./blocksparse/examples/transformer/enwik8.py The loss goes NaN:

Starting epoch 0
Not including 1 sequences
Number of minibatches this epoch: 8789
train iteration: 0, loss: 5.53922, bits per byte: 7.99140 ns:0.16947 gn:5.90084
train iteration: 200, loss: nan, bits per byte: nan ns:0.00000 gn:nan
train iteration: 400, loss: nan, bits per byte: nan ns:0.00000 gn:nan
train iteration: 600, loss: nan, bits per byte: nan ns:0.00000 gn:nan

2. When I run ./blocksparse/test/blocksparse_matmul_test.py

ERROR: testBlocksparseMatMul (main.BlocksparseMatMulTest)

Traceback (most recent call last):
File "./blocksparse/test/blocksparse_matmul_test.py", line 346, in testBlocksparseMatMul
y = bsmm(y, w2, dw_dtype=dtF, bench=repeat) # (bench and j==depth-1) (bench and j==0)
TypeError: call() got an unexpected keyword argument 'dw_dtype'

Ran 2 tests in 1.389s

FAILED (errors=1, skipped=1)

Anyone can help?

The position encoding of enwik8.py is not right.

According to the paper,

For text and audio, we used two-dimensional attention embeddings, where dattn = 2 and the index corresponds to each position’s row and column index in a matrix of width equal to the stride.

This isn't reflected in 'p_emb' of enwik8.py.

AttributeError: module '5fa89fc3154996733eabb433e18fa62f' has no attribute 'dw_matmul_large_n'

I installed python 3.6 and tensorflow 1.14. When I use blocksparse, it runs to error:

File "/home/admin1/git/treeson/attention/onsets_frames_transcription/thumt/layers/sparse_attention.py", line 4, in
from blocksparse import BlocksparseTransformer
File "/usr/local/lib/python3.6/dist-packages/blocksparse/init.py", line 15, in
dw_matmul_large_n = _op_module.dw_matmul_large_n
AttributeError: module '5fa89fc3154996733eabb433e18fa62f' has no attribute 'dw_matmul_large_n'

What's wrong with it?

install from source

I am really interested in the new paper, but this library is so hard to install or to build from source. Any update for the new blocksparse?

Tesla K80 Compilation

Hi, when I try to run the test code on an Amazon EC2 instance (P2 instances have Nvidia Tesla K80's, which are Kepler architectures), it gives me the following error:

(manarprojenv) ubuntu@ip-172-31-47-48:~/scratch_manar$ python blocksparse_scripy.py
2018-04-21 07:30:01.823848: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2018-04-21 07:30:01.920509: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-04-21 07:30:01.920894: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1344] Found device 0 with properties:
name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: 0000:00:1e.0
totalMemory: 11.17GiB freeMemory: 11.10GiB
2018-04-21 07:30:01.920929: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1423] Adding visible gpu devices: 0
2018-04-21 07:30:02.235420: I tensorflow/core/common_runtime/gpu/gpu_device.cc:911] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-04-21 07:30:02.235492: I tensorflow/core/common_runtime/gpu/gpu_device.cc:917] 0
2018-04-21 07:30:02.235501: I tensorflow/core/common_runtime/gpu/gpu_device.cc:930] 0: N
2018-04-21 07:30:02.235863: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 10761 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
2018-04-21 07:30:02.251499: W tensorflow/core/framework/op_kernel.cc:1273] OP_REQUIRES failed at blocksparse_matmul_op.cc:208 : Internal: invalid resource handle
Traceback (most recent call last):
File "/home/ubuntu/anaconda3/envs/manarprojenv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1327, in _do_call
return fn(*args)
File "/home/ubuntu/anaconda3/envs/manarprojenv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1312, in _run_fn
options, feed_dict, fetch_list, target_list, run_metadata)
File "/home/ubuntu/anaconda3/envs/manarprojenv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1420, in _call_tf_sessionrun
status, run_metadata)
File "/home/ubuntu/anaconda3/envs/manarprojenv/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InternalError: invalid resource handle
[[Node: BlocksparseMatMul_000000 = BlocksparseMatmul[C=4096, K=4096, alpha=1, axis=1, bench=0, beta=0, blocks=8246, bshift=5, dtype_dw=DT_FLOAT, dtype_w=DT_FLOAT, dtype_x=DT_FLOAT, dtype_y=DT_FLOAT, locks=0, locks_dx=0, segments=128, segments_dx=128, shared=624, shared_dx=624, _device="/job:localhost/replica:0/task:0/device:GPU:0"](_arg_Placeholder_0_0/_1, w/read, BlocksparseMatMul/fprop_lut/_3, BlocksparseMatMul/bprop_lut/_5, BlocksparseMatMul/updat_lut/_7)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "blocksparse_scripy.py", line 27, in
result = sess.run([y], feed_dict = {x: np.ones((minibatch_size,hidden_size), dtype='float32')})
File "/home/ubuntu/anaconda3/envs/manarprojenv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 905, in run
run_metadata_ptr)
File "/home/ubuntu/anaconda3/envs/manarprojenv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1140, in _run
feed_dict_tensor, options, run_metadata)
File "/home/ubuntu/anaconda3/envs/manarprojenv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1321, in _do_run
run_metadata)
File "/home/ubuntu/anaconda3/envs/manarprojenv/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: invalid resource handle
[[Node: BlocksparseMatMul_000000 = BlocksparseMatmul[C=4096, K=4096, alpha=1, axis=1, bench=0, beta=0, blocks=8246, bshift=5, dtype_dw=DT_FLOAT, dtype_w=DT_FLOAT, dtype_x=DT_FLOAT, dtype_y=DT_FLOAT, locks=0, locks_dx=0, segments=128, segments_dx=128, shared=624, shared_dx=624, _device="/job:localhost/replica:0/task:0/device:GPU:0"](_arg_Placeholder_0_0/_1, w/read, BlocksparseMatMul/fprop_lut/_3, BlocksparseMatMul/bprop_lut/_5, BlocksparseMatMul/updat_lut/_7)]]

Caused by op 'BlocksparseMatMul_000000', defined at:
File "blocksparse_scripy.py", line 22, in
y = bsmm(x, w)
File "/home/ubuntu/anaconda3/envs/manarprojenv/lib/python3.6/site-packages/blocksparse/matmul.py", line 383, in call
shared=self.fprop_shared, shared_dx=self.bprop_shared, bench=bench, name=name
File "", line 650, in blocksparse_matmul
File "/home/ubuntu/anaconda3/envs/manarprojenv/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 328, in apply_op
op_type_name, name, **keywords)
File "/home/ubuntu/anaconda3/envs/manarprojenv/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/ubuntu/anaconda3/envs/manarprojenv/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3290, in create_op
op_def=op_def)
File "/home/ubuntu/anaconda3/envs/manarprojenv/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1654, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

InternalError (see above for traceback): invalid resource handle
[[Node: BlocksparseMatMul_000000 = BlocksparseMatmul[C=4096, K=4096, alpha=1, axis=1, bench=0, beta=0, blocks=8246, bshift=5, dtype_dw=DT_FLOAT, dtype_w=DT_FLOAT, dtype_x=DT_FLOAT, dtype_y=DT_FLOAT, locks=0, locks_dx=0, segments=128, segments_dx=128, shared=624, shared_dx=624, _device="/job:localhost/replica:0/task:0/device:GPU:0"](_arg_Placeholder_0_0/_1, w/read, BlocksparseMatMul/fprop_lut/_3, BlocksparseMatMul/bprop_lut/_5, BlocksparseMatMul/updat_lut/_7)]]

(manarprojenv) ubuntu@ip-172-31-47-48:~/scratch_manar$

I understand that this probably is an issue with the flags in the installation, but I'm not sure what flags I should use at all. I've tried following online and adding correct gencodes: http://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/
but nothing seems to work for the Tesla K80. Can you give a set of flags that work for older cards? Should I also change the major, minor = 3,7 as well?

Thanks so much!

fp16 error on GeForce RTX 2080 Ti

Hi, I'm trying to run BlocksparseMatMul with feature_axis=1 on GeForce RTX 2080 Ti that supports fp16, but I get:

Gated blocksparse matmul currently only supported on fp16 tensorcores. [[node BlocksparseMatMul_000000 (defined at <string>:2247)

I installed the package using pip.
I use CUDA 10 and tensorflow 1.13.

Could you help?

DataLossError (see above for traceback): Checksum does not match while restore the model

I have trained a model with fp16, recompute and multi-processes, and I saved the model with 'saver.save'. But when I restoring the model, it causes tensorflow.python.framework.errors_impl.DataLossError: Checksum does not match: stored 1987019600 vs. calculated on the restored bytes 906132536.

Block sparsity on input features

Thanks for the great effort!

I wonder if this library only supports block sparsity on weights? Or also the input features?

For example, can Sparse Convs operate over dense weight but (block) spatially sparse input feature maps?

tensorflow.python.framework.errors_impl.NotFoundError: libtensorflow_framework.so: cannot open shared object file: No such file or directory

I run the code sparse attention on p100, but it goes wrong.

System information

OS Platform and Distribution: ubuntu 18.04
TensorFlow installed from: pip install
TensorFlow version: 1.14.0(gpu)
Python version: 3.6.5
GCC: 7.4.0
CUDA version: 10.0
GPU model and memory: P100

Here's the traceback

Traceback (most recent call last):
  File "/home/zyh/sparse_attention-master/attention.py", line 4, in <module>
    from blocksparse import BlocksparseTransformer
  File "/root/anaconda3/lib/python3.6/site-packages/blocksparse/__init__.py", line 3, in <module>
    from blocksparse.utils import (
  File "/root/anaconda3/lib/python3.6/site-packages/blocksparse/utils.py", line 16, in <module>
    _op_module = tf.load_op_library(os.path.join(data_files_path, 'blocksparse_ops.so'))
  File "/root/anaconda3/lib/python3.6/site-packages/tensorflow/python/framework/load_library.py", line 61, in load_op_library
    lib_handle = py_tf.TF_LoadLibrary(library_filename)
tensorflow.python.framework.errors_impl.NotFoundError: libtensorflow_framework.so: cannot open shared object file: No such file or directory

I check the libtensorflow_framework.so through find . -name libtensorflow_framework.so , however, it doesn't exist. Next, I find libtensorflow_framework.so1 at /root/anaconda3/lib/python3.6/site-packages/tensorflow/, so I copy the libtensorflow_framework.so1 to the libtensorflow_framework.so, and I append it to LD_LIBRARY_PATH through export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:"path/to/your/libtensorflow".
It still not works.

Please help.
Thanks in advance

Is there a plan for a pytorch wrapper?

simple example not working

I am using cuda-8.0, tensorflow 1.4.1 on ubuntu 16.04 and when trying to run the example, I get the following:

This is running on a gtx 960m, when running on a gtx 1060 it runs fine, is the 960m supported?

(ml) hyz@hyz-XPS-15-9550:~/prog/blocksparse/examples$ python simple.py
2017-12-08 11:06:57.447347: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2017-12-08 11:06:57.560860: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2017-12-08 11:06:57.561241: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: GeForce GTX 960M major: 5 minor: 0 memoryClockRate(GHz): 1.0975
pciBusID: 0000:01:00.0
totalMemory: 1.96GiB freeMemory: 1.55GiB
2017-12-08 11:06:57.561274: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 960M, pci bus id: 0000:01:00.0, compute capability: 5.0)
2017-12-08 11:06:59.663068: W tensorflow/core/framework/op_kernel.cc:1192] Internal: invalid resource handle
Traceback (most recent call last):
File "/home/hyz/ml/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1323, in _do_call
return fn(*args)
File "/home/hyz/ml/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1302, in _run_fn
status, run_metadata)
File "/home/hyz/ml/lib/python3.5/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in exit
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InternalError: invalid resource handle
[[Node: BlocksparseMatMul_000000 = BlocksparseMatmul[C=4096, K=4096, alpha=1, axis=1, bench=0, beta=0, blocks=8168, bshift=5, dtype_dw=DT_FLOAT, dtype_w=DT_FLOAT, dtype_x=DT_FLOAT, dtype_y=DT_FLOAT, locks=0, locks_dx=0, segments=128, segments_dx=128, shared=608, shared_dx=656, _device="/job:localhost/replica:0/task:0/device:GPU:0"](_arg_Placeholder_0_0/_1, w/read, BlocksparseMatMul/fprop_lut/_3, BlocksparseMatMul/bprop_lut/_5, BlocksparseMatMul/updat_lut/_7)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "simple.py", line 27, in
result = sess.run([y], feed_dict = {x: np.ones((minibatch_size,hidden_size), dtype='float32')})
File "/home/hyz/ml/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 889, in run
run_metadata_ptr)
File "/home/hyz/ml/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1120, in _run
feed_dict_tensor, options, run_metadata)
File "/home/hyz/ml/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run
options, run_metadata)
File "/home/hyz/ml/lib/python3.5/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: invalid resource handle
[[Node: BlocksparseMatMul_000000 = BlocksparseMatmul[C=4096, K=4096, alpha=1, axis=1, bench=0, beta=0, blocks=8168, bshift=5, dtype_dw=DT_FLOAT, dtype_w=DT_FLOAT, dtype_x=DT_FLOAT, dtype_y=DT_FLOAT, locks=0, locks_dx=0, segments=128, segments_dx=128, shared=608, shared_dx=656, _device="/job:localhost/replica:0/task:0/device:GPU:0"](_arg_Placeholder_0_0/_1, w/read, BlocksparseMatMul/fprop_lut/_3, BlocksparseMatMul/bprop_lut/_5, BlocksparseMatMul/updat_lut/_7)]]

Caused by op 'BlocksparseMatMul_000000', defined at:
File "simple.py", line 22, in
y = bsmm(x, w)
File "/home/hyz/prog/blocksparse/blocksparse/matmul.py", line 383, in call
shared=self.fprop_shared, shared_dx=self.bprop_shared, bench=bench, name=name
File "", line 631, in blocksparse_matmul
File "/home/hyz/ml/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 328, in apply_op
op_type_name, name, **keywords)
File "/home/hyz/ml/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/hyz/ml/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
op_def=op_def)
File "/home/hyz/ml/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1470, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

InternalError (see above for traceback): invalid resource handle
[[Node: BlocksparseMatMul_000000 = BlocksparseMatmul[C=4096, K=4096, alpha=1, axis=1, bench=0, beta=0, blocks=8168, bshift=5, dtype_dw=DT_FLOAT, dtype_w=DT_FLOAT, dtype_x=DT_FLOAT, dtype_y=DT_FLOAT, locks=0, locks_dx=0, segments=128, segments_dx=128, shared=608, shared_dx=656, _device="/job:localhost/replica:0/task:0/device:GPU:0"](_arg_Placeholder_0_0/_1, w/read, BlocksparseMatMul/fprop_lut/_3, BlocksparseMatMul/bprop_lut/_5, BlocksparseMatMul/updat_lut/_7)]]

where is the implementation of Factorized self-attention?

Hi,
Good job.
I am looking for the code about the implementation of factorized self-attention in the enwiki8 example, could you point me which lines of code was related(I did not found, looks it only mask out the blocks in future)?

nvcc fatal : Unsupported gpu architecture 'compute_60'

try make compile: it went smoothly until i hit

nvcc fatal : Unsupported gpu architecture 'compute_60'
Makefile:105: recipe for target 'build/batch_norm_op_gpu.cu.o' failed
make: *** [build/batch_norm_op_gpu.cu.o] Error 1

Could you please help?

Getting undefined symbol: _ZN10tensorflow7strings6StrCatERKNS0_8AlphaNumES3_S3_S3_

With TF 1.5rc0, CUDA 9.1 on commit 8095147 the library builds/installs successfully (with 'make compile', pip install dist/*.whl), but I'm getting:


~/n/blocksparse/examples (master) $ python simple.py 
Traceback (most recent call last):
  File "simple.py", line 1, in <module>
    from blocksparse.matmul import BlocksparseMatMul
  File "~/.local/lib/python2.7/site-packages/blocksparse/matmul.py", line 13, in <module>
    import blocksparse.ewops as ew
  File "/home/dchichkov/.local/lib/python2.7/site-packages/blocksparse/ewops.py", line 17, in <module>
    _op_module = tf.load_op_library(os.path.join(data_files_path, 'blocksparse_ops.so'))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/load_library.py", line 56, in load_op_library
    lib_handle = py_tf.TF_LoadLibrary(library_filename, status)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.NotFoundError: 
/~/.local/lib/python2.7/site-packages/blocksparse/blocksparse_ops.so: undefined symbol: 
_ZN10tensorflow7strings6StrCatERKNS0_8AlphaNumES3_S3_S3_

Oh, and python is spitting out syntax error in the def fused_lstm_gates(c, *args, name=None) see #8, and I guess replacing the order of *args and the default variable would be an incorrect fix.

can it support cpu?

can blocksparse support gpu?

Error installing blocksparse on Python2.7

Unable to import blocksparse using python2.7

pip install -v blocksparse

Gives following error:

Compiling /tmp/pip-install-Cylls1/blocksparse/blocksparse/ewops.py ...
File "/tmp/pip-install-Cylls1/blocksparse/blocksparse/ewops.py", line 162
def fused_lstm_gates(c, *args, name=None):
^
SyntaxError: invalid syntax

Saving checkpoints on enwiki8 model.

Hi and thank you for this wonderful package.

Seeing the code for the enwiki8 transformer model, you don't seem to save anywhere the model or the weights, for latter use. As so, the model cannot later be used for generation.

Can you maybe sketch a checkpointing strategy (what needs to be saved and how) or there is for not doing so in the first place?

Thanks in advance!

SyntaxError: invalid syntax

I installed blocksparse by following the instructions, and tested with the example. I got the following error:

Traceback (most recent call last):
File "test.py", line 1, in
from blocksparse.matmul import BlocksparseMatMul
File "/usr/local/lib/python2.7/dist-packages/blocksparse/matmul.py", line 13, in
import blocksparse.ewops as ew
File "/usr/local/lib/python2.7/dist-packages/blocksparse/ewops.py", line 162
def fused_lstm_gates(c, *args, name=None):
^
SyntaxError: invalid syntax

I am working on AWS p2 instance with Ubuntu 16, tensorflow 1.4.1, python 2.7, cuda-8.0, cudnn-6.0. The installation does not show any problem, but I have to put init.py in the blocksparse package folder, other wise the import blocksparse does not work.

I also tried many other ways, including python 3, and non was working.

"About" text is misleading, should specify that this is an NVidia-only repository

Efficient GPU kernels for block-sparse matrix multiplication and convolution

I clicked on this thinking it was a general library, maybe OpenCL, scrolled down, and got a bit peeved.
The only code here is written in non-portable CUDA and non-portable GPU assembly; NVidia cards are required unless you do HIP conversion which is no longer necessarily the most efficient kernel.

I might be getting one of their workstation cards later this year to get the best of both worlds, but NVidia aren't the only GPUs; for general purpose compute the current 170% more expensive model gets beat by a 7900XTX. I have no brand preference... actually if the drivers don't clash I plan on having an Arc A770 and an A6000 in this machine alongside hte 7900XTX by the end of the year to get the best of everything for 3D rendering or use the low power Arc for inference since it's as fast as the 7900XTX (with the XTX under the fastest way of running anything, Shark) and both are faster than the A6000 for that, but the NVidia should render some scenes faster and will probably still be the easiest way to do local training given how many libraries assume cuda and how long it will take them to make the slight changes required to fix that. Anyway my point is, tagging this correctly as "Efficient NVidia CUDA / assembly kernels for..." would be the user-friendly thing to do.

Upgrade to tensorflow 2.0

Hi all,

Has anybody tried to upgrade this project to tensorflow 2.0?

AFAIK one of the main issues is that cuda_stream.h header was removed in TF 2.0 (also see #40 ). Now instead of passing CUstream directly when writing an op, users must pass a GPUDevice object (probably to uncouple from CUDA dependency).

Tried to patch with this change but failed. Have others had any luck?

Add pre-requisits / hardware requirements for the LSTM sample

I've tried running LSTM example on my workstation with 24GB of GPU memory, 512 hidden units RNN, it somehow still runs out of memory. Please, could you add the pre-requisits / hardware requirements to reproduce your results for the LSTM example?

Also I would suggest adding a download path for the dataset, which I assume is:
https://einstein.ai/research/the-wikitext-long-term-dependency-language-modeling-dataset
https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-103-raw-v1.zip

direct solve support? (LU)

does blocksparse support solution of sparse linear systems,

ie solving

 Ax=b

for x where A is a sparse matrix and x, b are vectors?

If not, is this functionality in the pipeline?

Thanks

libcudart.so.9.0: cannot open shared object file: No such file or directory

Trying to recreate the example and get the following error when importing from blocksparse.matmul import BlocksparseMatMul

---------------------------------------------------------------------------
NotFoundError                             Traceback (most recent call last)
<ipython-input-1-1dea216b89b0> in <module>()
----> 1 from blocksparse.matmul import BlocksparseMatMul
      2 import tensorflow as tf
      3 import numpy as np

~/anaconda/lib/python3.6/site-packages/blocksparse/matmul.py in <module>()
     11 from tensorflow.python.framework import ops
     12 from tensorflow.python.ops.init_ops import Initializer
---> 13 import blocksparse.ewops as ew
     14 
     15 data_files_path = tf.resource_loader.get_data_files_path()

~/anaconda/lib/python3.6/site-packages/blocksparse/ewops.py in <module>()
     15 
     16 data_files_path = tf.resource_loader.get_data_files_path()
---> 17 _op_module = tf.load_op_library(os.path.join(data_files_path, 'blocksparse_ops.so'))
     18 # for x in dir(_op_module):
     19 #     print(x)

~/anaconda/lib/python3.6/site-packages/tensorflow/python/framework/load_library.py in load_op_library(library_filename):
     54   """
     55   with errors_impl.raise_exception_on_not_ok_status() as status:
---> 56     lib_handle = py_tf.TF_LoadLibrary(library_filename, status)
     57 
     58   op_list_str = py_tf.TF_GetOpList(lib_handle)

~/anaconda/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py in __exit__(self, type_arg, value_arg, traceback_arg)
    471             None, None,
    472             compat.as_text(c_api.TF_Message(self.status.status)),
--> 473             c_api.TF_GetCode(self.status.status))
    474     # Delete the underlying status object from memory otherwise it stays alive
    475     # as there is a reference to status from this from the traceback due to

NotFoundError: libcudart.so.9.0: cannot open shared object file: No such file or directory

I believe I have all the prerequisites:

Python 3.6.2
CUDA Version 8.0.61 (from /usr/local/cuda/version.txt)
tensorflow-gpu (1.4.1)
Ubuntu 16.04

I am running an AWS p2.xlarge instance; it uses a single Kepler GPU (K80).

Edit:

Tried this again on another instance that uses Maxwell architecture, since it is recommended (GPU+ at paperspace.com).

Apart from different GPU, the only other difference on that instance is Python 3.6.3.

Still get the same error.

Error importing blocksparse

Hi,
I've been trying to install blocksparse on a GCP instance with a P100 GPU, but I have not been able to get past this error.
Here's how I did it:

conda install -c anaconda tensorflow-gpu
pip install blocksparse

>>> import blocksparse
>>> from blocksparse.matmul import BlocksparseMatMul
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/johnsmith/anaconda3/envs/temp/lib/python3.6/site-packages/blocksparse/matmul.py", line 13, in <module>
    import blocksparse.ewops as ew
  File "/home/johnsmith/anaconda3/envs/temp/lib/python3.6/site-packages/blocksparse/ewops.py", line 17, in <module>
    _op_module = tf.load_op_library(os.path.join(data_files_path, 'blocksparse_ops.so'))
  File "/home/johnsmith/anaconda3/envs/temp/lib/python3.6/site-packages/tensorflow/python/framework/load_library.py", line 58, in load_op_library
    lib_handle = py_tf.TF_LoadLibrary(library_filename, status)
  File "/home/johnsmith/anaconda3/envs/temp/lib/python3.6/site-packages/tensorflow/python/framework/errors_impl.py", line 516, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.NotFoundError: /home/johnsmith/anaconda3/envs/temp/lib/python3.6/site-packages/blocksparse/blocksparse_ops.so: undefined symbol: _ZN10tensorflow20O
pKernelConstruction21CtxFailureWithWarningENS_6StatusE

I also tried this on an AWS p2.xlarge instance, and got the same error.

AttributeError: module 'tensorflow' has no attribute 'resource_loader'

Ubuntu 18.04 LTS clean install and fully updated, after a pip3 install of tensorflow-gpu and install of blocksparse from source I am getting the following error:

Python 3.6.8 (default, Jan 14 2019, 11:02:34) 
[GCC 8.0.1 20180414 (experimental) [trunk revision 259383]] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import blocksparse
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/gperry/DeepFlight/OpenAI/blocksparse/blocksparse/__init__.py", line 3, in <module>
    from blocksparse.utils import (
  File "/home/gperry/DeepFlight/OpenAI/blocksparse/blocksparse/utils.py", line 15, in <module>
    data_files_path = tf.resource_loader.get_data_files_path()
AttributeError: module 'tensorflow' has no attribute 'resource_loader'

problem with tensor core when using float16

When I am doing matrix multiplication，I use the blocksparse with tensor core, and I find a problem.
the code is below.

by dong this，I get the answer as below.

I got different answer between blocksparse(1.13.1) with float16 and the tensorflow matmul function.
When I use the same code with blocksparse(1.0.0), which uses cuda core, I get the right answer.
I have tried so many different test with tensor core, and they all get the wrong answer. So I really wonder why?
am I using blocksparse api in the wrong way?

cuda_stream.h: No such file or directory

When attempting to build blocksparse from source on Ubuntu 18.04 LTS:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "tensorflow/__init__.py", line 24, in <module>
    from tensorflow.python import pywrap_tensorflow  # pylint: disable=unused-import
  File "tensorflow/python/__init__.py", line 49, in <module>
    from tensorflow.python import pywrap_tensorflow
  File "tensorflow/python/pywrap_tensorflow.py", line 25, in <module>
    from tensorflow.python.platform import self_check
  File "tensorflow/python/platform/self_check.py", line 27, in <module>
    raise ImportError("Could not import tensorflow. Do not import tensorflow "
ImportError: Could not import tensorflow. Do not import tensorflow from its source directory; change directory to outside the TensorFlow source tree, and relaunch your Python interpreter from there.
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "tensorflow/__init__.py", line 24, in <module>
    from tensorflow.python import pywrap_tensorflow  # pylint: disable=unused-import
  File "tensorflow/python/__init__.py", line 49, in <module>
    from tensorflow.python import pywrap_tensorflow
  File "tensorflow/python/pywrap_tensorflow.py", line 25, in <module>
    from tensorflow.python.platform import self_check
  File "tensorflow/python/platform/self_check.py", line 27, in <module>
    raise ImportError("Could not import tensorflow. Do not import tensorflow "
ImportError: Could not import tensorflow. Do not import tensorflow from its source directory; change directory to outside the TensorFlow source tree, and relaunch your Python interpreter from there.
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "tensorflow/__init__.py", line 24, in <module>
    from tensorflow.python import pywrap_tensorflow  # pylint: disable=unused-import
  File "tensorflow/python/__init__.py", line 49, in <module>
    from tensorflow.python import pywrap_tensorflow
  File "tensorflow/python/pywrap_tensorflow.py", line 25, in <module>
    from tensorflow.python.platform import self_check
  File "tensorflow/python/platform/self_check.py", line 27, in <module>
    raise ImportError("Could not import tensorflow. Do not import tensorflow "
ImportError: Could not import tensorflow. Do not import tensorflow from its source directory; change directory to outside the TensorFlow source tree, and relaunch your Python interpreter from there.
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "tensorflow/__init__.py", line 24, in <module>
    from tensorflow.python import pywrap_tensorflow  # pylint: disable=unused-import
  File "tensorflow/python/__init__.py", line 49, in <module>
    from tensorflow.python import pywrap_tensorflow
  File "tensorflow/python/pywrap_tensorflow.py", line 25, in <module>
    from tensorflow.python.platform import self_check
  File "tensorflow/python/platform/self_check.py", line 27, in <module>
    raise ImportError("Could not import tensorflow. Do not import tensorflow "
ImportError: Could not import tensorflow. Do not import tensorflow from its source directory; change directory to outside the TensorFlow source tree, and relaunch your Python interpreter from there.
mkdir -p build
g++ -std=c++11 -O3 -fPIC -DGOOGLE_CUDA=1 -D_GLIBCXX_USE_CXX11_ABI= -I./build -I/usr/local/cuda/include -I/tensorflow/include -I/tensorflow/include/external/nsync/public -I/external/local_config_cuda/cuda -I/usr/local -c src/batch_norm_op.cc -o build/batch_norm_op.o
src/batch_norm_op.cc:2:10: fatal error: tensorflow/core/framework/op.h: No such file or directory
 #include "tensorflow/core/framework/op.h"
          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
Makefile:117: recipe for target 'build/batch_norm_op.o' failed
make: *** [build/batch_norm_op.o] Error 1

I've got the tensorflow cloned repo in the same directory but there is some conflict with importing tensorflow while running make compile in that directory.

Here's the error when moving the tensorflow source code repo our of the directory:

$ make compile
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "tensorflow/__init__.py", line 24, in <module>
    from tensorflow.python import pywrap_tensorflow  # pylint: disable=unused-import
  File "tensorflow/python/__init__.py", line 49, in <module>
    from tensorflow.python import pywrap_tensorflow
  File "tensorflow/python/pywrap_tensorflow.py", line 25, in <module>
    from tensorflow.python.platform import self_check
  File "tensorflow/python/platform/self_check.py", line 27, in <module>
    raise ImportError("Could not import tensorflow. Do not import tensorflow "
ImportError: Could not import tensorflow. Do not import tensorflow from its source directory; change directory to outside the TensorFlow source tree, and relaunch your Python interpreter from there.
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "tensorflow/__init__.py", line 24, in <module>
    from tensorflow.python import pywrap_tensorflow  # pylint: disable=unused-import
  File "tensorflow/python/__init__.py", line 49, in <module>
    from tensorflow.python import pywrap_tensorflow
  File "tensorflow/python/pywrap_tensorflow.py", line 25, in <module>
    from tensorflow.python.platform import self_check
  File "tensorflow/python/platform/self_check.py", line 27, in <module>
    raise ImportError("Could not import tensorflow. Do not import tensorflow "
ImportError: Could not import tensorflow. Do not import tensorflow from its source directory; change directory to outside the TensorFlow source tree, and relaunch your Python interpreter from there.
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "tensorflow/__init__.py", line 24, in <module>
    from tensorflow.python import pywrap_tensorflow  # pylint: disable=unused-import
  File "tensorflow/python/__init__.py", line 49, in <module>
    from tensorflow.python import pywrap_tensorflow
  File "tensorflow/python/pywrap_tensorflow.py", line 25, in <module>
    from tensorflow.python.platform import self_check
  File "tensorflow/python/platform/self_check.py", line 27, in <module>
    raise ImportError("Could not import tensorflow. Do not import tensorflow "
ImportError: Could not import tensorflow. Do not import tensorflow from its source directory; change directory to outside the TensorFlow source tree, and relaunch your Python interpreter from there.
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "tensorflow/__init__.py", line 24, in <module>
    from tensorflow.python import pywrap_tensorflow  # pylint: disable=unused-import
  File "tensorflow/python/__init__.py", line 49, in <module>
    from tensorflow.python import pywrap_tensorflow
  File "tensorflow/python/pywrap_tensorflow.py", line 25, in <module>
    from tensorflow.python.platform import self_check
  File "tensorflow/python/platform/self_check.py", line 27, in <module>
    raise ImportError("Could not import tensorflow. Do not import tensorflow "
ImportError: Could not import tensorflow. Do not import tensorflow from its source directory; change directory to outside the TensorFlow source tree, and relaunch your Python interpreter from there.
mkdir -p build
g++ -std=c++11 -O3 -fPIC -DGOOGLE_CUDA=1 -D_GLIBCXX_USE_CXX11_ABI= -I./build -I/usr/local/cuda/include -I/tensorflow/include -I/tensorflow/include/external/nsync/public -I/external/local_config_cuda/cuda -I/usr/local -c src/batch_norm_op.cc -o build/batch_norm_op.o
src/batch_norm_op.cc:2:10: fatal error: tensorflow/core/framework/op.h: No such file or directory
 #include "tensorflow/core/framework/op.h"
          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.
Makefile:117: recipe for target 'build/batch_norm_op.o' failed
make: *** [build/batch_norm_op.o] Error 1

Better documentation

Is it possible to document how kernels are implemented?

Keras wrapper

Great project! A few unrelated questions you might be able to answer.

Ive been working on a related approach and I wonder how you think it might compare. That is, instead of providing a connectivity pattern for small blocks, and optimising this at a lower level, I figured I should get similar benefits from using larger blocks (say 128; batch128128 is still a respectable amount of work for a GPU), and connecting those with some configurable pattern on the computation-graph level.

I am pretty sure your approach blows this naive approach out of the water for small blocks; but it would add some flexibility benefits; like different-sized blocks. But if this works efficiently down to 8x8 blocks then that latter argument is mostly moot as well.

Personally I was mostly inspired by grouped-convolutions, and the desire to have a similar control over the ratio of neural-bandwidth-to-number-of-connections, for dense networks too. By extension of the work on grouped convolutions, I have been thinking about such architectures in a way where most layers are straightforward diagonal, block-to-block operations, periodically interspersed with 'lateral', or off-diagonal connections.

But I suppose with your approach there is really no performance benefit in doing so, and this architectural question is subsumed by the tuning of the proportion of vertical/lateral, or diagonal/off-diagonal connections. That said it still could be that restricting the lateral flow of information and having connected block 'do their own thing' for multiple layers could be a beneficial strategy. Analogous to the work on multigrid/multiscale convnets, it seems likely that having fully connected blocks at a range of bandwidth-scales, with restricted off-diagonal communication between them, would effectively separate the signal into a high level contextual descriptor for the low bandwidth paths, and delegate the detail to the higher bandwidth blocks. Thatd be pretty cool from an unsupervised learning perspective.

Anyway, no shortage of potential applications I think, and kudos to openai for releasing this early instead of milking it like an academic researcher would. The application I am working on atm is proprietary, but I think work such as this will be pivotal to getting people to look at anything other than convnets again. With fractal/resnet like architectures giving unlimited depth, and work like this giving essentially unlimited width, I think a lot of possibilities are opening up.

Personally, I was experimenting with hypercube and tree-like topologies as connectivity patterns (both of which have log(n_blocks) longest path length). Have you compared these type of structured connectivities to the more random connectivities you explore? Is there a benefit to the randomness, or is it just a good vehicle for demonstrating generality of the approach?

Concluding my ramblings; I hope to find the time to try this out soon! Which leads me to another question; are you aware of a battle-tested keras wrapper? And if not, do you foresee any gotchas in writing one, or should it be straightforward? If I get around to it ill make sure to make a PR for it.

Tensorflow error

I am trying to run a model for semantic role labeling. While running the prediction model I encountered the below error. I am not sure how to fix it.

Traceback (most recent call last):
File "decoder.py", line 14, in
from srl_model import SRLModel
File "/Users/ajaysingh/Downloads/ms/study_masters/Spring_20/NLP/Assignment2/lsgn-master/srl_model.py", line 15, in
from model_utils import *
File "/Users/ajaysingh/Downloads/ms/study_masters/Spring_20/NLP/Assignment2/lsgn-master/model_utils.py", line 5, in
import srl_ops
File "/Users/ajaysingh/Downloads/ms/study_masters/Spring_20/NLP/Assignment2/lsgn-master/srl_ops.py", line 4, in
srl_op_library = tf.load_op_library("./srl_kernels.so")
File "/Users/ajaysingh/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/load_library.py", line 61, in load_op_library
lib_handle = py_tf.TF_LoadLibrary(library_filename)
tensorflow.python.framework.errors_impl.NotFoundError: dlopen(./srl_kernels.so, 6): image not found

What should I do to fix this issue?

bug report - in matrix-vector multiplication

when multiplying k-by-m (block-sparse) with m-by-n (dense), if n==1, result is vector of zeros.
I am using tf-1.12 with cuda 9.0 and 9.2 installed, I am not sure which one is used.
code:

`
from blocksparse.matmul import BlocksparseMatMul
import tensorflow as tf
import numpy as np

hidden_size = 16
block_size = 8
minibatch_size = 1

sparsity = np.random.randint(2, size=(hidden_size//block_size,hidden_size//block_size))
bsmm = BlocksparseMatMul(sparsity, block_size=block_size, feature_axis=0)

x = tf.placeholder(tf.float32, shape=[hidden_size, None])
w = tf.placeholder(tf.float32, shape=bsmm.w_shape)

x_data = np.ones([hidden_size, minibatch_size], dtype='float32')
a,b,c = bsmm.w_shape
w_data = np.ones(bsmm.w_shape, 'float32')
y = bsmm(x, w)

sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())
y_ = sess.run([y], feed_dict={x:x_data, w:w_data})
print(y_[0])
`

ValueError: Invalid aggregation_method specified 3.

The test blocksparse_matmul_bench.py meet error with reports:

Traceback (most recent call last):
  File "./blocksparse_matmul_bench.py", line 100, in <module>
    d = tf.gradients(y, [x, w], e, aggregation_method=3)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py", line 533, in gradients
    out_grads = _AggregatedGrads(grads, op, loop_state, aggregation_method)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py", line 820, in _AggregatedGrads
    aggregation_method)
ValueError: Invalid aggregation_method specified 3.

It seems that aggregation_method=3 is illegal according to official document of tensorflow https://www.tensorflow.org/api_docs/python/tf/AggregationMethod.

So, what value should be set to aggregation_method ?

If I use erase the fourth parameter aggregation_method, and run the test again. A TypeError occur

Traceback (most recent call last):
  File "blocksparse_matmul_bench.py", line 101, in <module>
    d = tf.gradients(y, [x, w], e)# aggregation_method=3)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py", line 533, in gradients
    out_grads = _AggregatedGrads(grads, op, loop_state, aggregation_method)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py", line 872, in _AggregatedGrads
    out_grads[i] = _MultiDeviceAddN(out_grad)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/gradients_impl.py", line 767, in _MultiDeviceAddN
    summands.append(math_ops.add_n(tensors))
  File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/math_ops.py", line 2000, in add_n
    return gen_math_ops._add_n(inputs, name=name)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/ops/gen_math_ops.py", line 220, in _add_n
    "AddN", inputs=inputs, name=name)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 598, in _apply_op_helper
    param_name=input_name)
  File "/usr/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 60, in _SatisfiesTypeConstraint
    ", ".join(dtypes.as_dtype(x).name for x in allowed_list)))
TypeError: Value passed to parameter 'inputs' has DataType bfloat16 not in list of allowed values: float32, float64, int64, int32, uint8, uint16, int16, int8, complex64, complex128, qint8, quint8, qint32, float16, variant

Error when importing blocksparse: `undefined symbol: _Z11GetCountSMsv`

I am having trouble importing your blocksparse library after compiling and pip-installing the built package. I am compiling using cuda 10 and the latest tf built from source, so I imagine this could be the issue as you mention using cuda 8 and tf 1.4.0. I'd like to try and get it working with the latest tf, as the rest of our pipelines and models are built upon the latest version

I found several references to GetCountSMs() and GetCountSMsVersion() in the source code, but could not find anything called _Z11GetCountSMsv

Thanks in advance for any help

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/blocksparse/blocksparse/__init__.py", line 3, in <module>
    from blocksparse.utils import (
  File "/home/ubuntu/blocksparse/blocksparse/utils.py", line 17, in <module>
    _op_module = tf.load_op_library(os.path.join(data_files_path, 'blocksparse_ops.so'))
  File "/home/ubuntu/.conda/envs/bs/lib/python3.7/site-packages/tensorflow_core/python/framework/load_library.py", line 61, in load_op_library
    lib_handle = py_tf.TF_LoadLibrary(library_filename)
tensorflow.python.framework.errors_impl.NotFoundError: /home/ubuntu/blocksparse/blocksparse/blocksparse_ops.so: undefined symbol: _Z11GetCountSMsv

has no attribute 'allreduce_nccl'

when i run the enwik8.py example in multi-gpus, it causes the error "has no attribute 'allreduce_nccl'" in blockspare.nccl.allreduce.

test/blocksparse_conv_test.py failed and example/simple.py sometimes raised an invalid memory access error

System information

OS Platform and Distribution: Linux Ubuntu 18.04
TensorFlow version: 1.13.1 (with GPU support)
Python version: 3.7.7
CUDA/cuDNN version: 10.0 / 7
GPU: Tesla T4

Encountered problem
I tried both pip install blocksparse and building from source. After installation, I can run import blocksparse in Python and pass most tests. However, when I run test/blocksparse_conv_test.py, the following error occurred.

(tf13) ubuntu@xxx:~/blocksparse$ python test/blocksparse_conv_test.py
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
WARNING:tensorflow:From /home/ubuntu/anaconda3/lib/python3.7/contextlib.py:82: TensorFlowTestCase.test_session (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use `self.session()` or `self.cached_session()` instead.
2020-07-19 15:22:55.214905: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-07-19 15:22:55.236910: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2499995000 Hz
2020-07-19 15:22:55.237482: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55b7bd771c50 executing computations on platform Host. Devices:
2020-07-19 15:22:55.237509: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2020-07-19 15:22:55.362344: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-19 15:22:55.363172: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:1e.0
totalMemory: 14.75GiB freeMemory: 14.65GiB
2020-07-19 15:22:55.363193: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2020-07-19 15:22:55.393925: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-07-19 15:22:55.393972: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0
2020-07-19 15:22:55.393981: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N
2020-07-19 15:22:55.394077: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14241 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5)
2020-07-19 15:22:55.395613: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x55b7bbfe59e0 executing computations on platform CUDA. Devices:
2020-07-19 15:22:55.395639: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5

test1
2020-07-19 15:22:55.429514: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at blocksparse_conv_op.cc:320 : Internal: device kernel image is invalid
ERROR:tensorflow:device kernel image is invalid
         [[node test1/F4B4/BlocksparseConv (defined at <string>:471) ]]
         [[node test1/F4B4/BlocksparseConv (defined at <string>:471) ]]

Caused by op 'test1/F4B4/BlocksparseConv', defined at:
  File "test/blocksparse_conv_test.py", line 213, in <module>
    tf.test.main()
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/test.py", line 64, in main
    return _googletest.main(argv)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/googletest.py", line 100, in main
    benchmark.benchmarks_main(true_main=main_wrapper)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/benchmark.py", line 371, in benchmarks_main
    true_main()
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/googletest.py", line 99, in main_wrapper
    return app.run(main=g_main, argv=args)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/googletest.py", line 70, in g_main
    return unittest_main(argv=argv)
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/main.py", line 101, in __init__
    self.runTests()
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/main.py", line 271, in runTests
    self.result = testRunner.run(self.test)
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/runner.py", line 176, in run
    test(result)
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/suite.py", line 84, in __call__
    return self.run(*args, **kwds)
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/suite.py", line 122, in run
    test(result)
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/suite.py", line 84, in __call__
    return self.run(*args, **kwds)
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/suite.py", line 122, in run
    test(result)
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/case.py", line 676, in __call__
    return self.run(*args, **kwds)
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/case.py", line 628, in run
    testMethod()
  File "test/blocksparse_conv_test.py", line 126, in testBlocksparseConv
    op   = bs_conv_op(devF, devI)
  File "/home/ubuntu/blocksparse/blocksparse/conv.py", line 511, in __call__
    dimF=F.get_shape().as_list(), fshare=self.fshared, bshare=self.bshared, debug=self.debug
  File "<string>", line 471, in blocksparse_conv
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
    op_def=op_def)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

InternalError (see above for traceback): device kernel image is invalid
         [[node test1/F4B4/BlocksparseConv (defined at <string>:471) ]]
         [[node test1/F4B4/BlocksparseConv (defined at <string>:471) ]]

Es
======================================================================
ERROR: testBlocksparseConv (__main__.BlocksparseConvTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1334, in _do_call
    return fn(*args)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1319, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1407, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: device kernel image is invalid
         [[{{node test1/F4B4/BlocksparseConv}}]]
         [[{{node test1/F4B4/BlocksparseConv}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "test/blocksparse_conv_test.py", line 127, in testBlocksparseConv
    devO = sess.run( op )
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/test_util.py", line 1368, in run
    return super(ErrorLoggingSession, self).run(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: device kernel image is invalid
         [[node test1/F4B4/BlocksparseConv (defined at <string>:471) ]]
         [[node test1/F4B4/BlocksparseConv (defined at <string>:471) ]]

Caused by op 'test1/F4B4/BlocksparseConv', defined at:
  File "test/blocksparse_conv_test.py", line 213, in <module>
    tf.test.main()
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/test.py", line 64, in main
    return _googletest.main(argv)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/googletest.py", line 100, in main
    benchmark.benchmarks_main(true_main=main_wrapper)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/benchmark.py", line 371, in benchmarks_main
    true_main()
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/googletest.py", line 99, in main_wrapper
    return app.run(main=g_main, argv=args)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/platform/googletest.py", line 70, in g_main
    return unittest_main(argv=argv)
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/main.py", line 101, in __init__
    self.runTests()
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/main.py", line 271, in runTests
    self.result = testRunner.run(self.test)
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/runner.py", line 176, in run
    test(result)
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/suite.py", line 84, in __call__
    return self.run(*args, **kwds)
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/suite.py", line 122, in run
    test(result)
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/suite.py", line 84, in __call__
    return self.run(*args, **kwds)
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/suite.py", line 122, in run
    test(result)
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/case.py", line 676, in __call__
    return self.run(*args, **kwds)
  File "/home/ubuntu/anaconda3/lib/python3.7/unittest/case.py", line 628, in run
    testMethod()
  File "test/blocksparse_conv_test.py", line 126, in testBlocksparseConv
    op   = bs_conv_op(devF, devI)
  File "/home/ubuntu/blocksparse/blocksparse/conv.py", line 511, in __call__
    dimF=F.get_shape().as_list(), fshare=self.fshared, bshare=self.bshared, debug=self.debug
  File "<string>", line 471, in blocksparse_conv
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
    op_def=op_def)
  File "/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

InternalError (see above for traceback): device kernel image is invalid
         [[node test1/F4B4/BlocksparseConv (defined at <string>:471) ]]
         [[node test1/F4B4/BlocksparseConv (defined at <string>:471) ]]


----------------------------------------------------------------------
Ran 2 tests in 0.231s

FAILED (errors=1, skipped=1)

Besides, invalid memory access sometimes happens when running examples/simples.py. Here is the output without error.

(tf13) ubuntu@xxx:~/blocksparse$ python examples/simple.py
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
WARNING:tensorflow:From /home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2020-07-19 15:23:58.994318: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-07-19 15:23:59.016917: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2499995000 Hz
2020-07-19 15:23:59.017474: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x56341a1f6330 executing computations on platform Host. Devices:
2020-07-19 15:23:59.017505: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2020-07-19 15:23:59.122639: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-19 15:23:59.123458: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:1e.0
totalMemory: 14.75GiB freeMemory: 14.65GiB
2020-07-19 15:23:59.123478: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2020-07-19 15:23:59.152687: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-07-19 15:23:59.152724: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0
2020-07-19 15:23:59.152735: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N
2020-07-19 15:23:59.152835: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14241 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5)
2020-07-19 15:23:59.154217: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x5634190aa650 executing computations on platform CUDA. Devices:
2020-07-19 15:23:59.154239: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
[array([[-0.00464108, -0.00446517, -0.00446705, ..., -0.00433037,
        -0.00435545, -0.00431154],
       [ 0.00696341,  0.00687434,  0.00675924, ...,  0.00679887,
         0.00693929,  0.00719775],
       [ 0.01524079,  0.01537668,  0.01533529, ...,  0.01533816,
         0.01512151,  0.01528387],
       ...,
       [-0.00238256, -0.00245797, -0.0022754 , ..., -0.00224203,
        -0.00239737, -0.00237827],
       [-0.00508011, -0.00536294, -0.00516913, ..., -0.00537378,
        -0.00533525, -0.00540836],
       [ 0.01230985,  0.01257054,  0.01233936, ...,  0.01226609,
         0.012429  ,  0.01214379]], dtype=float32)]

And here is the output when the error appears.

(tf13) ubuntu@xxx:~/blocksparse$ python examples/simple.py
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:526: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:527: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:528: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:529: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:530: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
/home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/dtypes.py:535: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.
  np_resource = np.dtype([("resource", np.ubyte, 1)])
WARNING:tensorflow:From /home/ubuntu/anaconda3/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2020-07-19 15:24:31.054902: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-07-19 15:24:31.076918: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2499995000 Hz
2020-07-19 15:24:31.077469: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x56258d1e4480 executing computations on platform Host. Devices:
2020-07-19 15:24:31.077494: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): <undefined>, <undefined>
2020-07-19 15:24:31.176438: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-07-19 15:24:31.177252: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: Tesla T4 major: 7 minor: 5 memoryClockRate(GHz): 1.59
pciBusID: 0000:00:1e.0
totalMemory: 14.75GiB freeMemory: 14.65GiB
2020-07-19 15:24:31.177274: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2020-07-19 15:24:31.208119: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-07-19 15:24:31.208164: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0
2020-07-19 15:24:31.208176: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N
2020-07-19 15:24:31.208278: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 14241 MB memory) -> physical GPU (device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5)
2020-07-19 15:24:31.209716: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x56258c098570 executing computations on platform CUDA. Devices:
2020-07-19 15:24:31.209739: I tensorflow/compiler/xla/service/service.cc:158]   StreamExecutor device (0): Tesla T4, Compute Capability 7.5
2020-07-19 15:24:31.685492: E tensorflow/stream_executor/cuda/cuda_event.cc:48] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2020-07-19 15:24:31.685539: F tensorflow/core/common_runtime/gpu/gpu_event_mgr.cc:274] Unexpected Event status: 1
Aborted (core dumped)

I guess that those problems are due to my TensorFlow and CUDA version. Could anyone help me? Thanks a lot!

why it occupy less GPU memory when I use bs.bias_relu ?

how to train example/lstm/train.py?

Q1: When I use the sparsity parameter, it will core. Like
python train.py --sparsity=ba_10 or
python train.py --sparsity=bae_10_10 or

Only --sparsity=dense is ok.
Q2: python --lstm_type=lstm is error, the code is not update.
Q3: python --lstm_type=rnn is faster than python --lstm_type=scottbrain, why?