Comments (15)
Sorry for the delay.
Right now, ISAAC's test suite uses prime numbers for M, N and K, but caffe calls GEMM on corner cases (M=1, K=1 or N=1). This has caused some issues in the past.
I'll add tests for corner cases ASAP so I can pinpoint the problem. The only hardware I can test this on is Broadwell 5500U iGPU. What hardware have you been using. I've not been using Beignet; I'll try this too.
Thanks for the report!
from triton.
@ptillet Broadwell 5500U is good to reproduce this issue, although the performance is not as good as the OpenCL SDK. But the test suite pass rate is very good, please use the git master beignet. Thanks for your support!
from triton.
GEMM-xT fails for M=1. That's probably the cause of all your issues. I'm on it :)
from triton.
@gongzg I have just pushed two small fixes for BLAS, for an uninitialized variable in BLAS-2 and the BLAS3 failure when M==1. Does it help?
from triton.
@ptillet the crash issue is fixed, but the test failures still exist.
from triton.
I have fixed more dot() corner cases in dbfaef8. There were still some problems left for the cases where SGEMV degenerates to SDOT or SAXPY... Hopefully this should fix some more issues with caffe.
from triton.
@ptillet There are still many failures. I choose one case as below:
build/test/test.testbin --gtest_filter=NetTest/2.TestSharedWeightsResume
It fails with the latest ISAAC. And if I choose viennacl's GEMM and the other math functions still use ISAAC, it could pass. For your reference. Thanks.
from triton.
I've fixed some bugs in the master branch. Now all the OpenCL caffe tests pass on my machine.
from triton.
@ptillet I dig into the reduce_2d::generate_impl and found there is a barrier issue.
element_wise_loop_1D(stream, p_.fetch_policy, (reduction_type_==REDUCE_ROWS)?p_.simd_width:1, "r", upper.str(), "$GLOBAL_IDX_1", "$GLOBAL_SIZE_1", device, [&](unsigned int cwidth)
{
...
stream << "#pragma unroll" << std::endl;
stream << "for($SIZE_T stride = " << p_.local_size_0/2 << "; stride >0; stride /=2)" << std::endl;
stream << "{" << std::endl;
stream.inc_tab();
stream << "$LOCAL_BARRIER;" << std::endl;
stream << "if (lidx < stride)" << std::endl;
stream << "{" << std::endl;
stream.inc_tab();
}
You can see the LOCAL_BARRIER is within the loop, and one example of the loop head is as below:
for(unsigned int r = get_global_id(1)_4; r < (M +3)/4_4/4_4; r += get_global_size(1)_4)
You can see that the r is a variable value which means different for different work items, thus not all work items will enter the loop at the same time which breaks the barrier usage policy defined in the opencl spec and will cause Intel's OpenCL compiler hang, please refer the following description of the barrier routine in OpenCL spec.
Description
All work-items in a work-group executing the kernel on a processor must execute this function before any are allowed to continue execution beyond the barrier. This function must be encountered by all work-items in a work-group executing the kernel.
If barrier is inside a conditional statement, then all work-items must enter the conditional if any work-item enters the conditional statement and executes the barrier.
If barrier is inside a loop, all work-items must execute the barrier for each iteration of the loop before any are allowed to continue execution beyond the barrier.
from triton.
Thanks for investigating the generated kernels! The latest commit did introduce a synchronization issue not caught by the unit tests (dammit!). I'll try to fix this tonight.
For reference, here is ISAAC normally handles this issue:
The upper-bound on the outer loop for r should be rounded up to the next multiple of local_size_1, to ensure that every work group does enter all the iterations of the loop:
for(size_t r = get_global_id(1); r < (M + LOCAL_SIZE_1 - 1)/LOCAL_SIZE_1*LOCAL_SIZE_1; r += get_global_size(1))
All bounds checks are done in this loop in parts that do not include a barrier.
The entire work group enters not only the first loop but also all iterations the second one:
for(size_t stride = LOCAL_SIZE_0/2; stride >0; stride /=2)
from triton.
Just pushed a quick dirty fix that seems to fix synchronization issues on my machine, at the price of a performance hit. Will investigate and try to get a proper fix ASAP.
from triton.
@ptillet The latest fix works. Thanks for your quick fix.
from triton.
@ptillet unfortunately there are still some failures on machine. One of them are
build/test/test.testbin --gtest_filter=InnerProductLayerTest/2.TestBackwardTranspose
It could pass with the 6ac5e1f....
No GPU hang this time.
from triton.
Interesting, the test fails randomly on my machine, about half of the time. Maybe an uninitialized variable... Looking into it...
from triton.
fd5c6d3 seems to have taken care of it. There is one more issue I'm fixing with the Intel OCL Driver -- there's a segfault on deinitialization, and I suspect it causes some tests to crash.
from triton.
Related Issues (20)
- https://github.com/Cecil500 HOT 3
- triton cache does not invalidate cache correctly when dynamically choosing a function to call HOT 10
- Why change the order of make_block_ptr when V.dtype.element_ty == tl.float8e5?
- Print statements inside kernel print incorrect value of int64 tensors HOT 4
- batched matrix multiplication within a program HOT 2
- urllib.error.HTTPError: HTTP Error 404: Not Found HOT 1
- Question about memory coalescing HOT 1
- For small size M, like the shape M=1 K=5120 N=1792, how to improve the performance with triton? HOT 3
- github tag is not consistent with pypi version
- Calling torch.compile fails when Triton kernel arguments include triton.language.dtype HOT 1
- tl.cumsum seems emitting an internal error. HOT 1
- How to perform a store operation on a part of a Tensor? HOT 1
- Question regarding stride HOT 1
- [AMD] Undefined behavior sanitizer invalid-bool-load in optimize_epilogue.mlir HOT 5
- int8 x bfloat16 matmul tests fail on 4090s due to numerical error
- Incorrect result with threadsPerWarp of [2, 2, 8] for a thread block of [2, 2, 32] HOT 16
- M2 Mac Build from Source Failure: MLIR Configuration Error HOT 1
- RuntimeError: Triton Error [HIP]: Code: 1, Messsage: invalid argument HOT 2
- StackTrace handler on python module does not allow signal to propagate. HOT 5
- Associative scan with non-scalar inputs
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from triton.