Comments (5)
I am actually a little bit puzzled. I was repeatedly calling a 600,000 x 100 distance calculation in a function not using rm() nor using gc()
stressme <- function(){
iter <- 0
repeat {
res <- distance(gpublocks,gpufacilities, method="sqEuclidean")
result <- res[,]
iter <- iter + 1
if (iter > 2000){
break
}
print(iter)
}
return("OK")
}
This morning this code stalled the machine after 3 iterations. What happens is that the GUI becomes unresponsive to any interaction. The mouse pointer still moves around but no mouse actions can be triggered, nor does the system accept keyboard inputs. This roughly matched my prior experiences.
Since then I did some additional processing and decided to run the loop with and without rm(), gc() calls. Turns out that it worked in either case. I completed a run with 2000 iterations without problems. I did notice though that the GUI became sluggish. Doing secondary things on the machine became a pain.
I then decided to do some speed comparisons with pdist and regular matrices, not distance and gpuDistances. Used system.time() for 10 iterations, then decided to fire up the gpuR above for 10 iterations as well. Stalled right after the first iteration. Had to reboot.
Only thing in the log I found so far:
Mar 1 16:05:03 Turbine watchdogd[256]: [watchdog_daemon] @(_wd_daemon_service_thread) - service (com.apple.WindowServer) reported as unresponsive
Mar 1 16:05:08 Turbine Safari[792]: tcp_connection_tls_session_error_callback_imp 12 __tcp_connection_tls_session_callback_write_block_invoke.434 error 22
Mar 1 16:05:08 Turbine spindump[394]: Saved userspace_watchdog_timeout.spin report for WindowServer version ??? (???) to /Library/Logs/DiagnosticReports/WindowServer_2016-03-01-160508_Turbine.userspace_watchdog_timeout.spin
Mar 1 16:05:08 Turbine watchdogd[256]: [watchdog_daemon] @(__wd_service_report_unresponsive_block_invoke) - spindump gathered for (com.apple.WindowServer) at (/Library/Logs/DiagnosticReports/WindowServer_2016-03-01-160508_Turbine.userspace_watchdog_timeout.spin)
Mar 1 16:05:28 Turbine watchdogd[256]: [watchdog_daemon] @(_wd_daemon_service_thread) - service (com.apple.WindowServer) reported as unresponsive
Mar 1 16:05:33 Turbine spindump[394]: Saved userspace_watchdog_timeout.spin report for WindowServer version ??? (???) to /Library/Logs/DiagnosticReports/WindowServer_2016-03-01-160533_Turbine.userspace_watchdog_timeout.spin
Mar 1 16:05:33 Turbine watchdogd[256]: [watchdog_daemon] @(__wd_service_report_unresponsive_block_invoke) - spindump gathered for (com.apple.WindowServer) at (/Library/Logs/DiagnosticReports/WindowServer_2016-03-01-160533_Turbine.userspace_watchdog_timeout.spin)
So I am leaning now towards assuming either a hardware defect or a OS problem. I'll see if I can find some other Mac user to replicate this. I'll also run the dist function in a loop to see if that is affected as well. Maybe that will offer any new clues. I'll also run loops with rm() and gc() so see if the problem gets "fixed" that way.
One more question - what size matrices could be reasonably expected to compute successfully and in what time? I did notice some sharp dropoff in performance per distance computed using 600,000 x 100 vs 100,000 x 10.
from gpur.
At this point I have been able to run code like this
stressme <- function(){
iter <- 0
repeat {
gpublocks <- gpuR::vclMatrix( as.matrix( mblock[sample(1:nrow(mblock), 500000,replace=FALSE),] ))
gpufacilities <- gpuR::vclMatrix( as.matrix( mblock[sample(1:nrow(mblock), 100,replace=FALSE),] ))
res <- distance(gpublocks,gpufacilities, method="sqEuclidean")
result <- res[,]
#rm(res)
#gc()
#rm(result)
#gc()
#rm(gpublocks)
#gc()
#rm(gpufacilities)
#gc
iter <- iter + 1
if (iter > 1000){
break
}
print(iter)
}
return("OK")
}
for hours at a time with and without the comments around garbage collection. And the only issue I have encountered was this message upon interrupting the code execution:
Error in gpuR::vclMatrix(as.matrix(mblock[sample(1:nrow(mblock), 5e+05, :
error in evaluating the argument 'data' in selecting a method for function 'vclMatrix': Error in base::try(res, TRUE) : object 'res' not found
Again, when not removing objects with rm and calling gc() there is definite impact on GUI responsiveness while executing the loop.
The best tool I have found to monitor the GPU behavior is iStat. From what I can tell the gpuR code is not surprisingly executed on the GPU running the display. Down the road I'll see if I can mess with this.
One odd behavior was that allocating several large matrices using
gpuR::vclMatrix()
initially showed memory decrease on the GPU. But then memory use seemed to stay constant, indicative of some sort of buffer using main memory. Again, all I have is the iStat report on this.
So all of this may be pointing back to some weird platform issue in my setup.
Regarding size of matrices - I do realize that what I am trying to do is really pushing it. Ideally I would like to run 11,000,000 x (small number, possibly 1 only) distances, keeping the larger in GPU memory. Therefore the tests with around 600,000 x 10 pairs. I actually do not want to work with all the distances, only those falling under a certain small threshold. The idea is to see how this performs compared to traditional spatial indexing.
Maybe it is time to close this issue? At least until more concrete evidence of a bug in gpuR comes up again?
from gpur.
@vsmaier I see now. I was mistaken in thinking you were doing all pairwise comparisons for the 600,000 matrix. It shouldn't be a problem with the comparisons between the 600K and 100 objects. I have made some minor changes to more efficiently handle temporary objects within the dist
/distance
functions. However, you still will likely need to use the rm()
and gc()
functions.
I have duplicated the memory issue using your stressme
function. My GPU quickly runs out of memory after 3 or 4 iterations. The distance computation is simply processing too fast for the R garbage collector to run in between runs. Explicitly removing the object and calling garbage collection frees up the GPU memory. I have confirmed this using watch nvidia-smi
(specific to NVIDIA GPUs memory monitoring) on my Ubuntu system. However, once I add in the rm()
and gc()
it runs without problem through 2000 iterations.
stressme <- function(){
iter <- 0
repeat {
res <- distance(gpublocks,gpufacilities, method="sqEuclidean")
result <- res[,]
rm(res)
gc()
iter <- iter + 1
if (iter > 2000){
break
}
print(iter)
}
return("OK")
}
The turbine errors are however outside my expertise. I am primarily a linux guy. I work with Macs and try to make sure there is support but that is the extent of my knowledge. That is something you will need to troubleshoot elsewhere. Would be happy to hear about any solution though.
Let me know if the most recent changes work (I have only applied them to vclMatrix
classes ATM). If this fixes things I will begin a few more updates.
Also, with AMD you may be able to check GPU memory usage with something like aticonfig --odgc --odgt
which I found here. Just a thought to perhaps help.
from gpur.
@vsmaier did this address your problem? If I don't hear back I will assume the issue has been addressed and I will close this issue as it is fully operational on my end.
from gpur.
Yes, please close the issue. Forcing collection seems to be sufficient.
Thanks
from gpur.
Related Issues (20)
- Add 'cor' functionality
- The error, when I intall gpuR package
- Eigen Calculation in gpuR
- unable to find an inherited method for function ‘as.gpuMatrix’ for signature ‘"matrix"’
- gpuR install error for debian 9: error: ViennaCL: FATAL ERROR: ViennaCL encountered an unknown OpenCL error.
- OpenCL error after trouble finding libOpenCL during install
- R CMD check error on matrix initialization HOT 2
- List of compatible GPUs? HOT 1
- Why does an R package based on gpuR not work? HOT 4
- No intel IGD when nvidia card installer.
- R package using gpuR fails to pass R CMD check in windows
- rocm install
- wich method
- Is this package still maintained? HOT 3
- Does gpuR work with the new Mesa? HOT 3
- gpuR with Intel HD graphics
- The package is not computing any tasks?
- A fatal error while compiling gpuR.
- NVIDIA RTX A2000 GPU not detected on Windows laptop
- i9-13900K & RTX 3090 working together, but only i9-13900K detected
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gpur.