Giter Club home page Giter Club logo

Comments (39)

sonots avatar sonots commented on May 22, 2024

I will look about this later.

from cupy.

sonots avatar sonots commented on May 22, 2024

Can you paste codes to reproduce?

from cupy.

hiro4bbh avatar hiro4bbh commented on May 22, 2024

Sorry, my code causing the problem is for my research, so I can't paste it.
I have no idea which part causes this problem.
So, if there is any way to debug this, please tell me and I will try to solve this.

from cupy.

sonots avatar sonots commented on May 22, 2024

On cython, import pdb; pdb.set_trace() is not available, so just do print debug. You can build cupy like

pip uninstall cupy  # just to assure another cupy is not installed
python setup.py install

and run your program.

from cupy.

sonots avatar sonots commented on May 22, 2024

Hmm, I can not reproduce.

import cupy

cupy.cuda.set_allocator(cupy.cuda.MemoryPool().malloc)
x = cupy.array([1,2,3,4], dtype=cupy.float64)
import cupy

cupy.cuda.set_allocator(cupy.cuda.MemoryPool().malloc)
x = cupy.array([1,2,3,4], dtype=cupy.float64)
y = cupy.array([1,2,3,4], dtype=cupy.float64)
z = float(cupy.sum((x - y)**2, dtype=cupy.float64))

from cupy.

sonots avatar sonots commented on May 22, 2024

Let me take a note. cupy.cuda.runtime.free should be invoked only when free_all_blocks is invoked

self.free_all_blocks()
(free_all_blocks removes references from _free list, then Memory.__dealloc__ is called)

from cupy.

hiro4bbh avatar hiro4bbh commented on May 22, 2024

Thank you for your rapid replies.

I will try to create PoC code from my framework and do print debug tomorrow on PC in our lab (please wait ...).

I realized that:

  • Normal forwarding steps in training seem not to affect.
  • Permutating the rows of a design matrix (calculating permutation index array with numpy.random.permutation and indexing on cupy array) seems to affect (causing cudaErrorIllegalAddress?).
  • Adam updates in backwarding steps seem to affect (not causing cudaErrorIllegalAddress).

from cupy.

sonots avatar sonots commented on May 22, 2024

The traceback line

  File "cupy\cuda\memory.pyx", line 358, in cupy.cuda.memory.PooledMemory.free
TypeError: 'NoneType' object is not callable

tells the pool object which is a weakref.ref of a SingleDeviceMemoryPool was None.

pool = self.pool()

347     def __dealloc__(self):
348         if self.ptr != 0:
349             self.free()
350
351     cpdef free(self):
352         """Frees the memory buffer and returns it to the memory pool.
353
354         This function actually does not free the buffer. It just returns the
355         buffer to the memory pool for reuse.
356
357         """
358         pool = self.pool()
359         if pool and self.ptr != 0:
360             pool.free(self.ptr, self.size)
361         self.ptr = 0
362         self.size = 0
363         self.device = None

Referring http://cython.readthedocs.io/en/latest/src/userguide/special_methods.html#finalization-method-dealloc,

You need to be careful what you do in a __dealloc__() method. By the time your __dealloc__() method is called, the object may already have been partially destroyed and may not be in a valid state as far as Python is concerned, so you should avoid invoking any Python operations which might touch the object. In particular, donโ€™t call any other methods of the object or do anything which might cause the object to be resurrected. Itโ€™s best if you stick to just deallocating C data.

it looks the pool object potentially can be None in __dealloc__().

This looks like a potential bug which also exists in cupy 1.0.1.

from cupy.

sonots avatar sonots commented on May 22, 2024

BTW: When I put raise RuntimeError like below:

351     cpdef free(self):
352         """Frees the memory buffer and returns it to the memory pool.
353
354         This function actually does not free the buffer. It just returns the
355         buffer to the memory pool for reuse.
356
357         """
358         raise RuntimeError('yay!')
359         pool = self.pool()

I got following error messages which are similar to issued error messages

Exception ignored in: 'cupy.cuda.memory.PooledMemory.__dealloc__'
Traceback (most recent call last):
  File "cupy/cuda/memory.pyx", line 358, in cupy.cuda.memory.PooledMemory.free (cupy/cuda/memory.cpp:7482)
RuntimeError: yay!
Exception ignored in: 'cupy.cuda.memory.Memory.__dealloc__'
Traceback (most recent call last):
  File "cupy/cuda/runtime.pyx", line 222, in cupy.cuda.runtime.free (cupy/cuda/runtime.cpp:3776)
  File "cupy/cuda/runtime.pyx", line 130, in cupy.cuda.runtime.check_status (cupy/cuda/runtime.cpp:2262)
cupy.cuda.runtime.CUDARuntimeError: cudaErrorIllegalAddress: an illegal memory access was encountered

Now, I am wondering how to fix this problem because cython's cdef classes do not have __del__().

from cupy.

sonots avatar sonots commented on May 22, 2024

@hiro4bbh I wrote an experimental patch master...sonots:fix_317. This patch is to use __del__() instead of __dealloc__() to release an object. Could you try this?

Build as:

git remote add sonots https://github.com/sonots/cupy
git remote update
git checkout -b fix_317 sonots/fix_317
git clean -fdx
python setup.py install

from cupy.

hiro4bbh avatar hiro4bbh commented on May 22, 2024

Thank you for your patch.

I applied the patch as you tell, then I got the following error message many times:

Traceback (most recent call last):
  File "cupy\cuda\memory.pyx", line 349, in cupy.cuda.memory.PooledMemory.__del__
  File "cupy\cuda\memory.pyx", line 360, in cupy.cuda.memory.PooledMemory.free
  File "cupy\cuda\memory.pyx", line 485, in cupy.cuda.memory.SingleDeviceMemoryPool.free
  File "cupy\cuda\memory.pyx", line 501, in cupy.cuda.memory.SingleDeviceMemoryPool.free
ValueError: list.remove(x): x not in list
Exception ignored in: <bound method PooledMemory.__del__ of <cupy.cuda.memory.PooledMemory object at 0x0000020C17A87A20>>

The number of times that this error happens changes at each run... Are some free lists destroyed in some chunk operations? I think there is no multithreaded operations...

If cupy.cuda.set_allocator is called, some calculations failed (NaN happens or losses didn't decrease in some case if random permutations change). cupy.cuda.set_pinned_memory_allocator accelerates calculations, and some error messages won't happen. Does pinned memory allocator affect this problem?

I will inspect the details as preparing PoC code.

from cupy.

sonots avatar sonots commented on May 22, 2024

Thank you for trying. Hmm, I will investigate.

cupy.cuda.set_pinned_memory_allocator is used to cache a pinned host (CPU) memory, not GPU memory. cupy.cuda.memory is not a module for pinned memory, so pinned memory allocator is probably not related with this problem.

from cupy.

sonots avatar sonots commented on May 22, 2024

It looks __del__() is working anyway although __dealloc__() was not working well. It is a progress although we still have another strange behavior.

from cupy.

hiro4bbh avatar hiro4bbh commented on May 22, 2024

When I implement Adam with cupy.ElementwiseKernel, then some error messages didn't happen, but rarely happen... Furthermore, when I extract the code from my framework, then any error won't happen...

Maybe, this error is based on free list manipulation operated at memory allocations/deallocations, so it would be difficult to write stable PoC code succinctly (some parts in my framework may affect).

I couldn't create PoC code, but i will continue to create PoC code and inspect the implementation.

from cupy.

hiro4bbh avatar hiro4bbh commented on May 22, 2024

As trying to create stable PoC code, I realized that cudaErrorIllegalAddress won't happen anymore. I think that cupy should use Python __dealloc__ for pure-C data structures, so @sonots's patch is helpful. Thank you.

In some cases, when cupy.ElementwiseKernel is used, any error doesn't happen, so I suspect that some small allocations destroys free list maybe (x * y causes one allocation for the result, etc.).

from cupy.

sonots avatar sonots commented on May 22, 2024

Let me make sure. Do you mean you still get ValueError: list.remove(x): x not in list although you do no get cudaErrorInvalidDevicePointer?

from cupy.

hiro4bbh avatar hiro4bbh commented on May 22, 2024

Yes. I got ValueError: list.remove(x): x not in list sometimes (I can't figure out the pattern...). However, I didn't get cudaErrorInvalidDevicePointer anymore.

from cupy.

sonots avatar sonots commented on May 22, 2024

Okay, thanks.

from cupy.

sonots avatar sonots commented on May 22, 2024

@hiro4bbh could you do me a favor?

I added debug print > sonots@6a6732a (this commit is pushed in sonots/fix_317 branch)

Could you run your program with this and paste the result? Please note that the result would become so huge. Pasting on a separated gist would be better. If log is too huge to paste, it is okay to filter to only "malloc" and "free" line.

from cupy.

hiro4bbh avatar hiro4bbh commented on May 22, 2024

Thanks for your patch.
I will try your patch next week, because I can't use CUDA PC untill then.
Sorry for my late reply...

from cupy.

hiro4bbh avatar hiro4bbh commented on May 22, 2024

I tried @sonots patch on a CUDA machine.

I got the exceptions (fix_317_failed_stdout.txt) and fix_317 logs (about 1MB, fix_317_failed_malloc_free.txt). The links are at my gist.

I think that we can ignore RuntimeError: reentrant call inside <_io.BufferedWriter name='<stdout>'>. Thus the problem is only ValueError: list.remove(x): x not in list.

from cupy.

sonots avatar sonots commented on May 22, 2024

Thanks! But, it seems the last line of fix_317_failed_malloc_free.txt is broken like:

fix_317 free(ptr=81726034432, si

Was you able to paste entire logs until last line where an error occurred?
Hmm, Gist may not be a good place to paste.
Can you send logs via email? My email address is available from here https://github.com/sonots.

from cupy.

sonots avatar sonots commented on May 22, 2024

One more thing. I changed the log line of malloc as:

fix_317 malloc(size=512) ptr=38930 PooledMemory=<cupy.cuda.memory.PooledMemory object at 0x7f88482002e8>

Could you pull fix_317 branch again? Thank you for your cooperation.

from cupy.

sonots avatar sonots commented on May 22, 2024

Thank you for your email!!

from cupy.

sonots avatar sonots commented on May 22, 2024

With logs you've sent via email, I could not see ValueError: list.remove(x): x not in list

C:\Program Files\Python36\lib\importlib\_bootstrap.py:205: ImportWarning: can't resolve package from __spec__ or __package__, falling back on __name__ and __path__
  return f(*args, **kwds)
Exception ignored in: <bound method PooledMemory.__del__ of <cupy.cuda.memory.PooledMemory object at 0x00000204360EF2E8>>
Traceback (most recent call last):
  File "cupy\cuda\memory.pyx", line 349, in cupy.cuda.memory.PooledMemory.__del__
  File "cupy\cuda\memory.pyx", line 360, in cupy.cuda.memory.PooledMemory.free
  File "cupy\cuda\memory.pyx", line 491, in cupy.cuda.memory.SingleDeviceMemoryPool.free
  File "cupy\cuda\memory.pyx", line 492, in cupy.cuda.memory.SingleDeviceMemoryPool.free
RuntimeError: reentrant call inside <_io.BufferedWriter name='<stdout>'>
Exception ignored in: <bound method PooledMemory.__del__ of <cupy.cuda.memory.PooledMemory object at 0x0000020436696F28>>
Traceback (most recent call last):
  File "cupy\cuda\memory.pyx", line 349, in cupy.cuda.memory.PooledMemory.__del__
  File "cupy\cuda\memory.pyx", line 360, in cupy.cuda.memory.PooledMemory.free
  File "cupy\cuda\memory.pyx", line 491, in cupy.cuda.memory.SingleDeviceMemoryPool.free
  File "cupy\cuda\memory.pyx", line 500, in cupy.cuda.memory.SingleDeviceMemoryPool.free
RuntimeError: reentrant call inside <_io.BufferedWriter name='<stdout>'>

Did you get the ValueError actually?

from cupy.

sonots avatar sonots commented on May 22, 2024

This is just my progress. I tried to reproduce by generating python codes like below from logs:

import re

print('import cupy')
print('pool = cupy.cuda.MemoryPool()')

for line in open('fix_317_failed_all.txt', 'r'):
    # fix_317 malloc(size=512) ptr=81719733760 PooledMemory=<cupy.cuda.memory.PooledMemory object at 0x0000020435305BE0>
    if line.startswith('fix_317 malloc'):
        line = line.replace('fix_317 malloc(', '')
        line = line.replace(')', '')
        line = re.sub(r'PooledMemory object.*$', '', line)
        items = line.split(' ')
        d = {}
        for item in items:
            k, v = item.split('=')
            d[k] = v
        print('m{} = pool.malloc({})'.format(d['ptr'], d['size']))
    # fix_317 free(ptr=81723916288, size=24064)
    elif line.startswith('fix_317 free'):
        line = line.replace('fix_317 free(', '')
        line = line.replace(')', '')
        items = line.split(', ')
        d = {}
        for item in items:
            k, v = item.split('=')
            d[k] = v
        print('del m{}'.format(d['ptr']))

Generated codes are:

import cupy
pool = cupy.cuda.MemoryPool()
m81719721984 = pool.malloc(512)
m81719722496 = pool.malloc(512)
m81719723008 = pool.malloc(512)
m81719723520 = pool.malloc(512)
m81719724032 = pool.malloc(512)
m81721819136 = pool.malloc(12288)
m81719724544 = pool.malloc(512)
m81721831424 = pool.malloc(12288)
m81723916288 = pool.malloc(24064)
m81719725056 = pool.malloc(512)
del m81723916288
m81723916288 = pool.malloc(512)
del m81719725056
m81719725056 = pool.malloc(512)
m81723916800 = pool.malloc(512)
[omitted]

But, I still can not reproduce yet.

from cupy.

hiro4bbh avatar hiro4bbh commented on May 22, 2024

Sorry, I think that I extracted the log of not-problematic code. I will extract the log of the problematic one.

Please wait a moment...

from cupy.

sonots avatar sonots commented on May 22, 2024

Thank you for new logs.

from cupy.

sonots avatar sonots commented on May 22, 2024

Hmm, unfortunately, I could not reproduce from replay. I will investigate more.

from cupy.

sonots avatar sonots commented on May 22, 2024
$ grep -C 2 -n '81721835008' ~/fix_317_failed_all.txt

1349:fix_317   [pop best-fit free_list] ptr=81721835008 size=512 chunk=<cupy.cuda.memory.Chunk object at 0x00000240B0A7BC10>
1350:fix_317   [split size=512] ptr=81721835008 size=512 chunk=<cupy.cuda.memory.Chunk object at 0x00000240B0A7BC10>
1351:fix_317   [push in_use] ptr=81721835008 size=512 chunk=<cupy.cuda.memory.Chunk object at 0x00000240B0A7BC10>
1352-fix_317 malloc(size=512) ptr=81721835008 PooledMemory=<cupy.cuda.memory.PooledMemory object at 0x00000240B0A44A20>
--
2121-fix_317 free(ptr=81721834496, size=512)
2122-fix_317   [pop in_use] ptr=81721834496 size=512 chunk=<cupy.cuda.memory.Chunk object at 0x00000240B0A7B3F0>
2123:fix_317   [remove next free_list] ptr=81721835008 size=512 chunk=<cupy.cuda.memory.Chunk object at 0x00000240B0A7BC10>
2124-fix_317 free(ptr=81721833984, size=512)

This log tells 81721835008 is in_use, but free(ptr=81721834496, size=512) tried to remove 81721835008 from free_list, and error occurred.

I tried to reproduce this on my environment, but I still can not reproduce.

from cupy.

sonots avatar sonots commented on May 22, 2024

@hiro4bbh could you tell me python version and cython version you used?

from cupy.

sonots avatar sonots commented on May 22, 2024

I think there is no multithreaded operations...

Do you run in multiple threads actually? I found wierd logs as below:

  1684 fix_317 malloc(size=512) ptr=81723928576 PooledMemory=<cupy.cuda.memory.PooledMemory object at 0x00000240B0A44D68>
  1685 fix_317   [pop best-fit free_list] ptr=81723929088 size=11264 chunk=<cupy.cuda.memory.Chunk object at 0x00000240B0C31798>
  1686 fix_317   [split size=512] ptr=81723929088 size=11264 chunk=<cupy.cuda.memory.Chunk object at 0x00000240B0C31798>fix_317 free(ptr=81721839616, size=512)
  1687 fix_317   [pop in_use] ptr=81721839616 size=512 chunk=<cupy.cuda.memory.Chunk object at 0x00000240B0A7A4C0>
  1688 fix_317   [remove prev free_list] ptr=81721835520 size=4096 chunk=<cupy.cuda.memory.Chunk object at 0x00000240B0C31A08>
  1689 fix_317   [merged] ptr=81721835520 size=4608 chunk=<cupy.cuda.memory.Chunk object at 0x00000240B0C31CE0>
  1690 fix_317   [push free_list] ptr=81721835520 size=4608 chunk=<cupy.cuda.memory.Chunk object at 0x00000240B0C31CE0>
  1691 fix_317 free(ptr=81721840128, size=512)
  1692 fix_317   [pop in_use] ptr=81721840128 size=512 chunk=<cupy.cuda.memory.Chunk object at 0x00000240B0A7A938>
  1693 fix_317   [remove prev free_list] ptr=81721835520 size=4608 chunk=<cupy.cuda.memory.Chunk object at 0x00000240B0C31CE0>
  1694 fix_317   [merged] ptr=81721835520 size=5120 chunk=<cupy.cuda.memory.Chunk object at 0x00000240B0C31D48>
  1695 fix_317   [push free_list] ptr=81721835520 size=5120 chunk=<cupy.cuda.memory.Chunk object at 0x00000240B0C31D48>
  1696 fix_317 free(ptr=81721840640, size=512)
  1697 fix_317   [pop in_use] ptr=81721840640 size=512 chunk=<cupy.cuda.memory.Chunk object at 0x00000240B0C31118>
  1698 fix_317   [remove prev free_list] ptr=81721835520 size=5120 chunk=<cupy.cuda.memory.Chunk object at 0x00000240B0C31D48>
  1699 fix_317   [merged] ptr=81721835520 size=5632 chunk=<cupy.cuda.memory.Chunk object at 0x00000240B0C31DB0>
  1700 fix_317   [push free_list] ptr=81721835520 size=5632 chunk=<cupy.cuda.memory.Chunk object at 0x00000240B0C31DB0>
  1701
  1702 fix_317   [push remaining free_list] ptr=81723929600 size=10752 chunk=<cupy.cuda.memory.Chunk object at 0x00000240B0A7AEE8>
  1703 fix_317   [push in_use] ptr=81723929088 size=512 chunk=<cupy.cuda.memory.Chunk object at 0x00000240B0A7AD48>
  1704 fix_317 malloc(size=512) ptr=81723929088 PooledMemory=<cupy.cuda.memory.PooledMemory object at 0x00000240AE7B56A0>

where

  1685 fix_317   [pop best-fit free_list] ptr=81723929088 size=11264 chunk=<cupy.cuda.memory.Chunk object at 0x00000240B0C31798>
  1686 fix_317   [split size=512] ptr=81723929088 size=11264 chunk=<cupy.cuda.memory.Chunk object at 0x00000240B0C31798>fix_317 free(ptr=81721839616, size=512)
  1702 fix_317   [push remaining free_list] ptr=81723929600 size=10752 chunk=<cupy.cuda.memory.Chunk object at 0x00000240B0A7AEE8>
  1703 fix_317   [push in_use] ptr=81723929088 size=512 chunk=<cupy.cuda.memory.Chunk object at 0x00000240B0A7AD48>
  1704 fix_317 malloc(size=512) ptr=81723929088 PooledMemory=<cupy.cuda.memory.PooledMemory object at 0x00000240AE7B56A0>

are consecutive logs.

from cupy.

sonots avatar sonots commented on May 22, 2024

Added codes to print thread_ids on the fix_317 branch.

from cupy.

sonots avatar sonots commented on May 22, 2024

hiro4bbh says...

Python on Windows 10 x64 is version 3.6.2, and Cython is version 0.26.

Also, I got logs from hiro4bbh-san and it seemed it was one thread when the error occurred although it shows the latter part of logs uses another different thread, but it probably is not related with the error.

from cupy.

sonots avatar sonots commented on May 22, 2024

I tried with same python and cython version, but I could not reproduce. I now doubt windows environment, but I do not have a windows environment ...

from cupy.

sonots avatar sonots commented on May 22, 2024

I am not sure whether this helps, but I made thread-safe implementation as master...sonots:fix_317. Can you try this?

from cupy.

hiro4bbh avatar hiro4bbh commented on May 22, 2024

I tried several cases for reproducing the bug in your previous patch, but I couldn't reproduce.

I will try your latest patch. If there is no problem, I will use that version. I will report how your latest patch works.

Thank you for your patch!

from cupy.

hiro4bbh avatar hiro4bbh commented on May 22, 2024

I confimed that your latest patch didn't fail. I couldn't reproduce the bug.

Thank you!

from cupy.

sonots avatar sonots commented on May 22, 2024

Fixed via #381 and #382.

from cupy.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.