Giter Club home page Giter Club logo

Comments (14)

mzwiessele avatar mzwiessele commented on August 15, 2024 1

Yes that is right. We are making use of the data attribute for in memory views of parameters. It seems that this will not be possible in the future. This appears to be the source of memory issues reported on GPy: SheffieldML/GPy#341, SheffieldML/GPy#51, though this is not confirmed.

Not using this in memory view significantly reduces turnaround during optimization, as getting and setting parameters has to go through python loops.

Can you think of a fix for this issue, or know about another way of in-memory views of arrays?

See stackoverflow discussions about this:
https://stackoverflow.com/questions/23650796/numpy-set-array-memory
https://stackoverflow.com/questions/39446199/in-numpy-how-do-i-set-array-bs-data-to-reference-the-data-of-array-a

from paramz.

g-bassetto avatar g-bassetto commented on August 15, 2024 1

An alternative idea would be to decouple the description of a parameter from its actual data storage.
The external interface of a Param object would still need to match that of ndarray, but interally it delegates all the operations to a proper ndarray. When the parameter is connected to some parent object, all we would need to do is to replace it internal storage with the appropriate view from the parent storage (taking care of copying the old values to the new storage before discarding the old one).
This solution would still keep the advantages of the old system, which avoid looping through parameters because all operations will still be done in-memory, but it will get rid of the deprecation warning. I am willing to give it a try.

from paramz.

darthdeus avatar darthdeus commented on August 15, 2024

Is there any reasonable workaround at the moment? Or should we just silence the warning in the meantime?

from paramz.

mzwiessele avatar mzwiessele commented on August 15, 2024

from paramz.

darthdeus avatar darthdeus commented on August 15, 2024

@mzwiessele Is there an upstream issue in numpy that we're waiting on to be fixed, so it can be tracked here?

from paramz.

mzwiessele avatar mzwiessele commented on August 15, 2024

from paramz.

g-bassetto avatar g-bassetto commented on August 15, 2024

Hi there,

I think I have worked out possible solution for this problem, but I am not sure it is easy to integrate it in the existing framework.
The idea is very simple: it consists of enriching an np.ndarray with a field to which, if set, all calls will be delegated; if the field is not set, then it falls back to invoking the method on itself.
I paste below a short example to explain what I mean:

import numpy as np
import logging
logging.basicConfig(level=logging.DEBUG)

BUFFER = '__my_buffer__'

class MyArray(np.ndarray):
    
    @property
    def data(self):
        try:
            return getattr(self, BUFFER)
        except AttributeError:
            return super().data
        
    @data.setter
    def data(self, value):
        value = np.asarray(value)
        logging.debug(f'self: {self}; other: {value}')
        assert value.shape == self.shape
        value[...] = self[...]
        logging.debug(f'self: {self}; other: {value}')
        setattr(self, BUFFER, value)
        logging.debug(f'self: {self}; other: {value}')
        
    def __getitem__(self, item):
        if getattr(self, BUFFER, self) is self:
            return super().__getitem__(item)
        else:
            return getattr(self, BUFFER).__getitem__(item)
            
    def __setitem__(self, item, value):
        if getattr(self, BUFFER, self) is self:
            return super().__setitem__(item, value)
        else:
            getattr(self, BUFFER).__setitem__(item, value)
                  
    def __iadd__(self, other):
        if getattr(self, BUFFER, self) is self:
            return super().__iadd__(other)
        else:
            getattr(self, BUFFER).__iadd__(other)
            return self
        
    # ... and so on

buf = np.ones(5)
arr = np.zeros(3).view(MyArray)
assert np.all(arr == 0)
# arr.data = buf[1:2]  # raises assertion error because the sizes are different
arr.data = buf[2:]  # this is ok because the sizes match
assert np.all(arr[:] == 0) 
assert np.all(arr == 0)
assert np.all(buf[:2] == 1) and np.all(buf[2:] == 0)

arr += 2;
assert np.all(arr[:] == 2)  # this is ok
# assert np.all(arr == 2)   # this is not ok
assert np.all(buf == [1, 1, 2, 2, 2])

Although this seems reasonable to me, I am not sure this will not break some internal mechanics of the framework, of which I honestly haven't an in-depth understanding yet (I got the overall mechanics of the system, but I am still trying to wrap my head around the logic behind the implementation of IndexOperations).

from paramz.

ekalosak avatar ekalosak commented on August 15, 2024

This is my impression, but it could use some correction:

The issue appears to boil down to having a pointer "*A" (the ndarray.data attribute) to the first element of the array that is changed (to "*B") while other objects may hold views of or pointers to the original *A. [1] The memory at the location indicated by the first pointer, now derefferenced by the original ndarray, may be garbage collected. Those objects holding copies of the *A pointer side-step the built-in python memory management - which can cause a segfault when *A is referenced. This is why we're given that "Inherently unsafe" warning - we may get a segfault if there's another object holding the original pointer *A that tries to access memory at that location after we've reassigned the array's contents to memory starting at *B.

It took some reading to understand @g-bassetto 's suggestion, and it makes some sense to me: essentially, we should provide a wrapper around the ndarray.data buffer that keeps it stable while allowing us to get and set the array's contents. Does anyone see any issues with this approach? Not only would it get rid of the warning, it appears to address the root cause of the warning by more carefully handling the .data attribute.

The only change I might propose is the addition of some initialization of the buffer that would do away with the if getattr(self, BUFFER, self) is self conditions replicated across the magic methods.

Sources:

  1. https://numpy.org/doc/stable/reference/arrays.interface.html specifically "A reference to the object exposing the array interface must be stored by the new object if the memory area is to be secured."

Edit:
@mzwiessele I'm not terribly sure I understand why the ndarray.data attribute is being set at all - why not simply update the arrays themselves? Is the desire to keep a contiguous memory map of the parameters' arrays?

from paramz.

ekalosak avatar ekalosak commented on August 15, 2024

Also, the link to the numpy mailing list thread above is broken for me, here's a currently working link for anyone else having the same issue: http://numpy-discussion.10968.n7.nabble.com/deprecating-assignment-to-ndarray-data-td42176.html

from paramz.

ekalosak avatar ekalosak commented on August 15, 2024

Why not do an in-place modification of the memory underlying the array, e.g.

a = np.array([1,2,3])
b = np.array([4,5,6])
# forgive the pseudocode
for i, bbyte in enumerate(b.data.memory_map):
  a.data.memory_map[i] = bbyte

Or, even simpler, why not just do a raw copy? Would something like the following work? (from https://github.com/sods/paramz/blob/master/paramz/core/parameter_core.py#L290)

            # pi.param_array.data = parray[pislice].data 
            pi.param_array[:] = parray[pislice][:]

I'm guessing this approach incurs the performance issue discussed towards the top of the thread, though, by using python loops.

It seems like we might have two additional approaches:

  1. Implement the assign_slice operation for numpy (currently a NULL pointer, not implemented [1])
  2. Avoid the .data attribute all together

The second approach is brutally just

            # pi.param_array.data = parray[pislice].data 
            pi.param_array = parray[pislice]

By assigning the .data attribute, we implicitly assume no other object holds a (post-.data-reassignment) stale pointer to the memory underlying the original array. I wonder if this assumption can't be leveraged in simply directly reassigning the value at which pi.param_array points to the value at which parray[pislice] points. In this case, there's no python loop [:] used to copy values one by one - the same behavior of reassigning the pi.param_array.data buffer to the value at parray[pslice] is achieved.

EDIT: If I'm not too mislead, the second solution I've sketched out matches up with @g-bassetto 's second solution.

Sources:

  1. https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/sequence.c#L75

from paramz.

g-bassetto avatar g-bassetto commented on August 15, 2024

@ekalosak your second approach matches mine second one, as you guessed.
As far as of my understanding, the reason why Param objects be of type of ndarray, is to be able to leverage the numpy implementation that already exists. At a first glance, the Param class is a kind of structured array. For this reason, a specific Param instance is meant to be a view on a subset of parameters of a larger parameters hierarchy associated to some model. The reason why they are using the ndarray.data attribute is because there is no other way to change the address to which a view is pointing to (please correct me if I am wrong).
From an implementation point of view, having all the parameters of the root model of the hierarchy to exist within a contiguous piece of memory is very convenient: most of the existing optimization tools which operate on scalar cost functions of vector variables will work out of the box, as these tools often assume that the parameters lie contiguously in memory.

The reason why I think that having the Param class decoupled by ndarray is a good idea, is that this will make it possible to change the back-end at runtime. The first idea that comes to mind is to be able to swap between numpy and pytorch, for example, in case one's model requires some heavy work to be carried out on a GPU. I think I am imagining the Param class acting as a proxy to the actual parameter array, rather than being itself the parameters array, and both mine and @ekalosak 's option number 2 are going in this direction.

Do you think that what I said is sensible?

from paramz.

ekalosak avatar ekalosak commented on August 15, 2024

I think what you've said is sensible, but it also might be out of scope for the narrow issue - though if you see the current narrow issue as a stepping stone to backend agnosticism, then that just adds motivation to solve this current, narrow issue.

If I've understood correctly, there's an issue with simply directly reassigning the pi.param_array = parray[pislice]: when we change the pi.param_array as a direct python re-assignment, the memory underlying the new data will be in a different, non-contiguous location:

untitled (3)

If the Param were decoupled from the np.ndarray, I can sense we might be able to maintain memory continuity, but it's not clear how we'd do it aside from directly modifying the array's values in place as it is done now. We could gain more confidence that there aren't any stale views of the data that could cause segfaults, but at the same time, decoupling seems like a potentially unnecessary additional level of abstraction.

from paramz.

ekalosak avatar ekalosak commented on August 15, 2024

At the same time, @g-bassetto , I wonder if it wouldn't be easier to just implement sq_ass_slice in numpy. Then we could just use param_array[slice] = pi.param_array without the subroutine falling back on python loops. It's wild to me that slice overwriting is not implemented in the C backend...
(https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/sequence.c#L75)

EDIT
Having run some rough benchmarking...

import numpy as np
import time

N = 10**6
a = np.empty(N)
b = np.empty(N)
c = np.ones(N)

t0a = time.time()
for i in range(N):
    a[i] = c[i]
t1a = time.time()
tda = t1a-t0a

t0b = time.time()
b[:] = c
t1b = time.time()
tdb = t1b-t0b

print(f"a: {tda}\nb: {tdb}")

Results:

a: 0.23286128044128418
b: 0.0038118362426757812

I don't quite understand why we wouldn't use the built-in slice overwriting. Seems like the a.data = b.data is unnecessary altogether? There's no slowdown.

from paramz.

Related Issues (17)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.