Comments (14)
Yes that is right. We are making use of the data attribute for in memory views of parameters. It seems that this will not be possible in the future. This appears to be the source of memory issues reported on GPy: SheffieldML/GPy#341, SheffieldML/GPy#51, though this is not confirmed.
Not using this in memory view significantly reduces turnaround during optimization, as getting and setting parameters has to go through python loops.
Can you think of a fix for this issue, or know about another way of in-memory views of arrays?
See stackoverflow discussions about this:
https://stackoverflow.com/questions/23650796/numpy-set-array-memory
https://stackoverflow.com/questions/39446199/in-numpy-how-do-i-set-array-bs-data-to-reference-the-data-of-array-a
from paramz.
An alternative idea would be to decouple the description of a parameter from its actual data storage.
The external interface of a Param
object would still need to match that of ndarray
, but interally it delegates all the operations to a proper ndarray
. When the parameter is connected to some parent object, all we would need to do is to replace it internal storage with the appropriate view from the parent storage (taking care of copying the old values to the new storage before discarding the old one).
This solution would still keep the advantages of the old system, which avoid looping through parameters because all operations will still be done in-memory, but it will get rid of the deprecation warning. I am willing to give it a try.
from paramz.
Is there any reasonable workaround at the moment? Or should we just silence the warning in the meantime?
from paramz.
from paramz.
@mzwiessele Is there an upstream issue in numpy that we're waiting on to be fixed, so it can be tracked here?
from paramz.
from paramz.
Hi there,
I think I have worked out possible solution for this problem, but I am not sure it is easy to integrate it in the existing framework.
The idea is very simple: it consists of enriching an np.ndarray
with a field to which, if set, all calls will be delegated; if the field is not set, then it falls back to invoking the method on itself.
I paste below a short example to explain what I mean:
import numpy as np
import logging
logging.basicConfig(level=logging.DEBUG)
BUFFER = '__my_buffer__'
class MyArray(np.ndarray):
@property
def data(self):
try:
return getattr(self, BUFFER)
except AttributeError:
return super().data
@data.setter
def data(self, value):
value = np.asarray(value)
logging.debug(f'self: {self}; other: {value}')
assert value.shape == self.shape
value[...] = self[...]
logging.debug(f'self: {self}; other: {value}')
setattr(self, BUFFER, value)
logging.debug(f'self: {self}; other: {value}')
def __getitem__(self, item):
if getattr(self, BUFFER, self) is self:
return super().__getitem__(item)
else:
return getattr(self, BUFFER).__getitem__(item)
def __setitem__(self, item, value):
if getattr(self, BUFFER, self) is self:
return super().__setitem__(item, value)
else:
getattr(self, BUFFER).__setitem__(item, value)
def __iadd__(self, other):
if getattr(self, BUFFER, self) is self:
return super().__iadd__(other)
else:
getattr(self, BUFFER).__iadd__(other)
return self
# ... and so on
buf = np.ones(5)
arr = np.zeros(3).view(MyArray)
assert np.all(arr == 0)
# arr.data = buf[1:2] # raises assertion error because the sizes are different
arr.data = buf[2:] # this is ok because the sizes match
assert np.all(arr[:] == 0)
assert np.all(arr == 0)
assert np.all(buf[:2] == 1) and np.all(buf[2:] == 0)
arr += 2;
assert np.all(arr[:] == 2) # this is ok
# assert np.all(arr == 2) # this is not ok
assert np.all(buf == [1, 1, 2, 2, 2])
Although this seems reasonable to me, I am not sure this will not break some internal mechanics of the framework, of which I honestly haven't an in-depth understanding yet (I got the overall mechanics of the system, but I am still trying to wrap my head around the logic behind the implementation of IndexOperations).
from paramz.
This is my impression, but it could use some correction:
The issue appears to boil down to having a pointer "*A" (the ndarray.data
attribute) to the first element of the array that is changed (to "*B") while other objects may hold views of or pointers to the original *A. [1] The memory at the location indicated by the first pointer, now derefferenced by the original ndarray, may be garbage collected. Those objects holding copies of the *A pointer side-step the built-in python memory management - which can cause a segfault when *A is referenced. This is why we're given that "Inherently unsafe" warning - we may get a segfault if there's another object holding the original pointer *A that tries to access memory at that location after we've reassigned the array's contents to memory starting at *B.
It took some reading to understand @g-bassetto 's suggestion, and it makes some sense to me: essentially, we should provide a wrapper around the ndarray.data buffer that keeps it stable while allowing us to get and set the array's contents. Does anyone see any issues with this approach? Not only would it get rid of the warning, it appears to address the root cause of the warning by more carefully handling the .data
attribute.
The only change I might propose is the addition of some initialization of the buffer that would do away with the if getattr(self, BUFFER, self) is self
conditions replicated across the magic methods.
Sources:
- https://numpy.org/doc/stable/reference/arrays.interface.html specifically "A reference to the object exposing the array interface must be stored by the new object if the memory area is to be secured."
Edit:
@mzwiessele I'm not terribly sure I understand why the ndarray.data
attribute is being set at all - why not simply update the arrays themselves? Is the desire to keep a contiguous memory map of the parameters' arrays?
from paramz.
Also, the link to the numpy mailing list thread above is broken for me, here's a currently working link for anyone else having the same issue: http://numpy-discussion.10968.n7.nabble.com/deprecating-assignment-to-ndarray-data-td42176.html
from paramz.
Why not do an in-place modification of the memory underlying the array, e.g.
a = np.array([1,2,3])
b = np.array([4,5,6])
# forgive the pseudocode
for i, bbyte in enumerate(b.data.memory_map):
a.data.memory_map[i] = bbyte
Or, even simpler, why not just do a raw copy? Would something like the following work? (from https://github.com/sods/paramz/blob/master/paramz/core/parameter_core.py#L290)
# pi.param_array.data = parray[pislice].data
pi.param_array[:] = parray[pislice][:]
I'm guessing this approach incurs the performance issue discussed towards the top of the thread, though, by using python loops.
It seems like we might have two additional approaches:
- Implement the assign_slice operation for numpy (currently a NULL pointer, not implemented [1])
- Avoid the
.data
attribute all together
The second approach is brutally just
# pi.param_array.data = parray[pislice].data
pi.param_array = parray[pislice]
By assigning the .data
attribute, we implicitly assume no other object holds a (post-.data
-reassignment) stale pointer to the memory underlying the original array. I wonder if this assumption can't be leveraged in simply directly reassigning the value at which pi.param_array
points to the value at which parray[pislice]
points. In this case, there's no python loop [:]
used to copy values one by one - the same behavior of reassigning the pi.param_array.data
buffer to the value at parray[pslice]
is achieved.
EDIT: If I'm not too mislead, the second solution I've sketched out matches up with @g-bassetto 's second solution.
Sources:
from paramz.
@ekalosak your second approach matches mine second one, as you guessed.
As far as of my understanding, the reason why Param
objects be of type of ndarray
, is to be able to leverage the numpy implementation that already exists. At a first glance, the Param
class is a kind of structured array. For this reason, a specific Param
instance is meant to be a view on a subset of parameters of a larger parameters hierarchy associated to some model. The reason why they are using the ndarray.data
attribute is because there is no other way to change the address to which a view is pointing to (please correct me if I am wrong).
From an implementation point of view, having all the parameters of the root model of the hierarchy to exist within a contiguous piece of memory is very convenient: most of the existing optimization tools which operate on scalar cost functions of vector variables will work out of the box, as these tools often assume that the parameters lie contiguously in memory.
The reason why I think that having the Param
class decoupled by ndarray
is a good idea, is that this will make it possible to change the back-end at runtime. The first idea that comes to mind is to be able to swap between numpy and pytorch, for example, in case one's model requires some heavy work to be carried out on a GPU. I think I am imagining the Param
class acting as a proxy to the actual parameter array, rather than being itself the parameters array, and both mine and @ekalosak 's option number 2 are going in this direction.
Do you think that what I said is sensible?
from paramz.
I think what you've said is sensible, but it also might be out of scope for the narrow issue - though if you see the current narrow issue as a stepping stone to backend agnosticism, then that just adds motivation to solve this current, narrow issue.
If I've understood correctly, there's an issue with simply directly reassigning the pi.param_array = parray[pislice]
: when we change the pi.param_array
as a direct python re-assignment, the memory underlying the new data will be in a different, non-contiguous location:
If the Param were decoupled from the np.ndarray, I can sense we might be able to maintain memory continuity, but it's not clear how we'd do it aside from directly modifying the array's values in place as it is done now. We could gain more confidence that there aren't any stale views of the data that could cause segfaults, but at the same time, decoupling seems like a potentially unnecessary additional level of abstraction.
from paramz.
At the same time, @g-bassetto , I wonder if it wouldn't be easier to just implement sq_ass_slice
in numpy. Then we could just use param_array[slice] = pi.param_array
without the subroutine falling back on python loops. It's wild to me that slice overwriting is not implemented in the C backend...
(https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/sequence.c#L75)
EDIT
Having run some rough benchmarking...
import numpy as np
import time
N = 10**6
a = np.empty(N)
b = np.empty(N)
c = np.ones(N)
t0a = time.time()
for i in range(N):
a[i] = c[i]
t1a = time.time()
tda = t1a-t0a
t0b = time.time()
b[:] = c
t1b = time.time()
tdb = t1b-t0b
print(f"a: {tda}\nb: {tdb}")
Results:
a: 0.23286128044128418
b: 0.0038118362426757812
I don't quite understand why we wouldn't use the built-in slice overwriting. Seems like the a.data = b.data
is unnecessary altogether? There's no slowdown.
from paramz.
Related Issues (17)
- Minimum required version for decorator
- Add LICENSE to MANIFEST HOT 2
- paramz print unwanted messages in console output
- parallel optimization is broken HOT 1
- The pypi version is not up to date.
- Max, how to optimization objective function trace? HOT 4
- How can I use the scipy optimizer to optimize my model? HOT 2
- How do I get the constraints of a paramz array? HOT 3
- How can I get the value function in each iteration of optimization?
- IPython checking in verbose_optimization.py doesn't work
- What does m.name[0].tie_to(other) do? HOT 1
- Constraint that makes parameters sum to one? HOT 1
- IS THIS PROJECT STILL ALIVE HOT 1
- Bug: Usage of deprecated numpy types HOT 1
- use github actions as travis ci runner does not respond on tag
- tests seem broken
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from paramz.