juliadata / pooledarrays.jl Goto Github PK

View Code? Open in Web Editor NEW

47.0 8.0 13.0 139 KB

A pooled representation for arrays with few unique elements

License: Other

Julia 100.00%

pooledarrays.jl's Introduction

PooledArrays.jl

A pooled representation of arrays for purposes of compression when there are few unique elements.

Installation: at the Julia REPL, import Pkg; Pkg.add("PooledArrays")

Usage:

Working with PooledArray objects does not differ from working with general AbstractArray objects, with two exceptions:

If you hold mutable objects in PooledArray it is not allowed to modify them after they are stored in it.
In multi-threaded context it is not safe to assign values that are not already present in a PooledArray's pool from one thread while either reading or writing to the same array from another thread.

Keeping in mind these two restrictions, as a user, the only thing you need to learn is how to create PooledArray objects. This is accomplished by passing an AbstractArray to the PooledArray constructor:

julia> using PooledArrays

julia> PooledArray(["a" "b"; "c" "d"])
2×2 PooledMatrix{String, UInt32, Matrix{UInt32}}:
 "a"  "b"
 "c"  "d"

PooledArray performs compression by storing an array of reference integers and a mapping from integers to its elements in a dictionary. In this way, if the size of the reference integer is smaller than the size of the actual elements the resulting PooledArray has a smaller memory footprint than the equivalent Array. By default UInt32 is used as a type of reference integers. However, you can specify the reference integer type you want to use by passing it as a second argument to the constructor. This is usually done when you know that you will have only a few unique elements in the PooledArray.

julia> PooledArray(["a", "b", "c", "d"], UInt8)
4-element PooledVector{String, UInt8, Vector{UInt8}}:
 "a"
 "b"
 "c"
 "d"

Alternatively you can pass the compress and signed keyword arguments to the PooledArray constructor to automatically select the reference integer type. When you pass compress=true then the reference integer type is chosen to be the smallest type that is large enough to hold all unique values in array. When you pass signed=true the reference type is signed (by default it is unsigned).

julia> PooledArray(["a", "b", "c", "d"]; compress=true, signed=true)
4-element PooledVector{String, Int8, Vector{Int8}}:
 "a"
 "b"
 "c"
 "d"

Maintenance: PooledArrays is maintained collectively by the JuliaData collaborators. Responsiveness to pull requests and issues can vary, depending on the availability of key collaborators.

Related Packages

pooledarrays.jl's People

Stargazers

Watchers

Forkers

tkelman kafisatz asbisen nalimilan oxinabox piever stjordanis sl-solution chrstphrbrns mrufsvold kragol lilithhafner

pooledarrays.jl's Issues

implement compactpool! and compactpool functions

Currently there is no API to compact pool of PooledArray. This can be needed in two cases:

you removed many values from a PooledArray and want to have a smaller pool
you subsetted a PooledArray thus dropping many unique values

This should be implemented after #56 is merged.

PooledArray is potentially not thread safe

See the following example:

julia> Threads.nthreads()
8

julia> using PooledArrays

julia> x = PooledArray(zeros(10^6));

julia> Threads.@threads for i in 1:10^6
           x[i] = i
       end
ERROR: TaskFailedException:
BoundsError: attempt to access 228263-element Array{Float64,1} at index [13310]

I think we should make setindex! thread safe. Any opinions?

revpool type

Looking at the code maybe revpool could be defined as Vector{T}. I do not see any place where we take advantage of the fact that it is a Dict.

@nalimilan - this would influence the performance of JuliaData/DataFrames.jl#1703 I think.

PooledArrays.jl 0.6 release

There has been a year since the last patch release of PooledArrays.jl I think with the changes we have we should aim to make a 0.6 release. We have some important breaking changes already on master + several issues/PRs open.

I think it would be good to make a release. I will comment on the open issues/PRs to make a decision if we want to include them or not in 0.6 release.

CC @nalimilan + @quinnj

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

implement reference offset

A number of binary formats contain dictionary encoded data with the references as 0-indexed integers. PooledArrays currently can't be used to wrap these because, if I understand correctly, the references are always 1-based. This means that to deserialize a format using PooledArrays you necessarily have to copy all the references.

I suggest we add an offset::Int field so this can be handled more generally. This would be added getindex. The most obvious difficulty with implementing this is that currently zeros give undefined values. Perhaps this can be circumvented at compile time with a new type parameter.

Anyway, before I try to implement this, has anyone given in consideration? How is arrow (which I seem to remember is 0-based) deal with this?

constuctor to create PooledArray sharing a pool from another PooledArray

In the case where you have 2+ PooledArrays and you want them to share a single pool, it would be useful to have a constructor that's something like

foo = PooledArray(rand(["a","b","c"], 5000))

bar = PooledArray(rand(["a","b","c"], 100), foo)

where bar is a PooledArray of 100 elements that share the pool of foo. Therefore, if one was to change foo.pool[1] = "zebra", every occurrence of "a" in bar would become "zebra"

The constructor would look something like

PooledArray(x, y::PooledArray)

I'm not sure of what a clean and efficient process for that would look like. Perhaps since the pool already exists, replacing every occurrence of a unique element with the corresponding pool index?

Error tagging new release

The URL of this package does not match that stored in METADATA.jl.
cc: @julia-tagbot[bot]

`pure=true` default to align with Base

I see why the pure keyword was added, but having pure=false contradicts Julia's behavior, which assumes mapped functions are pure. For instance:

julia> using SparseArrays

julia> A = sparse([1, 1, 2, 3], [1, 3, 2, 3], [3, 1, 2, 3])
3×3 SparseMatrixCSC{Int64, Int64} with 4 stored entries:
 3  ⋅  1
 ⋅  2  ⋅
 ⋅  ⋅  3

julia> map(x -> x^2 + rand(), A)
3×3 SparseMatrixCSC{Float64, Int64} with 9 stored entries:
 9.54687  0.85208  1.86548
 0.85208  4.31839  0.85208
 0.85208  0.85208  9.26578

This default behavior is also inconsistent with most programmers' expectations, as map is a functional construct and therefore tends to assume side-effect free functions. This could lead to bugs; it also means that any code using map on a vector of unknown type can't take advantage of the performance enhancements provided by PooledArrays, since pure is not a keyword for the map method in base. As a result, I propose defaulting to pure=true.

Change the default type of ref

Currently we have:

julia> x = PooledArray(zeros(Int, 10^6));
julia> copyto!(x, 1:10^6);
ERROR: You're using a PooledArray with ref type UInt8, which can only hold 255 values,
and you just tried to add the 256th reference.  Please change the ref type
to a larger int type, or use the default ref type (UInt32).

I think we should use UInt32 as the default as in practice limiting up to 255 levels by default is a small value. If someone wants such a limit to save memory I would have it as an opt-in rather than the default.

@quinnj - what does CSV.jl use as a default?

Change order of type parameters

Having PooledArray{T, R, N} is confusing since we have AbstractArray{T, N}. The day we make a breaking release we should change this.

DataAPI.levels method?

In the Arrow.jl package we added a method

function DataAPI.levels(x::DictEncoded)
    rp = DataAPI.refpool(x)   # may contain missing values
    Missing <: eltype(rp) || return rp
    convert(AbstractArray{nonmissingtype(eltype(rp))}, deleteat!(rp, ismissing.(rp)))
end

Should there be a similar method for PooledArray?

At present the default method for levels ends up sorting the pool.

similar(pa, 0) stopped working

When working on a PR for Gadfly.jl (GiovineItalia/Gadfly.jl#1521), I noticed that the update for PooledArrays.jl is breaking. I'm not sure whether this behavior is expected, so decided to check in.

With PooledArrays v1.1:

julia> using PooledArrays

julia> pa = PooledArray([1]);

julia> similar(pa, 0)
0-element PooledVector{Int64, UInt32, Vector{UInt32}}

With PooledArrays v1.2:

julia> similar(pa, 0)
ERROR: ArgumentError: collection must be non-empty
Stacktrace:
 [1] _extrema_itr
   @ ./operators.jl:483 [inlined]
 [2] _extrema_dims
   @ ./multidimensional.jl:1698 [inlined]
 [3] #extrema#488
   @ ./multidimensional.jl:1685 [inlined]
 [4] extrema
   @ ./multidimensional.jl:1685 [inlined]
 [5] PooledVector{Int64, UInt32, Vector{UInt32}}(rs::PooledArrays.RefArray{Vector{UInt32}}, invpool::Dict{Int64, UInt32}, pool::Vector{Int64}, refcount::Base.Threads.Atomic{Int64})
   @ PooledArrays ~/.julia/packages/PooledArrays/mXlIN/src/PooledArrays.jl:46
 [6] PooledArray
   @ ~/.julia/packages/PooledArrays/mXlIN/src/PooledArrays.jl:78 [inlined]
 [7] similar(pa::PooledVector{Int64, UInt32, Vector{UInt32}}, S::Type, dims::Tuple{Int64})
   @ PooledArrays ~/.julia/packages/PooledArrays/mXlIN/src/PooledArrays.jl:293
 [8] similar(a::PooledVector{Int64, UInt32, Vector{UInt32}}, dims::Int64)
   @ Base ./abstractarray.jl:735
 [9] top-level scope
   @ REPL[5]:1

What is the relation to CategoricalArrays?

https://github.com/JuliaData/CategoricalArrays.jl

Do they both try to do the same thing, i.e. replace PooledDataArray?

cc @nalimilan.

`map` applies function to unused pool levels

map applies the function to all values in the pool, whether they are used or not. This is notably a problem if you replace missing values:

julia> x = PooledArray([missing, 1])
2-element PooledArray{Union{Missing, Int64},UInt8,1,Array{UInt8,1}}:
  missing
 1

julia> x[1] = 0
0

julia> map(Int, x)
ERROR: MethodError: no method matching Int64(::Missing)

This is a difficult problem to fix, as it would require going over all values to check which ones are not used. One possibility would be to wrap the call in try... catch, store a special value in case of error, and check at the end whether that value was used. If so, call the function again to trigger the error.

Precompilation error on Julia 1.0

Running using PooledArrays on Julia 1.0 gives the following error:

[ Info: Precompiling PooledArrays [2dfb63ee-cc39-5dd5-95bd-886bf059d720]
ERROR: LoadError: UndefVarError: endof not defined
Stacktrace:
 [1] getproperty(::Module, ::Symbol) at ./sysimg.jl:13
 [2] top-level scope at none:0
 [3] include at ./boot.jl:317 [inlined]
 [4] include_relative(::Module, ::String) at ./loading.jl:1038
 [5] include(::Module, ::String) at ./sysimg.jl:29
 [6] top-level scope at none:2
 [7] eval at ./boot.jl:319 [inlined]
 [8] eval(::Expr) at ./client.jl:389
 [9] top-level scope at ./none:3
in expression starting at /Users/dilum/.julia/packages/PooledArrays/T5B1A/src/PooledArrays.jl:141
ERROR: Failed to precompile PooledArrays [2dfb63ee-cc39-5dd5-95bd-886bf059d720] to /Users/dilum/.julia/compiled/v1.0/PooledArrays/vi11X.ji.
Stacktrace:
 [1] error(::String) at ./error.jl:33
 [2] macro expansion at ./logging.jl:313 [inlined]
 [3] compilecache(::Base.PkgId, ::String) at ./loading.jl:1184
 [4] _require(::Base.PkgId) at ./logging.jl:311
 [5] require(::Base.PkgId) at ./loading.jl:852
 [6] macro expansion at ./logging.jl:311 [inlined]
 [7] require(::Module, ::Symbol) at ./loading.jl:834

add insert!

It should be easy to add insert! support for PooledVector.

improve median for pooled vectors

If length of pool is much smaller than the number of entries we can run the following (working code):

function median_fast(x::PooledVector)
    n = length(x)
    p = sortperm(x.pool)
    counts = zeros(Int, length(p))
    for v in x.refs
        counts[v] += 1
    end
    cum = 0
    for (j,i) in enumerate(p)
        cum += counts[i]
        if isodd(n)
            if cum >= div(n + 1, 2)
                return middle(x.pool[i])
            end
        else
            if cum >= div(n, 2)
                if cum == div(n, 2)
                    return middle(x.pool[i], x.pool[p[j+1]])
                else
                    return middle(x.pool[i])
                end
            end
        end
    end
    error("unreachable reached")
end

Example timings:

julia> x = PooledArray(rand(1:100, 10^8));

julia> @time median_fast(x);
  0.063734 seconds (4 allocations: 1.781 KiB)

julia> @time median_fast(x);
  0.070556 seconds (4 allocations: 1.781 KiB)

julia> @time median(x);
  0.931585 seconds (3 allocations: 762.940 MiB, 9.86% gc time)

julia> @time median(x);
  0.955555 seconds (3 allocations: 762.940 MiB, 14.29% gc time)

Constructor that accepts an iterator?

It would be useful to have a constructor that accepts an general iterator, as

most collection types have this
it would be especially useful to avoid allocating a large array fully of duplicates that is just going to be freed once i make the PooledArray.

deleteat! missing

Is it feasible to add deleteat! method to PooledVector or it would break some design assumptions?

CC @nalimilan. This is related to JuliaData/CSV.jl#435.

Auto bumping ref type to the next larger int type

Hi there. Great package!

Is there any reason not to make it smarter by automatically bumping to the next larger int type? I can do that manually but thought it could be a nice feature.

julia> x = PooledArray(fill(1,300));

julia> typeof(x)
PooledArrays.PooledArray{Int64,UInt8,1,Array{UInt8,1}}

julia> x[1:300] = 1:300
ERROR: You're using a PooledArray with ref type UInt8, which can only hold 255 values,
and you just tried to add the 256th reference.  Please change the ref type
to a larger int type, or use the default ref type (UInt32).

Redesing of PooledArray internals

Things to do:

treat missing as a special value that is not pooled, probably with level 0. This would work the same as in CategoricalArrays.jl; the benefit is that two PooledArrays differing only in the fact if they allow Missing or not could share pool
add locking for setindex! but make sure that we support batch operations of adding levels (both in setindex! and in e.g. copyto!); this will allow to fully drop Copy-On-Write and never copy pool and invpool by default; tentatively unsafe_setindex! would be an alternative that does not use lock
stress in documentation that using invpool is not safe if potentially other threads are modifying it (this should not be a problem)
add droplevels! to DataAPI.jl and to PooledArrays.jl (this requires also a change in CategoricalArrays.jl); this function would reduce pool and invpool to only used levels and also at the same time make a fresh copy of them (as a way to detach pool and invpool between PooledArray-s)

I think this design is better than global pool. It will still cost us a bit in H2O benchmarks, but at least we avoid a global pool that is not reclaimable.

@nalimilan + @quinnj : any additional comments on this?

bug with empty constructors

In src/PooledArrays.jl at lines 202-203, we find the following code

@inline PooledArray(t::Type) = PooledArray(Array(t,0))
@inline PooledArray(t::Type, r::Type) = PooledArray(Array(t,0), r)

which does not work as there is no method Array(::Type,::Int).

I guess the intent was more like Array{t}(undef,0) (or just t[]) instead of Array(t,0).

performance of getindex

A bottleneck in fast join in DataFrames.jl is:

PooledArray(RefArray(getindex(A.refs, I)), copy(A.invpool))

as copy(A.invpool) is very expensive for large pools (also it leaves pool much larger than needed usually). I see two options:

do not copy the poll (and share it as in PooledArray we do not allow removing levels from pool)
if the new array is short relative to the old one allocate a fresh pool

This case will hit us (and probably already significantly hits us but we did not know about it in H2O benchmarks which heavily use PooledArrays)

CC @nalimilan @quinnj

Purity assumption of map

Originally posted by @bkamins in #44 (comment)

Also the problem is that function passed to map does not have to be pure, in which case you should apply it to each element of the array.
Related is: #36.

By the functional definition of map it kinda should be pure.
But that isn't what julia has, so it's probably not great to assume that.
It might be nice though if there was a pure_map (in DataAPI maybe?) that is documented to be assuming that the function is pure.
And that falls back to map if not overloaded (e.g. by PooledArray), or possibly even for large arrays to a memorized version of map (could even go so far as to do a little tuning step to workout how large)

Release version 0.5.2

@JuliaRegistrator register()

Constructor that accepts `refs` and `pool`

Sometimes I'm starting with an existing array of Ints (refs), and a pool vector.

Is there a friendly way to make a PooledArray out of that? I couldn't find one - seems you have to go via the basic constructor with invpool = Dict(k => i for (k, i) in pool), then this gets turned back to pool in the constructor! A PooledArray(refs, pool) constructor would be useful.

Add examples

I eould love to see one or two simple examples of the functionality of this package in the readme or a docs page.

ipermute!!

On julia 0.7 I get an error regarding your definition (or extension) of permute!!
I do not know what that function does. It seems you could simply delete it (for 0.7)

MethodError since 0.2.1

I try to load a table that I saved with a previous version of JuliaDB. It contains a column that is of type Int8. I get:

MethodError: no method matching PooledArrays.PooledArray(::PooledArrays.RefArray{Array{UInt8,1}}, ::Array{Int8,1})
[1] deserialize(::SerializationState{IOStream}, ::Type{MemPool.MMSer{PooledArrays.PooledArray}}) at C:\Users\Max\AppData\Local\JuliaPro-0.6.2.1\pkgs-0.6.2.1\v0.6\MemPool\src\io.jl:10
[2] handle_deserialize(::SerializationState{IOStream}, ::Int32) at .\serialize.jl:685
[3] collect_to!(::Array{Array{T,1} where T,1}, ::Base.Generator{UnitRange{Int64},JuliaDB.##30#32{SerializationState{IOStream}}}, ::Int64, ::Int64) at .\array.jl:508
[4] collect_to!(::Array{Array{Int32,1},1}, ::Base.Generator{UnitRange{Int64},JuliaDB.##30#32{SerializationState{IOStream}}}, ::Int64, ::Int64) at .\array.jl:518
[5] collect(::Base.Generator{UnitRange{Int64},JuliaDB.##30#32{SerializationState{IOStream}}}) at .\array.jl:476
[6] mmread(::Type{IndexedTables.Columns}, ::SerializationState{IOStream}, ::Bool) at C:\Users\Max\AppData\Local\JuliaPro-0.6.2.1\pkgs-0.6.2.1\v0.6\JuliaDB\src\serialize.jl:57
[7] deserialize(::SerializationState{IOStream}, ::Type{MemPool.MMSer{IndexedTables.Columns}}) at C:\Users\Max\AppData\Local\JuliaPro-0.6.2.1\pkgs-0.6.2.1\v0.6\MemPool\src\io.jl:10
[8] handle_deserialize(::SerializationState{IOStream}, ::Int32) at .\serialize.jl:685
[9] mmread(::Type{IndexedTables.NextTable}, ::SerializationState{IOStream}, ::Bool) at C:\Users\Max\AppData\Local\JuliaPro-0.6.2.1\pkgs-0.6.2.1\v0.6\JuliaDB\src\serialize.jl:86
[10] deserialize(::SerializationState{IOStream}, ::Type{MemPool.MMSer{IndexedTables.NextTable}}) at C:\Users\Max\AppData\Local\JuliaPro-0.6.2.1\pkgs-0.6.2.1\v0.6\MemPool\src\io.jl:10
[11] handle_deserialize(::SerializationState{IOStream}, ::Int32) at .\serialize.jl:685
[12] open(::Base.Serializer.#deserialize, ::String) at .\iostream.jl:152
[13] load(::String) at C:\Users\Max\AppData\Local\JuliaPro-0.6.2.1\pkgs-0.6.2.1\v0.6\JuliaDB\src\io.jl:185

I am using Julia 0.6.2 and:

- MemPool                       0.0.11
- PooledArrays                  0.2.1
- JuliaDB                       0.8.4

Downgrading JuliaDB to 0.8.1 (and hence PooledArrays to 0.1.1) resolves the issue, using version 0.8.2 the issue is back.

Constructor for PooledArray

In the example below, what is the best way to define y without allocating x

Is this how it should be done?

using DataFrames
x=repeat(vcat("someString"),1_000_000));
y = DataFrames.PooledArrays.PooledArray(x)

invpool=Dict{String,UInt8}("someString"=>1)
pool=DataFrames.PooledArrays._invert(invpool)
refs=repeat(vcat(UInt8(1)),1_000_000)
y2 = DataFrames.PooledArray(DataFrames.PooledArrays.RefArray(refs),invpool,pool)

@assert isequal(y,y2)

view of PooledArray

CategoricalArrays.jl handles DataAPI.refarray, DataAPI.refvalue, and DataAPI.refpool correctly for views. I propose to add the same for PooledArrays.jl (now views return nothing from DataAPI.refpool).

Disallow mutable eltype in PooledArray

I think we should disallow allowing mutable elements of PooledArray.

The reason of this proposal is the following problem:

julia> x = PooledArray([[1]])
1-element PooledArray{Array{Int64,1},UInt8,1,Array{UInt8,1}}:
 [1]

julia> push!(x[1], 2)
2-element Array{Int64,1}:
 1
 2

julia> x.invpool
Dict{Array{Int64,1},UInt8} with 1 entry:
  [1, 2] => 0x01

julia> x.invpool[[1,2]]
ERROR: KeyError: key [1, 2] not found

Alternatively we should change how invpool is handled if mutable objects are stored in PooledArray or explicitly warn that it is not allowed to mutate them.

ambiguity in vcat

We currently have the following design issue:

julia> vcat([1,2,3], PooledArray([1,2,3]))
ERROR: MethodError: vcat(::Vector{Int64}, ::PooledVector{Int64, UInt32, Vector{UInt32}}) is ambiguous.

julia> vcat(PooledArray([1,2,3]), [1,2,3])
ERROR: MethodError: vcat(::PooledVector{Int64, UInt32, Vector{UInt32}}, ::Vector{Int64}) is ambiguous.