Giter Club home page Giter Club logo

Comments (6)

GPSnoopy avatar GPSnoopy commented on June 14, 2024

A few ideas / todos:

  • User can pass a mapping handler (interface + default base class?).
  • Different handler for read and write.
  • Have a base implementation that by default dispatches to LogicalRead.GetDirectReader() and LogicalReader.GetConverter() (and same for writer).
  • Existing converters ought to be made public such that users can re-use as much as possible from them.
  • Row-oriented API should support this too.

Open questions:

  • Should the handlers be passed to file reader/writer ctors or only when calling LogicalColumn() on a column reader/writer?

from parquetsharp.

GPSnoopy avatar GPSnoopy commented on June 14, 2024

I've created PR #185 to implement some of the aforementioned ideas. Work in progress, reader only, no unit test yet. But should be a good illustration to what I had in mind.

When asking for the LogicalColumnReader, the converter factory can be explicitly passed by the user (otherwise the default one is used). Will add some unit tests to show how this can be done to handle user-provided types. In theory, it also allows the user to change how some of the existing types are handled as well.

from parquetsharp.

m00ngoose avatar m00ngoose commented on June 14, 2024

My example above could also be solved by recursive flattening. It wouldn't cover every use-case of custom (de)serializers, but perhaps it would be more useful / easier to use?

from parquetsharp.

GPSnoopy avatar GPSnoopy commented on June 14, 2024

I have done some progress on the PR. Here is some code example.

[Test]
public static void TestReadConverter()
{
    using var buffer = new ResizableBuffer();

    // Write regular float values to the file.
    using (var output = new BufferOutputStream(buffer))
    {
        using var fileWriter = new ParquetFileWriter(output, new Column[] {new Column<float>("values")});
        using var groupWriter = fileWriter.AppendRowGroup();
        using var columnWriter = groupWriter.NextColumn().LogicalWriter<float>();

        columnWriter.WriteBatch(new[] {1f, 2f, 3f});
        fileWriter.Close();
    }

    // Read back the float values using a custom user-type.
    using (var input = new BufferReader(buffer))
    {
        using var fileReader = new ParquetFileReader(input)
        {
            LogicalTypeFactory = new ReadTypeFactory(),
            LogicalReadConverterFactory = new ReadConverterFactory()
        };
        using var groupReader = fileReader.RowGroup(0);
        using var columnReader = groupReader.Column(0).LogicalReader<VolumeInDollars>();

        var expected = new[] {new VolumeInDollars(1f), new VolumeInDollars(2f), new VolumeInDollars(3f)};
        var values = columnReader.ReadAll(checked((int) groupReader.MetaData.NumRows));

        Assert.AreEqual(expected, values);
    }
}

[StructLayout(LayoutKind.Sequential)]
private readonly struct VolumeInDollars : IEquatable<VolumeInDollars>
{
    public VolumeInDollars(float value)
    {
        Value = value;
    }

    public readonly float Value;

    public bool Equals(VolumeInDollars other)
    {
        return Value.Equals(other.Value);
    }

    public override string ToString()
    {
        return $"VolumeInDollars({Value})";
    }
}

/// <summary>
/// A logical type factory that supports our user custom type.
/// </summary>
private sealed class ReadTypeFactory : LogicalTypeFactory
{
    public override (Type physicalType, Type logicalType) GetSystemTypes(ColumnDescriptor descriptor, Type columnLogicalTypeHint)
    {
        // We have to use the column name to know what type to expose if we don't get the column hint.
        // The column logical type hint is given to us if we use the row-oriented API or a ParquetFileWriter
        // with the Column[] ctor argument.
        columnLogicalTypeHint ??= descriptor.Path.ToDotVector().First() == "values" ? typeof(VolumeInDollars) : null;
        return base.GetSystemTypes(descriptor, columnLogicalTypeHint);
    }
}

/// <summary>
/// A read converter factory that supports our custom type.
/// </summary>
private sealed class ReadConverterFactory : LogicalReadConverterFactory
{
    public override Delegate GetDirectReader<TLogical, TPhysical>()
    {
        // Optional: the following is an optimisation and not stricly needed (but helps with speed).
        // Since VolumeInDollars is bitwise identical to float, we can read the values in-place.
        if (typeof(TLogical) == typeof(VolumeInDollars))
        {
            return LogicalRead.GetDirectReader<VolumeInDollars, float>();
        }

        return base.GetDirectReader<TLogical, TPhysical>();
    }

    public override Delegate GetConverter<TLogical, TPhysical>(ColumnDescriptor columnDescriptor, ColumnChunkMetaData columnChunkMetaData)
    {
        // VolumeInDollars is bitwise identical to float, so we can reuse the native converter.
        if (typeof(TLogical) == typeof(VolumeInDollars))
        {
            return LogicalRead.GetNativeConverter<VolumeInDollars, float>();
        }

        return base.GetConverter<TLogical, TPhysical>(columnDescriptor, columnChunkMetaData);
    }
}

The ReadTypeFactory is needed to know what type a column-reader is when we create it. In this example, it's done using the column name. If you use the row-oriented API, I don't expect this will be needed.

The ReadConverterFactory is always needed, as it tells ParquetSharp how to convert your C# type to the Parquet physical type. Although GetDirectReader() is an optimisation and doesn't have to be overridden by the user for correctness.

from parquetsharp.

GPSnoopy avatar GPSnoopy commented on June 14, 2024

Should be partially addressed by PR #185. Row-oriented API is to be done as part of a second PR (the first one was getting rather large already).

from parquetsharp.

GPSnoopy avatar GPSnoopy commented on June 14, 2024

TODO:

  • Expose factories in row-oriented API
  • "This PR changed the public api in at least two places (CreateSchemaNode, GetSystemTypes)" -> add back the default overload.

from parquetsharp.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.