Comments (6)
A few ideas / todos:
- User can pass a mapping handler (interface + default base class?).
- Different handler for read and write.
- Have a base implementation that by default dispatches to
LogicalRead.GetDirectReader()
andLogicalReader.GetConverter()
(and same for writer). - Existing converters ought to be made public such that users can re-use as much as possible from them.
- Row-oriented API should support this too.
Open questions:
- Should the handlers be passed to file reader/writer ctors or only when calling
LogicalColumn()
on a column reader/writer?
from parquetsharp.
I've created PR #185 to implement some of the aforementioned ideas. Work in progress, reader only, no unit test yet. But should be a good illustration to what I had in mind.
When asking for the LogicalColumnReader
, the converter factory can be explicitly passed by the user (otherwise the default one is used). Will add some unit tests to show how this can be done to handle user-provided types. In theory, it also allows the user to change how some of the existing types are handled as well.
from parquetsharp.
My example above could also be solved by recursive flattening. It wouldn't cover every use-case of custom (de)serializers, but perhaps it would be more useful / easier to use?
from parquetsharp.
I have done some progress on the PR. Here is some code example.
[Test]
public static void TestReadConverter()
{
using var buffer = new ResizableBuffer();
// Write regular float values to the file.
using (var output = new BufferOutputStream(buffer))
{
using var fileWriter = new ParquetFileWriter(output, new Column[] {new Column<float>("values")});
using var groupWriter = fileWriter.AppendRowGroup();
using var columnWriter = groupWriter.NextColumn().LogicalWriter<float>();
columnWriter.WriteBatch(new[] {1f, 2f, 3f});
fileWriter.Close();
}
// Read back the float values using a custom user-type.
using (var input = new BufferReader(buffer))
{
using var fileReader = new ParquetFileReader(input)
{
LogicalTypeFactory = new ReadTypeFactory(),
LogicalReadConverterFactory = new ReadConverterFactory()
};
using var groupReader = fileReader.RowGroup(0);
using var columnReader = groupReader.Column(0).LogicalReader<VolumeInDollars>();
var expected = new[] {new VolumeInDollars(1f), new VolumeInDollars(2f), new VolumeInDollars(3f)};
var values = columnReader.ReadAll(checked((int) groupReader.MetaData.NumRows));
Assert.AreEqual(expected, values);
}
}
[StructLayout(LayoutKind.Sequential)]
private readonly struct VolumeInDollars : IEquatable<VolumeInDollars>
{
public VolumeInDollars(float value)
{
Value = value;
}
public readonly float Value;
public bool Equals(VolumeInDollars other)
{
return Value.Equals(other.Value);
}
public override string ToString()
{
return $"VolumeInDollars({Value})";
}
}
/// <summary>
/// A logical type factory that supports our user custom type.
/// </summary>
private sealed class ReadTypeFactory : LogicalTypeFactory
{
public override (Type physicalType, Type logicalType) GetSystemTypes(ColumnDescriptor descriptor, Type columnLogicalTypeHint)
{
// We have to use the column name to know what type to expose if we don't get the column hint.
// The column logical type hint is given to us if we use the row-oriented API or a ParquetFileWriter
// with the Column[] ctor argument.
columnLogicalTypeHint ??= descriptor.Path.ToDotVector().First() == "values" ? typeof(VolumeInDollars) : null;
return base.GetSystemTypes(descriptor, columnLogicalTypeHint);
}
}
/// <summary>
/// A read converter factory that supports our custom type.
/// </summary>
private sealed class ReadConverterFactory : LogicalReadConverterFactory
{
public override Delegate GetDirectReader<TLogical, TPhysical>()
{
// Optional: the following is an optimisation and not stricly needed (but helps with speed).
// Since VolumeInDollars is bitwise identical to float, we can read the values in-place.
if (typeof(TLogical) == typeof(VolumeInDollars))
{
return LogicalRead.GetDirectReader<VolumeInDollars, float>();
}
return base.GetDirectReader<TLogical, TPhysical>();
}
public override Delegate GetConverter<TLogical, TPhysical>(ColumnDescriptor columnDescriptor, ColumnChunkMetaData columnChunkMetaData)
{
// VolumeInDollars is bitwise identical to float, so we can reuse the native converter.
if (typeof(TLogical) == typeof(VolumeInDollars))
{
return LogicalRead.GetNativeConverter<VolumeInDollars, float>();
}
return base.GetConverter<TLogical, TPhysical>(columnDescriptor, columnChunkMetaData);
}
}
The ReadTypeFactory
is needed to know what type a column-reader is when we create it. In this example, it's done using the column name. If you use the row-oriented API, I don't expect this will be needed.
The ReadConverterFactory
is always needed, as it tells ParquetSharp how to convert your C# type to the Parquet physical type. Although GetDirectReader()
is an optimisation and doesn't have to be overridden by the user for correctness.
from parquetsharp.
Should be partially addressed by PR #185. Row-oriented API is to be done as part of a second PR (the first one was getting rather large already).
from parquetsharp.
TODO:
- Expose factories in row-oriented API
- "This PR changed the public api in at least two places (CreateSchemaNode, GetSystemTypes)" -> add back the default overload.
from parquetsharp.
Related Issues (20)
- CI build fails on master HOT 3
- Release ParquetSharp 8.0.0-beta2
- Upgrade to Arrow 9.0.0
- [CI] Fix deprecation warnings HOT 13
- ManagedRandomAccessFile: A callback was made on a garbage collected delegate HOT 4
- Exception when loading decimal (10,4) column HOT 6
- Cannot decrypt files with pyarrow HOT 2
- Access PageIndex HOT 2
- [Read] Read Decimal LogicalType HOT 5
- RowOriented API fails silently when using internal types in F# HOT 2
- [DOC] How to write nested data struct HOT 1
- Clarify nesting example "objects" status HOT 1
- Managing size of MemoryStream HOT 3
- [ARROW] Add Arrow RecordBatch Reader/Writer HOT 5
- Allow logical type override for decimal when logical system type (c# type) is not decimal HOT 2
- Is SIMD automatically supported in ParquetSharp HOT 2
- System.DllNotFoundException: Unable to load DLL 'ParquetSharpNative' HOT 7
- How to turn off nullable operation for int double? HOT 1
- Upgrade to Arrow 12.0.1 HOT 3
- Question: Is there a way to write to a partitioned directory HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from parquetsharp.