Giter Club home page Giter Club logo

parquetsharp's Introduction

Main logo

Introduction

ParquetSharp is a cross-platform .NET library for reading and writing Apache Parquet files.

ParquetSharp is implemented in C# as a PInvoke wrapper around Apache Parquet C++ to provide high performance and compatibility. Check out ParquetSharp.DataFrame if you need a convenient integration with the .NET DataFrames.

Supported platforms:

Chip Linux Windows macOS
x64
arm64
Status
Release Nuget NuGet latest release
Pre-Release Nuget NuGet latest pre-release
CI Build CI Status

Why use Parquet?

Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. Relative to CSV files, Parquet executes queries 34x faster while taking up 87% less space. Source

Quickstart

The following examples show how to write and then read a Parquet file with three columns representing a timeseries of object-value pairs. These use the low-level API, which is the recommended API for working with native .NET types and closely maps to the API of Apache Parquet C++. For reading and writing data in the Apache Arrow format, an Arrow based API is also provided.

How to write a Parquet File:

var timestamps = new DateTime[] { /* ... */ };
var objectIds = new int[] { /* ... */ };
var values = new float[] { /* ... */ };

var columns = new Column[]
{
    new Column<DateTime>("Timestamp"),
    new Column<int>("ObjectId"),
    new Column<float>("Value")
};

using var file = new ParquetFileWriter("float_timeseries.parquet", columns);
using var rowGroup = file.AppendRowGroup();

using (var timestampWriter = rowGroup.NextColumn().LogicalWriter<DateTime>())
{
    timestampWriter.WriteBatch(timestamps);
}
using (var objectIdWriter = rowGroup.NextColumn().LogicalWriter<int>())
{
    objectIdWriter.WriteBatch(objectIds);
}
using (var valueWriter = rowGroup.NextColumn().LogicalWriter<float>())
{
    valueWriter.WriteBatch(values);
}

file.Close();

How to read a Parquet file:

using var file = new ParquetFileReader("float_timeseries.parquet");

for (int rowGroup = 0; rowGroup < file.FileMetaData.NumRowGroups; ++rowGroup) {
    using var rowGroupReader = file.RowGroup(rowGroup);
    var groupNumRows = checked((int) rowGroupReader.MetaData.NumRows);

    var groupTimestamps = rowGroupReader.Column(0).LogicalReader<DateTime>().ReadAll(groupNumRows);
    var groupObjectIds = rowGroupReader.Column(1).LogicalReader<int>().ReadAll(groupNumRows);
    var groupValues = rowGroupReader.Column(2).LogicalReader<float>().ReadAll(groupNumRows);
}

file.Close();

Documentation

For more detailed information on how to use ParquetSharp, see the following documentation:

Rationale

We desired a Parquet implementation with the following properties:

  • Cross platform (originally Windows and Linux - but now also macOS).
  • Callable from .NET Core.
  • Good performance.
  • Well maintained.
  • Close to official Parquet reference implementations.

Not finding an existing solution meeting these requirements, we decided to implement a .NET wrapper around apache-parquet-cpp (now part of Apache Arrow) starting at version 1.4.0. The library tries to stick closely to the existing C++ API, although it does provide higher level APIs to facilitate its usage from .NET. The user should always be able to access the lower-level API.

Performance

The following benchmarks can be reproduced by running ParquetSharp.Benchmark.csproj. The relative performance of ParquetSharp 10.0.1 is compared to Parquet.NET 4.6.2, an alternative open-source .NET library that is fully managed. The Decimal tests focus purely on handling the C# decimal type, while the TimeSeries tests benchmark three columns of the types {int, DateTime, float}. Results are from a Ryzen 5900X on Linux 6.2.7 using the dotnet 6.0.14 runtime.

If performance is a concern for you, we recommend benchmarking your own workloads and testing different encodings and compression methods. For example, disabling dictionary encoding for floating point columns can often significantly improve performance.

Decimal (Read) Decimal (Write) TimeSeries (Read) TimeSeries (Write)
Parquet.NET 1.0x 1.0x 1.0x 1.0x
ParquetSharp 4.0x Faster 3.0x Faster 2.8x Faster 1.5x Faster

Known Limitations

Because this library is a thin wrapper around the Parquet C++ library, misuse can cause native memory access violations.

Typically this can arise when attempting to access an instance whose owner has been disposed. Because some objects and properties are exposed by Parquet C++ via regular pointers (instead of consistently using std::shared_ptr), dereferencing these after the owner class instance has been destructed will lead to an invalid pointer access.

As only 64-bit runtimes are available, ParquetSharp cannot be referenced by a 32-bit project. For example, using the library from F# Interactive requires running fsiAnyCpu.exe rather than fsi.exe.

Building

Dev Container

ParquetSharp can be built and tested within a dev container. This is a probably the easiest way to get started, as all the C++ dependencies are prebuilt into the container image.

GitHub Codespaces

If you have a GitHub account, you can simply open ParquetSharp in a new GitHub Codespace by clicking on the green "Code" button at the top of this page.

Choose the "unspecified" CMake kit when prompted and let the C++ configuration run. Once done, you can build the C++ code via the "Build" button in the status bar at the bottom.

You can then build the C# code by right-clicking the ParquetSharp solution in the Solution Explorer on the left and choosing "Build". The Test Explorer will then get populated with all the C# tests too.

Visual Studio Code

If you want to work locally in Visual Studio Code, all you need is to have Docker and the Dev Containers extension installed.

Simply open up your copy of ParquetSharp in VS Code and click "Reopen in container" when prompted. Once the project has been opened, you can follow the GitHub Codespaces instructions above.

Podman and SELinux workarounds Using the dev container on a Linux system with podman and SELinux requires some workarounds.

You'll need to edit .devcontainer/devcontainer.json and add the following lines:

  "remoteUser": "root",
  "containerUser": "root",
  "workspaceMount": "",
  "runArgs": ["--volume=${localWorkspaceFolder}:/workspaces/${localWorkspaceFolderBasename}:Z"],
  "containerEnv": { "VCPKG_DEFAULT_BINARY_CACHE": "/home/vscode/.cache/vcpkg/archives" }

This configures the container to run as the root user, because when you run podman as a non-root user your user id is mapped to root in the container, and files in the workspace folder will be owned by root.

The workspace mount command is also modified to add the :Z suffix, which tells podman to relabel the volume to allow access to it from within the container.

Finally, setting the VCPKG_DEFAULT_BINARY_CACHE environment variable makes the root user in the container use the vcpkg cache of the vscode user.

CLI

If the CLI is how you roll, then you can install the Dev Container CLI tool and issue the following command in the your copy of ParquetSharp to get up and running:

devcontainer up

Build the C++ code and run the C# tests with:

devcontainer exec ./build_unix.sh
devcontainer exec dotnet test csharp.test

Native

Building ParquetSharp natively requires the following dependencies:

  • A modern C++ compiler toolchain
  • .NET SDK 7.0
  • Apache Arrow (15.0.2)

For building Arrow (including Parquet) and its dependencies, we recommend using Microsoft's vcpkg. The build scripts will use an existing vcpkg installation if either of the VCPKG_INSTALLATION_ROOT or VCPKG_ROOT environment variables are defined, otherwise vcpkg will be downloaded into the build directory.

Windows

Building ParquetSharp on Windows requires Visual Studio 2022 (17.0 or higher).

Open a Visual Studio Developer PowerShell and run the following commands to build the C++ code and run the C# tests:

build_windows.ps1
dotnet test csharp.test

cmake must be available in the PATH for the build script to succeed.

Unix

Build the C++ code and run the C# tests with:

./build_unix.sh
dotnet test csharp.test

Known Issues

An issue that may occur when building ParquetSharp locally using build_windows.ps1 is Visual Studio not being detected by CMake:

CMake Error at CMakeLists.txt:2 (project):   Generator

  Visual Studio 17 2022

could not find any instance of Visual Studio.

This is a known issue: (1) (2). It can be solved by ensuring that all required Visual Studio Build Tools are properly installed and that the relevant version of Visual Studio is available, and finally rebooting the machine. Another potential solution is to reinstall Visual Studio with the required build tools.

When building, you may come across the following problem with Microsoft.Cpp.Default.props:

error MSB4019: The imported project "C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\MSBuild\Microsoft\V
C\v170\Microsoft.Cpp.Default.props" was not found. Confirm that the expression in the Import declaration "C:\Program Fi
les (x86)\Microsoft Visual Studio\2022\BuildTools\MSBuild\Microsoft\VC\v170\\Microsoft.Cpp.Default.props" is correct, a
nd that the file exists on disk.

To resolve this, make sure that the "Desktop development with C++" option is selected when installing Visual Studio Build Tools. If installation is successful, the required directory and files should be present.

Another common issue is the following:

CMake Error at CMakeLists.txt:2 (project):
  The CMAKE_C_COMPILER:

    C:/Program Files/Microsoft Visual Studio/2022/Community/VC/Tools/MSVC/14.37.32822/bin/Hostx64/x64/cl.exe

  is not a full path to an existing compiler tool.

CMake Error at CMakeLists.txt:2 (project):
  The CMAKE_CXX_COMPILER:

    C:/Program Files/Microsoft Visual Studio/2022/Community/VC/Tools/MSVC/14.37.32822/bin/Hostx64/x64/cl.exe

  is not a full path to an existing compiler tool.

This is also related to installed Visual Studio modules. Make sure to install "C++/CLI support for build tools" from the list of optional components for Desktop development with C++ for the relevant version of Visual Studio.

For any other build issues, please open a new discussion.

Contributing

We welcome new contributors! We will happily receive PRs for bug fixes or small changes. If you're contemplating something larger please get in touch first by opening a GitHub Issue describing the problem and how you propose to solve it.

License

Copyright 2018-2023 G-Research

Licensed under the Apache License, Version 2.0 (the "License"); you may not use these files except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

parquetsharp's People

Contributors

adamreeve avatar adeboyed avatar asjflondon avatar c-rindi avatar damellp avatar demarillacizere avatar dependabot[bot] avatar frassle avatar gpsnoopy avatar jescalada avatar jgiannuzzi avatar jhickson avatar ljubon avatar m00ngoose avatar marcin-krystianc avatar markackroyd avatar markpattison avatar mfkl avatar naskio avatar nisden avatar oldukhno avatar pavlovic-ivan avatar philjdf avatar saul avatar thompson-tomo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

parquetsharp's Issues

Writing sparse Decimals and DateTimes

I'm having trouble trying to find my way around writing sparse larger binary types.

I having an Array<DateTime> or Array<Decimals> with an accompanying array of defLevels.

I keep getting:
Unhandled exception. System.InvalidCastException: Unable to cast object of type 'ParquetSharp.ColumnWriter'1[System.Int64]' to type 'ParquetSharp.ColumnWriter'1[System.DateTime]'

when trying to write with WriteBatch(nItems, defLevels, null, dates: Array<DateTime>).

What am I doing wrong?

ColumnChunkMetaData is susceptible to premature GC collection

This is because the C++ is a std::unique_ptr and therefore only owned by the ColumnChunkMetaData itself. If the ColumnChunkMetaData instance gets disposed, the underlying structure gets deleted even if the FileReader/RowGroupReader is still around.

For example:
var stats = reader.MetaData.GetColumnChunkMetaData(0).Statistics;
The object returns by GetColumnChunkMetaData(0) can be disposed and finalized at any time by the GC.

All public methods/properties of ColumnChunkMetaData needs to be appended with a GC.KeepAlive(this) to ensure this does not happen.

Do review other classes for such weakness (although in theory, all of them are).

Support to RowGroup::sort_columns

Parquet format supports the definition of a set of columns to describe the columns used to sort order of data in a row group.

This set of columns is called "sorting_columns", and it's defined in the parquet-format project containing format specifications and Thrift definitions of metadata required to properly read Parquet files. A link to this definition can be found here: https://github.com/apache/parquet-format/blob/01971a532e20ff8e5eba9d440289bfb753f0cf0b/src/main/thrift/parquet.thrift#L818

I see that in ParquetSharp, that RowGroupWriter has a quite similar structure to parquet-format::RowGroup. But it does not contain "sorting_columns". Is correct to affirm that they are equivalent? If yes, then is there any plan to add support to this structure of columns?

I also noticed that ColumnDescriptor class contains a property called SortOrder, which differs from parquet-format::SortingColumn (been just a enum). Can ColumnDescriptor.SortOrder be used as an alternative to describe the columns sorting data in the RowGroup?
I was not able to find in parquet-format::ColumnMetaData a property that similar to SortOrder, for this reason I have the impression that sorting columns definition (originally specified in parquet-format::RowGroup) is defined in ColumnDescriptor.

ParquetSharp 2.0.2 writes parquet files SPARK and pyarrow cannot open

I recently upgraded to ParquetSharp 2.0.2, but found both SPARK and PyArrow where unable to read the parquet files which it wrote. SPARK would throw an lz4 decompression error, while pyarrow would (depending on the version) return errors like:

ArrowIOError: Couldn't deserialize thrift: don't know what type:
Deserializing page header failed.

FastParquet would (depending on the version) return errors like:

line 71, in lz4_decompress
return lz4.block.decompress(data, uncompressed_size=uncompressed_size)
TypeError: an integer is required (got type NoneType)

All files were written using default (Snappy) compression.

Reverting to ParquetSharp 2.0.1 and rewriting all parquet files resolved all my issues.

PrimitiveNode.Equals() should not compare DecimalMetadata

The call to LogicalType.Equals() now implies comparison of the decimal data if the type is Decimal.

Also DecimalMetaData naively compares its tuple values. But in truth, if both left and right have IsSet = false, the other properties should be ignored.

Memory pressure and perforamnce on string heavy parquet

I have noticed some extreme memory pressure and performance issues when working with string heavy parquet. Looking at the column reader it UTF8 decodes every string instance. Perhaps a better strategy would be to keep a instance cache ByteArray -> String?

ParquetRowWriter is not double-Dispose safe

Internally calling Close() on RowGroupWriter is not safe once the parent file writer has been disposed of.

Fix is likely to set to null the RowGroupWriter within ParquetRowWriter.Dispose() method.

Empty string and byte[] cause NullReferenceException

Amend TestLogicalTypeRoundtrip with:

                new ExpectedColumn
                {
                    Name = "string_field",
                    PhysicalType = PhysicalType.ByteArray,
                    LogicalType = LogicalType.Utf8,
                    Values = Enumerable.Range(0, NumRows).Select(i => i % 9 == 0 ? i % 18 == 0 ? null : "" : $"Hello, {i}!").ToArray(),
                    NullCount = (NumRows + 17) / 18,
                    NumValues = NumRows - (NumRows + 17) / 18,
                    Min = "",
                    Max = "Hello, 98!",
                    Converter = StringConverter
                },

Same for byte[].

The code will break around the ByteArray conversion to string and (most likely) byte[] as well; as the ByteArray Pointer is null when Length is zero.

Query if column type is supported

Feature request by user.

For the moment, the only way to know if a particular logical type is supported by the library is to create the column writer, which will throw if the type is not supported.

Adding support for custom (de)serializers

Looking at

_directReader = LogicalRead<TLogical, TPhysical>.GetDirectReader();

I particularly have in mind strong type-aliases, eg.

public readonly struct UserId {
    public readonly int UserId;
    public UserId(int userId) => UserId = userId;
}

public class User {
    public readonly UserId UserId;
    public readonly string Name;
    public readonly int Age;
    public User(UserId userId, string name, int age) {
        UserId = userId;
        Name = name;
        Age = age;
    }
}

then we'll trip on the UserId field, though it's quite obvious we should use int as the physical type. This is particularly painful for the row-oriented api, where you need to create a whole new class to hold the (physical) values. It would be nice to be able to seamlessly round-trip to/from User without an intermediate type.

I therefore propose that the user be allowed to provide custom (de)serializers.

Add UUID support

ParquetSharp has already consumed an Arrow version supporting the UUID logical type, but there's no support for it in ParquetSharp yet, i.e. in ColumnDescriptor.cs

Corrupt parquet files will be created for some float data

I created a parquet file with different columns. 2 float data columns get corrupted (no longer possible to open them with apache drill or parquet.NET for example (index out of bound exception). If I change the data type to double in c#, the parquet file will be generated in the right way. I am writing the files exactly like documented under "Low-level API".

update: 99 % of generated parquet files are working, but some of them are failing (reproducible). Maybe some inconsistency in float32 handling in c# and c++?

parquet_output.zip

Writing buffered row groups with plain encoding leads to corrupt data.

See buffered unit tests in TestPhysicalTypeRoundTrip.
(current in branch https://github.com/G-Research/ParquetSharp/tree/BufferedRowGroupTest)

It appears that an upstream bug recently introduced (Arrow 0.16) causes file/data corruption when writing a large number of rows using buffered row groups and plain encoding.

The problem does not seem to happen if any of the following is true:

  • using normal unbuffered row groups,
  • using dictionary encoding,
  • writing a small number of rows (less than ~50K, ballpark figure).

There also seem to be a smaller unrelated issue, where the list of returned Encodings from the column chunk metadata contains duplicated entries (Plain encoding in these tests).

Attempt at investigating and fixing the upstream issue here:
https://github.com/G-Research/gr-oss/issues/136

ParquetRowWriter throws when writing a Decimal

Whenever I write Decimals with ParquetRowWriter I get an exception.

ParquetSharp.ParquetException : class parquet::ParquetException (message: 'Invalid DECIMAL precision: -1. Precision must be a number between 1 and 38 inclusive')
at ParquetSharp.ExceptionInfo.Check(IntPtr exceptionInfo)
at ParquetSharp.Column.CreateSchemaNode(Type type, String name, LogicalType logicalTypeOverride, Int32 length, Int32 precision, Int32 scale)
at ParquetSharp.ParquetFileWriter.<>c.b__8_0(Column c)

Add DeepClone method to Schema.Node

Due to schema nodes lifetime being bound to the parquet file reader, it is useful to deep-clone them. The following code should probably be a set of virtual methods on Node, GroupNode and PrimitiveNode.

        private static Node DeepClone(Node node)
        {
            switch (node)
            {
                case GroupNode groupNode:
                    return DeepClone(groupNode);
                case PrimitiveNode primitiveNode:
                    return DeepClone(primitiveNode);
                default:
                    throw new ArgumentOutOfRangeException($"unknown node type {node.GetType()}");
            }
        }

        private static GroupNode DeepClone(GroupNode groupNode)
        {
            return new GroupNode(
                groupNode.Name,
                groupNode.Repetition,
                groupNode.Fields.Select(DeepClone).ToArray(),
                groupNode.LogicalType);
        }

        private static PrimitiveNode DeepClone(PrimitiveNode primitiveNode)
        {
            var decimalMetadata = primitiveNode.DecimalMetadata;

            return new PrimitiveNode(
                primitiveNode.Name,
                primitiveNode.Repetition,
                primitiveNode.PhysicalType,
                primitiveNode.LogicalType,
                primitiveNode.TypeLength,
                decimalMetadata.Precision,
                decimalMetadata.Scale);
        }

Support for outputting Parquet from a ReadOnlySpan<T>

Currently LogicalColumnWriter requires you to pass an array into WriteBatch.

This is fine for use cases where the quantity of data to be written is known at the start of data collection since you can buffer your columns in arrays, but for use cases where the quantity is unknown you have to use an expandable type that will inevitably have some unused array space left over at the end when it comes time to write out the resultant Parquet file.

In most cases you'd work around this by copying your data into an exact-fit array before passing it to the Parquet library (.toArray). However, this is resource-prohibitive in memory-constrained environments when working with large data sets.

Internally, ParquetSharp uses Span types to avoid unnecessary copies. There's even a (private) method to take a ReadOnlySpan and sink it to Parquet format.

It would be very helpful if a variant of the WriteBatch API were exposed that accepted a ReadOnlySpan<T> directly to avoid this extra memory copy. For my current use case array support would be unnecessary, but in general it would be nice to have.

Is it possible to use the Delta encodings?

Hi,

I wanted to experiment with the delta encodings to see if they offer any advantage over the RleDictionary encoding that it seems to use on everything by default.

I am creating a WriterPropertiesBuilder and set .Version(ParquetVersion.PARQUET_2_0).

Then, I disable the dictionary and set the appropriate encoding. In this case, node.Name is "Time", which is a INT64 Timestamp and I wanted to try the DeltaBinaryPacked encoding.

builder.DisableDictionary(node.Name);
builder.Encoding(node.Name, ParquetSharp.Encoding.DeltaBinaryPacked);

This results in a ParquetException: class parquet::ParquetException (message: 'Not yet implemented: Selected encoding is not supported.')

This seems to come from here: https://github.com/apache/parquet-cpp/blob/80e110c823c5631ce4a4f0a5da486e759219f1e3/src/parquet/column_writer.cc#L552

Is there a way around this, or is it just a limitation of the underlying c++ implementation?

Make SchemaNodes interned to allow reference equality

If you attempt to do something like this:

public static string GetColumnIndexFromName(ParquetFileReader reader, string name)
{
    foreach (var i in Enumerable.Range(0, reader.FileMetaData.Schema.NumColumns))
    {
        var column = _reader.FileMetaData.Schema.Column(i);
        var node = column.SchemaNode;
        while (node.Parent != reader.FileMetaData.Schema.SchemaRoot)
        {
            node = node.Parent;
        }
        if (node.Name == name)
        {
            return i;
        }
   }
   throw new KeyNotFoundException($"Failed to find column {name}");
}

It will fail to work as inteneded because the Nodes are not reference equal.
Workarounds include checking the name ("schema" or for FieldId 0, but those are specific to the schema root), it would be nice to be able to just test the nodes for equality including via reference equality.

Add full list support

Currently the lists supported are only arbitrarily nested primitive types, while the full spec allows arbitrary nested records.

Row based read access

Hi,

Can you post some code to show how you would read a parquet file row by row?

Regards

Consider making the ByteBuffer public

Hi,

I wanted to suggest making ByteBuffer.cs public instead of internal, and possibly allowing inheritance.

Context is that I have several TBs of gzipped csvs to turn into parquet, and in order to save on time and memory, I skip allocating the UTF8 bytes as strings entirely. There did not seem to be a good way to use the Logical writers to submit a ReadOnlySpan<byte> for a string-type column, so I had to go for the physical writers and use ByteArray instead.

I basically had to build my own version of the ByteBuffer ( I added some deduping / re-using the same ByteArray to further save on memory ), and as such I was thinking it would be lovely if I could have just extended your ByteBuffer class instead. It would certainly make it easier for others who just want to prepare a bunch of a UTF8 bytes / ReadOnlySpan<byte> to write to a column.

Cheers, and many thanks for the great library. It performs very well.

Move CI to GitHub actions

  • We often time out after 1h build on AppVeyor.
  • GitHub actions are free for public projects.
  • Hardware specs look better.
  • CI description files are within the repo.

Columnar Metadata

I cannot find a way to add custom columnar metadata. I can read it, but I can't figure out how to write it. Am I missing something?

Long filename support

ParquetFileReader when used with a long string path fails:

ParquetSharp.ParquetException : class parquet::ParquetStatusException (message: 'IOError: Failed to open local file '<very long filename>'. Detail: [Windows error 3] The system cannot find the path specified.
')
   at ParquetSharp.ExceptionInfo.Check(IntPtr exceptionInfo)
   at ParquetSharp.ParquetFileReader..ctor(String path, ReaderProperties readerProperties)
   ...

Looks like ParquetSharp eventually calls https://github.com/apache/arrow/blob/88e3267ad09ca62643b7fe5ffd98eb03b29728fc/cpp/src/arrow/util/io_util.cc#L838 which is calling _wsopen_s on Windows, so https://docs.microsoft.com/en-us/cpp/c-runtime-library/path-field-limits?view=vs-2019 appears relevant, so it is just a case of needing to prepend the filename with \\?\?

Recursive row reading/writing

Copied from #177

public readonly struct UserId {
    public readonly int UserId;
    public UserId(int userId) => UserId = userId;
}

public class User {
    public readonly UserId UserId;
    public readonly string Name;
    public readonly int Age;
    public User(UserId userId, string name, int age) {
        UserId = userId;
        Name = name;
        Age = age;
    }
}

Another way of being able to roundtrip this using the row-oriented api, is by automatically flattening nested classes.

libthrift dynamically linked

ldd ParquetSharpNative.so
linux-vdso.so.1 (0x00007ffff02f7000)
libthrift.so.0.13.0 => not found
libthriftz.so.0.13.0 => not found
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f5702e90000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f5702b00000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f5702760000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f5702540000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f5702140000)
/lib64/ld-linux-x86-64.so.2 (0x00007f5703800000)

Interest in support for automatic column mapping in RowOriented reader?

Based on this issue (#72), it seems like the row-oriented reader is not really intended to be used as a core part of the library (more of an example).

Is there any interest in a PR to add some minor enhancements to the row oriented API? Specifically, I needed to pass some parquet data to and from a component that doesn't guarantee column order is preserved. When I read it back, I needed to map the columns to my TTuple type by name, rather than by position. I know that can't work for all possible usages of parquet but it was convenient.

Creating a Parquet file with an empty row group doesn't work anymore

It seems that the upgrade from Arrow 0.15.1 to Arrow 0.16 has caused a regression when writing a Parquet file with an empty row group (i.e. a row group with zero rows).

The following C# .NET Core 3.1 code works with ParquetSharp 2.0.2, but fails with 2.1.0-beta1 when reading the file.

public static void Main()
{
    try
    {
        WriteEmptyRowGroup();
        ReadEmptyRowGroup();
    }
    catch (Exception exception)
    {
        Console.WriteLine("ERROR: " + exception);
    }
}

private static void WriteEmptyRowGroup()
{
    var columns = new Column[]
    {
        new Column<int>("Id"), 
        new Column<float>("Value")
    };

    using var fileWriter = new ParquetFileWriter("empty_row_group.parquet", columns);
    using var groupWriter = fileWriter.AppendRowGroup();

    // Uncomment to get the actual write exception
    //groupWriter.Close();
}

private static void ReadEmptyRowGroup()
{
    using var fileReader = new ParquetFileReader("empty_row_group.parquet");
    using var groupReader = fileReader.RowGroup(0);

    if (groupReader.MetaData.NumRows != 0) throw new InvalidDataException($"expected 0 rows (got {groupReader.MetaData.NumRows})");
    if (groupReader.MetaData.NumColumns != 2) throw new InvalidDataException($"expected 2 columns (got {groupReader.MetaData.NumColumns}");

    using var idReader = groupReader.Column(0);
    using var valueReader = groupReader.Column(1);
}

Calling groupWriter.Close() or fileWriter.Close() in either Arrow 0.15.1 and 0.16 bubbles up an exception that otherwise gets gobbled by the C++ destructors: 'Only -1 out of 2 columns are initialized'. The difference is that in Arrow 0.15.1 the generated file is still readable, while in 0.16 it is not.

The is not the first attempt to address this issue. See the following ticket:
https://issues.apache.org/jira/browse/PARQUET-1378

IMHO There are several issues here:

  • The generated file should be readable.
  • No exception should be thrown when calling Close() on an empty row group. This is not an exceptional case, creating an empty row group is logically correct (e.g. a table with no rows, an empty array, an empty collection, etc).
  • This has clearly regressed, which means there are not enough unit tests around this area in Arrow.
  • Bonus for unit tests around Parquet files with zero columns.

Add WriteRowGroup API call

It would be useful, instead of having to manage RowGroups via calls to StartNewRowGroup(), to be able to write an entire group at once using a method with this signature:
void WriteRowGroup(IEnumerable<T> data)

ParquetSharp can produce invalid files without throwing an exception when ran without enough memory.

Context: We saw 2 reports in the last week of invalid Parquet files being written without any exception reported by the library. This is especially annoying from the perspective of jobs on a compute farm, because we interpret the process as being successful, therefore assuming the parquet file is valid, therefore causing all jobs which consume this file to fail.

We are using the AppendBufferedRowGroup API, and were restricting the memory via job objects.

I have tested this locally, with a parquet-writing process which runs processes attached to job objects, ramping up the memory limit until it stopped throwing OutOfMemory exceptions. As expected, the first few files produced from this approach are invalid. The most obvious explanation is that there is an exception thrown in the Dispose method of either ColumnWriter or RowGroupWriter, which is swallowed:

This is consistent with my experiment, as the first N tests threw OOM exceptions (because it's not swallowed in AppendRowGroup or WriteBatch), then there was a set which produced broken files silently (when all the Writes succeeded, but the Dispose did not), then all future ones were valid parquet files.

I could definitely produce more evidence towards reproducing this bug, namely:

  • The dodgy parquet files produced
  • The repro code

Please push me for this if you think these would be helpful.

Provide a utility method to reverse map from the 'original' column name to the column index.

Naive use of something like: reader.FileMetaData.Schema.ColumnIndex("name") does not work based on the names provided when originally adding columns except for trivial instance where no Dremel shredding was needed.
As an example, for adding a List type column via something like:

writer = new ParquetFileWriter(
    "example.parquet", 
    new Column<double[]>("values")),
    new WriterPropertiesBuilder().Build());

Will result in it being shredded into "values.list.item". The resulting name of the node which is dierctly associated with this column is then "item". Lookup via a ColumnPath of "values.list.item" would work, but I believe this is confusing to the end user of the API. They specified a column with a name "values", they would expect that asking for the index of a column called "values" would return the right index, not fail.

Here is a work around that works for the simple list columns. I haven't verified it against anything more complex

public static string GetColumnIndexFromName(ParquetFileReader reader, string name)
{
    foreach (var i in Enumerable.Range(0, reader.FileMetaData.Schema.NumColumns))
    {
        var column = _reader.FileMetaData.Schema.Column(i);
        var node = column.SchemaNode;
        while (node.Parent.FieldId != 0)
        {
            node = node.Parent;
        }
        if (node.Name == name)
        {
            return i;
        }
   }
   throw new KeyNotFoundException($"Failed to find column {name}");
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.