Giter Club home page Giter Club logo

parquetsharp's Introduction

Main logo

Introduction

ParquetSharp is a cross-platform .NET library for reading and writing Apache Parquet files.

It is implemented in C# as a PInvoke wrapper around Apache Parquet C++ to provide high performance and compatibility. Check out ParquetSharp.DataFrame if you need a convenient integration with the .NET DataFrames.

Supported platforms:

Chip Linux Windows macOS
x64
arm64
Status
Release Nuget NuGet latest release
Pre-Release Nuget NuGet latest pre-release
CI Build CI Status

Quickstart

The following examples show how to write and then read a Parquet file with three columns representing a timeseries of object-value pairs. These use the low-level API, which is the recommended API and closely maps to the API of Apache Parquet C++.

Writing a Parquet File:

var timestamps = new DateTime[] { /* ... */ };
var objectIds = new int[] { /* ... */ };
var values = new float[] { /* ... */ };

var columns = new Column[]
{
    new Column<DateTime>("Timestamp"),
    new Column<int>("ObjectId"),
    new Column<float>("Value")
};

using var file = new ParquetFileWriter("float_timeseries.parquet", columns);
using var rowGroup = file.AppendRowGroup();

using (var timestampWriter = rowGroup.NextColumn().LogicalWriter<DateTime>())
{
    timestampWriter.WriteBatch(timestamps);
}
using (var objectIdWriter = rowGroup.NextColumn().LogicalWriter<int>())
{
    objectIdWriter.WriteBatch(objectIds);
}
using (var valueWriter = rowGroup.NextColumn().LogicalWriter<float>())
{
    valueWriter.WriteBatch(values);
}

file.Close();

Reading the file back:

using var file = new ParquetFileReader("float_timeseries.parquet");

for (int rowGroup = 0; rowGroup < file.FileMetaData.NumRowGroups; ++rowGroup) {
    using var rowGroupReader = file.RowGroup(rowGroup);
    var groupNumRows = checked((int) rowGroupReader.MetaData.NumRows);

    var groupTimestamps = rowGroupReader.Column(0).LogicalReader<DateTime>().ReadAll(groupNumRows);
    var groupObjectIds = rowGroupReader.Column(1).LogicalReader<int>().ReadAll(groupNumRows);
    var groupValues = rowGroupReader.Column(2).LogicalReader<float>().ReadAll(groupNumRows);
}

file.Close();

Documentation

For more detailed information on how to use ParquetSharp, see the following documentation:

Rationale

We desired a Parquet implementation with the following properties:

  • Cross platform (originally Windows and Linux - but now also macOS).
  • Callable from .NET Core.
  • Good performance.
  • Well maintained.
  • Close to official Parquet reference implementations.

Not finding an existing solution meeting these requirements, we decided to implement a .NET wrapper around apache-parquet-cpp (now part of Apache Arrow) starting at version 1.4.0. The library tries to stick closely to the existing C++ API, although it does provide higher level APIs to facilitate its usage from .NET. The user should always be able to access the lower-level API.

Performance

The following benchmarks can be reproduced by running ParquetSharp.Benchmark.csproj. The relative performance of ParquetSharp 2.4.0-beta1 is compared to Parquet.NET 3.8.6, an alternative open-source .NET library that is fully managed. The Decimal tests focus purely on handling the C# decimal type, while the TimeSeries tests benchmark three columns respectively of the types {int, DateTime, float}. Results are from a Ryzen 5950X on Windows 10.

Decimal (Read) Decimal (Write) TimeSeries (Read) TimeSeries (Write)
Parquet.NET 1.0x 1.0x 1.0x 1.0x
ParquetSharp 4.7x Faster 3.7x Faster 2.9x Faster 8.5x Faster

Known Limitations

Because this library is a thin wrapper around the Parquet C++ library, misuse can cause native memory access violations.

Typically this can arise when attempting to access an instance whose owner has been disposed. Because some objects and properties are exposed by Parquet C++ via regular pointers (instead of consistently using std::shared_ptr), dereferencing these after the owner class instance has been destructed will lead to an invalid pointer access.

As only 64-bit runtimes are available, ParquetSharp cannot be referenced by a 32-bit project. For example, using the library from F# Interactive requires running fsiAnyCpu.exe rather than fsi.exe.

In the 5.0.X versions, reading nested structures was introduced. However, nesting information about nulls is lost when reading columns with Repetition Level optional inside structs with Repetition Level optional. ParquetSharp does not yet provide information about whether the column or the enclosing struct is null.

Building

Building ParquetSharp for Windows requires the following dependencies:

  • Visual Studio 2022 (17.0 or higher)
  • Apache Arrow (8.0.0)

For building Arrow (including Parquet) and its dependencies, we recommend using Microsoft's vcpkg. The build scripts will use an existing vcpkg installation if either of the VCPKG_INSTALLATION_ROOT or VCPKG_ROOT environment variables are defined, otherwise vcpkg will be downloaded into the build directory. Note that the Windows build needs to be done in a Visual Studio Developer PowerShell for the build script to succeed.

Windows (Visual Studio 2022 Win64 solution)

> build_windows.ps1
> dotnet build csharp.test --configuration=Release

Linux and macOS (Makefile)

> ./build_unix.sh
> dotnet build csharp.test --configuration=Release

We have had to write our own FindPackage macros for most of the dependencies to get us going - it clearly needs more love and attention and is likely to be redundant with some vcpkg helper tools.

Contributing

We welcome new contributors! We will happily receive PRs for bug fixes or small changes. If you're contemplating something larger please get in touch first by opening a GitHub Issue describing the problem and how you propose to solve it.

License

Copyright 2018-2021 G-Research

Licensed under the Apache License, Version 2.0 (the "License"); you may not use these files except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

parquetsharp's People

Contributors

gpsnoopy avatar jgiannuzzi avatar adamreeve avatar philjdf avatar oldukhno avatar frassle avatar m00ngoose avatar jhickson avatar nisden avatar markackroyd avatar c-rindi avatar adeboyed avatar pavlovic-ivan avatar ljubon avatar marcin-krystianc avatar markpattison avatar saul avatar asjflondon avatar damellp avatar mfkl avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.