Giter Club home page Giter Club logo

arrow-js-ffi's Introduction

Hi there! I'm Kyle ๐Ÿ‘‹

I'm a software engineer passionate about fast geospatial data science.

I'm primarily developing the GeoArrow and GeoParquet ecosystems because I believe they portend a massive shift for faster, more interoperable geospatial data analysis.

Python:

Project Role Description
geopolars author Geospatial extensions for the Polars DataFrame library.
lonboard author Python library for fast, interactive geospatial vector data visualization in Jupyter.
geoarrow-rust author A Python library implementing the GeoArrow specification with efficient spatial operations.
suncalc-py author A Python port of suncalc.js for calculating sun position and sunlight phases.
pymartini author A Cython port of Martini for fast RTIN terrain mesh generation.
pydelatin author Python bindings to hmm for fast terrain mesh generation.
quantized-mesh-encoder author A fast Python Quantized Mesh encoder
usgs-topo-tiler author Python package to read Web Mercator map tiles from USGS Historical Topographic Maps
keplergl_cli author One-line geospatial data visualization using Kepler.gl

JavaScript:

Project Role Description
parquet-wasm author Rust-based WebAssembly bindings to read and write Apache Parquet data.
@geoarrow/deck.gl-layers author deck.gl layers for rendering GeoArrow data.
geoarrow-wasm author Efficient, vectorized geospatial operations in WebAssembly.
arrow-js-ffi author Zero-copy reading of Arrow data from WebAssembly.
geoarrow-js author TypeScript implementation of GeoArrow.
deck.gl contributor WebGL2 powered visualization framework.
deck.gl-raster author deck.gl layers and WebGL modules for client-side satellite imagery analysis.

Rust:

Project Role Description
geoarrow-rs author A Rust implementation of the GeoArrow specification and bindings to GeoRust algorithms for efficient spatial operations on GeoArrow memory.
geopolars author Geospatial extensions for the Polars DataFrame library.
geo-index author A Rust crate for packed, static, zero-copy spatial indexes.
arrow-wasm author Building block library for using Apache Arrow in Rust WebAssembly modules

Specifications:

Project Role Description
geoarrow core contributor Specification for storing geospatial data in Apache Arrow.
geoparquet core contributor Specification for storing geospatial vector data (point, line, polygon) in Parquet.

Other:

Project Role Description
National Scenic Trails Guide author A website and data tools for exploring and navigating the Pacific Crest Trail. After hiking the PCT, this project was the core of my effort to transition to a career in geospatial software engineering.
all-transit author Visualization of all transit routes in continental U.S.
vscode-jupyter-python author Run automatically-inferred Python code blocks in the VS Code Jupyter extension

arrow-js-ffi's People

Contributors

dependabot[bot] avatar domoritz avatar kylebarron avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

arrow-js-ffi's Issues

Maximum call stack size exceeded after passing data to `new Table` after it is returned from `parseRecordBatch`

Hi,
I've been trying to setup small example of getting data from parquet to json.
So far everything works except when I try to get data from parseRecordBatch, after I generate recordBatches and try to pass them into apache-arrow Table it fails with Maximum call stack size exceeded (screenshot attached)
image

I am not sure if I am doing something wrong getting wasmArrowTable from parquet.toFFI or that data is large (using your geo location data from parquet-wasm repo).
This is combination of multiple examples that I could find that made some sense so I am sorry if I am butchering code.
And I am pretty new to parquet and arrow.

Here is all the demo code that I have needed to reproduce it. Bun is used but function generateData should be easily converted to what you need to reproduce it

import { dir } from '@stricjs/utils';
import { Router } from '@stricjs/router';
import { Table, tableFromIPC } from "apache-arrow";
import { readParquet, _memory } from "parquet-wasm/node/arrow2";
import { parseRecordBatch } from "arrow-js-ffi";

const generateData = async () => {
    const path = "./public/geo.parquet"
    const file = await Bun.file(path);
    const buff = await file.arrayBuffer()
    const bytes = new Uint8Array(buff);
    console.log(bytes.length);
    const prq = readParquet(bytes)
    const wasmArrowTable = prq.toFFI();
    console.log("numBatches", wasmArrowTable.numBatches())

    const recordBatches = [];
for (let i = 0; i < wasmArrowTable.numBatches(); i++) {
  const recordBatch = parseRecordBatch(
    _memory().buffer,
    wasmArrowTable.arrayAddr(i),
    wasmArrowTable.schemaAddr(),
    true
  );
  recordBatches.push(recordBatch);
}

console.log(recordBatches[0])
const table = new Table(recordBatches[0]);
wasmArrowTable.drop();

}

export default new Router()
  .get('/', () => new Response('Hi'))
  .get('/test', async () => {
    await generateData()
    new Response("DONE")
  })
  .get('/make', async () => {
    await makeTable()
    new Response("DONE")
  })
  .get('/*', dir('./public'));

Explore "nanoarrow-js"

Arrow JS is a big library! It's not really a tenable dependency for a very bundle size conscious library or application.

This is actually the same story as in C/C++/Python. The C++ Arrow library got so big that many projects didn't want to depend on it. That's why nanoarrow was created. As a super minimal library that works with the C Data Interface representation of Arrow arrays.

I think there's definitely potential for a low level Arrow library in JS, that hews very closely to the C Data Interface.

Data structures would be essentially the JS counterpart of C Data Interface structs. All array data (no matter the logical type) would be a Uint8Array, that could later be viewed as another type or as strings.

Because array data are all Uint8Arrays, it means an array could either be "owned" in JS memory or "viewed" from wasm memory. So the memory safety wouldn't be great, but this is JS after all!

It would make sense to have toArrowJS and fromArrowJS functions that convert to and from Arrow JS arrays/Data instances.

An emphasis should be placed on a functional api instead of a class API to keep bundle size low.

Ideally, this would allow high-performance programs to rely on Arrow memory without fear of a huge bundle size impact! But this would be complementary not competitive with Arrow JS.

Add `parseSchema` method

Should parallel parseRecordBatch.. should take in an ArrowSchema struct, which it asserts to be of a struct type, then unpack the internal fields to return an Arrow JS Schema

Improved tree shaking?

I previously found that importing from * as arrow meant that esbuild couldn't treeshake at all. geoarrow/geoarrow-js#20

So maybe this lib should use path imports? It does have to import every type necessarily because it doesn't know what data type the C struct will be.

Offsets with nested buffers

I originally tried to parse an array of strings, and the last item in the list was always null. Does the offsets array always need to be length + 1 in arrow?

running wasm build fails

Hey Kyle

I am attempting to wasm build rust-wasm portion as indicated:

โžœ  rust-wasm git:(main) wasm-pack build --target web
[INFO]: ๐ŸŽฏ  Checking for the Wasm target...
[INFO]: ๐ŸŒ€  Compiling to Wasm...
   Compiling rust-wasm v0.1.0 (/Users/anirudh.vyas/rust/Polars-wasm-mwe/rust-wasm)
error[E0609]: no field `0` on type `Result<Row<'_>, PolarsError>`
  --> src/lib.rs:42:22
   |
42 |         for j in row.0.iter() {
   |                      ^

For more information about this error, try `rustc --explain E0609`.
error: could not compile `rust-wasm` due to previous error
Error: Compiling your crate to WebAssembly failed
Caused by: failed to execute `cargo build`: exited with exit status: 101
  full command: "cargo" "build" "--lib" "--release" "--target" "wasm32-unknown-unknown"

Would you know what am I doing wrong?

Writing to FFI structs

I think it makes sense to explore writing to Wasm memory.

  • writeField(field: arrow.Field, malloc: (length: Number) -> Number): Number would take a field and a malloc function. It would write the field manually into wasm memory using malloc, and at the end returning the pointer to the written number struct.
  • Requires a memory copy, but still that's better than going through IPC, which requires a single memory buffer.

A good reference here, esp in relation to Rust is Matu Radei's great blog post: https://radu-matei.com/blog/practical-guide-to-wasm-memory/#passing-arrays-to-rust-webassembly-modules

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.