Giter Club home page Giter Club logo

Comments (5)

Eugene-Mark avatar Eugene-Mark commented on July 18, 2024 1

@sharpe5 Hi sharpe5, can you provide me with sample parquet files in LZ4 or other compression codec. I need them for testing usage.

from bigdata-file-viewer.

Eugene-Mark avatar Eugene-Mark commented on July 18, 2024

Good point, marked your comment as an enhancement. Thanks for your contribution.

from bigdata-file-viewer.

sharpe5 avatar sharpe5 commented on July 18, 2024

Here you go:

type=blockStream,rowCount=1000,compression=LZ4.zip

GitHub accepts .zip files, so unzip the .parquet file. There should be 6 columns of random doubles, a few thousand rows.

Anything else, let me know!

from bigdata-file-viewer.

sharpe5 avatar sharpe5 commented on July 18, 2024

C++ code to create said file (missing functions; demo only). Arrow Parquet library was installed using vcpkg. Compiles with MSVC and gcc.

void demo3()
{
	using namespace std;
	using namespace fmt;
	using namespace System::Diagnostics;

	print("Demo 3: Open a file, flush blocks of rows to it until done:\n");
	
	{
		print("  - Test:\n");
		double r1 { drand() };
		print("    - r1={}\n", r1);
	}

	//const int maxRows = 1'000'000;
	const int maxRows = 500;
	vector<tuple<double, double, double, double, double, double>> rows;	
	{	
		rows.reserve(maxRows);

		print("  - Creating raw data:\n");
		Stopwatch sw = Stopwatch::StartNew();
		for (int i=0;i<maxRows;i++)
		{
			rows.push_back({drand(), drand(), drand(), drand(), drand(), drand()});
		}
		sw.Stop();
		print("    - rows.size(): {}\n", rows.size());
		print("    - Done: {} milliseconds\n", sw.Elapsed().TotalMilliseconds());
	}
	
	shared_ptr<arrow::Table> arrowTable;
	{
		const vector<string> names ={"col1", "col2", "col3", "col4", "col5", "col6"};		
		print("  - Creating Parquet table:\n");
		Stopwatch sw = Stopwatch::StartNew();
		if (!arrow::stl::TableFromTupleRange(arrow::default_memory_pool(), rows, names, &arrowTable).ok()) 
		{
			// Error handling code should go here.
			print("    - Error when creating table.\n");
			return;
		}
		sw.Stop();
		print("    - Done: {} milliseconds\n", sw.Elapsed().TotalMilliseconds());
	}

	string filepath;
	{
		std::shared_ptr<arrow::io::FileOutputStream> outfile;
		const string filename=format("type=blockStream,rowCount={},compression=LZ4.parquet",maxRows * 2); // As we are writing two chunks (see below).

		print("  - Write Parquet table:\n");
		Stopwatch sw = Stopwatch::StartNew();
		PARQUET_ASSIGN_OR_THROW(outfile,arrow::io::FileOutputStream::Open(filename));

		parquet::WriterProperties::Builder propertiesBuilder;
	        propertiesBuilder.compression(parquet::Compression::LZ4);
	        const auto properties = propertiesBuilder.build();
		
		// https://stackoverflow.com/questions/45572962/how-can-i-write-streaming-row-oriented-data-using-parquet-cpp-without-buffering
		auto arrow_output_stream = arrow::io::FileOutputStream::Open(filename, false);
		std::unique_ptr<parquet::arrow::FileWriter> writer;
		parquet::arrow::FileWriter::Open(*(arrowTable->schema()), ::arrow::default_memory_pool(), *arrow_output_stream, properties, parquet::default_arrow_writer_properties(), &writer);

		const int chunkSize = static_cast<int>(rows.size()); 
		writer->WriteTable(*arrowTable, chunkSize);		
                // Demonstrates writing data in blocks.
		writer->WriteTable(*arrowTable, chunkSize);
		writer->Close();

		print("    - Compression: LZ4\n");
		print("    - Block size: {}\n", chunkSize);
		print("    - Done: {} milliseconds\n", sw.Elapsed().TotalMilliseconds());
		const string dir = System::IO::Directory::GetCurrentDirectoryAlt();
		filepath = Path::Combine(dir, filename);
	}

	{
		print("  - Output file: {}\n", filepath);
	}
}

from bigdata-file-viewer.

Eugene-Mark avatar Eugene-Mark commented on July 18, 2024

Close the issue since it's over years, will reopen the feature is in the roadmap.

from bigdata-file-viewer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.