p-ranav / csv Goto Github PK

View Code? Open in Web Editor NEW

235.0 14.0 32.0 522 KB

[DEPRECATED] See https://github.com/p-ranav/csv2

License: MIT License

C++ 99.32% CMake 0.68%

csv-parser header-only concurrent-queue library mit-license fast multi-threaded robin-map cpp17

csv's Introduction

[DEPRECATED APRIL 2020]

This library is now deprecated. Checkout a second implementation of this library here: https://github.com/p-ranav/csv2.

Highlights

Header-only library
Fast, asynchronous, multi-threaded processing using:
- Lock-free Concurrent Queues
- Robin hood Hashing
Requires C++17
MIT License

Reading CSV files
Writing CSV files
Steps For Contributors
Steps For Users
Continuous Integration Reports
License

Reading CSV files

Simply include reader.hpp and you're good to go.

#include <csv/reader.hpp>

To start parsing CSV files, create a csv::Reader object and call .read(filename).

csv::Reader foo;
foo.read("test.csv");

This .read method is non-blocking. The reader spawns multiple threads to tokenize the file stream and build a "list of dictionaries". While the reader is doing it's thing, you can start post-processing the rows it has parsed so far using this iterator pattern:

while(foo.busy()) {
  if (foo.ready()) {
    auto row = foo.next_row();  // Each row is a csv::unordered_flat_map (github.com/martinus/robin-hood-hashing)
    auto foo = row["foo"]       // You can use it just like an std::unordered_map
    auto bar = row["bar"];
    // do something
  }
}

If instead you'd like to wait for all the rows to get processed, you can call .rows() which is a convenience method that executes the above while loop

auto rows = foo.rows();           // blocks until the CSV is fully processed
for (auto& row : rows) {          // Example: [{"foo": "1", "bar": "2"}, {"foo": "3", "bar": "4"}, ...] 
  auto foo = row["foo"];
  // do something
}

Dialects

This csv library comes with three standard dialects:

Name	Description
excel	The excel dialect defines the usual properties of an Excel-generated CSV file
excel_tab	The excel_tab dialect defines the usual properties of an Excel-generated TAB-delimited file
unix	The unix dialect defines the usual properties of a CSV file generated on UNIX systems, i.e. using '\n' as line terminator and quoting all fields

Configuring Custom Dialects

Custom dialects can be constructed with .configure_dialect(...)

csv::Reader csv;
csv.configure_dialect("my fancy dialect")
  .delimiter("")
  .quote_character('"')
  .double_quote(true)
  .skip_initial_space(false)
  .trim_characters(' ', '\t')
  .ignore_columns("foo", "bar")
  .header(true)
  .skip_empty_rows(true);

csv.read("foo.csv");
for (auto& row : csv.rows()) {
  // do something
}

Property	Data Type	Description
delimiter	`std::string`	specifies the character sequence which should separate fields (aka columns). Default = `","`
quote_character	`char`	specifies a one-character string to use as the quoting character. Default = `'"'`
double_quote	`bool`	controls the handling of quotes inside fields. If true, two consecutive quotes should be interpreted as one. Default = `true`
skip_initial_space	`bool`	specifies how to interpret whitespace which immediately follows a delimiter; if false, it means that whitespace immediately after a delimiter should be treated as part of the following field. Default = `false`
trim_characters	`std::vector<char>`	specifies the list of characters to trim from every value in the CSV. Default = `{}` - nothing trimmed
ignore_columns	`std::vector<std::string>`	specifies the list of columns to ignore. These columns will be stripped during the parsing process. Default = `{}` - no column ignored
header	`bool`	indicates whether the file includes a header row. If true the first row in the file is a header row, not data. Default = `true`
column_names	`std::vector<std::string>`	specifies the list of column names. This is useful when the first row of the CSV isn't a header Default = `{}`
skip_empty_rows	`bool`	specifies how empty rows should be interpreted. If this is set to true, empty rows are skipped. Default = `false`

The line terminator is '\n' by default. I use std::getline and handle stripping out '\r' from line endings. So, for now, this is not configurable in custom dialects.

Multi-character Delimiters

Consider this strange, messed up log file:

[Thread ID] :: [Log Level] :: [Log Message] :: {Timestamp}
04 :: INFO :: Hello World ::             1555164718
02        :: DEBUG :: Warning! Foo has happened                :: 1555463132

To parse this file, simply configure a new dialect that splits on "::" and trims whitespace, braces, and bracket characters.

csv::Reader csv;
csv.configure_dialect("my strange dialect")
  .delimiter("::")
  .trim_characters(' ', '[', ']', '{', '}');   

csv.read("test.csv");
for (auto& row : csv.rows()) {
  auto thread_id = row["Thread ID"];    // "04"
  auto log_level = row["Log Level"];    // "INFO"
  auto message = row["Log Message"];    // "Hello World"
  // do something
}

Ignoring Columns

Consider the following CSV. Let's say you don't care about the columns age and gender. Here, you can use .ignore_columns and provide a list of columns to ignore.

name, age, gender, email, department
Mark Johnson, 50, M, [email protected], BA
John Stevenson, 35, M, [email protected], IT
Jane Barkley, 25, F, [email protected], MGT

You can configure the dialect to ignore these columns like so:

csv::Reader csv;
csv.configure_dialect("ignore meh and fez")
  .delimiter(", ")
  .ignore_columns("age", "gender");

csv.read("test.csv");
auto rows = csv.rows();
// Your rows are:
// [{"name": "Mark Johnson", "email": "[email protected]", "department": "BA"},
//  {"name": "John Stevenson", "email": "[email protected]", "department": "IT"},
//  {"name": "Jane Barkley", "email": "[email protected]", "department": "MGT"}]

No Header?

Sometimes you have CSV files with no header row:

If you want to prevent the reader from parsing the first row as a header, simply:

Set .header to false
Provide a list of column names with .column_names(...)

csv.configure_dialect("no headers")
  .header(false)
  .column_names("foo", "bar", "baz");

The CSV rows will now look like this:

[{"foo": "9", "bar": "52", "baz": "1"}, {"foo": "52", "bar": "91", "baz": "0"}, ...]

If .column_names is not called, then the reader simply generates dictionary keys like so:

[{"0": "9", "1": "52", "2": "1"}, {"0": "52", "1": "91", "2": "0"}, ...]

Dealing with Empty Rows

Sometimes you have to deal with a CSV file that has empty lines; either in the middle or at the end of the file:

a,b,c
1,2,3

4,5,6

10,11,12

Here's how this get's parsed by default:

csv::Reader csv;
csv.read("inputs/empty_lines.csv");
auto rows = csv.rows();
// [{"a": 1, "b": 2, "c": 3}, {"a": "", "b": "", "c": ""}, {"a": "4", "b": "5", "c": "6"}, {"a": "", ...}]

If you don't care for these empty rows, simply call .skip_empty_rows(true)

csv::Reader csv;
csv.configure_dialect()
  .skip_empty_rows(true);
csv.read("inputs/empty_lines.csv");
auto rows = csv.rows();
// [{"a": 1, "b": 2, "c": 3}, {"a": "4", "b": "5", "c": "6"}, {"a": "10", "b": "11", "c": "12"}]

Reading first N rows

If you know exactly how many rows to parse, you can help out the reader by using the .read(filename, num_rows) overloaded method. This saves the reader from trying to figure out the number of lines in the CSV file. You can use this method to parse the first N rows of the file instead of parsing all of it.

csv::Reader foo;
foo.read("bar.csv", 1000);
auto rows = foo.rows();

Note: Do not provide num_rows greater than the actual number of rows in the file - The reader will loop forever till the end of time.

Performance Benchmark

// benchmark.cpp
void parse(const std::string& filename) {
  csv::Reader foo;
  foo.read(filename);
  std::vector<csv::unordered_flat_map<std::string_view, std::string>> rows;
  while (foo.busy()) {
    if (foo.ready()) {
      auto row = foo.next_row();
      rows.push_back(row);
    }
  }
}

$ g++ -pthread -std=c++17 -O3 -Iinclude/ -o test benchmark.cpp
$ time ./test

Each test is run 30 times on an Intel(R) Core(TM) i7-6650-U @ 2.20 GHz CPU.

Here are the average-case execution times:

Dataset	File Size	Rows	Cols	Time
Demographic Statistics By Zip Code	27 KB	237	46	0.026s
Simple 3-column CSV	14.1 MB	761,817	3	0.523s
Majestic Million	77.7 MB	1,000,000	12	1.972s
Crimes 2001 - Present	1.50 GB	6,846,406	22	32.411s

Writing CSV files

Simply include writer.hpp and you're good to go.

#include <csv/writer.hpp>

To start writing CSV files, create a csv::Writer object and provide a filename:

csv::Writer foo("test.csv");

Constructing a writer spawns a worker thread that is ready to start writing rows. Using .configure_dialect, configure the dialect to be used by the writer. This is where you can specify the column names:

foo.configure_dialect()
  .delimiter(", ")
  .column_names("a", "b", "c");

Now it's time to write rows. You can do this in multiple ways:

foo.write_row("1", "2", "3");                                     // parameter packing
foo.write_row({"4", "5", "6"});                                   // std::vector
foo.write_row(std::map<std::string, std::string>{                 // std::map
  {"a", "7"}, {"b", "8"}, {"c", "9"} });
foo.write_row(std::unordered_map<std::string, std::string>{       // std::unordered_map
  {"a", "7"}, {"b", "8"}, {"c", "9"} });
foo.write_row(csv::unordered_flat_map<std::string, std::string>{  // csv::unordered_flat_map
  {"a", "7"}, {"b", "8"}, {"c", "9"} });

You can also omit one or more values dynamically when using maps:

foo.write_row(std::map<std::string, std::string>{                 // std::map
  {"a", "7"}, {"c", "9"} });                                      // omitting "b"
foo.write_row(std::unordered_map<std::string, std::string>{       // std::unordered_map
  {"b", "8"}, {"c", "9"} });                                      // omitting "a"
foo.write_row(csv::unordered_flat_map<std::string, std::string>{  // csv::unordered_flat_map
  {"a", "7"}, {"b", "8"} });                                      // omitting "c"

Finally, once you're done writing rows, call .close() to stop the worker thread and close the file stream.

foo.close();

Here's an example writing 3 million lines of CSV to a file:

csv::Writer foo("test.csv");
foo.configure_dialect()
  .delimiter(", ")
  .column_names("a", "b", "c");

for (long i = 0; i < 3000000; i++) {
  auto x = std::to_string(i % 100);
  auto y = std::to_string((i + 1) % 100);
  auto z = std::to_string((i + 2) % 100);
  foo.write_row(x, y, z);
}
foo.close();

The above code takes about 1.8 seconds to execute on my Surface Pro 4.

Steps For Contributors

Contributions are welcome, have a look at the CONTRIBUTING.md document for more information.

git clone https://github.com/p-ranav/csv.git
cd csv
git submodule update --init --recursive
mkdir build
cd build
cmake .. -DCSV_BUILD_TESTS=ON
cmake --build . --config Debug
ctest --output-on-failure -C Debug

Steps For Users

git clone https://github.com/p-ranav/csv.git
cd csv
mkdir build
cd build
cmake ../.
sudo make install

Continuous Integration Reports

License

The project is available under the MIT license.

csv's People

Contributors

Stargazers

Watchers

csv's Issues

non-multi-threaded mode

I would prefer to default to threads off for simplicity and error propagation, but I think csv looks like a good project. Looking around at a few of the sources, it seems that this might not be too hard to make happen?

What I am doing wrong?

Hello!
What I want:

I want to create CSV log file with list of exceptions in order to easy browse and analyse in Calc or Excel.
I want to write unit test for saving the exception list (in my CSV file).

What I did:
I program saveAsCSV() and loadFromCSV() functions:

void Exceptions::saveAsCSV(std::wstring aPathToLogFile)
{
    csv::Writer lLogCSV(gToNarrow(aPathToLogFile));
    lLogCSV.use_dialect("excel");
    lLogCSV.configure_dialect()
        .delimiter(", ")
        .column_names("Type", "Code", "Message", "Component",
            "File", "Line", "Function");

    for(std::list<ExceptionPtr>::iterator i(mExceptions.begin())
         ; i != mExceptions.end(); ++i)
    {
        lLogCSV.write_row(std::to_string((*i)->type())
            , std::to_string((*i)->code())
            , gToNarrow((*i)->message())
            , gToNarrow((*i)->component())
            , gToNarrow((*i)->file())
            , std::to_string((*i)->line())
            , gToNarrow((*i)->function())
            );
    }
    lLogCSV.close();
}

std::list<ExceptionPtr> Exceptions::loadFromCSV(std::wstring aPathToLogFile)
{
    dNotifyInfo(fr(tr("Try to load CSV file: {0}"), aPathToLogFile), eFileSystem);
    std::list<ExceptionPtr> lResult;

    csv::Reader lCSVFile;
    lCSVFile.use_dialect("excel");
    std::string lPathToFileName(gToNarrow(aPathToLogFile));
    lCSVFile.read(lPathToFileName);

    while(lCSVFile.busy())
    {
        if(lCSVFile.ready())
        {
            auto lRow = lCSVFile.next_row();
            std::cout << lRow["Type"]
                    << ", " << lRow["Code"]
                    << ", " << lRow["Message"]
                    << ", " << lRow["Component"]
                    << ", " << lRow["File"]
                    << ", " << lRow["Line"]
                    << ", " << lRow["Funciton"]
                    << std::endl << std::flush;

            ExceptionPtr lEx(new Exception(static_cast<Exception::Type>(
                std::stoi(lRow["Type"]))
                , std::stoull(lRow["Code"])
                , gToWide(lRow["Message"])
                , gToWide(lRow["Component"])
                , gToWide(lRow["File"])
                , std::stoi(lRow["Line"])
                , gToWide(lRow["Funciton"])
                ));
            lResult.push_back(lEx);
        }
        else
            std::this_thread::sleep_for(100ms);
    }

    return lResult;
}

saveAsCSV() function works as expected and generates file:

Type, Code, Message, Component, File, Line, Function
0, 1, My exception 1, EnergoKodInstrumentyTest, /home/szyk/!-EnergoKod/!-Libs/EnergoKodInstrumenty/Tests/Src/ExceptionsTest.cpp, 54, virtual void ExceptionsTest::run()
0, 1, My exception 2, EnergoKodInstrumentyTest, /home/szyk/!-EnergoKod/!-Libs/EnergoKodInstrumenty/Tests/Src/ExceptionsTest.cpp, 60, virtual void ExceptionsTest::run()
0, 1, My exception 3, EnergoKodInstrumentyTest, /home/szyk/!-EnergoKod/!-Libs/EnergoKodInstrumenty/Tests/Src/ExceptionsTest.cpp, 66, virtual void ExceptionsTest::run()
1, 2, My warning 1, EnergoKodInstrumentyTest, /home/szyk/!-EnergoKod/!-Libs/EnergoKodInstrumenty/Tests/Src/ExceptionsTest.cpp, 70, virtual void ExceptionsTest::run()
1, 2, My warning 2, EnergoKodInstrumentyTest, /home/szyk/!-EnergoKod/!-Libs/EnergoKodInstrumenty/Tests/Src/ExceptionsTest.cpp, 71, virtual void ExceptionsTest::run()
1, 2, My warning 3, EnergoKodInstrumentyTest, /home/szyk/!-EnergoKod/!-Libs/EnergoKodInstrumenty/Tests/Src/ExceptionsTest.cpp, 72, virtual void ExceptionsTest::run()
2, 3, My info 1, EnergoKodInstrumentyTest, /home/szyk/!-EnergoKod/!-Libs/EnergoKodInstrumenty/Tests/Src/ExceptionsTest.cpp, 74, virtual void ExceptionsTest::run()
2, 3, My info 2, EnergoKodInstrumentyTest, /home/szyk/!-EnergoKod/!-Libs/EnergoKodInstrumenty/Tests/Src/ExceptionsTest.cpp, 75, virtual void ExceptionsTest::run()
2, 3, My info 3, EnergoKodInstrumentyTest, /home/szyk/!-EnergoKod/!-Libs/EnergoKodInstrumenty/Tests/Src/ExceptionsTest.cpp, 76, virtual void ExceptionsTest::run()

But I am unable to load anything from CSV file. The output is:

POCZĄTEK pod testu: Test saving exceptions to CSV.
Information! Code: 2, Message:
Try to load CSV file: /var/tmp/ExceptionTestsExceptionsTest.csv
0, , , , , , 
Wystąpił wyjątek! Wiadomość:
stoull
Naciśnij <RETURN> aby zamknąć to okno...

Note: In loadFromCSV() function I follow strictly your README.md entrance example. So I have no idea how to program it correctly.
Please help me.

ps. If this place is not suitable for such questions - please give me advice where should I ask.

Thank you and best regards.

Issue with the CSV header

I'm trying to use the csv::Writer as a backend for my Persistent Manager class but for some reason I cannot get it to write the header. Basically, I have something like this:

class PM {
	class Writer;
}

class Writer{

	using namespace std;

	string filename;
	unique_ptr<csv::Writer> writer;

	Writer(string filename) filename(filename) {
		writer = make_unique<csv::Writer>(filename);
		writer->configure_dialect().delimiter(cols);
	}

	~Write(){
		writer->close();
	}

	void write(map<sting, string> row) {
               // I override the method to accept vector<string> as well.
		writer->configure_dialect().column_names(cols);

		writer->write_row(row);
	}
}

Do you see any particular issue with this? I'm not quite sure why honestly, everything looks fine.

Files with only one data row are not read correctly

Not wroking correctly on file with too many lines

Environment

Ubuntu 18.04
CLion 2019.1.4

Issue

I test the program on a CSV file 20k_rows_data.csv.txt with 20K lines and the program does not work correctly. (I change the filename with .txt, because GitHub issue does not support uploading .csv file.)

int main() {
  csv::Reader csv;
  csv.read("../tests/inputs/20k_rows_data.csv.txt");
  auto rows = csv.rows();
  auto cols = csv.cols();
  int row_count = 0;
  for (auto row : rows) {
    std::string s = std::to_string(++row_count);
    for (auto col : cols) {
      s += ' ' + (std::string)(row[col]);
    }
    std::cout << s << std::endl;
  }
}

Part of the output is like (copy from my console):

5332     
5333     
5334 1 1 1 1 1
5335     
5336     
5337 1 1 1 1 1
5338     
5339     
5340     
5341 1 1 1 1 1
5342     
5343

Note that the outputs are not the same each time I run it.

Overload write for void write_row(unordered_flat_map<std::string_view, std::string> row_map)

Is it possible to make it simpler to read data and then write that same data?

The read function reads in rows as unordered_flat_map<std::string_view, std::string>, but the write function expects unordered_flat_map<std::string, std::string>. This appears to require manual conversion between the types, making something like converting a CSV file to a TSV file more difficult than I had hoped.

Newer Version of Robin Hood Hashing

Newer versions of Robin Hood seem to address several compiler warnings and errors. I can't currently compile my project with GCC 9.1.1 on RHEL 7.7 because of this.

Support reading directly from std::istream

I have a use case where I want to parse the first part of a CSV file with my own parser and the rest of it with your parser. Currently, I will have to re-parse most of the file with your parser because it only takes a filename as an argument. I have a std::istream exactly at the position where I want to pick up parsing with your parser. Would it be possible to support reading from a std::istream?

Option to append to file when writing

This is same use case as #16 except for writing to a file. An option to have the writer append to an existing file would be very helpful. Even better might be adding a csv::Writer constructor which takes a std::ostream.

Throw exception if file not exists

in reader.read(), throw an exception if file doesn't exist

Implement Dialect.doublequote Dialect.escapechar

Dialect.doublequote
Controls how instances of quotechar appearing inside a field should themselves be quoted. When True, the character is doubled. When False, the escapechar is used as a prefix to the quotechar. It defaults to True.

On output, if doublequote is False and no escapechar is set, Error is raised if a quotechar is found in a field.

Dialect.escapechar
A one-character string used by the writer to escape the delimiter if quoting is set to QUOTE_NONE and the quotechar if doublequote is False. On reading, the escapechar removes any special meaning from the following character. It defaults to None, which disables escaping.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.