Giter Club home page Giter Club logo

csv's Introduction

[DEPRECATED APRIL 2020]

This library is now deprecated. Checkout a second implementation of this library here: https://github.com/p-ranav/csv2.

Highlights

Table of Contents

Reading CSV files

Simply include reader.hpp and you're good to go.

#include <csv/reader.hpp>

To start parsing CSV files, create a csv::Reader object and call .read(filename).

csv::Reader foo;
foo.read("test.csv");

This .read method is non-blocking. The reader spawns multiple threads to tokenize the file stream and build a "list of dictionaries". While the reader is doing it's thing, you can start post-processing the rows it has parsed so far using this iterator pattern:

while(foo.busy()) {
  if (foo.ready()) {
    auto row = foo.next_row();  // Each row is a csv::unordered_flat_map (github.com/martinus/robin-hood-hashing)
    auto foo = row["foo"]       // You can use it just like an std::unordered_map
    auto bar = row["bar"];
    // do something
  }
}

If instead you'd like to wait for all the rows to get processed, you can call .rows() which is a convenience method that executes the above while loop

auto rows = foo.rows();           // blocks until the CSV is fully processed
for (auto& row : rows) {          // Example: [{"foo": "1", "bar": "2"}, {"foo": "3", "bar": "4"}, ...] 
  auto foo = row["foo"];
  // do something
}

Dialects

This csv library comes with three standard dialects:

Name Description
excel The excel dialect defines the usual properties of an Excel-generated CSV file
excel_tab The excel_tab dialect defines the usual properties of an Excel-generated TAB-delimited file
unix The unix dialect defines the usual properties of a CSV file generated on UNIX systems, i.e. using '\n' as line terminator and quoting all fields

Configuring Custom Dialects

Custom dialects can be constructed with .configure_dialect(...)

csv::Reader csv;
csv.configure_dialect("my fancy dialect")
  .delimiter("")
  .quote_character('"')
  .double_quote(true)
  .skip_initial_space(false)
  .trim_characters(' ', '\t')
  .ignore_columns("foo", "bar")
  .header(true)
  .skip_empty_rows(true);

csv.read("foo.csv");
for (auto& row : csv.rows()) {
  // do something
}
Property Data Type Description
delimiter std::string specifies the character sequence which should separate fields (aka columns). Default = ","
quote_character char specifies a one-character string to use as the quoting character. Default = '"'
double_quote bool controls the handling of quotes inside fields. If true, two consecutive quotes should be interpreted as one. Default = true
skip_initial_space bool specifies how to interpret whitespace which immediately follows a delimiter; if false, it means that whitespace immediately after a delimiter should be treated as part of the following field. Default = false
trim_characters std::vector<char> specifies the list of characters to trim from every value in the CSV. Default = {} - nothing trimmed
ignore_columns std::vector<std::string> specifies the list of columns to ignore. These columns will be stripped during the parsing process. Default = {} - no column ignored
header bool indicates whether the file includes a header row. If true the first row in the file is a header row, not data. Default = true
column_names std::vector<std::string> specifies the list of column names. This is useful when the first row of the CSV isn't a header Default = {}
skip_empty_rows bool specifies how empty rows should be interpreted. If this is set to true, empty rows are skipped. Default = false

The line terminator is '\n' by default. I use std::getline and handle stripping out '\r' from line endings. So, for now, this is not configurable in custom dialects.

Multi-character Delimiters

Consider this strange, messed up log file:

[Thread ID] :: [Log Level] :: [Log Message] :: {Timestamp}
04 :: INFO :: Hello World ::             1555164718
02        :: DEBUG :: Warning! Foo has happened                :: 1555463132

To parse this file, simply configure a new dialect that splits on "::" and trims whitespace, braces, and bracket characters.

csv::Reader csv;
csv.configure_dialect("my strange dialect")
  .delimiter("::")
  .trim_characters(' ', '[', ']', '{', '}');   

csv.read("test.csv");
for (auto& row : csv.rows()) {
  auto thread_id = row["Thread ID"];    // "04"
  auto log_level = row["Log Level"];    // "INFO"
  auto message = row["Log Message"];    // "Hello World"
  // do something
}

Ignoring Columns

Consider the following CSV. Let's say you don't care about the columns age and gender. Here, you can use .ignore_columns and provide a list of columns to ignore.

name, age, gender, email, department
Mark Johnson, 50, M, [email protected], BA
John Stevenson, 35, M, [email protected], IT
Jane Barkley, 25, F, [email protected], MGT

You can configure the dialect to ignore these columns like so:

csv::Reader csv;
csv.configure_dialect("ignore meh and fez")
  .delimiter(", ")
  .ignore_columns("age", "gender");

csv.read("test.csv");
auto rows = csv.rows();
// Your rows are:
// [{"name": "Mark Johnson", "email": "[email protected]", "department": "BA"},
//  {"name": "John Stevenson", "email": "[email protected]", "department": "IT"},
//  {"name": "Jane Barkley", "email": "[email protected]", "department": "MGT"}]

No Header?

Sometimes you have CSV files with no header row:

9 52 1
52 91 0
91 135 0
135 174 0
174 218 0
218 260 0
260 301 0
301 341 0
341 383 0
...

If you want to prevent the reader from parsing the first row as a header, simply:

  • Set .header to false
  • Provide a list of column names with .column_names(...)
csv.configure_dialect("no headers")
  .header(false)
  .column_names("foo", "bar", "baz");

The CSV rows will now look like this:

[{"foo": "9", "bar": "52", "baz": "1"}, {"foo": "52", "bar": "91", "baz": "0"}, ...]

If .column_names is not called, then the reader simply generates dictionary keys like so:

[{"0": "9", "1": "52", "2": "1"}, {"0": "52", "1": "91", "2": "0"}, ...]

Dealing with Empty Rows

Sometimes you have to deal with a CSV file that has empty lines; either in the middle or at the end of the file:

a,b,c
1,2,3

4,5,6

10,11,12



Here's how this get's parsed by default:

csv::Reader csv;
csv.read("inputs/empty_lines.csv");
auto rows = csv.rows();
// [{"a": 1, "b": 2, "c": 3}, {"a": "", "b": "", "c": ""}, {"a": "4", "b": "5", "c": "6"}, {"a": "", ...}]

If you don't care for these empty rows, simply call .skip_empty_rows(true)

csv::Reader csv;
csv.configure_dialect()
  .skip_empty_rows(true);
csv.read("inputs/empty_lines.csv");
auto rows = csv.rows();
// [{"a": 1, "b": 2, "c": 3}, {"a": "4", "b": "5", "c": "6"}, {"a": "10", "b": "11", "c": "12"}]

Reading first N rows

If you know exactly how many rows to parse, you can help out the reader by using the .read(filename, num_rows) overloaded method. This saves the reader from trying to figure out the number of lines in the CSV file. You can use this method to parse the first N rows of the file instead of parsing all of it.

csv::Reader foo;
foo.read("bar.csv", 1000);
auto rows = foo.rows();

Note: Do not provide num_rows greater than the actual number of rows in the file - The reader will loop forever till the end of time.

Performance Benchmark

// benchmark.cpp
void parse(const std::string& filename) {
  csv::Reader foo;
  foo.read(filename);
  std::vector<csv::unordered_flat_map<std::string_view, std::string>> rows;
  while (foo.busy()) {
    if (foo.ready()) {
      auto row = foo.next_row();
      rows.push_back(row);
    }
  }
}
$ g++ -pthread -std=c++17 -O3 -Iinclude/ -o test benchmark.cpp
$ time ./test

Each test is run 30 times on an Intel(R) Core(TM) i7-6650-U @ 2.20 GHz CPU.

Here are the average-case execution times:

Dataset File Size Rows Cols Time
Demographic Statistics By Zip Code 27 KB 237 46 0.026s
Simple 3-column CSV 14.1 MB 761,817 3 0.523s
Majestic Million 77.7 MB 1,000,000 12 1.972s
Crimes 2001 - Present 1.50 GB 6,846,406 22 32.411s

Writing CSV files

Simply include writer.hpp and you're good to go.

#include <csv/writer.hpp>

To start writing CSV files, create a csv::Writer object and provide a filename:

csv::Writer foo("test.csv");

Constructing a writer spawns a worker thread that is ready to start writing rows. Using .configure_dialect, configure the dialect to be used by the writer. This is where you can specify the column names:

foo.configure_dialect()
  .delimiter(", ")
  .column_names("a", "b", "c");

Now it's time to write rows. You can do this in multiple ways:

foo.write_row("1", "2", "3");                                     // parameter packing
foo.write_row({"4", "5", "6"});                                   // std::vector
foo.write_row(std::map<std::string, std::string>{                 // std::map
  {"a", "7"}, {"b", "8"}, {"c", "9"} });
foo.write_row(std::unordered_map<std::string, std::string>{       // std::unordered_map
  {"a", "7"}, {"b", "8"}, {"c", "9"} });
foo.write_row(csv::unordered_flat_map<std::string, std::string>{  // csv::unordered_flat_map
  {"a", "7"}, {"b", "8"}, {"c", "9"} });

You can also omit one or more values dynamically when using maps:

foo.write_row(std::map<std::string, std::string>{                 // std::map
  {"a", "7"}, {"c", "9"} });                                      // omitting "b"
foo.write_row(std::unordered_map<std::string, std::string>{       // std::unordered_map
  {"b", "8"}, {"c", "9"} });                                      // omitting "a"
foo.write_row(csv::unordered_flat_map<std::string, std::string>{  // csv::unordered_flat_map
  {"a", "7"}, {"b", "8"} });                                      // omitting "c"

Finally, once you're done writing rows, call .close() to stop the worker thread and close the file stream.

foo.close();

Here's an example writing 3 million lines of CSV to a file:

csv::Writer foo("test.csv");
foo.configure_dialect()
  .delimiter(", ")
  .column_names("a", "b", "c");

for (long i = 0; i < 3000000; i++) {
  auto x = std::to_string(i % 100);
  auto y = std::to_string((i + 1) % 100);
  auto z = std::to_string((i + 2) % 100);
  foo.write_row(x, y, z);
}
foo.close();

The above code takes about 1.8 seconds to execute on my Surface Pro 4.

Steps For Contributors

Contributions are welcome, have a look at the CONTRIBUTING.md document for more information.

git clone https://github.com/p-ranav/csv.git
cd csv
git submodule update --init --recursive
mkdir build
cd build
cmake .. -DCSV_BUILD_TESTS=ON
cmake --build . --config Debug
ctest --output-on-failure -C Debug

Steps For Users

git clone https://github.com/p-ranav/csv.git
cd csv
mkdir build
cd build
cmake ../.
sudo make install

Continuous Integration Reports

License

The project is available under the MIT license.

csv's People

Contributors

amirmasoudabdol avatar interruping avatar jwillikers avatar p-ranav avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

csv's Issues

non-multi-threaded mode

I would prefer to default to threads off for simplicity and error propagation, but I think csv looks like a good project. Looking around at a few of the sources, it seems that this might not be too hard to make happen?

What I am doing wrong?

Hello!
What I want:

  1. I want to create CSV log file with list of exceptions in order to easy browse and analyse in Calc or Excel.
  2. I want to write unit test for saving the exception list (in my CSV file).

What I did:
I program saveAsCSV() and loadFromCSV() functions:

void Exceptions::saveAsCSV(std::wstring aPathToLogFile)
{
    csv::Writer lLogCSV(gToNarrow(aPathToLogFile));
    lLogCSV.use_dialect("excel");
    lLogCSV.configure_dialect()
        .delimiter(", ")
        .column_names("Type", "Code", "Message", "Component",
            "File", "Line", "Function");

    for(std::list<ExceptionPtr>::iterator i(mExceptions.begin())
         ; i != mExceptions.end(); ++i)
    {
        lLogCSV.write_row(std::to_string((*i)->type())
            , std::to_string((*i)->code())
            , gToNarrow((*i)->message())
            , gToNarrow((*i)->component())
            , gToNarrow((*i)->file())
            , std::to_string((*i)->line())
            , gToNarrow((*i)->function())
            );
    }
    lLogCSV.close();
}
std::list<ExceptionPtr> Exceptions::loadFromCSV(std::wstring aPathToLogFile)
{
    dNotifyInfo(fr(tr("Try to load CSV file: {0}"), aPathToLogFile), eFileSystem);
    std::list<ExceptionPtr> lResult;

    csv::Reader lCSVFile;
    lCSVFile.use_dialect("excel");
    std::string lPathToFileName(gToNarrow(aPathToLogFile));
    lCSVFile.read(lPathToFileName);

    while(lCSVFile.busy())
    {
        if(lCSVFile.ready())
        {
            auto lRow = lCSVFile.next_row();
            std::cout << lRow["Type"]
                    << ", " << lRow["Code"]
                    << ", " << lRow["Message"]
                    << ", " << lRow["Component"]
                    << ", " << lRow["File"]
                    << ", " << lRow["Line"]
                    << ", " << lRow["Funciton"]
                    << std::endl << std::flush;

            ExceptionPtr lEx(new Exception(static_cast<Exception::Type>(
                std::stoi(lRow["Type"]))
                , std::stoull(lRow["Code"])
                , gToWide(lRow["Message"])
                , gToWide(lRow["Component"])
                , gToWide(lRow["File"])
                , std::stoi(lRow["Line"])
                , gToWide(lRow["Funciton"])
                ));
            lResult.push_back(lEx);
        }
        else
            std::this_thread::sleep_for(100ms);
    }

    return lResult;
}

saveAsCSV() function works as expected and generates file:

Type, Code, Message, Component, File, Line, Function
0, 1, My exception 1, EnergoKodInstrumentyTest, /home/szyk/!-EnergoKod/!-Libs/EnergoKodInstrumenty/Tests/Src/ExceptionsTest.cpp, 54, virtual void ExceptionsTest::run()
0, 1, My exception 2, EnergoKodInstrumentyTest, /home/szyk/!-EnergoKod/!-Libs/EnergoKodInstrumenty/Tests/Src/ExceptionsTest.cpp, 60, virtual void ExceptionsTest::run()
0, 1, My exception 3, EnergoKodInstrumentyTest, /home/szyk/!-EnergoKod/!-Libs/EnergoKodInstrumenty/Tests/Src/ExceptionsTest.cpp, 66, virtual void ExceptionsTest::run()
1, 2, My warning 1, EnergoKodInstrumentyTest, /home/szyk/!-EnergoKod/!-Libs/EnergoKodInstrumenty/Tests/Src/ExceptionsTest.cpp, 70, virtual void ExceptionsTest::run()
1, 2, My warning 2, EnergoKodInstrumentyTest, /home/szyk/!-EnergoKod/!-Libs/EnergoKodInstrumenty/Tests/Src/ExceptionsTest.cpp, 71, virtual void ExceptionsTest::run()
1, 2, My warning 3, EnergoKodInstrumentyTest, /home/szyk/!-EnergoKod/!-Libs/EnergoKodInstrumenty/Tests/Src/ExceptionsTest.cpp, 72, virtual void ExceptionsTest::run()
2, 3, My info 1, EnergoKodInstrumentyTest, /home/szyk/!-EnergoKod/!-Libs/EnergoKodInstrumenty/Tests/Src/ExceptionsTest.cpp, 74, virtual void ExceptionsTest::run()
2, 3, My info 2, EnergoKodInstrumentyTest, /home/szyk/!-EnergoKod/!-Libs/EnergoKodInstrumenty/Tests/Src/ExceptionsTest.cpp, 75, virtual void ExceptionsTest::run()
2, 3, My info 3, EnergoKodInstrumentyTest, /home/szyk/!-EnergoKod/!-Libs/EnergoKodInstrumenty/Tests/Src/ExceptionsTest.cpp, 76, virtual void ExceptionsTest::run()

But I am unable to load anything from CSV file. The output is:

POCZĄTEK pod testu: Test saving exceptions to CSV.
Information! Code: 2, Message:
Try to load CSV file: /var/tmp/ExceptionTestsExceptionsTest.csv
0, , , , , , 
Wystąpił wyjątek! Wiadomość:
stoull
Naciśnij <RETURN> aby zamknąć to okno...

Note: In loadFromCSV() function I follow strictly your README.md entrance example. So I have no idea how to program it correctly.
Please help me.

ps. If this place is not suitable for such questions - please give me advice where should I ask.

Thank you and best regards.

Issue with the CSV header

I'm trying to use the csv::Writer as a backend for my Persistent Manager class but for some reason I cannot get it to write the header. Basically, I have something like this:

class PM {
	class Writer;
}

class Writer{

	using namespace std;

	string filename;
	unique_ptr<csv::Writer> writer;

	Writer(string filename) filename(filename) {
		writer = make_unique<csv::Writer>(filename);
		writer->configure_dialect().delimiter(cols);
	}

	~Write(){
		writer->close();
	}

	void write(map<sting, string> row) {
               // I override the method to accept vector<string> as well.
		writer->configure_dialect().column_names(cols);

		writer->write_row(row);
	}
}

Do you see any particular issue with this? I'm not quite sure why honestly, everything looks fine.

Not wroking correctly on file with too many lines

Environment

  • Ubuntu 18.04
  • CLion 2019.1.4

Issue

I test the program on a CSV file 20k_rows_data.csv.txt with 20K lines and the program does not work correctly. (I change the filename with .txt, because GitHub issue does not support uploading .csv file.)

int main() {
  csv::Reader csv;
  csv.read("../tests/inputs/20k_rows_data.csv.txt");
  auto rows = csv.rows();
  auto cols = csv.cols();
  int row_count = 0;
  for (auto row : rows) {
    std::string s = std::to_string(++row_count);
    for (auto col : cols) {
      s += ' ' + (std::string)(row[col]);
    }
    std::cout << s << std::endl;
  }
}

Part of the output is like (copy from my console):

5332     
5333     
5334 1 1 1 1 1
5335     
5336     
5337 1 1 1 1 1
5338     
5339     
5340     
5341 1 1 1 1 1
5342     
5343     

Note that the outputs are not the same each time I run it.

Overload write for void write_row(unordered_flat_map<std::string_view, std::string> row_map)

Is it possible to make it simpler to read data and then write that same data?

The read function reads in rows as unordered_flat_map<std::string_view, std::string>, but the write function expects unordered_flat_map<std::string, std::string>. This appears to require manual conversion between the types, making something like converting a CSV file to a TSV file more difficult than I had hoped.

Support reading directly from std::istream

I have a use case where I want to parse the first part of a CSV file with my own parser and the rest of it with your parser. Currently, I will have to re-parse most of the file with your parser because it only takes a filename as an argument. I have a std::istream exactly at the position where I want to pick up parsing with your parser. Would it be possible to support reading from a std::istream?

Option to append to file when writing

This is same use case as #16 except for writing to a file. An option to have the writer append to an existing file would be very helpful. Even better might be adding a csv::Writer constructor which takes a std::ostream.

Implement Dialect.doublequote Dialect.escapechar

Dialect.doublequote
Controls how instances of quotechar appearing inside a field should themselves be quoted. When True, the character is doubled. When False, the escapechar is used as a prefix to the quotechar. It defaults to True.

On output, if doublequote is False and no escapechar is set, Error is raised if a quotechar is found in a field.

Dialect.escapechar
A one-character string used by the writer to escape the delimiter if quoting is set to QUOTE_NONE and the quotechar if doublequote is False. On reading, the escapechar removes any special meaning from the following character. It defaults to None, which disables escaping.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.