aotimme / gocsv Goto Github PK

View Code? Open in Web Editor NEW

196.0 16.0 21.0 10.74 MB

Command-line CSV processing utility.

License: MIT License

Makefile 0.17% Go 97.38% Shell 2.45%

csv csv-processing

gocsv's People

Contributors

Stargazers

Watchers

Forkers

mtaziz ewebconsult heyrocker cuttle-ai nh-live riktheunis flowrean ezekiel munim wukuan405 govindbhardwaj bovinemagnet pedroalbanese davealexis sdunixgeek zacharysyoung rossbennett robertranjan sysujayce

gocsv's Issues

Cap requires names; panics with only default-name

The bigger logic of Cap (implicitly) allows for the user to leave the --names flag empty and specify only --default-name:

If names happens to be an empty slice, then numNames is 0. The first half of the predicate is true, but if defaultName is not empty then Cap doesn't error-out:

gocsv/cmd/cap.go

Lines 49 to 52 in f9d4372

 if numColumns > numNames && defaultName == "" { 

 fmt.Fprintf(os.Stderr, "Must specify --default-name if there are more columns than column names provided") 

 os.Exit(1) 

 }

Then... with nunNames being 0, the code flows through the else statement and builds the header exclusively off of defaultName:

gocsv/cmd/cap.go

Lines 60 to 71 in f9d4372

 for i := range firstRow { 

 if i < numNames { 

 newHeader[i] = names[i] 

 } else { 

 if j == 0 { 

 newHeader[i] = defaultName 

 } else { 

 newHeader[i] = fmt.Sprintf("%s %d", defaultName, j) 

 } 

 j++ 

 } 

 }

And I would like this behavior: if I have a headerless CSV with any number of columns that I want to feed into another GoCSV command I want to easily cap it with --default-name=Col.

But, prior to that, Cap tries to create the slice of strings, names, based on the assumption that the --names flag is not an empty string:

gocsv/cmd/cap.go

Lines 37 to 40 in f9d4372

 func (sub *CapSubcommand) RunCap(inputCsv *InputCsv, outputCsvWriter OutputCsvWriter) { 

 names := GetArrayFromCsvString(sub.namesString) 

 Cap(inputCsv, outputCsvWriter, names, sub.truncateNames, sub.defaultName) 

 }

The call to GetArrayFromCsvString(...) will panic if the passed string is empty.

I propose:

marking the --names flag optional in documentation
adding a check in RunCap to make sure at least one the --names or --default-name flags were set on the command line
guarding against passing an empty string to GetArrayFromCsvString() in RunCap

add assumes the input has a header

which isn't spelled out in the documentation; took me a minute to figure out why this wasn't working.

A sample CSV:

% cat test.csv 
1a
2a
3a

Naive attempt:

% gocsv add -t 'foo' test.csv
1a,
2a,foo
3a,foo

cap then add:

% gocsv cap -names 'C1' test.csv | gocsv add -t 'foo'         
C1,
1a,foo
2a,foo
3a,foo

that works, so finally:

% gocsv cap -names 'C1' test.csv | gocsv add -t 'foo' | gocsv behead
1a,foo
2a,foo
3a,foo

So, maybe spell out that add assumes a header, and to pipe from cap if your data doesn't already have one?

How to change to delim from "\x01" to others?

Hello there, I have a csv file with "\x01" as delimiter and I want to use gocsv to change it to others like "," or "\t".

I have tried head input | gocsv delim -i "\x01" -o "\t" > outputbut nothing changed.

What should I do?

Permission Denied while installing gocsv on Ubuntu 14.04

Hi,
I am trying to install gocsv on my Ubuntu 14.04 Instance but experiencing permission issues.
I was exectuting the following command.
/bin/bash <(curl -s https://raw.githubusercontent.com/DataFoxCo/gocsv/latest/scripts/install-latest-darwin-amd64.sh)

That means it is trying to write in usr/local/bin/gocsv

What I did is that I created a folder gocsv in usr/local/bin then I changed the permission as below:

sudo mkdir -p /usr/local/bin/gocsv
sudo chmod -R 777 gocsv
/bin/bash <(curl -s https://raw.githubusercontent.com/DataFoxCo/gocsv/latest/scripts/install-latest-darwin-amd64.sh)
sudo chown -R mtaziz:mtaziz gocsv

Following above, It has been installed successfully but when I run gocsv help , says Permission denied.

Any help in this regard highly appreciated.
Thank you.

Add new empty columns

I'm in a situation where I want to take an existing CSV and add a few empty columns with specific names for it. The CSV is an inventory of website URLs, and I want to add some columns to it like "Up To Date", "Reviewed", "Delete" etc for an audit spreadsheet. I am currently accomplishing this with --template like so:

gocsv template -t " " --name "Reviewed"

but it seems a little hacky to have to specify a value there, when all I really want is a new empty column with a specific name. It would be nice to be able to do something like

gocsv add --name "Reviewed"

and be done with it.

Thanks for a great tool!

select fails to access first column

If I have a csv file such as:

Test file: test.csv

cat test.csv
ContactName,EmailAddress

Run the following:

../gocsv select --columns ContactName test.csv
panic: Could not find header "ContactName"

goroutine 1 [running]:
main.GetIndicesForColumnsOrPanic(0xc42000a240, 0x2, 0x2, 0xc420062550, 0x1, 0x1, 0xffffffffffffffff, 0x1, 0x1)
	/Users/alden/gocsv/src/utils.go:18 +0xc1
main.SelectColumns(0x5b4800, 0xc42000a220, 0xc420062550, 0x1, 0x1)
	/Users/alden/gocsv/src/select.go:106 +0x18d
main.(*SelectSubcommand).Run(0xc42000a160, 0xc42000e180, 0x1, 0x1)
	/Users/alden/gocsv/src/select.go:43 +0x133
main.main()
	/Users/alden/gocsv/src/gocsv.go:96 +0x22d

As can be seen the select cannot find the first column.

If I change test.csv to:

cat test.csv 
fred,ContactName,EmailAddress

Then run

../gocsv select --columns ContactName test.csv
ContactName

Now that ContactName is the second column the select works.

Output column raw value

The program jq offers an option to output the raw value which is really useful in bash scripting, it would be nice of gocsv also had this to output a column without CSV formatting (quotes and new lines as \n) to allow querying and extracting data.

stack traces obscure errors

Every error produced by gocsv also dumps a stack trace to the console.

The stack trace is great for debugging but annoying when using the tool and making simple mistakes. It makes it hard to see the actual error message.

It would be nice if the stack trace was suppressed unless a switch (--debug) was added to the command line.

Improving the release practice

Hi,
Can I suggest an improvement to the release practice? Currently, the only release (described as 'Latest Release') dates from 2016. The documented way to make a release is to edit this release and update the binaries, which I think is not ideal. It is not clear how outdated the binaries actually are, which already caused confusion. Not to mention older releases are lost.
If we make new tagged releases from time to time, it should still be possible to refer to the latest release, e.g. in the macOS install script, as this page details.
Although I don't have permission to actually make a release, I'd be happy to submit a pull request with some changes. Here's where I think updates might be required:

scripts/update-latest-tag.sh
scripts/install-latest-darwin-amd64.sh
go.mod and cmd/go.mod (not sure)
DEV.md
README.md
VERSION (starting with v1.0.0?)

"gocsv sql" and column names that contain square brackets?

Hello,

First, many thanks for this great tool!

I face an issue using sql subcommand and column names that contain square brackets:

# echo -e "id,f[0],f[1]\n1,2,3" | gocsv sql -q "select * from stdin" -
# Error: unrecognized token: "]"

Is there a solution?

Parse error with escaped quotes

When a CSV file has an escaped quote, gocsv returns a parse error.

sample.csv

$ cat sample.csv | gocsv filter -c 2 -eq "af4ba48d_wp[af4ba48d_wp] @ localhost []" | gocsv filter -c 5 -eq "Query" | gocsv select -c 6
Error: parse error on line 3, column 140: extraneous or missing " in quoted-field
SELECT option_value FROM wp_options WHERE option_name = 'disabled_hit_count' LIMIT 1
SHOW FULL COLUMNS FROM `wp_options`

I'm not sure if quote escapes like this are something you would want to handle, as the standard way to escape quotes in CSV is to just repeat the quote twice like "" as is done in excel. This CSV could easily be translated to that with sed 's/\\\\\\"/""/g', but I thought I'd mention this if it is something you see fit to handle as the MySQL CSV engine seems to escape quotes in that manner.

index out of range if --column switch missing.

I forgot to pass the column switch to the join and got an index out of range.
It should state that a required switch is missing.


gocsv join --left services_with_id.csv iahproducts.csv 
panic: runtime error: index out of range

goroutine 1 [running]:
main.GetArrayFromCsvString(0x0, 0x0, 0xc42000a132, 0x594eca, 0x5)
	/Users/alden/gocsv/src/utils.go:147 +0x254
main.(*JoinSubcommand).Run(0xc42000a120, 0xc42000e170, 0x2, 0x2)
	/Users/alden/gocsv/src/join.go:51 +0x75
main.main()
	/Users/alden/gocsv/src/gocsv.go:96 +0x22d

No windows binary

Project looks very promising!
gocsv-windows-amd64.zip only contains one file: gocsv (5 435 904 bytes)
Peek inside:

Feature Request: Ability to specify -n headers to skip

Would like to be able to specify how many rows to skip before performing a task, for situations where an export I am editing has more than 1 header.

The .exe embedded in the installation zip file triggers malware detection in Sophos

I've downloaded gocsv-windows-4.0-amd64.zip and when I extract gocsv.exe, it triggers Sophos' malware detection.

The malware in question is identified as CXrep/MalGo-A. Is there an alternative?

display info to user on which file's headers didnt match for stack subcommand

when using the stack subcommand - if file headers on any of the subsequent files (after the 1st) dont match up - to print a message with the filename that is failing.

(extra - if not too tricky) if its detectible where the mismatch begins, then perhaps that as well - e.g. the first header name that doesnt line up.

Directory mode [for stacking in particular]

Provide a switch that allows the input of filenames to be directory based. If a directory has more files than is allowed on the command line (when using * for example) the command fails (as expected).

e.g. this fails
gocsv stack /my/dir/of/10M_files/*.csv

So the only way to get round this is to:
find /my/dir/of/10M_files/ -name "*.csv" -exec gocsv stack "{}" ";"

updates to tealeg/xlsx result in gocsv build errors

go get github.com/DataFoxCo/gocsv now results in:

src/github.com/DataFoxCo/gocsv/cmd/xlsx.go:105:27: sheet.Rows undefined (type *xlsx.Sheet has no field or method Rows)

Convert xlsx file from some directory to CSV

Cannot convert file from stdin

Update Sprig to v3.2.1

GoCSV is currently using Sprig v3.1.0. The latest Sprig is v3.2.1, which notably has math functions for floats (3.1.0 only has integer math functions).

sortable split output files

When using gocsvs split the output file names have suffix such as -1.csv, etc. so if there are more than 9 the filenames don't sort properly. Would be nice to be able to optionally zero pad the number in the suffix with sufficient zeroes that they sort properly.

Allow setting a default delimiter

I work almost exclusively with semicolon-delimited CSVs and it's a bit frustrating to always have to use gocsv delim before and after the actual commands. In some cases, like joins, gocsv delim isn't really an option even. Would you consider supporting setting an alternative default delimiter, perhaps via an environment variable? Would you review a pull request that would add this feature?

Since I needed this feature quickly, I set the default delimiter to semicolon in a fork, but it would of course be nicer to make it configurable in the main repository. I think tabs are also very common as delimiter and that would be another use case.

Join semantics not consistent with sql

The join syntax doesn't seem to quite match the sql syntax.

When I 'left' join a csv file I'm expecting to only get rows in the left table if and only if they match a row in the right hand table.

In fact I get every row in the left table regardless of whether they are in the right table or not.

This perhaps seems like it might make sense for csv files, but then what does the 'outer' join method do?

Convert XLSX to CSV causes blank fields at end of row to not have proper delimiter

When converting an XLSX to CSV, if there are 5 columns, you'd expect the output to be:

column_01,column_02,column_03,column_04,column_05
text,text,text,text,text

However if the last column is blank you get this:
text,text,text,text

It is missing that last comma delimiter and running other operations like -select will error at that row due to incorrect number of columns found.

gocsv xlsx option does convert Exel date fields to numbers

Given this xlsv file wich d field is a date formated.

The result is:

a,b,c,d
caleb,03/02/12,tyy,44197

Dates values became numbers.

Ps.: Would be great if we could change the filename result!

How to use unique then save as a csv name?

replace into a new column

I'm using the --regex and --repl args.

The problem is that I need to preserve the original column.

what I really want to do is output the results of the replace into a new column.

Join Subcommand Problem

I use gocsv join subcommand with 2 csv files, but only for one column condition.
If more than one column to be join with, it failed.
How to join 2 csv files with multiple columns condition?

View and getTruncatedLine() have problems

The following function has two issues:

a width of 1 or 2 causes a panic
non-ASCII runes are not correctly truncated

gocsv/cmd/view.go

Lines 175 to 184 in 5924c92

 func getTruncatedLine(line string, width int) string { 

 lineLen := utf8.RuneCountInString(line) 

 if lineLen == width { 

 return line 

 } else if lineLen < width { 

 return line + strings.Repeat(" ", width-lineLen) 

 } else { 

 return line[:width-3] + "..." 

 } 

 }

width is 1 or 2

Sometimes I want to squeeze as many columns onto the screen as it will fit and I try gocsv view -w 1 file.csv, like:

go run main.go view -w 1 << EOF
Col1-really-long-name,Col2-really-long-name
1,2
3,4
EOF

I naively expect something like:

+---+---+
| C | C |
+---+---+
| 1 | 2 |
+---+---+
| 3 | 4 |
+---+---+

Instead GoCSV panics:

panic: runtime error: slice bounds out of range [:-2]
...

getTruncatedLine assumes that width is greater-than-or-equal-to 3 when it tries to truncate an extra 3 chars (but not runes) to make room for the ellipsis, return line[:width-3] + "..." , leading to a negative high bound for the re-slice operation.

I think if we want to keep the ellipsis, then view's Run function should guard against a width of 1 or 2:

gocsv/cmd/view.go

Lines 34 to 38 in 5924c92

 func (sub *ViewSubcommand) Run(args []string) { 

 if sub.maxWidth < 0 { 

 fmt.Fprintln(os.Stderr, "Invalid argument --max-width") 

 os.Exit(1) 

 }

and the documentation should be updated to say --max-width must be a minimum of 3 (since the user would probably be confused or disappointed that 0 did nothing).

Those fixes won't actually give me the expected I shared, but I understand that it's not possible without other changes, so what I actually want is not a part of this issue.

As for "(but not runes)" in the explanation above...

non-ASCII runes

go run main.go view -w 15 << EOF
"Foobarbaz 日本のルーン",Col2-really-long-name
1,2
3,4
EOF

I expect:

+-----------------+-----------------+
| Foobarbaz 日本... | Col2-really-... |
+-----------------+-----------------+
| 1               | 2               |
+-----------------+-----------------+
| 3               | 4               |
+-----------------+-----------------+

Instead:

+-----------------+-----------------+
| Foobarbaz �... | Col2-really-... |
+-----------------+-----------------+
| 1               | 2               |
+-----------------+-----------------+
| 3               | 4               |
+-----------------+-----------------+

While getTruncatedLine correctly checks for length with utf8.RuneCountInString, when it comes to truncating it's operating on the (UTF-8-encoded) string, so its slice bounds are off, return line[:width-3].

line could be converted to a slice of runes, re-sliced to width, then converted back to string.

Add filter for 'empty' column

I've been using left/right outer joins and need to identify rows where the join didn't match.

I can do this with a filter and a regex but that's a bit painful.

It would be nice to have a filter that explicitly matches a blank/empty column.

e.g.

gocsv filter --columns a,b --empty

I think the semantics of empty should evaluate to true even if the field contains spaces.

Allows spaces between column names for --columns and --names

It took me a little while to work out what was going wrong and this may be a little hard to fix.

Essentially I habitually add a space after a comma.
The result for me was that when specifying columns to the --column switch I was getting an error 'too many files'.
I eventually worked out that the --column switch looks for the first space to determine the end of the column list.

This isn't documented and the resulting error was non-obvious.

This either needs to be explicitly documented or the command line changed to deal with spaces.
It looks like change the command line parsing could only be done in a 'non backward compatible' method so it might be better to just highlight this fact in the doco.

How to use sql subcommand exactly?

I use ON WINDOW10
gocsv sql --query "SELECT column FROM sample" sample.csv
but in vain.
The CMD window show
Error:Binary was complied with 'CGO_ENABLED=0' go-sqlite3 requires cgo to work. this is a stub

Convert csv to json

It would be great if gocsv could convert csv input into json output :-)

Documentation does't detail how to access duplicate column names

The doco doesn't provide any description on how you stipulate a column name when a column names appears in multiple csv files or even if the same column name appears twice in a single file.

For instance I'm 'joining' two files that have a common column name.
The contents of these two columns may be different.
I am looking to run a filter after 'joining' the two files but can't work out how to stipulate the second of the two column names to apply the filter to.

Question: Should CSV clean introduce a new line on the final row of a file

Firstly, thanks for this tool - I like it, very useful. One question: the clean option introduces a new empty line at the end of the file.

input:

A,B,,,
0,0.8570,,,
499,0.8570,,,
999,0.9021,,,
1499,0.9498,,,
1999,1.0000,,,
2499,1.0527,,,
2999,1.0528,,,
3499,1.0528,,,

becomes:

A,B
0,0.8570
499,0.8570
999,0.9021
1499,0.9498
1999,1.0000
2499,1.0527
2999,1.0528
3499,1.0528

Is this intentional and by design?

Regression in delim's default behavior

The fix for #54 changes the delim subcommand's default behavior concerning the input and output delimiters.

The subcommand used to do a run-time check if the parsed rune for either in or out was not the zero value, and if not set either to the rune. If the rune was the zero value, the csv.Reader's default Comma value (,) was used.

#54 changed the delimiter parsing to fail if no delimiter was explicitly set and error-out immediately: the csv.Reader's default Comma value no longer matters.

Now, the delim subcommand requires both in and out delimiters being explicitly set. That breaks at least one existing alias/script for me. I'd like delim to return to having a default behavior that just works. Setting the default values for the input and output flags seems like a sensible correction to me:

shows the user what to expect
works with GoCSV being about commas, and my notion that delim allows non-comma encoded data to work in a GoCSV-pipeline, or making comma-encoded data ready for some downstream process

Allow to specify encoding and destination path of the output file

Thank you for developing this, it seems to be a very useful tool. However, when converting a xlsx file to csv with gocsv xlsx, it would be great to have the option to specify the encoding of the output `csv file.

Also, when doing a batch conversion of multiple files, it would be nice to be able to specify the destination directory for the output files instead of creating a directory for the CSVs.

Grouping/stacking datasets

Feature request: It would be amazing to be able to perform the following analysis with gocsv - unless of course I've missed something!

https://www.fireeye.com/blog/threat-research/2012/11/indepth-data-stacking.html

gocsv isn't "go get"-able

Would you be open to having gocsv be buildable and installable with go get?

Regex Replace Transformations

I would like to request the ability to run basic regex transformation on a column.

Something along the format of:

--regexreplace "regex string to match" "replacement regex"

For example there are a few main way I generally use my regexes - 2 of which use capture groups:

Ex. 1 : I have a column that has only datafox urls and I would like, for any cell that matches the regex, to only give me the id or slug at the end:

--regexreplace "^http://datafox.com/.*/" ""
Replacement with an empty string

--regexreplace "(http://datafox.com/.*/)(\w{24})" "$2"
Replacement with the second capture group

--regexreplace "(last name), (first name)" "$2 $1"
Formatting but swapping capture groups and adding a space in between - where all characters are literal except for the capture group # reference

--regexreplace ".*@gmail.com|.*@aol.com|.*@hotmail.com" ""
Another example of replacement with empty string so that only cells without those matches would remain

	if numColumns > numNames && defaultName == "" {
	fmt.Fprintf(os.Stderr, "Must specify --default-name if there are more columns than column names provided")
	os.Exit(1)
	}

	for i := range firstRow {
	if i < numNames {
	newHeader[i] = names[i]
	} else {
	if j == 0 {
	newHeader[i] = defaultName
	} else {
	newHeader[i] = fmt.Sprintf("%s %d", defaultName, j)
	}
	j++
	}
	}

	func (sub CapSubcommand) RunCap(inputCsv InputCsv, outputCsvWriter OutputCsvWriter) {
	names := GetArrayFromCsvString(sub.namesString)
	Cap(inputCsv, outputCsvWriter, names, sub.truncateNames, sub.defaultName)
	}

	func getTruncatedLine(line string, width int) string {
	lineLen := utf8.RuneCountInString(line)
	if lineLen == width {
	return line
	} else if lineLen < width {
	return line + strings.Repeat(" ", width-lineLen)
	} else {
	return line[:width-3] + "..."
	}
	}

	func (sub *ViewSubcommand) Run(args []string) {
	if sub.maxWidth < 0 {
	fmt.Fprintln(os.Stderr, "Invalid argument --max-width")
	os.Exit(1)
	}

aotimme / gocsv Goto Github PK

gocsv's People

Contributors

Stargazers

Watchers

Forkers

gocsv's Issues

width is 1 or 2

non-ASCII runes

Recommend Projects

Recommend Topics

Recommend Org