Comments (4)
Here is the testing that I did for this ticket.
Table Definition
CREATE TABLE test.dsbulk (
pk int PRIMARY KEY,
c1 text
);
Input CSV file contents. Notice 1st one is quoted and 2nd one isn't
$ cat dsbulk.csv
pk,c1
1,'First line.\nSecond line.'
2,First line.\nSecond line.
DSBulk version
$ ./dsbulk --version
DataStax Bulk Loader v1.7.0
DSBulk execution command and output
$ ./dsbulk load -k test -t dsbulk -header true -url ./dsbulk.csv
Operation directory: /Users/madhavan.sridharan/Data/Tools/DSBulk/dsbulk-1.7.0/bin/logs/LOAD_20201014-143653-997467
total | failed | rows/s | p50ms | p99ms | p999ms | batches
2 | 0 | 7 | 5.89 | 8.59 | 8.59 | 1.00
Operation LOAD_20201014-143653-997467 completed successfully in less than one second.
Last processed positions can be found in positions.txt
Result from CQLSH
cqlsh:test> select * from test.dsbulk ;
pk | c1
----+-----------------------------
1 | 'First line.\nSecond line.'
2 | First line.\nSecond line.
(2 rows)
DSBulk unload operation activity
$ ./dsbulk unload -k test -t dsbulk -header true -url ./dsbulk_unload
Operation directory: /Users/madhavan.sridharan/Data/Tools/DSBulk/dsbulk-1.7.0/bin/logs/UNLOAD_20201014-144238-583471
total | failed | rows/s | p50ms | p99ms | p999ms
2 | 0 | 4 | 7.62 | 12.98 | 12.98
Operation UNLOAD_20201014-144238-583471 completed successfully in less than one second.
Output of the unloaded csv files
$ ls dsbulk_unload/output-00000
output-000001.csv output-000002.csv
$ cat dsbulk_unload/output-00000*.csv
pk,c1
2,First line.\nSecond line.
pk,c1
1,'First line.\nSecond line.'
Java Program using Java Driver 4.9.0
I tried to insert records with DataStax Java Driver 4.9.0
using the below code and still couldn't repro this behavior,
import java.net.InetSocketAddress;
import com.datastax.oss.driver.api.core.CqlSession;
import com.datastax.oss.driver.api.core.cql.ResultSet;
public class DAT617 {
public static void main(String... args) {
try (CqlSession session = CqlSession.builder().addContactPoint(new InetSocketAddress("localhost", 9042))
.withLocalDatacenter("dc1").build()) { // Change values to match your environment
ResultSet rs = session.execute("insert into test.dsbulk(pk,c1) values (3,'First line.\\nSecond line.')");
}
}
}
CQLSH output of the 3rd record inserted into the test table
cqlsh:test> select * from test.dsbulk ;
pk | c1
----+-----------------------------
1 | 'First line.\nSecond line.'
2 | First line.\nSecond line.
3 | First line.\nSecond line.
(3 rows)
DSBulk unload command run and the output csv file contents
$ ./dsbulk unload -k test -t dsbulk -header true -url ./dsbulk_unload
Operation directory: /Users/madhavan.sridharan/Data/Tools/DSBulk/dsbulk-1.7.0/bin/logs/UNLOAD_20201014-150212-234300
total | failed | rows/s | p50ms | p99ms | p999ms
3 | 0 | 9 | 1.41 | 1.49 | 1.49
Operation UNLOAD_20201014-150212-234300 completed successfully in less than one second.
and the output of the unloaded csv files are below,
$ cat dsbulk_unload/output-00000*.csv
pk,c1
1,'First line.\nSecond line.'
pk,c1
3,First line.\nSecond line.
pk,c1
2,First line.\nSecond line.
FWIW, I tested this against DSE 6.7.10
. I don't think DSBulk is doing it wrong in here. Do we've a full minimal reproducible example to reproduce this behavior?
from dsbulk.
This has been discussed internally and it appears there was just some misunderstanding about how newline characters are processed by CQLSH and DSBulk. All is good now.
from dsbulk.
Hi @adutra I see this issue still present in version 1.7 and 1.8 running against a 5.1.14 cluster. Cqlsh COPY TO command will copy the exact contents, but the dsbulk unload command will interpret or omit escaped characters, such as \0xf and \r\n when present in a text column.
from dsbulk.
Hi @ibeatmybrothers do you have a reproducer? Because we discussed this internally and I wasn't able to see any defect in how DSBulk exports such characters. Thanks!
(Note: the example reported above by @msmygit did not contain actual line breaks, but only escaped \n
sequences. These are not converted by DSBulk in any way)
from dsbulk.
Related Issues (20)
- Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/LoggerFactory HOT 9
- dsbulk unload stuck when config -maxConcurrentFiles (write concurrency) greater than 1 HOT 1
- DSBulk Java API
- DSBulk dependency on `logback` implementation
- `ClassLoader` aware DSBulk
- `maxRecords` flag does not apply to write operations
- DSBulk count doesn't work on tables with just partition keys
- dsbulk compat with vector type HOT 4
- Loading from AWS S3 large file gives "Required array length is too large" error HOT 2
- Cannot import multiple values in a map<T,T> column using CSV files
- Add support for loading/unloading vector type data HOT 1
- dsbulk doesn't support toUnixTimestamp? HOT 4
- Parsing trouble when a column is called "vector" HOT 6
- Parsing vector data from JSON fails for "floats" with too many digits (aka doubles) HOT 1
- Split when unloading into smaller files
- Escape character when unloading
- DSBulk unload fails to parse map[value] as provided in query HOT 2
- Windows version only works when dsbulk in in short folders
- DSBulk DELETE can not accept any ranges on the clustering column when used within -query
- Allow file input for dsbulk unload
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dsbulk.