Comments (12)
Oop, actually, upon reading your error message it looks like you're using simple authentication instead of kerberos. In that case, no, skein has the same bug under simple authentication. I've been putting off fixing it in favor of other things, but I'll track that down today. Should be a simple fix.
I also tried pyarrow, but it complained about not being able to load libhdfs
Pyarrow should work fine on any system, but you may need to set some environment variables, see: https://arrow.apache.org/docs/python/filesystems.html#hadoop-file-system-hdfs
from knit.
What version of pandas? 0.23.0 introduced a couple when using compression: pandas-dev/pandas#21144 and pandas-dev/pandas#17778, and I think those fixes introduces another that is being included in 0.23.2.
from knit.
@jcrist , do containers created by skein
have credentials to write to HDFS as the initiating user?
from knit.
Yes. The delegation token for the default filesystem is provided in each container, and picked up automatically by libhdfs (the backend for pyarrow's hdfs reader), and maybe libhdfs3 (haven't tested). Currently skein doesn't handle delegation token renewal, so this will stop working after it expires, but that should only matter for long running jobs (> 1 day).
from knit.
Thanks @jcrist .
@yuriy-davygora , would you like to try using skein ?
from knit.
The only reason I tried it with knit was that I wanted to have a quick and easy test, but if it does not work, then I'll try skein.
One more question, though: when I set the permission 777 for everything, I received another error (see the update to my original question). I was using hdf3 at the time. I also tried pyarrow, but it complained about not being able to load libhdfs
, and I could not get it to run with libhdfs3
. Is this again a knit-related issue, or is this something else?
from knit.
@TomAugspurger , did something change in pandas to_csv
? Calling .getvalue()
seems to mean it's assuming the file-like is some BytesIO. This appears like a hdfs3 issue, but I think it showed up in one of the other file-system backends.
from knit.
Ah, correct, this seems to be in the ZIP branch (https://github.com/pandas-dev/pandas/blob/master/pandas/io/formats/csvs.py#L176), but the dask file object also handles compression internally. Not sure what to do about that. @yuriy-davygora , you could try passing compression=
to to_csv
and trying various parameters.
from knit.
@TomAugspurger: conda installed 0.23.1 automatically, I did not specify pandas version explicitely. I will try 0.23.2 tomorrow.
@jcrist Thank you for your answer, I will give pyarrow another go tomorrow.
from knit.
from knit.
@yuriy-davygora, dask-yarn (https://dask-yarn.readthedocs.io/en/latest/) has been released and now uses Skein (https://jcrist.github.io/skein/index.html), a more robust library for python/yarn interaction. The above permissions issue has been addressed there.
As for the to_csv
issue, I'm not sure what needs to be done here. This has nothing to do with hadoop stuff necessarily, and is more a bug(?) in dask's bytes handling/to_csv functions. @TomAugspurger, @martindurant any ideas on what if anything needs to be done here?
from knit.
from knit.
Related Issues (20)
- hdfs_home no longer respected
- With an hdfs instance, addresses already on hdfs cause python exception before start
- More robust testing HOT 1
- No upload option for env HOT 1
- Knit applications fail if files are uploaded in new release. HOT 4
- ApplicationMaster out of memory? HOT 8
- CondaCreator on a different platform HOT 4
- Redundant hdfs:// prefix for files uploaded by ``Knit.start`` HOT 8
- Container naming assumptions in ``yarn_api`` seem incorrect HOT 3
- Pre-installed Python Option HOT 2
- Support for constructing and env from an environment.yml HOT 4
- No stdout HOT 4
- Locale error on application start HOT 3
- Need an example of submitting a dask/python script to YARN HOT 15
- No information on running tasks in Bokeh in EMR HOT 6
- Knit unable to renew HDFS delegation token HOT 2
- Master Application getting closed upon all workers removed HOT 9
- More control required over construction of zip files HOT 3
- Deprecate? HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from knit.