sciruby / daru-io Goto Github PK

daru-io is a plugin gem to the existing daru gem, which aims to add support to Importing DataFrames from / Exporting DataFrames to multiple formats.

Home Page: http://www.rubydoc.info/github/athityakumar/daru-io/master/

License: MIT License

Ruby 100.00%

importer exporter data-analysis ruby-gem ruby parser daru

daru-io's Issues

Module to_csv : Export to .csv.gz format

Followed from SciRuby/daru#127

Add tests for linkages between daru & daru-io calls

These tests are required, just to ensure that daru calls are redirected to appropriate daru-io calls. For example, I manually found out one such linkage bug while trying out Importer calls from Daru::DataFrame. Seems like I had missed to make the changes in PR #52 and have subsequently add them with fd08213 before release of v0.1.0 (fortunately).

CLI usage : Convert one format to another

Being a general purpose conversion library, I think daru-io could use some CLI functionality like -

$ daru-io /path/to/import/file.csv /path/to/export/file.json

Testing with multiple dependency version

Redirecting from @zverok's suggestion on PR #28 regarding a narrow version dependency set by roo. In general, it'd be good to test the IO modules with multiple versions of IO dependency gems (like roo, spreadshseet, etc.).

Block support for CSV Importer

Redirecting from SciRuby/daru#413

Better :convert_comma for CSV Exporter

As per PR #34, the :convert_comma option when set to true, works with the following -

str =~ /^\d+./ ? str.tr('.',',') : str

This works mostly, but seems a bit fragile and could be more battle-tested.

Writing transactions

An idea for far future, extracted from discussion:

...support for some sort of "transactions" for writing could be useful, like

transaction writing to that db: df1 to table1, df2 to table2, then commit;
transaction writing to that CSV file: df1, then df2, then flush;
transaction writing to that XLSX file: df1 to sheet1, df2 to sheet2, then save.

Fix merged PRs according to review

PR #16 - Porting existing modules of daru to daru-io
PR #21 - Adds JSON importer

Better distinction between method arguments in Importers

Suggested by @zverok in PR #52

Currently, owing to the restriction due to automatic monkey-patching of daru-io modules into daru, the Importers are designed like this.

#! Usage from daru
df = Daru::DataFrame.read_csv(path, col_sep: ' ', other_opts)

#! is linked to daru-io like
inst = Daru::IO::Importers::CSV.read(path)
df = inst.call(col_sep: ' ', other_opts)

But, daru-io could use better set of arguments for methods, to ensure that a file is read only ONCE, and then called for dataframe with other options.

df = Daru::DataFrame.read_csv(path, col_sep: ' ', other_opts)

#! should rather be linked to daru-io like
inst = Daru::IO::Importers::CSV.read(path, col_sep: ' ')
df = inst.call(other_opts)

In general, all file parsing arguments and path should be provided in the read method, while post-reading arguments can be provided in call method.

Handle empty rows in CSV Importer

Redirecting from SciRuby/daru#367. Similar changes are to be added to daru-io too.

Module to_json

Followed from this issue tracker.

Read .csv.gz files

Followed from SciRuby/daru#127

Boolean converter for CSV Importer

Redirecting from SciRuby/daru#373

Add support for importing CSV files with only headers

Support for CSV files with only headers. Redirecting from this issue : SciRuby/daru#349

Excel Exporter - gem dependency

Redirecting from PR #37.

Currently, the Excel Exporter depends on 'Spreadsheet' gem which supports only .xls format. Support for .xlsx format has to provided by some other gem like rubyxl / axlsx.

RSRuby doesn't work with Rails

Though RSRuby gem (and RDS / RData IO modules) work properly on Travis CI within a gem, it faces this error when used with Rails (Passenger / Rack) -

Error: C stack usage  17589078384920 is too close to the limit
Error: C stack usage  17589078384968 is too close to the limit
Error: C stack usage  17589078384872 is too close to the limit
Fatal error: unable to initialize the JIT

Similar issue posted in StackOverFlow and issue tracker of RSRuby gem repo.

However, this small hack seems to make it work.

It should be possible to specify an index column in a CSV

It should be possible to specify an index column present in a CSV when loading a DataFrame from CSV rather than having to give an Array or Daru::Index.

Add yard-junk to Travis CI builds

Link to yard-junk repository : https://github.com/zverok/yard-junk

Markdown files

README.md : A well-detailed README with usage examples for partial & full requires, is required for a better tomorrow. Specifically remember to add badges from Travis and Waffle. See this and this for other badges. Also, merge PR #20 whenever the README has become well-maintained.
CONTRIBUTING.md : Guidelines to contribute, for fellow open-source developers.
LICENSE.md: License has currently been set as MIT.
CODE_OF_CONDUCT.md

Badges -

Waffle :
Travis :
Inch CI :
YARD Doc :
Codeclimate :
MIT License :

Auto-generate Importer-Exporter markdown templates

As mentioned by @zverok in PR #64, the README seems to be quite crowded with examples and useful information of ALL Importer-Exporter modules. Rather, the corresponding links in the LOC in the README can be linked to corresponding module/{FORMAT}_IMPORTER.md. These module/{FORMAT}_IMPORTER.md can preferably be generated by a Rake task, via ERB templates.

Old text format importer

I am not sure how this format is properly called (investigate?), but it is pretty common for scientific and international standartization data. Example (official unicode tables, official timezones tables are also published in this format):

# Note: characters with PROSGEGRAMMENI are actually titlecase, not uppercase!

1F80; 1F80; 1F88; 1F08 0399; # GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI
1F81; 1F81; 1F89; 1F09 0399; # GREEK SMALL LETTER ALPHA WITH DASIA AND YPOGEGRAMMENI
1F82; 1F82; 1F8A; 1F0A 0399; # GREEK SMALL LETTER ALPHA WITH PSILI AND VARIA AND YPOGEGRAMMENI
1F83; 1F83; 1F8B; 1F0B 0399; # GREEK SMALL LETTER ALPHA WITH DASIA AND VARIA AND YPOGEGRAMMENI
1F84; 1F84; 1F8C; 1F0C 0399; # GREEK SMALL LETTER ALPHA WITH PSILI AND OXIA AND YPOGEGRAMMENI
1F85; 1F85; 1F8D; 1F0D 0399; # GREEK SMALL LETTER ALPHA WITH DASIA AND OXIA AND YPOGEGRAMMENI

E.g. it is a bit like CSV with ; separator but:

# is comment;
spaces after/before separator are ignored;
empty lines are ignored.

It will be a nice showcase to have those "standard" data parsed out-of-the-box.

Port existing IO modules from daru to daru-io

Post GSoC: Steal like an artist

There are some gems gaining popularity recently, whose task could be solved (probably with more grace!) with daru+daru-io.

Let's look at them and consider what useful ideas we can borrow: sometimes for new features, sometimes for showcases. List will probably grow!

SpreadsheetArchitect -- ActiveRecord addon to export models to Excel. Could be done by daru-io in its current state. So, it is a matter of probably writing a blog post demonstrating our approaches to those problems ;) A really good chance to showcase daru-io, because a lot of people are talking about the gem recently.
Xport -- also AR-to-Excel exporter. Unlike the above gem, also allows to setup cells style, which we still can't (but probably should?)
Saxlsx from the same author -- (pretends to be) really quick Xlsx parser. Probably can be integrated into Xlsx importer? (Before integration, some measurements should be invented and checked, to understand if it is worth it → and generally speaking, speed tests for exporters and importers is probably idea for another GitHub issue)
Trick for fast importing CSV into Postgres (IDK if it is really useful for us, just leaving it here)
Cloudxls -- cloud (?) XLS-creation service. Don't use it, just look at their examples and what they are advertising (about convertion of "messy CSS" to "pretty Excel")

Modules from_avro and to_avro

Allow symbol to CSV Converter

Changes from SciRuby/daru#445 PR has to be mirrored here.

Metaprogramming to automate things

After having gone through a bit of meta-programming with Ruby, I feel that we can use this concept for atleast 2 purposes -

Auto-initializing class variables (as we're moving to keyword arguments)

#! Should this class be called Base?
module Daru
  module IO
    module Importers
      class Importer
        def initialize(**args)
          args.each do |k,v|
            instance_variable_set("@#{k}", v)
            define_singleton_method(k) { instance_variable_get("@#{k}") }
          end
        end
      end
    end
  end
end

#! lib/daru/io/importers/format.rb
module Daru
  module IO
    module Importers
      class Format < Importer
        def call
          # do importer specific stuff here
        end
      end
    end
  end
end

#! Use case (note that ALL arguments have to be keyword arguments)
df = Daru::IO::Importers::Format.new(path: '/path/to/format' or connection: 'connection', other keyword arguments).call

Linking Daru::DataFrame#from_{format} to Daru::IO::Importers::{Format}

#! daru/io/importers/linkages.rb
module Daru
  class DataFrame
    class << self
      def register_importer(function, instance)
        define_singleton_method(function) { |*args| instance.new(*args).call }
      end

      def register_all_importers
        importers = Daru::IO::Importers
        klasses   = importers.constants.select {|c| importers.const_get(c).is_a? Class}
        klasses.each do |klass|
          method_name = "from_#{klass.downcase}".to_sym
          register_importer method_name, Object.const_get("Daru::IO::Importers::#{klass}")
        end     
      end
    end
  end
end

Daru::DataFrame.register_all_importers
# Use Daru::DataFrame.register_importer for partial requires, yay!
# Note that for libraries like rcsv, the call changes to Daru::DataFrame.from_rcsv(...)

I'm positive about both of these changes. I'd like to know if I've left out any other place(s) where metaprogramming can be used, or if there's any problem with this methodology.

Module from_html

Handle Mongo & Redis timeouts?

@prasunanand - Thanks for pointing this out during the code review conference, I had missed to handle this error. But thinking about it, I'm not sure whether it should be handled.

For example, Mongo raises a TimeOut error when no results are obtained in 30 seconds. So, it anyway has to wait for 30 seconds (in case of error) to raise the TimeOut error. So, handling this wouldn't make the tests faster in case Mongo isn't installed (as it'll always take 30 seconds before reporting a TimeOut error). Similar issue with Redis too.
Also, TimeOut error is quite communicative to the user.

Installation of Redis & Mongo should definitely be added in the README. But, should the TimeOut error be handled or left to be raised?

Template files for Issues & Pull Requests

This is what I roughly have in mind, as an initial draft.

ISSUE TEMPLATE

Description

Thanks for opening this issue. Add a brief description of what this issue is, and how to recreate it. Do tag the relevant issue(s) and PR(s) below.

Relevant Issues : (optional)
Relevant PRs : (optional)
Type of issue :
- New IO module request
- Bug in existing IO module
- Clean-up :
  - Refactoring
  - Code quality
  - Test(s)
  - Documentation

PULL REQUEST TEMPLATE

Description

Thanks for contributing this Pull Request. Add a brief description of what this Pull Request does. Do tag the relevant issue(s) and PR(s) below.

Relevant Issues : (compulsory, read the Contribution Guidelines)
Relevant PRs : (optional)
Type of change(s) handled in this Pull Request :
- New IO module request
- Bug in existing IO module
- Clean-up :
  - Refactoring
  - Code quality
  - Test(s)
  - Documentation

Add to_html

Reflect changes from this Pull Request - SciRuby/daru#366

Support for Excelx Exporter

As mentioned in issue #43, gems like rubyxl / axlsx can be used for exporting to .xlsx format.

Add benchmarks for comparing all IO modules

Read from relative paths in HTML Importer

Optional dependencies workflow

Daru-io has lots of format-specific dependencies that are used in just one importer / exporter. Having them all as optional dependencies is one way to go about it.

#! lib/daru/io/importers/html.rb
begin
    gem gem_name, gem_version
    require gem_name
rescue LoadError
    raise "Please install #{gem_name} gem v#{gem_version} with `gem install #{gem_name}`."
end

Optional dependencies aren't supported by Rubygem's gemspec file - so they will NOT feature in the gemspec file. So, what if any user wants to install ALL of the optional gems of daru-io at one go? In bundler's Gemfile, can all optional gems be included them under a group (say, optional)? That way, the normal user installs with bundle install --without optional and someone who wants all optional gems runs just bundle install.

Please share your thoughts on whether there is a better way to go about optional dependencies. 😃

Ping @zverok @v0dro @lokeshh

Module from_json

Followed from the discussions in this issue tracker.

Add :skiprows to from_csv module

Followed from SciRuby/daru#220

Module from_redis

Modules from_rdata and to_rdata

Idea: Gist export (and probably import)

Just a wild idea, not sure how useful... But seems quite a bit.

Use case: I have some data and want to quickly show it to a colleague in another city. What's the sanest and easiest way to do it? Well, typically, we'll save the file, and then share file somewhere... But there can be this:

dataframe.first(1000).to_gist(access_token: '123456', format: :csv, name: 'data1')
# => prints URL https://gist.github.com/zverok/44971da8a59b07521a0914b657ff770f

dataframe.first(1000).to_gist(access_token: '123456', format: :markdown, name: 'data2')
# => prints URL https://gist.github.com/zverok/535ed082eaae7c5bf2a42fcda9676b42

(Both URLs I've created just for this demo)

This way, you can send links to your data to friends without ever leaving your IRuby notebook, or IRB session, or folder with data processing scripts.

CSV one is simpler to implement (our CSV exporter + Gist API, which is reasonable and well-documented), but Markdown also seems cool.

Module from_smartercsv

smarter_csv could probably be a faster alternative to parsing CSV files into Daru::DataFrames. Redirecting from this issue : SciRuby/daru#337

Resolve rubocop error

With the new version rollouts of rubocop-rspec gem in November, RSpec/ContextWording has been added. This enforces context descriptions to begin with 'when', 'with', or 'without'.

I think this makes sense for us to update the wordings as per this rule, as "with data from X data source" does seem more readable (IMO) than "reads data from X data source".

Modules to_excel and from_xlsx

Followed from this issue tracker.

Release policy

@baarkerlounger did a great job in adding the release policy for daru. I think daru-io (and also daru-view) could inherit the same release policy.

Module from_mongo

Remove placeholder codes / modules

The placeholder modules / tests can be removed once the first module has been written with tests.

Enhancements in XLSX Importer

Parse index (row names)
Handle merged cells (both order and index)
Handle formulas in xlsx file

Calling read-write methods from daru to daru-io

Currently, daru-io supports only from_format and to_format methods, with Daru::DataFrame.from_format(...) redirecting to Daru::IO::Importers::Format.new(...).call. Similar case with exporters.

As per discussion on SciRuby/daru#280, from_format, read_format, to_format and write_format methods are to be supported by all IO modules. Please check whether this would be good enough. Of course, feel free to suggest better calling methods (we may have to re-factor).

Daru::DataFrame.from_format(...) -> Daru::IO::Importers::Format.new(...).from
Daru::DataFrame.read_format(...) -> Daru::IO::Importers::Format.new(...).read

Ping @zverok @v0dro @lokeshh

Faster CSV Importers

Followed from SciRuby/daru#31, SciRuby/daru#170 and SciRuby/daru#337.

Add fixture files for the IO modules

One shared_example for all importer specs

Just like how all exporter specs currently have a shared_context, it'd be better (DRY) to have ONE common shared_example for all importer specs rather than importer-specific (which aren't really 'specific' on the importer) specs that test different attributes.

sciruby / daru-io Goto Github PK

daru-io's Issues

ISSUE TEMPLATE

Description

PULL REQUEST TEMPLATE

Description

Recommend Projects

Recommend Topics

Recommend Org