Giter Club home page Giter Club logo

doc_ripper's People

Contributors

pzaich avatar weilandia avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

doc_ripper's Issues

Docx encode problems

I'm ripping a file but the output is with encode problems. How should I pass the encode to the ripper?

My output is something like:

Um cronograma que você terá em mãos para acompanhá-lo

My code is:
text = DocRipper::rip('file_name.docx')
puts text

Fails on Heroku

Version 0.0.7.1 fails when trying to push a Rails -v 4.2.0 to Heroku.

In my Gemfile:
gem 'doc_ripper', '~> 0.0.7.1'

Full Heroku logs:

git push staging MF-LoginBug-33:master -f
Counting objects: 103, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (102/102), done.
Writing objects: 100% (103/103), 18.49 KiB | 0 bytes/s, done.
Total 103 (delta 77), reused 0 (delta 0)
remote: Compressing source files... done.
remote: Building source:
remote: 
remote: -----> Ruby app detected
remote: -----> Compiling Ruby/Rails
remote: -----> Using Ruby version: ruby-2.2.3
remote: -----> Installing dependencies using bundler 1.11.2
remote:        Running: bundle install --without development:test --path vendor/bundle --binstubs vendor/bundle/bin -j4 --deployment
remote:        Warning: the running version of Bundler is older than the version that created the lockfile. We suggest you upgrade to the latest version of Bundler by running `gem install bundler`.
remote:        Fetching gem metadata from https://rubygems.org/...........
remote:        Fetching version metadata from https://rubygems.org/...
remote:        Fetching dependency metadata from https://rubygems.org/..
remote:        Using i18n 0.7.0
remote:        Using rake 11.3.0
remote:        Using json 1.8.3
remote:        Using minitest 5.9.1
remote:        Using thread_safe 0.3.5
remote:        Using builder 3.2.2
remote:        Using erubis 2.7.0
remote:        Using mini_portile2 2.1.0
remote:        Using pkg-config 1.1.7
remote:        Using rack 1.6.4
remote:        Using mime-types-data 3.2016.0521
remote:        Using sass 3.4.22
remote:        Using thor 0.19.1
remote:        Using coffee-script-source 1.10.0
remote:        Using execjs 2.7.0
remote:        Using formtastic_i18n 0.6.0
remote:        Using arel 6.0.3
remote:        Using concurrent-ruby 1.0.2
remote:        Using tilt 2.0.5
remote:        Using encryptor 3.0.0
remote:        Installing CFPropertyList 2.3.3
remote:        Using bcrypt 3.1.11
remote:        Using cancancan 1.15.0
remote:        Using net-ssh 3.2.0
remote:        Using currencies 0.4.2
remote:        Using daemons 1.2.4
remote:        Using orm_adapter 0.5.0
remote:        Using rotp 2.1.2
remote:        Using eventmachine 1.2.0.1
remote:        Using multi_json 1.12.1
remote:        Using pg 0.19.0
remote:        Using bundler 1.11.2
remote:        Using rails_serve_static_assets 0.0.5
remote:        Using rails_stdout_logging 0.0.5
remote:        Using rolify 5.1.0
remote:        Using tzinfo 1.2.2
remote:        Using nokogiri 1.6.8
remote:        Using mime-types 3.1
remote:        Using rack-test 0.6.3
remote:        Installing sqlite3 1.3.12 with native extensions
remote:        Using warden 1.2.6
remote:        Using bourbon 4.2.7
remote:        Using autoprefixer-rails 6.4.1.1
remote:        Using uglifier 3.0.2
remote:        Using coffee-script 2.4.1
remote:        Using sprockets 3.7.0
remote:        Using attr_encrypted 3.0.3
remote:        Using countries 0.10.0
remote:        Using net-scp 1.2.1
remote:        Using thin 1.7.0
remote:        Using rails_12factor 0.0.3
remote:        Using activesupport 4.2.7.1
remote:        Using loofah 2.0.3
remote:        Using mail 2.6.4
remote:        Using bootstrap-sass 3.3.7
remote:        Using country_select 2.1.0
remote:        Using sshkit 1.11.3
remote:        Using rails-deprecated_sanitizer 1.0.3
remote:        Using globalid 0.3.7
remote:        Using arbre 1.1.1
remote:        Using activemodel 4.2.7.1
remote:        Installing mimemagic 0.3.2
remote:        Using jbuilder 2.6.0
remote:        Using rails-html-sanitizer 1.0.3
remote:        Using capistrano 3.4.1
remote:        Using rails-dom-testing 1.0.7
remote:        Using activejob 4.2.7.1
remote:        Using activerecord 4.2.7.1
remote:        Using capistrano-bundler 1.2.0
remote:        Using actionview 4.2.7.1
remote:        Using polyamorous 1.3.1
remote:        Using capistrano-rails 1.1.8
remote:        Using actionpack 4.2.7.1
remote:        Using actionmailer 4.2.7.1
remote:        Using railties 4.2.7.1
remote:        Using formtastic 3.1.4
remote:        Using has_scope 0.6.0
remote:        Using kaminari 0.17.0
remote:        Using ransack 1.8.2
remote:        Using sprockets-rails 3.2.0
remote:        Using simple_form 3.3.1
remote:        Using coffee-rails 4.2.1
remote:        Using responders 2.3.0
remote:        Using jquery-rails 4.2.1
remote:        Using jquery-ui-rails 5.0.5
remote:        Using sass-rails 5.0.6
remote:        Using rails 4.2.7.1
remote:        Using inherited_resources 1.6.0
remote:        Using devise 4.2.0
remote:        Using activeadmin 1.0.0.pre4 from git://github.com/activeadmin/activeadmin.git (at master@f892683)
remote:        Gem::Ext::BuildError: ERROR: Failed to build gem native extension.
remote:        /tmp/build_afde38e88e94b9d8b8a3985e2508425a/vendor/ruby-2.2.3/bin/ruby -r ./siteconf20161017-198-m3n2yq.rb extconf.rb
remote:        checking for sqlite3.h... no
remote:        sqlite3.h is missing. Try 'brew install sqlite3',
remote:        'yum install sqlite-devel' or 'apt-get install libsqlite3-dev'
remote:        and check your shared library search path (the
remote:        location where your sqlite3 shared library is located).
remote:        *** extconf.rb failed ***
remote:        Could not create Makefile due to some reason, probably lack of necessary
remote:        libraries and/or headers.  Check the mkmf.log file for more details.  You may
remote:        need configuration options.
remote:        Provided configuration options:
remote:        --with-opt-dir
remote:        --without-opt-dir
remote:        --with-opt-include
remote:        --without-opt-include=${opt-dir}/include
remote:        --with-opt-lib
remote:        --without-opt-lib=${opt-dir}/lib
remote:        --with-make-prog
remote:        --without-make-prog
remote:        --srcdir=.
remote:        --curdir
remote:        --ruby=/tmp/build_afde38e88e94b9d8b8a3985e2508425a/vendor/ruby-2.2.3/bin/$(RUBY_BASE_NAME)
remote:        --with-sqlite3-config
remote:        --without-sqlite3-config
remote:        --with-pkg-config
remote:        --without-pkg-config
remote:        --with-sqlite3-dir
remote:        --without-sqlite3-dir
remote:        --with-sqlite3-include
remote:        --without-sqlite3-include=${sqlite3-dir}/include
remote:        --with-sqlite3-lib
remote:        --without-sqlite3-lib=${sqlite3-dir}/lib
remote:        extconf failed, exit code 1
remote:        Gem files will remain installed in /tmp/build_afde38e88e94b9d8b8a3985e2508425a/vendor/bundle/ruby/2.2.0/gems/sqlite3-1.3.12 for inspection.
remote:        Results logged to /tmp/build_afde38e88e94b9d8b8a3985e2508425a/vendor/bundle/ruby/2.2.0/extensions/x86_64-linux/2.2.0-static/sqlite3-1.3.12/gem_make.out
remote:        Using devise-two-factor 3.0.0
remote:        Installing climate_control 0.0.3
remote:        An error occurred while installing sqlite3 (1.3.12), and Bundler cannot
remote:        continue.
remote:        Make sure that `gem install sqlite3 -v '1.3.12'` succeeds before bundling.
remote:        Bundler Output: Warning: the running version of Bundler is older than the version that created the lockfile. We suggest you upgrade to the latest version of Bundler by running `gem install bundler`.
remote:        Fetching gem metadata from https://rubygems.org/...........
remote:        Fetching version metadata from https://rubygems.org/...
remote:        Fetching dependency metadata from https://rubygems.org/..
remote:        Using i18n 0.7.0
remote:        Using rake 11.3.0
remote:        Using json 1.8.3
remote:        Using minitest 5.9.1
remote:        Using thread_safe 0.3.5
remote:        Using builder 3.2.2
remote:        Using erubis 2.7.0
remote:        Using mini_portile2 2.1.0
remote:        Using pkg-config 1.1.7
remote:        Using rack 1.6.4
remote:        Using mime-types-data 3.2016.0521
remote:        Using sass 3.4.22
remote:        Using thor 0.19.1
remote:        Using coffee-script-source 1.10.0
remote:        Using execjs 2.7.0
remote:        Using formtastic_i18n 0.6.0
remote:        Using arel 6.0.3
remote:        Using concurrent-ruby 1.0.2
remote:        Using tilt 2.0.5
remote:        Using encryptor 3.0.0
remote:        Installing CFPropertyList 2.3.3
remote:        Using bcrypt 3.1.11
remote:        Using cancancan 1.15.0
remote:        Using net-ssh 3.2.0
remote:        Using currencies 0.4.2
remote:        Using daemons 1.2.4
remote:        Using orm_adapter 0.5.0
remote:        Using rotp 2.1.2
remote:        Using eventmachine 1.2.0.1
remote:        Using multi_json 1.12.1
remote:        Using pg 0.19.0
remote:        Using bundler 1.11.2
remote:        Using rails_serve_static_assets 0.0.5
remote:        Using rails_stdout_logging 0.0.5
remote:        Using rolify 5.1.0
remote:        Using tzinfo 1.2.2
remote:        Using nokogiri 1.6.8
remote:        Using mime-types 3.1
remote:        Using rack-test 0.6.3
remote:        Installing sqlite3 1.3.12 with native extensions
remote:        Using warden 1.2.6
remote:        Using bourbon 4.2.7
remote:        Using autoprefixer-rails 6.4.1.1
remote:        Using uglifier 3.0.2
remote:        Using coffee-script 2.4.1
remote:        Using sprockets 3.7.0
remote:        Using attr_encrypted 3.0.3
remote:        Using countries 0.10.0
remote:        Using net-scp 1.2.1
remote:        Using thin 1.7.0
remote:        Using rails_12factor 0.0.3
remote:        Using activesupport 4.2.7.1
remote:        Using loofah 2.0.3
remote:        Using mail 2.6.4
remote:        Using bootstrap-sass 3.3.7
remote:        Using country_select 2.1.0
remote:        Using sshkit 1.11.3
remote:        Using rails-deprecated_sanitizer 1.0.3
remote:        Using globalid 0.3.7
remote:        Using arbre 1.1.1
remote:        Using activemodel 4.2.7.1
remote:        Installing mimemagic 0.3.2
remote:        Using jbuilder 2.6.0
remote:        Using rails-html-sanitizer 1.0.3
remote:        Using capistrano 3.4.1
remote:        Using rails-dom-testing 1.0.7
remote:        Using activejob 4.2.7.1
remote:        Using activerecord 4.2.7.1
remote:        Using capistrano-bundler 1.2.0
remote:        Using actionview 4.2.7.1
remote:        Using polyamorous 1.3.1
remote:        Using capistrano-rails 1.1.8
remote:        Using actionpack 4.2.7.1
remote:        Using actionmailer 4.2.7.1
remote:        Using railties 4.2.7.1
remote:        Using formtastic 3.1.4
remote:        Using has_scope 0.6.0
remote:        Using kaminari 0.17.0
remote:        Using ransack 1.8.2
remote:        Using sprockets-rails 3.2.0
remote:        Using simple_form 3.3.1
remote:        Using coffee-rails 4.2.1
remote:        Using responders 2.3.0
remote:        Using jquery-rails 4.2.1
remote:        Using jquery-ui-rails 5.0.5
remote:        Using sass-rails 5.0.6
remote:        Using rails 4.2.7.1
remote:        Using inherited_resources 1.6.0
remote:        Using devise 4.2.0
remote:        Using activeadmin 1.0.0.pre4 from git://github.com/activeadmin/activeadmin.git (at master@f892683)
remote:        
remote:        Gem::Ext::BuildError: ERROR: Failed to build gem native extension.
remote:        
remote:        /tmp/build_afde38e88e94b9d8b8a3985e2508425a/vendor/ruby-2.2.3/bin/ruby -r ./siteconf20161017-198-m3n2yq.rb extconf.rb
remote:        checking for sqlite3.h... no
remote:        sqlite3.h is missing. Try 'brew install sqlite3',
remote:        'yum install sqlite-devel' or 'apt-get install libsqlite3-dev'
remote:        and check your shared library search path (the
remote:        location where your sqlite3 shared library is located).
remote:        *** extconf.rb failed ***
remote:        Could not create Makefile due to some reason, probably lack of necessary
remote:        libraries and/or headers.  Check the mkmf.log file for more details.  You may
remote:        need configuration options.
remote:        
remote:        Provided configuration options:
remote:        --with-opt-dir
remote:        --without-opt-dir
remote:        --with-opt-include
remote:        --without-opt-include=${opt-dir}/include
remote:        --with-opt-lib
remote:        --without-opt-lib=${opt-dir}/lib
remote:        --with-make-prog
remote:        --without-make-prog
remote:        --srcdir=.
remote:        --curdir
remote:        --ruby=/tmp/build_afde38e88e94b9d8b8a3985e2508425a/vendor/ruby-2.2.3/bin/$(RUBY_BASE_NAME)
remote:        --with-sqlite3-config
remote:        --without-sqlite3-config
remote:        --with-pkg-config
remote:        --without-pkg-config
remote:        --with-sqlite3-dir
remote:        --without-sqlite3-dir
remote:        --with-sqlite3-include
remote:        --without-sqlite3-include=${sqlite3-dir}/include
remote:        --with-sqlite3-lib
remote:        --without-sqlite3-lib=${sqlite3-dir}/lib
remote:        
remote:        extconf failed, exit code 1
remote:        
remote:        Gem files will remain installed in /tmp/build_afde38e88e94b9d8b8a3985e2508425a/vendor/bundle/ruby/2.2.0/gems/sqlite3-1.3.12 for inspection.
remote:        Results logged to /tmp/build_afde38e88e94b9d8b8a3985e2508425a/vendor/bundle/ruby/2.2.0/extensions/x86_64-linux/2.2.0-static/sqlite3-1.3.12/gem_make.out
remote:        Using devise-two-factor 3.0.0
remote:        Installing climate_control 0.0.3
remote:        An error occurred while installing sqlite3 (1.3.12), and Bundler cannot
remote:        continue.
remote:        Make sure that `gem install sqlite3 -v '1.3.12'` succeeds before bundling.
remote:  !
remote:  !     Failed to install gems via Bundler.
remote:  !     
remote:  !     Detected sqlite3 gem which is not supported on Heroku.
remote:  !     https://devcenter.heroku.com/articles/sqlite3
remote:  !
remote:  !     Push rejected, failed to compile Ruby app.
remote: 
remote:  !     Push failed
remote: Verifying deploy...
remote: 
remote: !   Push rejected to immense-bastion-38873.
remote: 
To https://git.heroku.com/immense-bastion-38873.git
 ! [remote rejected] MF-LoginBug-33 -> master (pre-receive hook declined)
error: failed to push some refs to 'https://git.heroku.com/immense-bastion-38873.git'

does this support Chinese word doc/docx file

my docx file contain chinese like
四、我们确认,我们完全同意招标文件制定的投标规则,并承诺按照这些规则履行我们的所有义务,包括一旦投标文件被贵方接受,将履行社会资本合作方的义务

in my mac, i used doc_ripper and the result shows below

➜  ~ irb
irb(main):001:0> require 'doc_ripper'
=> true
irb(main):002:0> DocRipper::rip('/Users/datSource/test/docx1.docx')
=> "ç\u009B® å½\u0095 TOC \\o \"1-4\" \\h \\z \\u ä¸\u0080ã\u0080\u0081æ\u008A\u0095èµ\u0084ç\u0094³è¯·ä¹¦ PAGEREF _Toc448258241 \\h 2äº\u008Cã\u0080\u0081æ\u008E\u0088æ\u009D\u0083å§\u0094æ\u0089\u0098书 PAGEREF _Toc448258242 \\h 5ä¸\u0089ã\u0080\u0081å¼\u0080æ \u0087ä¸\u0080è§\u0088表 PAGEREF _Toc448258243 \\h 6å\u009B\u009Bã\u0080\u0081è¯\u0084å\u0088\u0086ç´

how can i get the right plain text

thks!!

Docx extraction may be incomplete ("Binary file (standard input) matches")

The code to extract text from docx files includes the following snippet:

unzip -p #{to_shell(file_path)} | grep ...

However, streaming the docx-data to grep can create a problem when the docx-zip-archive contains binary data (e.g. JPEGs):

With the pipe, grep is not processing file by file, but instead the whole archive as a continuous stream. That stream is processed by grep in chunks, not file by file. And how exactly the stream is split into chunks is AFAIK uncontrollable - it depends on the pipe buffering logic and on how the OS switches between the concurrent processes on the left and right side of the pipe.

It thus can happen that a chunk consists of matching text and binary data. But then, the whole chunk is discarded with the message "Binary file (standard input) matches" (see description of option "--binary-files" in https://www.gnu.org/software/grep/manual/grep.html), and hence the extraction of text is incomplete.

This can be very difficult to detect: I had a case where the extraction worked almost always. Very seldomly, the result showed some extracted text and then "Binary file (standard input) matches", truncating some data. When it failed, it turned out that word/document.xml was processed properly, but word/footer2.xml and a JPEG file were read as one chunk. It happened so seldom because the footer file and the JPEG file were almost 2MB apart in the archive. Only during exceptional circumstances did it happen that the I/O ended up processing footer2.xml and the JPEG in one chunk.

I can reproduce this fairly regularly (perhaps 20-50% of the time) with docx_with_image.docx on the Unix Command line with a one-liner shell script containing

unzip -p "$1" | grep '<w:t' | sed 's/<[^<]*>//g' | grep -v '^[[:space:]]*$'

Sometimes the output is (correctly) Hello world, this is a test, sometimes it is incorrectly Binary file (standard input) matches.

Replacing spec/fixtures/lorem.docx with the above docx-file and explicitly testing for the text will also reveal (sometimes) the failure:

index 2602b61..a7fb0da 100644
--- a/spec/doc_ripper/doc_ripper_spec.rb
+++ b/spec/doc_ripper/doc_ripper_spec.rb
@@ -48,7 +48,7 @@ module DocRipper

       it 'should respond with text to valid file extensions' do
         expect(DocRipper.rip(doc_path)).not_to eq(nil)
-        expect(DocRipper.rip(docx_path)).not_to eq(nil)
+        expect(DocRipper.rip(docx_path)).to eq("Hello world, this is a test\r\n")
         expect(DocRipper.rip(pdf_path)).not_to eq(nil)
Failures:

  1) provide a clean api to return the text from a document #rip should respond with text to valid file extensions
     Failure/Error: expect(DocRipper.rip(docx_path)).to eq(docx_text)

       expected: "Hello world, this is a test\r\n"
            got: "Binary file (standard input) matches\n"

       (compared using ==)

       Diff:
       @@ -1 +1 @@
       -Hello world, this is a test
       +Binary file (standard input) matches

     # ./spec/doc_ripper/doc_ripper_spec.rb:59:in `block (3 levels) in <module:DocRipper>'

I suppose a valid fix would be to unzipping only the xml-files (though I am not proficient enough in Office Open XML to know whether all relevant text is only found there):

unzip -p #{to_shell(file_path)} '*.xml' | grep ...

Invalid byte sequence on all DOCX files

I am using this library to parse data from some government docs, it had a mixture of DOC and DOCX files but overnight, they must have done a complete refactor and now everything is DOCX.

Every file is now doing this sed: RE error: illegal byte sequence. Is this a bug or is that an issue with the DOCX files themselves?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.