Comments (5)
Related: It would be great if we could rectify and deskew images before ocring them. It seems that tesseract doesn't do this by default.
e.g. in the following image, only the highlighted area was ocred.
Quick google search turned up:
An implementation in python:
https://www.pyimagesearch.com/2017/02/20/text-skew-correction-opencv-python/
It seems that imagemagick might be able to do the job:
https://stackoverflow.com/questions/12117644/deskew-and-filter-an-image-for-ocr
from computation_hist.
Ah! I was used to Acrobat OCR which does this -- yes, I think that this would a super task to do.
from computation_hist.
@samimak37 and @meesuekim: I think it would be useful for @ifeife123 's task of extracting documents from larger pdfs if the ocr function could extract page ranges, i.e. if you could implement the params start_page
and end_page
such that it would create an ocred pdf / extract text only from the selected page range.
from computation_hist.
@samimak37
I think this implements the method (find angle that maximizes number of lines that are white or mixed) that we discussed yesterday: https://avilpage.com/2016/11/detect-correct-skew-images-python.html
from computation_hist.
I've been messing around today with tesseract on the command line, primarily with some of my tobacco documents.
TLDR: the LSTM mode of tesseract 4 is impressive.
Base image: https://s3.amazonaws.com/comp-hist/docs/1_10_architecture/docs/1/pages/1/1_10_architecture_1_1.png
CLI documentation: https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage
Tesseract 3:
‘ ’ W“. 4'4”?“ ‘
Ir. c I. 1 P3133333 V
. 31333131", P3131331 P1331
_ 11333 21-235 ' ~ 3
1' 033: I1. 1313:3311: _ 1‘
g ' . ‘ 13 33331-33333 3113 on: 31333331333 3113 33:1333 m
"311131313 at 3 333113; 33 133131”, 33:33 81-, '31: 33:33:! to
3:331:13 133 III (mp 3113 3333 33333 13 3311313¢ 30.1313 33333 7‘"
,13 13 33 3333 by 133 3331133 8313333 333331-33 0:333 (180) 333
. 73111 33 3313¢ 133 704 33 3 333—33111 33313 1313 m 333
333133: 311133 13 03311-31 83331-3 31 ‘73 1333333333113 Av3333. 1
. III 333333133 1331 1 3: 2 3373 (3313:3111 1333 33313
'-p:313r I) 13 33116133 20 33 33313336 13 1333 to 3333-30313 133
33333 31 1331: 31311.1 3331.3 1133 you to 3313: 1313 :333331
3333¢ 133 3133: 319333 3133 33133 133 3:3 3331-33111 33331d3r133.
81333:“: {33:3 ,
I. I. Vanna -
1133131331 011-3313:
cc: 9131. C. I. 3133
\/Pro.f. P. 3. 302-33
Tesseract 4 with LSTM and language set to english
tesseract test.png stdout -l eng --oem 1
Meri 4, esr
Mr, C. M. ¥. Peterson _
Director, Physical Plast
Room 24-205
& Dear wr, Peterson: ; wl es
; In accordance with our alonsitinn with varices EM
_efticials at a meeting on Thursday, March 21, MIT agreed to
© provide the 18M group with some space in Building 20, This space ict, =
is to be used by the Applied Science Research Group (ASR) we
~. will be using the 704 on a one-shift basis This group has
another office in Central Square at 678 Bsssanhusatis Avenue. ;
IBM requested that 1 or 2 bays (naturally they would
_ prefer 2) in Building 20 be assigned to them to accomodate the
needs of their staff, 1 would like you to enter this request
among the other space bids which you are Tana considering.
Sincerely yours ’
F. M, Verzuh
Assistant Director
ce: pfof . C., F, Floe
Prof. P, M. Morse
per
I'll experiment a bit more.
from computation_hist.
Related Issues (20)
- Fix unicode bug(s?) HOT 2
- Unify design across the site HOT 7
- Fix layout issues on small screens HOT 1
- Add a
- Add a "in progress" indicator for search HOT 1
- Hitting 'reset' while running a simulation doesn't behave correctly
- Simulations Column Alignment
- Browse Link in Navbar is Broken HOT 2
- Make Names in Search Results Link to Person Page
- Add citation format to the document viewer HOT 8
- Folder and people display should also use datatables HOT 2
- "Letter to Sir" HOT 1
- Finish for-real deployment
- Make sure all pages have titles HOT 1
- Improve simulations design HOT 1
- Support for Sense-Type Instructions
- Line numbers not visible on general assembler
- General assembler slow on startup
- Network visualizations axis HOT 1
- add DH as punchcard image to the "About" page
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from computation_hist.