Giter Club home page Giter Club logo

trufanov-nok / minidjvu-mod Goto Github PK

View Code? Open in Web Editor NEW

This project forked from barak/minidjvu

13.0 13.0 5.0 3.28 MB

A multipage DjVu encoder. This is a fork of minidjvu, with full-scale shared dictionaries (djbz) optimization and a few tricks in order to compensate the subsequent performance drop (multi-threading etc.).

License: GNU General Public License v3.0

Makefile 0.48% M4 2.69% C 34.46% C++ 61.98% NSIS 0.23% Python 0.17%
djvu ebooks image-processing scanning

minidjvu-mod's People

Contributors

barak avatar maple7-7-7 avatar onovy avatar svpv avatar trufanov-nok avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

minidjvu-mod's Issues

Adding FGbz chunk to Table gives unexpected results - Resolved

A patch for DjVuLibre (Linux) for the problem of getting unexpected results when adding FGbz chunks to minidjvu-mod Tables has been created and successfully implemented for Tables of Version 0.9m06.
The issues affect minidjvu-mod Tables if the Tables are subsequently colorized. The two issues were a larger than expected FGbz chunk size and an inaccurate description of the number of colors in a Table page's information window.
Even though the particular example of this issue has been resolved, this issue is being kept open for now should there be any changes associated with more complex Tables. But at this point, things look good.

Use DjVuLibre's miniexp.cpp for settings parsing

As suggested by Leon Bottou:
https://sourceforge.net/p/djvu/discussion/103286/thread/1b0de7aa93/?page=1&limit=25#2ae7

... I just found your 'settings reader' using s-expressions. If I had realized you wanted to do this, I could have saved you a lot of work by pointing out that the files "miniexp.h" and "miniexp.cpp" are actually standalone and can be used without the rest of libdjvu as in the "minilisp" example of https://sourceforge.net/p/djvu/djvulibre-git/ci/master/tree/doc/minilisp/.

The doc is in the h file (https://sourceforge.net/p/djvu/djvulibre-git/ci/master/tree/libdjvu/miniexp.h). To read the settings file, you just have create an io structure with miniexp_io_init() and miniexp_io_set_input() and then call miniexp_read_r(io). This returns a s-expression data structure that contains the whole settings and that you can navigate with the miniexp.h functions.

You can even define a macrochar to implement the # comments you have in the settings. See https://sourceforge.net/p/djvu/djvulibre-git/ci/master/tree/doc/minilisp/minilisp.cpp#l1111 which defines semicolon comments ( instead of a # comments )

...

The annotation processing code in djvused is older. I never changed it because I want to preserve the original chunk indentation. So instead of really decoding it, there is a state machine that filters things out and fixes compatibility issues. The real nasty parsing code is in DjVuAnno.cpp. Also it is tightly connected to the Lizardtech XML annotation tools that many people said they wanted. Otherwise I would have removed that long ago. I could not but have no time....

Version 09.5 Tests - smooth crops right border

Discussed in #11

Originally posted by maple7-7-7 September 8, 2021
Hi, Alex,
Here are the results of testing v09.5 vs v09.3. For all pages of a 30-page set of 4250x5500 pbms converted to 500 dpi DjVus using 09.5, the right border was unexpectedly offset a bit to the left. The 09.3 version was fine. This is shown in the two jpgs, which are converted from the respective first page DjVus. A 30-page dictionary was used for both versions.
Also included is a zip of the 30 pbms and a script of the 30 pbm filenames.
Thanks.
test_30_pbms_script_for_500dpi.txt
test_30_PBMs_(01-30)_4250x5500_emerson_for_500dpi_09.3_and_09.5.zip
test_30_pbms_mdm09 3_gui_500dpi_emerson_30pgdict_pg1
test_30_pbms_mdm09 5_gui_500dpi_emerson_30pgdict_pg1

Comparison with DjVuSolo 3.1: I see a bold character where it shouldn't be.

Hi Alex,

I made a sampe PBM with scantailor and compressed it with DjVuSolo 3.1 bitonal 300 dpi and minidjvu-mod --lossy for comparison.

Sample.zip

In the version of minidjvu-mod you can see that a normal character has been switched for a bold one:

image

That switch didn't take place in DjVuSolo3.1 bitonal300, which still reached 3k smaller.

Some links on the subject of recognizing italic and bold, just from some googling:
https://www.researchgate.net/publication/235412971_Automatic_Text_Clustering_and_Classification_Based_on_Font_Geometrical_Characteristics

https://stackoverflow.com/questions/62947592/does-google-cloud-vision-api-detect-formatting-in-ocred-text-like-bold-italics

https://github.com/tesseract-ocr/tesseract/issues/1371

https://studylib.net/doc/18711914/detection-of-bold-italic-and-underline-fonts-for-hindi-ocr

https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.735.9226&rep=rep1&type=pdf

Can't get minidjvu_mod to work

Hi again, Alexander,

I love the possibility of a reduced DjVu file size using your new fork.

But I keep having problems installing.

I tried to install on two computers and am getting more error messages.

I tried your corrections and still no luck.

I love experimenting with DjVu, including annotations, but I have little Linux programming.experience.

I have a number of pbm to DjVu projects that I would love to test for you and me, in terms of changes in file size using the fork for a given set of DjVu parameters.

It would be revealing, for instance, to compare minidjvu with your minidjvu fork and with any2djvu and DjVu Solo 3.1, for a set number of dictionary pages. The program DjVuToy is also useful for comparing possibilities.

I have several self-made DjVu documents derived from pbms and bmps, some hundreds of pages long. It would be interesting to use your new program with these.

Is there a way for you to create a version of the minidjvu fork that will install as easily as minidjvu itself? Then maybe I can give you lots of resulting data regarding the efficacy of your new program at reducing DjVu file size.

I have tinkered a lot with DjVu, including with annotations, but I am no programmer.

I could, if necessary, remove minidjvu and then test the fork after a new fork install.

Thanks again for your great work and for your interesting discoveries about encoding DjVu.

Stephen Jones

Stephen - Trying to get a split-pages 20-page dictionary

Hi Alex,
Thank you for your advice! I now have a DjVu thanks to the proper use of a settings file. I am working again with the set of 30 text pages I mentioned before. The first 10 are pbms, the next 10 are tiffs, and the last 10 are pbms again. I am trying to get minidjvu-mod to make a 20-page dictionary for the pbms and a 10-page dictionary for the tiffs and then bring them together as a 30-page DjVu at 500 dpi. The following setup as a command line worked as far as making three 10-page dictionaries in the DjVu. Could you please tell me what to add to the command line so that I can get the "20a - 10 - 20b" dictionary setup? I basically can't quite figure out where to place the dictionary settings.
Thanks again.

Command line where test1.txt is the name of the settings file
Basic:
minidjvu-mod -v -d 500 -S test1.txt test1.djvu
Detailed: [square brackets and their contents of course not included]
minidjvu-mod -v -d 500 - S [settings file next (as test1.txt) - ->] (input-files 001_500.pbm 002_500.pbm 003_500.pbm 004_500.pbm 005_500.pbm 006_500.pbm 007_500.pbm 008_500.pbm 009_500.pbm 010_500.pbm 011_500_.tif 012_500_.tif 013_500_.tif 014_500_.tif 015_500_.tif 016_500_.tif 017_500_.tif 018_500_.tif 019_500_.tif 020_500_.tif 021_500__.pbm 022_500__.pbm 023_500__.pbm 024_500__.pbm 025_500__.pbm 026_500__.pbm 027_500__.pbm 028_500__.pbm 029_500__.pbm 030_500__.pbm) (djbz id 0001 (files 001_500.pbm 002_500.pbm 003_500.pbm 004_500.pbm 005_500.pbm 006_500.pbm 007_500.pbm 008_500.pbm 009_500.pbm 010_500.pbm 021_500__.pbm 022_500__.pbm 023_500__.pbm 024_500__.pbm 025_500__.pbm 026_500__.pbm 027_500__.pbm 028_500__.pbm 029_500__.pbm 030_500__.pbm )) (djbz id 0002 (files 011_500_.tif 012_500_.tif 013_500_.tif 014_500_.tif 015_500_.tif 016_500_.tif 017_500_.tif 018_500_.tif 019_500_.tif 020_500_.tif)) [end of settings file test1.txt] test1.djvu

I get the verbose readout, the dpi = 500, and the settings file processed with the default 10 pages per dictionary in the resulting test1.djvu.

Thanks.

Trying to install minidjvu_mod-master

Hi Alexander, (migrated from sourceforge) (new to GitHub)

I again tried to install this fork but get the same error 61 problem.

It also says that pkg-config is already installed.

This attempted installation was on a second computer.

Any further suggestions are appreciated.

I am not a programmer, but like your project and its potential.

Thanks,
Stephen

GUI Adjustments?

Hi Alex,
Here are some thoughts regarding the GUI.
[When I enter a "0" for the number of dictionary pages, it defaults to "10."]
[maple7-7-7 edit: I am now seeing it as a "0" -- could be my mistake. Disregard. Sorry.]
The default compression options I feel are too strong as a starting set, and they can lead to degradation, especially the Lossy erosion feature. IMO, the best default compression starting options should be -match + Use Prototypes. In other words, it might be better to start the compression with those features that do not initially involve any potential degradation of the images. This would leave a better first impression. I hardly paid attention to the options at first, but seeing some degradation in small font characters in the resulting DjVu, I went back and removed the Lossy, the smooth, the clean, and the erosion options, and got very nice results with -match and Use Prototypes.

Thank you for the pop-up explanatory windows in the GUI. Very helpful.
It looks like you added the multi-select feature in the files/folder window. Thank you.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.