Giter Club home page Giter Club logo

tex2txt's People

Contributors

matze-dd avatar symphorien avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

tex2txt's Issues

Check of nesting depths may fail

In the section on implementation issues of README.md, checks for nesting depths of braces and environments have been mentioned. They do fail, if braces or environment frames are inserted via expansion
of a declared macro.

One could reject recursive declaration of macros, but this would reduce
flexibility.

LanguageTool 5.1 ignores random paragraphs

When I use the python script of tex2txt in combination with shell.py and use LanguageTool in version 5.1 some random paragraphs are ignored and they do not appear in the html output anymore.
I use the following command:
python3 shell.py --html --language de-DE chapter-4.tex > report-4test.html

I have not found any pattern for this so I can't provide an example for this issue. With LanguageTool 4.7 it does work, so I don't know what the problem is. I would like to use 5.1 because there are some updates and more grammar rules included.

Thanks and regards
Matthias

Macro may skip too much white space

A problem similar to the one in script function split_sec() occurs in general. Assume that \xyz is a macro without arguments that yields an empty string and is expanded after \textcolor. Then

\textcolor{red}{This\xyz} is a problem.

will create the script output

Thisis a problem.

Looking for a preferably general solution.

Equations: warnings on $, &, \\ enclosed in {} braces

Inside of $...$ inline math, an unescaped $ sign enclosed in {} braces
leads to a warning. It is unclear, whether a macro is expanded and which text results from it.

For the same reason, in displayed equations, inclusion of & and \\ in {} braces will lead to warnings.

Function tex2txt() modifies globals

Since the function modifies members of the global objects 'defs' and 'parms', as well as function variables like 'text_add_frame', it may only be active once at each point in time.

Sentence splitting with maths replacements for English texts

With tex2txt.py --lang en ..., the LaTeX input

We know that
\[
    f(n) = 0 \text{ for all } n.
\]
this completes the proof.

currently results in this plain text version:

We know that
  U for all V. 
this completes the proof.

It seems that the dot at 'V.' is not recognised as sentence splitter by LanguageTool (LT), since it might be the acronym of a first name. Consequently, LT will not complain about the lower-case 'this' starting a new sentence.

According to some experiments, the following settings for maths replacements are more appropriate in English texts.

parms.inline_math = ('B-B-B', 'C-C-C', 'D-D-D', 'E-E-E', 'F-F-F', 'G-G-G')
parms.display_math = ('U-U-U', 'V-V-V', 'W-W-W', 'X-X-X', 'Y-Y-Y', 'Z-Z-Z')

Now, LT's sensitivity seems to be almost as good as for German texts with the current replacement collections ('D1D', 'I1I', ...). Word repetitions due to missing interpunction in equations and missing white space in connection with \text{...} parts are detected as before.

Still, there is at least one difference to the German version. In the following snippet, the missing dot is not detected in the English variant. LT does not complain about the capital 'This'.

We know that
\[
    f(n) = 0 \text{ for all } n
\]
This completes the proof.

But LT also won't generate a message for

This Is a pity.

Macro arguments without {} braces would be practical

Short-hand notation without {} braces is practical for macros just taking the next character as argument.
Although the implementation of text-mode accents supports both this and the style with braces, declaration of such macros is not possible via Simple() or Macro().

Need better math replacements for English texts

The replacement collection in variable parms.display_math works
quite well, if German is the main language. Requirements for replacements are summarized in the script in function set_language_de().
Till now, we could not yet select replacements that work equally well
with the English version of LanguageTool. For example, sensitivity is not good with the collection provided in function set_language_en() in these cases:

  • missing final dot in an equation, if something like 'Therefore'
    is following;
  • lower-case text continuation after an equation with final dot.

HTML report: highlighting may be too long

The Python script shell.py sometimes marks too much.
An extreme example is (\texorpdfstring skips its second argument)

\texorpdfstring{Thisx}{is a problem} is OK.

Here, only 'Thisx' should be highlighted.
Instead, 'Thisx}{is a problem}' is marked.

Wrong line numbers on swapping of macro arguments

Assume declaration

Macro(name='swap', args='AA', repl=r'\2\1')

Then the input

\swap
{A}
{B}

correctly leads to output 'BA', but the corresponding line number from option --nums is '3+' instead of '2+'.

This only happens on swapping of arguments, and it apparently has been introduced with the simplification of mysub() in release 1.5.6.

Wrong handling of \verb macro and verbatim environment

The handling has to be restructured.

  • on input '\verb?%? \verb?X?', the second \verb is not treated correctly
  • nested \verb macros may be resolved in wrong order
  • these two points similarly hold for verbatim environment
  • input '\verb?\begin{verbatim}X\end{verbatim}?' is resolved in wrong order
  • tracking of character position is poor for \verb argument

Line number tracking in displayed equations

There is no line-number tracking inside of parsed displayed equations.
Instead, all corresponding lines in the output text carry the line number of the opening \begin{...}.
Thus, error location may be cumbersome for large equation environments.

Restrictions for verb macro and verbatim environment

Since \verb has to be handled at the very beginning, a macro declaration
like Simple('textbackslash', r'\\verb?\\?') does not work.

Similarly, something as

\comment{\begin{verbatim}}
...
\comment{\end{verbatim}}

will not work.

Script structure

The Python script is only structured by comments.
It wildly mixes definitions of variables and functions with statements that actually perform text replacement operations.
This should be improved, for instance by usage of classes.

Line number tracking in macro arguments

A similar problem as in issue #2 appears with mulit-line arguments of macros, for instance in:

\textcolor{red}{
XX
YY
ZZ
}

All of the second argument is related to the line numer of the opening \textcolor.

Remember also column numbers by character tracking?

Currently, the central substitution function mysub() operates on a text string and an array of line numbers.
Instead, one could use an array of character offsets: for each character in the current text string, the corresponding array entry would indicate the position of that character in the input file.
Instead of storing line numbers with option --nums, one would write a file with character offset numbers.

With an adapted output filter for the proofreader, one then could also display corrected column numbers.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.