matze-dd / tex2txt Goto Github PK

View Code? Open in Web Editor NEW

22.0 22.0 5.0 1.25 MB

A LaTeX filter with Python

License: GNU General Public License v3.0

Shell 12.70% Python 86.62% TeX 0.02% HTML 0.66%

filter languagetool latex python-script regular-expression

tex2txt's People

Contributors

Stargazers

Watchers

Forkers

adriankhl symphorien sabquat hanqing-yu jiangerxiaozhao

tex2txt's Issues

Check of nesting depths may fail

In the section on implementation issues of README.md, checks for nesting depths of braces and environments have been mentioned. They do fail, if braces or environment frames are inserted via expansion
of a declared macro.

One could reject recursive declaration of macros, but this would reduce
flexibility.

LanguageTool 5.1 ignores random paragraphs

When I use the python script of tex2txt in combination with shell.py and use LanguageTool in version 5.1 some random paragraphs are ignored and they do not appear in the html output anymore.
I use the following command:
python3 shell.py --html --language de-DE chapter-4.tex > report-4test.html

I have not found any pattern for this so I can't provide an example for this issue. With LanguageTool 4.7 it does work, so I don't know what the problem is. I would like to use 5.1 because there are some updates and more grammar rules included.

Thanks and regards
Matthias

Macro may skip too much white space

A problem similar to the one in script function split_sec() occurs in general. Assume that \xyz is a macro without arguments that yields an empty string and is expanded after \textcolor. Then

\textcolor{red}{This\xyz} is a problem.

will create the script output

Thisis a problem.

Looking for a preferably general solution.

Equations: warnings on $, &, \\ enclosed in {} braces

Inside of $...$ inline math, an unescaped $ sign enclosed in {} braces
leads to a warning. It is unclear, whether a macro is expanded and which text results from it.

For the same reason, in displayed equations, inclusion of & and \\ in {} braces will lead to warnings.

Injection of codes for verbatim text via --repl option

Via the file from option --repl, one can insert character sequences
internally used for coding of verbatim text.
--> reject inclusion of raw % signs (escaped version \% is resolved
afterwards)

Function tex2txt() modifies globals

Since the function modifies members of the global objects 'defs' and 'parms', as well as function variables like 'text_add_frame', it may only be active once at each point in time.

Sentence splitting with maths replacements for English texts

With tex2txt.py --lang en ..., the LaTeX input

We know that
\[
    f(n) = 0 \text{ for all } n.
\]
this completes the proof.

currently results in this plain text version:

We know that
  U for all V. 
this completes the proof.

It seems that the dot at 'V.' is not recognised as sentence splitter by LanguageTool (LT), since it might be the acronym of a first name. Consequently, LT will not complain about the lower-case 'this' starting a new sentence.

According to some experiments, the following settings for maths replacements are more appropriate in English texts.

parms.inline_math = ('B-B-B', 'C-C-C', 'D-D-D', 'E-E-E', 'F-F-F', 'G-G-G')
parms.display_math = ('U-U-U', 'V-V-V', 'W-W-W', 'X-X-X', 'Y-Y-Y', 'Z-Z-Z')

Now, LT's sensitivity seems to be almost as good as for German texts with the current replacement collections ('D1D', 'I1I', ...). Word repetitions due to missing interpunction in equations and missing white space in connection with \text{...} parts are detected as before.

Still, there is at least one difference to the German version. In the following snippet, the missing dot is not detected in the English variant. LT does not complain about the capital 'This'.

We know that
\[
    f(n) = 0 \text{ for all } n
\]
This completes the proof.

But LT also won't generate a message for

This Is a pity.

Pairing of [] brackets ignores {} braces

Pairing of [] brackets is checked without recognition of enclosed
{} braces.

Check of nesting depths still may fail

The checks from issue #5 still can fail if something like that is used in a heavily nested manner:

\begin{\xxx{itemize}}

We do not plan to catch such a case.

Macro arguments without {} braces would be practical

Short-hand notation without {} braces is practical for macros just taking the next character as argument.
Although the implementation of text-mode accents supports both this and the style with braces, declaration of such macros is not possible via Simple() or Macro().

Need better math replacements for English texts

The replacement collection in variable parms.display_math works
quite well, if German is the main language. Requirements for replacements are summarized in the script in function set_language_de().
Till now, we could not yet select replacements that work equally well
with the English version of LanguageTool. For example, sensitivity is not good with the collection provided in function set_language_en() in these cases:

missing final dot in an equation, if something like 'Therefore'
is following;
lower-case text continuation after an equation with final dot.

HTML report: highlighting may be too long

The Python script shell.py sometimes marks too much.
An extreme example is (\texorpdfstring skips its second argument)

\texorpdfstring{Thisx}{is a problem} is OK.

Here, only 'Thisx' should be highlighted.
Instead, 'Thisx}{is a problem}' is marked.

Displayed equations: & always leaves white space

The delimiter & in displayed equations will always create white space in the text output.

Wrong line numbers on swapping of macro arguments

Assume declaration

Macro(name='swap', args='AA', repl=r'\2\1')

Then the input

\swap
{A}
{B}

correctly leads to output 'BA', but the corresponding line number from option --nums is '3+' instead of '2+'.

This only happens on swapping of arguments, and it apparently has been introduced with the simplification of mysub() in release 1.5.6.

Wrong handling of \verb macro and verbatim environment

The handling has to be restructured.

on input '\verb?%? \verb?X?', the second \verb is not treated correctly
nested \verb macros may be resolved in wrong order
these two points similarly hold for verbatim environment
input '\verb?\begin{verbatim}X\end{verbatim}?' is resolved in wrong order
tracking of character position is poor for \verb argument

Line number tracking in displayed equations

There is no line-number tracking inside of parsed displayed equations.
Instead, all corresponding lines in the output text carry the line number of the opening \begin{...}.
Thus, error location may be cumbersome for large equation environments.

No tracking of (La)TeX constructs that span file boundaries

Things like

\begin{align}
\input{equ.tex}     % contains only equation body
\end{align}

are not treated correctly.

Restrictions for verb macro and verbatim environment

Since \verb has to be handled at the very beginning, a macro declaration
like Simple('textbackslash', r'\\verb?\\?') does not work.

Similarly, something as

\comment{\begin{verbatim}}
...
\comment{\end{verbatim}}

will not work.

Final removal of lonely {} braces

Bugs:

The removal can lead to new blank lines.
This is not done with option --extr.

Script structure

The Python script is only structured by comments.
It wildly mixes definitions of variables and functions with statements that actually perform text replacement operations.
This should be improved, for instance by usage of classes.

Line number tracking in macro arguments

A similar problem as in issue #2 appears with mulit-line arguments of macros, for instance in:

\textcolor{red}{
XX
YY
ZZ
}

All of the second argument is related to the line numer of the opening \textcolor.

Remember also column numbers by character tracking?

Currently, the central substitution function mysub() operates on a text string and an array of line numbers.
Instead, one could use an array of character offsets: for each character in the current text string, the corresponding array entry would indicate the position of that character in the input file.
Instead of storing line numbers with option --nums, one would write a file with character offset numbers.

With an adapted output filter for the proofreader, one then could also display corrected column numbers.