Giter Club home page Giter Club logo

docx2tex's Introduction

Current Release Github All Releases Downloads

docx2tex

Converts Microsoft Word's DOCX to LaTeX. Developed by le-tex and based on the transpect framework. The main author of docx2tex and the underlying xml2tex is @mkraetke.

get docx2tex

download the latest release

Download the latest docx2tex release

…or get source via Git. Please note that you have to add the --recursive option in order to clone docx2hub with submodules.

git clone https://github.com/transpect/docx2tex --recursive

requirements

  • Java 1.7 up to 1.15 (more recent versions not yet tested). Java 11 has a bug with file URIs, it should be avoided. Java 13 is safe again.
  • works on Windows, Linux and Mac OS X

run docx2tex

You can run docx2tex with a Bash script (Linux, Mac OSX, Cygwin) or the Windows batch script whose options are somewhat limited, compared to the Bash script.

Linux/MacOSX

./d2t [options ...] myfile.docx
Option Description
-o path to custom output directory
-c path to custom docx2tex configuration file
-m choose MathType source (ole|wmf|ole+wmf)
-f path to custom fontmaps directory
-p generate PDF with pdflatex
-t choose table model (tabularx|tabular|htmltabs)
-e custom XSLT stylesheet for evolve-hub overrides
-x custom XSLT stylesheet for postprocessing the evolve-hub results
-d debug mode

Windows

d2t.bat myfile.docx

via XML Calabash

Linux/Mac OSX

calabash/calabash.sh -o result=myfile.tex -o hub=myfile.xml xpl/docx2tex.xpl docx=myfile.docx conf=conf/conf.xml

Windows

calabash\calabash.bat -o result=myfile.tex -o hub=myfile.xml xpl/docx2tex.xpl docx=myfile.docx conf=conf/conf.xml

configure

The docx2tex pipeline consists of 3 macroscopic steps:

  • docx2hub. This step is hardly configurable. It transforms a docx file to a Hub XML representation.
  • evolve-hub. This is a bag of XSLT modes that, among other things, transform paragraphs with list markers and hanging indentation to proper nested lists, create a nested section hierarchy, group images with their figure titles, etc. Only some of the modes are used by docx2tex, orchestrated by evolve-hub.xpl and configured in detail by evolve-hub-driver.xsl.
  • xml2tex

There are five major hooks for adding your own processing: CSV or xml2tex configuration; XSLT that is applied between evolve-hub and xml2tex; XSLT that modifies what happens in evolve-hub; fontmaps.


You can specify a custom configuration file for docx2tex. There are two different formats to write a configuration.

  • The CSV-based configuration format permits a simple way to map from MS Word styles to LaTeX commands.
  • The xml2tex configuration format is recommended for a deeper level of configuration but requires basic knowledge of XML and XPath.

CSV

For each MS Word style name, create a line with three semicolon separated values.

  • MS Word style name
  • LaTeX start statement
  • LaTeX end statement

Just follow this example:

Heading 1   ; \chapter{     ; }
Heading 2   ; \section{     ; }
Heading 3   ; \subsection{  ; }
Quote       ; \begin{quote} ; end{quote}

You can edit CSV files either with a simple text editor or with a spreadsheet application.

xml2tex

docx2tex can also be configured by means of an xml2tex configuration file. docx2tex will apply the configuration to the intermediate Hub XML file and generates the LaTeX output.

The configuration in conf/conf.xml is used by default and works with the styles defined in Microsoft Word's normal.dot. If you want to configure docx2tex for other styles, you can edit this file or pass a custom configuration file with the conf option.

Learn how to edit this file here.

XSLT between evolve-hub and xml2tex

You can provide an XSLT that works on the result of evolve-hub (if debugging is enabled, on the file [basename].debug/evolve-hub/70.docx2tex-postprocess.xml). The location of this XSLT file (absolute URI or path relative to the main directory that d2t and d2t.bat reside in) may be provided to d2t via the -x option. d2t.bat does not have all the flags; if you are confined to Windows and don’t have Cygwin, WSL, or MinGW, you may invoke calabash/calabash.bat yourself, see above. The additional XSLT’s URI may be provided by the custom-xsl option. This processing is applied before the xml2tex configuration, so your XSLT should transform Hub (DocBook namespace) to Hub.

During evolve-hub

In case you need to influence what evolve-hub does, you can provide a custom stylesheet for this. Contrary to custom-xsl which is passed as an option, this is passed to the pipeline on the input port custom-evolve-hub-driver, or using the -e option of d2t. There is an example for such an XSLT that retains empty paragraphs that will otherwise be removed by default, in one of the XSLT passes that comprise evolve-hub. This example was created in response to a user request. If you want to create \chapter, \section, etc. headings from arbitrary docx paragraphs, you should add a template that sets the paragraph’s @role attribute to Heading1, Heading2, etc. (For paragraphs that are not removed during evolve-hub, this can also be done in the -x stylesheet.) It is strongly advised to xsl:import the default evolve-hub customization (see example).

fontmaps

The docx conversion supports individual fontmaps for mapping non-unicode characters to unicode. Please note that this is just needed for fonts that are not unicode-compatible. If you want to map characters from Unicode to LaTeX, please use the character map in the xml2tex configuration instead.

Please find further documentation on how to create a fontmap here.

After you created your fontmap, store it in a directory and pass the path of the directory to docx2tex with the -f option.

If you invoke the docx2tex XProc pipeline (xpl/docx2tex.xpl), you can specify the fontmap directory with the option custom-font-maps-dir.

language tagging

You may have noticed some obscure \foreignlanguage{} or \selectlanguage{} code that doesn't match the actual language used in your TeX document. We have no fancy AI™-based natural language algorithms at work but docx2tex evaluates the original document language which typically applies to your system settings and the language setting of the paragraph or character style which is used by word for auto-correction and hyphenation. docx2tex evaluates these settings and filters redundant markup, e.g. detecting the main language by evaluating the character count of each of the styles and their respective language setting. However, when you copy and paste from the World Wide Web, Microsoft Word usually copies the language of the original Website as well. This causes most of the weird language markup, you may have noticed. So we recommend to copy and paste as plain text and to create new paragraph and character styles when you want to intentionally change the language of a text fragment.

docx2tex's People

Contributors

fr4nze avatar gimsieke avatar lwittmar avatar md-5 avatar mkraetke avatar polypunkt avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

docx2tex's Issues

lost image and caption (probably related to east-asia chars)

Please note that in the translation of the file given below, figure 4 and its caption become (line 89 in the tex file):

\等线{46 69 ...

There is something similar at line 330.

I tried to isolate the problematic part only, but I get a different output (below), so I'm providing the whole document, but please do not distribute it

\DengXian{46 69 ...

Problematic file and translation:
https://medialab.sissa.it/owncloud/index.php/s/4APWPgtO5slLkuA

[OT] I'm opening many issues, please feel free to stop me if I'm too pesky :-)

possible error in some of the last commits

[low priority]
If I pull the docx2tex code from the repo now, the d2t script fails and I get the following errors in the .d2t.log
This does not happen if I download the release 1.1.
It is not a problem for me, but I tought I should report.

ERROR: http://transpect.github.io/../index.html:1:107:Not a pipeline or library: html
ERROR: err:XS0044:Unexpected step name: tr:load-cascaded
ERROR: It is a static error if any element in the XProc namespace or any step has element children other than those specified for it by this specification. In particular, the presence of atomic steps for which there is no visible declaration may raise this error.

ERROR: An empty sequence is not allowed as the third argument of replace()

I am trying to convert a .docx file that is an article with equations, figures, and even references introduced with Endnote.

Using the master (since with the last pre-release (0.3), I was getting the same reported error that was solved recently), and running:

$ docx2tex-master/d2t -o test test.docx

I am getting the following errors:
ERROR: docx2tex-master/xproc-util/load/xpl/load.xpl:0:load-error:Could not load file:/usr/people/jmdamas/docx2tex-master/conf/conf.csv (file:///usr/people/jmdamas/docx2tex-master/xproc-util/load/xpl/load.xpl) dtd-validate=false
ERROR: file:///usr/people/jmdamas/docx2tex-master/mml2tex/xsl/mml2tex.xsl:339:err:XPTY0004:An empty sequence is not allowed as the third argument of replace()
ERROR: An empty sequence is not allowed as the third argument of replace()
ERROR: cause: file:///usr/people/jmdamas/docx2tex-master/mml2tex/xsl/mml2tex.xsl:339:err:XPTY0004:An empty sequence is not allowed as the third argument of replace()
ERROR: An empty sequence is not allowed as the third argument of replace()
ERROR: cause: file:///usr/people/jmdamas/docx2tex-master/mml2tex/xsl/mml2tex.xsl:339:err:XPTY0004:An empty sequence is not allowed as the third argument of replace()
ERROR: Pipeline failed: An empty sequence is not allowed as the third argument of replace()
ERROR: Underlying exception: An empty sequence is not allowed as the third argument of replace()

In the first ERROR, I don't understand why the file can't be loaded, since it is there.
The other errors are all related to a replace function, but I can't understand the origin.

I tried to run with an a shorter version of the .docx (just the first 5 or 6 pages, with some equations), and I didn't get any errors. I tried to remove the Endnote references only (I thought they might be a problem) and tested it, and it gave me the errors again. I could go in a trial-and-error mode, trying to identify which part of the document is causing the problem, but I don't think that's a solution.

Can you give me some tips on how to solve this?

Oh, I am running this on an Ubuntu 12.04 with JAVA 1.7.0_80.

Meanwhile, I am using an older (?) version of this software in codeplex (https://docx2tex.codeplex.com/releases/view/19618), that is working well on Windows.

xproc-util/load/xpl/load.xpl:0:load-error:Could not load...

hello,

When I run d2t, below error occurs:

cp: '../modelo-resumo-semana-conhecimento-2019.docx' e '/usr/src/modelo-resumo-semana-conhecimento-2019.docx' são o mesmo arquivo
INFO : xpl/docx2tex.xpl:197:38:No custom-font-maps loaded.
ERROR: xproc-util/load/xpl/load.xpl:0:load-error:Could not load file:/usr/src/docx2tex/conf/conf.csv (file:///usr/src/docx2tex/xproc-util/load/xpl/load.xpl) dtd-validate=false
ERROR: xproc-util/load/xpl/load.xpl:0:load-error:Could not load file:/usr/src/docx2tex/conf/conf.csv (file:///usr/src/docx2tex/xproc-util/load/xpl/load.xpl) dtd-validate=false
Message: Mode: insert-xpath
Message: Mode: docx2hub:preprocess-styles
Message: Mode: docx2hub:resolve-tblBorders
Message: Mode: docx2hub:add-props
Message: Mode: docx2hub:props2atts
Message: Mode: docx2hub:remove-redundant-run-atts
Message: Mode: docx2hub:join-instrText-runs
Message: Mode: docx2hub:field-functions
Message: Mode: wml-to-dbk
Message: Mode: docx2hub:join-runs
Message: Mode: hub:twipsify-lengths
Message: Mode: hub:split-at-tab
Message: Mode: hub:identifiers
Message: Mode: hub:tabs-to-indent
Message: Mode: hub:handle-indent
Message: Mode: hub:prepare-lists
Message: Mode: hub:lists
Message: Mode: hub:postprocess-lists
Message: Mode: docx2tex-preprocess
Message: Mode: docx2tex-postprocess
INFO : cascade/xpl/load-cascaded.xpl:43:59:load-cascaded: using file:/usr/src/docx2tex/xml2tex/xsl/xml2tex.xsl
INFO : cascade/xpl/load-cascaded.xpl:43:59:load-cascaded: using file:/usr/src/docx2tex/xml2tex/xsl/calstable2tabular.xsl
WARN : file:///usr/src/docx2tex/xslt-util/functx/xsl/functx.xsl:35:66:Stylesheet module http://transpect.io/xslt-util/functx/xsl/functx.xsl is included or imported more than once. This is permitted, but may lead to errors or unexpected behavior
INFO : cascade/xpl/load-cascaded.xpl:43:59:load-cascaded: using file:///usr/src/docx2tex/mml-normalize/xsl/mml-normalize.xsl
Message: Mode: mml2tex-grouping
Message: Mode: mml2tex-preprocess
INFO : cascade/xpl/load-cascaded.xpl:43:59:load-cascaded: using file:/usr/src/docx2tex/mml2tex/xsl/invoke-mml2tex.xsl
WARN : err:SXXP0005:The source document is in namespace http://docbook.org/ns/docbook, but none of the template rules match elements in this namespace (Use --suppressXsltNamespaceCheck:on to avoid this warning)
Message: Mode: escape-bad-chars
Message: Stylesheet compilation failed: 2 errors reported
Message: [FATAL ERROR]: XSLT mode 'escape-bad-chars' failed due to conversion errors. 
ERROR: xproc-util/xslt-mode/xpl/xslt-mode.xpl:0:xslt-mode-escape-bad-chars:Stylesheet compilation failed: 2 errors reported
ERROR: xproc-util/xslt-mode/xpl/xslt-mode.xpl:0:xslt-mode-escape-bad-chars:Stylesheet compilation failed: 2 errors reported
ERROR: xproc-util/xslt-mode/xpl/xslt-mode.xpl:0:xslt-mode-escape-bad-chars:Stylesheet compilation failed: 2 errors reported
ERROR: Unknown error

Java version:

java -version
java version "1.8.0_201"
Java(TM) SE Runtime Environment (build 1.8.0_201-b09)
Java HotSpot(TM) 64-Bit Server VM (build 25.201-b09, mixed mode)

Message: docx2hub error on unzipping.

Message: docx2hub error on unzipping.
Zip file seems to be corrupted: /infektionen-bei-haematologischen-und-onkologischen-patienten-uebersicht.docx (No such file or directory)

ERROR: err:XD0001:Only whitespace text nodes can appear at the top level in a document
ERROR: err:XD0001:Only whitespace text nodes can appear at the top level in a document
ERROR: err:XD0001:Only whitespace text nodes can appear at the top level in a document
ERROR: It is a dynamic error if a non-XML resource is produced on a step output or arrives on a step input.

I can provide the sample file by email since Github does not support DOCX uploads.

The issue appears to be specific to MacOSX. Converting the same file on Linux works.

JAVA 1.8 and upper?

─diamon@diamon-ThinkPad-13 ~/projects/docx2tex ‹system› ‹master*›
╰─$ ./d2t test.docx
starting docx2tex
Errors encountered while running docx2tex. Please see /home/diamon/projects/docx2tex/test.d2t.log for details.
╭─diamon@diamon-ThinkPad-13 ~/projects/docx2tex ‹system› ‹master*›
╰─$ cat test.d2t.log 1 ↵
cp: 'test.docx' и '/home/diamon/projects/docx2tex/test.docx' - один и тот же файл
Exception in thread "main" java.lang.NoClassDefFoundError: javax/activation/DataSource
at java.base/java.lang.Class.getDeclaredMethods0(Native Method)
at java.base/java.lang.Class.privateGetDeclaredMethods(Class.java:3119)
at java.base/java.lang.Class.getMethodsRecursive(Class.java:3260)
at java.base/java.lang.Class.getMethod0(Class.java:3246)
at java.base/java.lang.Class.getMethod(Class.java:2065)
at com.xmlcalabash.core.XProcRuntime.initializeSteps(XProcRuntime.java:317)
at com.xmlcalabash.core.XProcRuntime.(XProcRuntime.java:272)
at com.xmlcalabash.drivers.Main.run(Main.java:100)
at com.xmlcalabash.drivers.Main.main(Main.java:83)
Caused by: java.lang.ClassNotFoundException: javax.activation.DataSource
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:583)
at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:190)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:499)
... 9 more

Citavi→BibTeX

Citavi 6 seems to store its references as base64-encoded JSON in field codes. There has been a request to transform them into BibTeX.
We will do this if we receive at least € 960 of external funding for it. The user who requested the feature is currently considering sponsorship (that is, committing to the full amount).

[][] Overfull \hbox (1.19997pt too wide) in alignment at lines 141--141 [][] ! Undefined control sequence. <argument> \Micro _{0}ɛ_{0} l.154 \end{tabularx}

[][] 

Underfull \hbox (badness 10000) in alignment at lines 90--90
[][][] 
[1{/usr/local/texlive/2015/texmf-var/fonts/map/pdftex/updmap/pdftex.map}]
Overfull \hbox (1.19997pt too wide) in alignment at lines 102--102
[][] 

Overfull \hbox (1.19997pt too wide) in alignment at lines 114--114
[][] 

Overfull \hbox (1.19997pt too wide) in alignment at lines 128--128
[][] 

Overfull \hbox (1.19997pt too wide) in alignment at lines 141--141
[][] 
! Undefined control sequence.
<argument> \Micro 
                  _{0}ɛ_{0}
l.154 \end{tabularx}

? 

FATAL: Failed to parse Saxon configuration file.

I want to use docx2tex to test whether it can convert mathtype equation to latex.
I use the most recent docx2tex release.
I got error message as below:

FATAL: Failed to parse Saxon configuration file.
java.nio.file.InvalidPathException: Illegal char <*> at index 67: C:\docx2tex\calabash\extensions\transpect\javascript-extension\lib*

This is my docx file.
equation.docx

Problems with Endnote references

I have a document where I used Endnote to manage the references. The file is the same of #3.

What happens is that the superscripted numbers in the main text corresponding to the references are all replaced by \href{}{}, which causes the resulting pdf to show nothing instead of the superscripted numbers.

LaTeX Error: There's no line here to end.

I have a docx file that lead to the following pdflatex error (I can provide the file by private email).

Underfull \hbox (badness 10000) in paragraph at lines 36--36
\OT1/cmr/bx/n/10.95 mun-the-ra-pie

Underfull \hbox (badness 10000) in paragraph at lines 36--36
\OT1/cmr/bx/n/10.95 nicht ge-eig-net

Underfull \hbox (badness 10000) in paragraph at lines 36--36
[]|\OT1/cmr/bx/n/10.95 nicht

Underfull \hbox (badness 10000) in paragraph at lines 36--36
\OT1/cmr/bx/n/10.95 quan-ti-fi-

Underfull \hbox (badness 10000) in paragraph at lines 36--36
[]|\OT1/cmr/m/n/10.95 Idelalisib/Rituximab f[]uhrt zu ei-ner

Underfull \hbox (badness 3354) in paragraph at lines 36--36
\OT1/cmr/m/n/10.95 Verl[]angerung der pro-gres-si-ons-frei-en und des

Underfull \hbox (badness 3536) in paragraph at lines 36--36
\OT1/cmr/m/n/10.95 Ge-samt[]uberlebenszeit so-wie zu ei-ner Stei-ge-

Underfull \hbox (badness 10000) in alignment at lines 36--36
[][][]

! LaTeX Error: There's no line here to end.

See the LaTeX manual or LaTeX Companion for explanation.
Type H for immediate help.
...

l.40 \newline

Is this the project relatex to docx2latex.com?

Hi, I am looking for the source code that generated from google doc latex document.

I wonder if this is the script used by docx2latex.com, with this script I am not getting the same result and perhaps I am missing something.

Thanks in advance,

runtime error

I get the following error when I try to translate this file:

INFO : file:..docx2tex/xpl/docx2tex.xpl:188:38:No custom-font-maps loaded.
ERROR: file:...docx2tex/xproc-util/load/xpl/load.xpl:0:load-error:Could not load file:...docx2tex/conf/conf.csv (file:...docx2tex/xproc-util/load/xpl/load.xpl) dtd-validate=false
...

https://medialab.sissa.it/owncloud/index.php/s/BZFHlref5mB3uAS

(I think my installation is ok, because I can translate other documents)

Problem about time improvment

Hi, i have tons of docx files to transform but this project is very time-consuming. I found it generates some temporary files in the process, i guess this may be the problem. I am not good at shell, could you please offer a solution for me? many thanks.

hidden text not "tagged"

Desiderata

Text that has font effect "hidden" is translated as normal text, even if it does not display in a pdf generated from the docx and it does not display in the document when the "formatting symbol" button (¶) is not active.

In the following example, the word "bbb" is "hidden".
Would it be possible to have it translated into something like \@gobble{bbb}?

https://medialab.sissa.it/owncloud/index.php/s/VFtUaKfo3chdV82

*.tmp folder outside the -o folder

This is a minor issue, but when I run doc2tex with the -o option, the *.tmp folder, which contains stuff like the media folder with the images to be inserted, is placed outside the -o folder, so the path for the images is wrong and they are not loaded when the .tex is compiled.

Cheers

d2t does not play well with filenames containing whitespace

./d2t "6 Interferometrische Sensoren/160228 6_Interferometrische Sensoren.docx" 
./d2t: line 87: [: too many arguments
./d2t: line 121: [: too many arguments
./d2t: line 143: $LOG: ambiguous redirect
./d2t: line 146: [: too many arguments
starting docx2tex
./d2t: line 167: $LOG: ambiguous redirect
Errors encountered while running docx2tex. Please see /Users/ajung/src/docx2tex/6 Interferometrische Sensoren/6
Interferometrische
160228
6_Interferometrische
Sensoren.docx
.docx.d2t.log for details.

question re conf.xml

How would I configure the conf.xml to produce an article instead of a book? Is there a basic conf.xml for articles?

Beginner's question on how to use docx2tex

Dear Sirs/Madams,
I need to convert a number of docx files to LaTeX so I have downloaded your tool on my xubuntu 19.04 laptop. Regrettably, when I try to run your code an error message is displayed:

$ ./d2t ~/Documents/Introduction.docx 
starting docx2tex
Errors encountered while running docx2tex. Please see /home/eidon/Documents/Introduction.d2t.log for details.
$ cat Introduction.d2t.log
./d2t: line 203: /home/eidon/packages/docx2tex-master/calabash/calabash.sh: No such file or directory

From this I understood that I needed to install calabash, which I did by running

$ java -jar xmlcalabash-1.1.27-99.jar

Despite this, the error is still there. Would you be so kind as to help me? Thank you very much!

Kind regards,
Eidon

lost spaces and formatting in hyperlink

Please note that, in a fragment similar to the following,
the spaces after "Instrum." and after "88" are lost
and the boldface of "88" is also lost (or applied to the whole hyperlink maybe)

...<w:t xml:space="preserve">Rev. Sci. Instrum. </w:t></w:r>
<w:r w:rsidR="00A51604" w:rsidRPr="002E01AB"><w:rPr><w:rStyle w:val="af1"/><w:b/><w:bCs/><w:color w:val="auto"/><w:lang w:val="en-US"/></w:rPr>
<w:t xml:space="preserve">88 </w:t>...

TeX becomes:

Rev. Sci. Instrum.88(2017) 033504

MWE here:
https://medialab.sissa.it/owncloud/index.php/s/qPdO4qMBWdU28RH

"...FATAL: Failed to parse Saxon configuration file..."

I tried to change word files into latex files.
but failed.
fail message is
"...FATAL: Failed to parse Saxon configuration file.
java.nio.file.InvalidPathException: Illegal char <*> at index 96: C:\Users\alpac\Documents\GitHub\docx2tex\calabash\extensions\transpect\javascript-extension\lib*..."
help me, please...OTL

ERROR: An empty sequence is not allowed as the result of function tr:theme-font()

Hi guys,

I am having this issue. Does this sounds familiar to you?

./d2t -o tmpp ~/workspace/ets/phd/thesis/versions/v1.11.docx
starting docx2tex
Errors encountered while running docx2tex. Please see /Users/david/opt/docx2tex/tmpp/v1.11.d2t.log for details.

Log file:

Message: Mode: insert-xpath
ERROR: file:/Users/david/opt/docx2tex/docx2hub/xsl/insert-xpath.xsl:223:err:XTTE0780:An empty sequence is not allowed as the result of function tr:theme-font()
ERROR: An empty sequence is not allowed as the result of function tr:theme-font()
ERROR:     cause: file:/Users/david/opt/docx2tex/docx2hub/xsl/insert-xpath.xsl:223:err:XTTE0780:An empty sequence is not allowed as the result of function tr:theme-font()
ERROR: An empty sequence is not allowed as the result of function tr:theme-font()
ERROR:     cause: file:/Users/david/opt/docx2tex/docx2hub/xsl/insert-xpath.xsl:223:err:XTTE0780:An empty sequence is not allowed as the result of function tr:theme-font()
ERROR: An empty sequence is not allowed as the result of function tr:theme-font()
ERROR:     cause: file:/Users/david/opt/docx2tex/docx2hub/xsl/insert-xpath.xsl:223:err:XTTE0780:An empty sequence is not allowed as the result of function tr:theme-font()
ERROR: An empty sequence is not allowed as the result of function tr:theme-font()
ERROR:     cause: file:/Users/david/opt/docx2tex/docx2hub/xsl/insert-xpath.xsl:223:err:XTTE0780:An empty sequence is not allowed as the result of function tr:theme-font()
ERROR: Pipeline failed: An empty sequence is not allowed as the result of function tr:theme-font()
ERROR: Underlying exception: An empty sequence is not allowed as the result of function tr:theme-font()

I am using the master branch.

In fact I don't really care about the system font. I don't if there is way to ignore this error and continue?

! Undefined control sequence.

(/opt/local/share/texmf-texlive/tex/latex/latexconfig/epstopdf-sys.cfg))
(/opt/local/share/texmf-texlive/tex/latex/hyperref/nameref.sty
(/opt/local/share/texmf-texlive/tex/generic/oberdiek/gettitlestring.sty))
Chapter 1.
! Undefined control sequence.
l.37 ...den Sie daf"{u}r die Formatvorlage {\glqq
}"{U}berschrift 1{\grqq}.

I can send you the related DOCX file by private email.

Handling of embedded .emf files

We have DOCX files where the authors often embed Powerpoint files.
This case is not handler properly.

! LaTeX Error: Unknown graphics extension: .emf.

See the LaTeX manual or LaTeX Companion for explanation.
Type  H <return>  for immediate help.
 ...                                              

l.429 ...16t125157.docx.tmp/word/media/image1.emf}

? x

Ideally .emf files would converted to proper SVGs or PNGs.
If this is not possible they should be removed and not carried forward the LaTeX output
Perhaps removed image could be replace with a placeholder or a warning message.

Issues with d2t.bat

The current version of the d2t.bat file doesn't work correctly it has 2 issues:

  • The path to calabash is incorrect
  • Exiting the script also exits the shell

The attached patch fixes both these issues
windows_fixes.diff.txt

Equation-related issues (\underset, \sum\nolimits, \substack, cases environment)

The reference file is the same as #3.

  1. I use some sum-class symbols in-text and the limits are always on the right of the sigma symbol. But doc2tex is recognizing this and using the \underset{<limits>}{\sum} instead of the more common \sum\nolimits. In fact, this is also happening in the equations not in-text, but I am guessing that's because they are inside the tabular environment.
  2. Limits of sums with more than one line (see equation 2 of reference file) are being transformed into an array environment. I think the best solution would be to use \substack (maybe \mathclamp could also help here)

Cheers

xslt-util/calstable/xpl and com.xmlcalabash conversion errors

Bug Report:
My OS: Linux Gentoo Base System release 2.24.1.12 64 bit PC desktop
Java: 1.8.0_66
Shell: bash 4.3.42 (x86_64-pc-linux-gnu)
Install: cd /home/el/bin; git clone https://github.com/transpect/docx2tex --recursive
The input docx has a few unicode shenanigans, but nothing too out of band: http://www.filedropper.com/examplefail
Run you code: cd /home/el/bin/docx2tex; ./d2t ExampleFail.docx
Failure .log File: http://www.filedropper.com/examplefaild2t

What I expected: I expected some kind of output file ExampleFail.tex output containing latex code.

Quarantining the bug, proving the bug isn't on my side:

  1. Use libreoffice version 5.2.3.3 -writer to create an new empty .docx document containing the ascii text asdf.

  2. Save the above file as Untitled.docx using format Microsoft Word 2007-2013 XML (.docx) format.

  3. Openoffice -writer produces this Untitled.docx: http://www.filedropper.com/untitled_22

  4. Run the code: cd /home/el/bin/docx2tex; ./d2t Untitled.docx

  5. docx2tex works as expected, the contents of Untitled.tex render by pdflatex to a similar looking pdf:

The problem is in the table layouts.

funny translations: \TimesNewRoman{56 41 43...

The translation of the attached file results in a lot of macros in the following form:
\TimesNewRoman{41 43...

As soon as I open the docx in word and save it, the problem disappears.

Might be related to issues/25

(please do not distribute the attached file)
wj.docx

Installation problem

Hi, I'm trying to install docx2tex on my Mac running El Capitain. Towards the end I get this:

Submodule path 'mml2tex': checked out '03430be79a70b283679cfc1cb1529da5a044f41f'
Cloning into 'schema/hub'...
The authenticity of host 'github.com (192.30.252.130)' can't be established.
RSA key fingerprint is SHA256:nThbg6kXUpJWGl7E1IGOCspRomTxdCARLviKw6E5SY8.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'github.com,192.30.252.130' (RSA) to the list of known hosts.
Permission denied (publickey).
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.
Clone of '[email protected]:le-tex/Hub.git' into submodule path 'schema/hub' failed

I'm not sure if this is a problem on my end or not, as I'm something of a newbie to Github.

! LaTeX Error: Lonely \item--perhaps a missing list environment.

I am receiving the following error for a given DOCX document (sorry, I can not provide the source
due to non-disclosure reasons).

Package hyperref Warning: Suppressing link with empty target on input line 59.

Package hyperref Warning: Suppressing link with empty target on input line 61.

! LaTeX Error: Lonely \item--perhaps a missing list environment.

See the LaTeX manual or LaTeX Companion for explanation.
Type H for immediate help.
...

l.61 \href{}{1.1}\item \href
{}{Besondere Darstellungen im Handbuch }

58
59 \href{}{1. Aufbau des Handbuchs }
60
61 \href{}{1.1}\item \href{}{Besondere Darstellungen im Handbuch }
62
63 \href{}{1.2}\item \href{}{Zielgruppe }
64
65 \href{}{1.3}\item \href{}{Die Themenabschnitte des Handbuchs im "{U}berblick }

Improper LaTeX output

docx2text running on

http://public.zopyx.com/lungenkarzinom-nicht-kleinzellig-nsclc.docx

generates improper LaTeX....possibly an improper DOCX structure however the converter should
perhaps not generate improper output but add some logging message to the log.


[Loading MPS to PDF converter (version 2006.09.02).]
) (/opt/local/share/texmf-texlive/tex/latex/oberdiek/epstopdf-base.sty
(/opt/local/share/texmf-texlive/tex/latex/oberdiek/grfext.sty)
(/opt/local/share/texmf-texlive/tex/latex/latexconfig/epstopdf-sys.cfg))
(/opt/local/share/texmf-texlive/tex/latex/hyperref/nameref.sty
(/opt/local/share/texmf-texlive/tex/generic/oberdiek/gettitlestring.sty))
[1{/opt/local/var/db/texmf/fonts/map/pdftex/updmap/pdftex.map}] [2]
Chapter 1.

! LaTeX Error: Lonely \item--perhaps a missing list environment.

See the LaTeX manual or LaTeX Companion for explanation.
Type H for immediate help.
...

l.29 2.\item \chapter
{Grundlagen}
? c
Type to proceed, S to scroll future error messages,
R to run without stopping, Q to run quietly,
I to insert something, E to edit your file,
1 or ... or 9 to ignore the next 1 to 9 tokens of input,
H for help, X to quit.

how to obtain a tex file with utf8 chars?

I would like to obtain that in the output tex file certain chars stays in utf8 (à) and are not translated into latex macros ('a). Is this possible?

I tried to look in the <charmap> of conf.xml, but the chars that I want are not there.
I was looking at fontmaps, but, if I understand correctly, I want the opposite of what they would do.

thanks
aaa.docx

Exception in thread "main" java.lang.NoClassDefFoundError: javax/activation/

Hello,

After updating jvm, I can no longer use docx2tex to produce tex files from docx files with the ./d2t command. I use mac OS via terminal for conversions and currently I have the following version of java:
java version "1.8.0_211"
Java (TM) SE Runtime Environment (build 1.8.0_211-b12)
Java HotSpot (TM) 64-Bit Server VM (build 25.211-b12, mixed mode).
I would like to know if the problem is really with the java version of my computer, if someone else has already encountered this problem and, if possible, what solution should I take to remedy the problem.

Thank you very much in advance.

Follow the log generated.

2 Exception in thread "main" java.lang.NoClassDefFoundError: javax/activation/ DataSource
3 at java.base/java.lang.Class.getDeclaredMethods0(Native Method)
4 at java.base/java.lang.Class.privateGetDeclaredMethods(Class.java:31 72)
5 at java.base/java.lang.Class.getMethodsRecursive(Class.java:3313)
6 at java.base/java.lang.Class.getMethod0(Class.java:3299)
7 at java.base/java.lang.Class.getMethod(Class.java:2112)
8 at com.xmlcalabash.core.XProcRuntime.initializeSteps(XProcRuntime.ja va:317)
9 at com.xmlcalabash.core.XProcRuntime.(XProcRuntime.java:272)
10 at com.xmlcalabash.drivers.Main.run(Main.java:100)
11 at com.xmlcalabash.drivers.Main.main(Main.java:83)
12 Caused by: java.lang.ClassNotFoundException: javax.activation.DataSource
13 at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(Builti nClassLoader.java:583)
14 at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadCla ss(ClassLoaders.java:178)
15 at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:521)
16 ... 9 more

conf.csv in d2t

Why is there a default value of conf.csv when the pipeline actually expects an XML file (that it then loads with tr:load)?

"! Undefined control sequence" error

Improper TeX output is being generated for the attached DOCX file .

Overfull \hbox (18.6093pt too wide) in paragraph at lines 65--66
\OT1/cmr/m/n/10.95 ovial-sarkome. Die h^^?aufigsten We-ichteil-sarkome des Erwa
ch-se-nen sind in Tabelle 1 aufgef^^?uhrt.
4 [5] [6]
Chapter 3.
! Undefined control sequence.
l.74 ... 3-gradige Klassifikationsschema der {\grq
}French Federation of Canc...

Problems with symbols and accented characters

The reference file is the same as #3.

I am not sure of how doc2tex should be dealing with these issues, but while symbols like the alpha or beta characters are converted to $\alpha$ or $\beta$, other symbols are not being recognized. Some examples include the minus or times signs, apostrophe (for example, see «Kramers'» in the file), or tildes (which should be converted to \~{} or \textasciitilde{}).

Moreover, some accented characters, like in my name, are not being recognized, but comparing with the output from codeplex's doc2tex, I identified this problem to be the lack of \usepackage[utf8]{inputenc} in the preamble. Am I correct?

About the lone symbols, is there anything that can be done?
Also, can doc2tex recognize the differences between types of dashes (see here)?

Cheers

cp ....are the same file

Fresh installation using:

git clone https://github.com/transpect/docx2tex --recursive

Any conversion with d2t gives me the same error

cp: ‘/tmp/docx-samples/160229_Wolff_Sensor_Technologien/all/1_Einleitung.docx’ and ‘/tmp/docx-samples/160229_Wolff_Sensor_Technologien/all/1_Einleitung.docx’ are the same file
ERROR: xml2tex/xpl/xml2tex.xpl:71:65:err:XS0052:Cannot import: http://transpect.io/mml2tex/xpl/mml2tex.xpl
ERROR:     cause: I/O error reported by XML parser processing http://transpect.io/mml2tex/xpl/mml2tex.xpl: http://transpect.github.io/mml2tex/xpl/mml2tex.xpl
ERROR: It is a static error if the URI of a p:import cannot be retrieved or if, once retrieved, it does not point to a p:library, p:declare-step, or p:pipeline.
ERROR: Underlying exception: I/O error reported by XML parser processing http://transpect.io/mml2tex/xpl/mml2tex.xpl: http://transpect.github.io/mml2tex/xpl/mml2tex.xpl

lost newline

In the following file, a newline is lost between "...Fast ICA Algorithm" and "FastICA disintegrates...":
https://medialab.sissa.it/owncloud/index.php/s/PXm3ktw0LFYXVti

However, please note that in the original docx, there are a couple of spaces missing in "3.1Denoisingby" (probably a typo), but when I tried to add them (to get "3.1 Denoising by") and save the file, the missing newline magically appears in the TeX translation and I can see no error.

Formatting issues with in-text (sub|super)scripts

The reference file is the same as #3.

I am experiencing some issues with subscripts and superscripts in the reference file.

  1. I think there in a lack of brackets for longer subscripted words, with kcat being converted to \textit{k}$_cat$/ instead of \textit{k}$_{cat}$/.
  2. Issues with the superscript when the following character in a minus sign, like $^−1$ (maybe related with issue #5?)
  3. I am getting double-equation mark-up for Cα, like this C$^$\alpha$$. This screws things and a lot of the non-equation text following it is pulled into the equation mode.

Cheers

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.