Giter Club home page Giter Club logo

java-mammoth's Introduction

Mammoth .docx to HTML converter for Java/JVM

Mammoth is designed to convert .docx documents, such as those created by Microsoft Word, Google Docs and LibreOffice, and convert them to HTML. Mammoth aims to produce simple and clean HTML by using semantic information in the document, and ignoring other details. For instance, Mammoth converts any paragraph with the style Heading 1 to h1 elements, rather than attempting to exactly copy the styling (font, text size, colour, etc.) of the heading.

There's a large mismatch between the structure used by .docx and the structure of HTML, meaning that the conversion is unlikely to be perfect for more complicated documents. Mammoth works best if you only use styles to semantically mark up your document.

The following features are currently supported:

  • Headings.

  • Lists.

  • Customisable mapping from your own docx styles to HTML. For instance, you could convert WarningHeading to h1.warning by providing an appropriate style mapping.

  • Tables. The formatting of the table itself, such as borders, is currently ignored, but the formatting of the text is treated the same as in the rest of the document.

  • Footnotes and endnotes.

  • Images.

  • Bold, italics, underlines, strikethrough, superscript and subscript.

  • Links.

  • Line breaks.

  • Text boxes. The contents of the text box are treated as a separate paragraph that appears after the paragraph containing the text box.

  • Comments.

Installation

Available on Maven Central.

<dependency>
  <groupId>org.zwobble.mammoth</groupId>
  <artifactId>mammoth</artifactId>
  <version>1.7.0</version>
</dependency>

Other supported platforms

Usage

Library

Basic conversion

To convert an existing .docx file to HTML, create an instance of DocumentConverter and pass an instance of File to convertToHtml. For instance:

import org.zwobble.mammoth.DocumentConverter;
import org.zwobble.mammoth.Result;

DocumentConverter converter = new DocumentConverter();
Result<String> result = converter.convertToHtml(new File("document.docx"));
String html = result.getValue(); // The generated HTML
Set<String> warnings = result.getWarnings(); // Any warnings during conversion

You can also extract the raw text of the document by using extractRawText. This will ignore all formatting in the document. Each paragraph is followed by two newlines.

DocumentConverter converter = new DocumentConverter();
Result<String> result = converter.extractRawText(new File("document.docx"));
String html = result.getValue(); // The raw text
Set<String> warnings = result.getWarnings(); // Any warnings during conversion

Custom style map

By default, Mammoth maps some common .docx styles to HTML elements. For instance, a paragraph with the style name Heading 1 is converted to a h1 element. You can add custom style maps by calling addStyleMap(String). A description of the syntax for style maps can be found in the section "Writing style maps". For instance, if paragraphs with the style name Section Title should be converted to h1 elements, and paragraphs with the style name Subsection Title should be converted to h2 elements:

DocumentConverter converter = new DocumentConverter()
    .addStyleMap("p[style-name='Section Title'] => h1:fresh")
    .addStyleMap("p[style-name='Subsection Title'] => h2:fresh");

You can also pass in the entire style map as a single string, which can be useful if style maps are stored in text files:

String styleMap =
    "p[style-name='Section Title'] => h1:fresh\n" +
    "p[style-name='Subsection Title'] => h2:fresh";
DocumentConverter converter = new DocumentConverter()
    .addStyleMap(styleMap);

The most recently-added styles have the greatest precedence. User-defined style mappings are used in preference to the default style mappings. To stop using the default style mappings altogether, call disableDefaultStyleMap:

DocumentConverter converter = new DocumentConverter()
    .disableDefaultStyleMap();

Custom image handlers

By default, images are converted to <img> elements with the source included inline in the src attribute. This behaviour can be changed by calling imageConverter() with an image converter .

For instance, the following would replicate the default behaviour:

DocumentConverter converter = new DocumentConverter()
    .imageConverter(image -> {
        String base64 = streamToBase64(image::getInputStream);
        String src = "data:" + image.getContentType() + ";base64," + base64;
        Map<String, String> attributes = new HashMap<>();
        attributes.put("src", src);
        return attributes;
    });

where streamToBase64 is a function that reads an input stream and encodes it as a Base64 string.

Bold

By default, bold text is wrapped in <strong> tags. This behaviour can be changed by adding a style mapping for b. For instance, to wrap bold text in <em> tags:

DocumentConverter converter = new DocumentConverter()
    .addStyleMap("b => em");

Italic

By default, italic text is wrapped in <em> tags. This behaviour can be changed by adding a style mapping for i. For instance, to wrap italic text in <strong> tags:

DocumentConverter converter = new DocumentConverter()
    .addStyleMap("i => strong");

Underline

By default, the underlining of any text is ignored since underlining can be confused with links in HTML documents. This behaviour can be changed by adding a style mapping for u. For instance, suppose that a source document uses underlining for emphasis. The following will wrap any explicitly underlined source text in <em> tags:

DocumentConverter converter = new DocumentConverter()
    .addStyleMap("u => em");

Strikethrough

By default, strikethrough text is wrapped in <s> tags. This behaviour can be changed by adding a style mapping for strike. For instance, to wrap strikethrough text in <del> tags:

DocumentConverter converter = new DocumentConverter()
    .addStyleMap("strike => del");

Comments

By default, comments are ignored. To include comments in the generated HTML, add a style mapping for comment-reference. For instance:

DocumentConverter converter = new DocumentConverter()
    .addStyleMap("comment-reference => sup");

Comments will be appended to the end of the document, with links to the comments wrapped using the specified style mapping.

API

DocumentConverter

Methods:

  • Result<String> convertToHtml(File file): converts file into an HTML string.

  • Result<String> convertToHtml(InputStream stream): converts stream into an HTML string. Note that using this method instead of convertToHtml(File file) means that relative paths to other files, such as images, cannot be resolved.

  • Result<String> extractRawText(File file): extract the raw text of the document. This will ignore all formatting in the document. Each paragraph is followed by two newlines.

  • Result<String> extractRawText(InputStream stream): extract the raw text of the document. This will ignore all formatting in the document. Each paragraph is followed by two newlines.

  • DocumentConverter addStyleMap(String styleMap): add a style map to specify the mapping of Word styles to HTML. The most recently added style map has the greatest precedence. See the section "Writing style maps" for a description of the syntax.

  • DocumentConverter disableDefaultStyleMap(): by default, any added style maps are combined with the default style map. Call this to stop using the default style map altogether.

  • DocumentConverter disableEmbeddedStyleMap(): by default, if the document contains an embedded style map, then it is combined with the default style map. Call this to ignore any embedded style maps.

  • DocumentConverter preserveEmptyParagraphs(): by default, empty paragraphs are ignored. Call this to preserve empty paragraphs in the output.

  • DocumentConverter idPrefix(String idPrefix): a string to prepend to any generated IDs, such as those used by bookmarks, footnotes and endnotes. Defaults to the empty string.

  • DocumentConverter imageConverter(ImageConverter.ImgElement imageConverter): by default, images are converted to <img> elements with the source included inline in the src attribute. Call this to change how images are converted.

Result<T>

Represents the result of a conversion. Methods:

  • T getValue(): the generated text.

  • Set<String> getWarnings(): any warnings generated during the conversion.

Image converters

An image converter can be created by implementing ImageConverter.ImgElement. This creates an <img> element for each image in the original docx. The interface has a single method, Map<String, String> convert(Image image). The image argument is the image element being converted, and has the following methods:

  • InputStream getInputStream(): open the image file.

  • String getContentType(): the content type of the image, such as image/png.

  • Optional<String> getAltText(): the alt text of the image, if any.

convert() should return a Map of attributes for the <img> element. At a minimum, this should include the src attribute. If any alt text is found for the image, this will be automatically added to the element's attributes.

For instance, the following replicates the default image conversion:

DocumentConverter converter = new DocumentConverter()
    .imageConverter(image -> {
        String base64 = streamToBase64(image::getInputStream);
        String src = "data:" + image.getContentType() + ";base64," + base64;
        Map<String, String> attributes = new HashMap<>();
        attributes.put("src", src);
        return attributes;
    });

where streamToBase64 is a function that reads an input stream and encodes it as a Base64 string.

Writing style maps

A style map is made up of a number of style mappings separated by new lines. Blank lines and lines starting with # are ignored.

A style mapping has two parts:

  • On the left, before the arrow, is the document element matcher.
  • On the right, after the arrow, is the HTML path.

When converting each paragraph, Mammoth finds the first style mapping where the document element matcher matches the current paragraph. Mammoth then ensures the HTML path is satisfied.

Freshness

When writing style mappings, it's helpful to understand Mammoth's notion of freshness. When generating, Mammoth will only close an HTML element when necessary. Otherwise, elements are reused.

For instance, suppose one of the specified style mappings is p[style-name='Heading 1'] => h1. If Mammoth encounters a .docx paragraph with the style name Heading 1, the .docx paragraph is converted to a h1 element with the same text. If the next .docx paragraph also has the style name Heading 1, then the text of that paragraph will be appended to the existing h1 element, rather than creating a new h1 element.

In most cases, you'll probably want to generate a new h1 element instead. You can specify this by using the :fresh modifier:

p[style-name='Heading 1'] => h1:fresh

The two consecutive Heading 1 .docx paragraphs will then be converted to two separate h1 elements.

Reusing elements is useful in generating more complicated HTML structures. For instance, suppose your .docx contains asides. Each aside might have a heading and some body text, which should be contained within a single div.aside element. In this case, style mappings similar to p[style-name='Aside Heading'] => div.aside > h2:fresh and p[style-name='Aside Text'] => div.aside > p:fresh might be helpful.

Document element matchers

Paragraphs, runs and tables

Match any paragraph:

p

Match any run:

r

Match any table:

table

To match a paragraph, run or table with a specific style, you can reference the style by name. This is the style name that is displayed in Microsoft Word or LibreOffice. For instance, to match a paragraph with the style name Heading 1:

p[style-name='Heading 1']

You can also match a style name by prefix. For instance, to match a paragraph where the style name starts with Heading:

p[style-name^='Heading']

Styles can also be referenced by style ID. This is the ID used internally in the .docx file. To match a paragraph or run with a specific style ID, append a dot followed by the style ID. For instance, to match a paragraph with the style ID Heading1:

p.Heading1

Bold

Match explicitly bold text:

b

Note that this matches text that has had bold explicitly applied to it. It will not match any text that is bold because of its paragraph or run style.

Italic

Match explicitly italic text:

i

Note that this matches text that has had italic explicitly applied to it. It will not match any text that is italic because of its paragraph or run style.

Underline

Match explicitly underlined text:

u

Note that this matches text that has had underline explicitly applied to it. It will not match any text that is underlined because of its paragraph or run style.

Strikethough

Match explicitly struckthrough text:

strike

Note that this matches text that has had strikethrough explicitly applied to it. It will not match any text that is struckthrough because of its paragraph or run style.

All caps

Match explicitly all caps text:

all-caps

Note that this matches text that has had all caps explicitly applied to it. It will not match any text that is all caps because of its paragraph or run style.

Small caps

Match explicitly small caps text:

small-caps

Note that this matches text that has had small caps explicitly applied to it. It will not match any text that is small caps because of its paragraph or run style.

Ignoring document elements

Use ! to ignore a document element. For instance, to ignore any paragraph with the style Comment:

p[style-name='Comment'] => !

HTML paths

Single elements

The simplest HTML path is to specify a single element. For instance, to specify an h1 element:

h1

To give an element a CSS class, append a dot followed by the name of the class:

h1.section-title

To add an attribute, use square brackets similarly to a CSS attribute selector:

p[lang='fr']

To require that an element is fresh, use :fresh:

h1:fresh

Modifiers must be used in the correct order:

h1.section-title:fresh

Separators

To specify a separator to place between the contents of paragraphs that are collapsed together, use :separator('SEPARATOR STRING').

For instance, suppose a document contains a block of code where each line of code is a paragraph with the style Code Block. We can write a style mapping to map such paragraphs to <pre> elements:

p[style-name='Code Block'] => pre

Since pre isn't marked as :fresh, consecutive pre elements will be collapsed together. However, this results in the code all being on one line. We can use :separator to insert a newline between each line of code:

p[style-name='Code Block'] => pre:separator('\n')

Nested elements

Use > to specify nested elements. For instance, to specify h2 within div.aside:

div.aside > h2

You can nest elements to any depth.

Missing features

Compared to the JavaScript and Python implementations, the following features are currently missing:

  • CLI
  • Writing embedded style maps
  • Markdown support
  • Document transforms

Donations

If you'd like to say thanks, feel free to make a donation through Ko-fi.

If you use Mammoth as part of your business, please consider supporting the ongoing maintenance of Mammoth by making a weekly donation through Liberapay.

java-mammoth's People

Contributors

mwilliamson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

java-mammoth's Issues

mammoth used in a Office Add-in for Word doesn't see images

Tested with Win7x64 and Win10x64.

Steps to reproduce the issue:

SETTING UP

  1. If you do not have already installed Git, please go to https://git-scm.com/download/win and install it
  2. If you do not have NodeJS installed then go to: https://nodejs.org/en/ , get V6.9.4 LTS and install it.
  3. Create a TempFolder like C:\TempOfficeAddIn
  4. Open CMD and got to C:\TempOfficeAddIn
  5. run "npm install -g tsd bower gulp"
  6. copy the folder MyAddInSite within the zip file "MyAddInSite.zip" to the C:\TempOfficeAddIn
  7. run "gulp serve-static"
  8. you should now be able to access: https://localhost:8443/app/home/home.html
  9. copy the file "onlineFile.1.docx" I sent you, in your OneDrive

TO TEST ONLINE

  1. go to office.com, and sign in with your account
  2. choose Word
  3. Open in browser from your online drive the file
  4. Click on "Edit Document" and choose "Edit in Browser"
  5. Click on "Insert" menu,
    then on "Office Add-Ins",
    then on "Upload My Add-in",
    browse for C:\TempOfficeAddIn\MyAddInSite\manifest-my-office-project.xml,
    then click "Upload"
  6. you should now see in the right side the add-in loaded.
  7. open the dev tools (F12) - I use Chrome, click on Console and cler the console history
  8. click on "GetHtml" button and wait about 5-6 seconds, you will see the "result" object received from mammoth in the console. The HTML is generated without any image...

Because Word for desktop is working only with registered Add-in, I cannot give you details on how to test it on Desktop version (which seems to work fine on my side).

If you need any other details, please let me know.

MyAddInSite.zip
onlineFile(1).docx

The font display in Webview is too small, how can I change to a larger size?

If you're reporting a bug or requesting a feature, please include:

  • a minimal example document
  • the HTML output that you'd expect

If you're reporting a bug, it's also useful to know what platform you're
running on, including:

  • the version of the JVM
  • the version of Java, or other language that you're using
  • the operating system and version

Alignments and colors are not preserving

While converting .docx file to html, I found that html document lacks alignment and color from docx file.

Trying converting attached file with code,

String FILE1="Synopsis.docx";
Synopsis.docx

DocumentConverter documentConverter = new DocumentConverter();
documentConverter.preserveEmptyParagraphs();
Result stringResult= documentConverter.convertToHtml(new FileInputStream(new File(FILE1)));
FileOutputStream fileOutputStream = new FileOutputStream(new File("1.html"));
fileOutputStream.write(stringResult.getValue().getBytes(Charset.forName("UTF-16")));
fileOutputStream.close();

Missing Tab when use docxToHtml to convert docx

Hi, I tried different documents has w:tab in docx's document.xml, however, the result does not contain any \t for w:tab

See docx below, when I turn on the indicate mode, you can clearly see that there is tab indicate "->" as shown below

image

Test Document:

Test_Tab.docx

I'm using kotlin as the language

Expected:
this.hideloader();
Love is pain
Testing Tab
Try to figure out what is this
Try to figure out what is this

With Html

<p>this.hideloader();</p>
<p> \t \t \t \t \t Love is Pain</p>
<p> \t \t \t \t \t Testing Tab </p>
<p>Try to figure out \t what is this</p>
<p>Try to figure out \t \t what is this</p>

but get this below in html string

<p>this.hideloader();</p>
<p> Love is Pain</p>
<p> Testing Tab </p>
<p>Try to figure out what is this</p>
<p>Try to figure out what is this</p>

Maintain spacing B/W Two Paragraph.

I am trying to convert Doc to HTML through mammoth java Lib. but it removes the spacing BW two paragraph. I found some suggestions "var options = {
styleMap: [
"p[style-name='center'] => center",
"p[style-name='Heading 1'] => p:fresh > h1:fresh",
"p[style-name='Heading 2'] => p:fresh > h2:fresh",
"p[style-name='Heading 3'] => p:fresh > h3:fresh",
"p[style-name='Heading 4'] => p:fresh > h4:fresh",
"p[style-name='Heading 5'] => p:fresh > h5:fresh",
"p[style-name='Heading 6'] => p:fresh > h6:fresh"
],
transformDocument: transformElement,
ignoreEmptyParagraphs: false
}; " and "DocumentConverter converter = new DocumentConverter().preserveEmptyParagraphs();"
to maintain the spaceing. but it is not working. can someone suggest me if I am doing something wrong.

@mwilliamson
tablecheck.docx

IMG-eab356043d4630e86bd757dd1dd28791-V

This is docx

IMG-2ba3a043cd195639bd885156fa497c9e-V

This is html

Image converter add attribute id or class

Hi,
Is it possible to add attribute id or class on img tag during the ImageConverter process?
So it's easy to identify each image after converter is done. for instance,

[image id="image-vangogh" src="/img/html/vangogh.jpg" alt="Van Gogh, Self-portrait"]
Right now it seems only alt and src can be generated.

Get list of pages instead of one html string

Hi, I have tried to achieve a list of strings instead of one string with the HTML of the whole document. Is there a way to do this like converting every page as a different string in a list? It could be divided by pages, or headings, etc.

So instead of:
final String html = result.getValue();

I would like to have:
final List<String> html = result.getValue();

Cross-references are not converted in anchors

Hi,
Some cross-references are not converted in anchors. I attached a sample:
sample.docx

The result of the conversion should be:

<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <title/>
    </head>
    <body>
        <h1>Title</h1>
        <p>Reference to: <a href="#_Ref66182243">Heading</a></p>
        <h1><a id="_Ref66182243"/>Heading</h1>
        <p>Content1</p>
    </body>
</html>

but the "a" element with the "href" attribute is not added.

This kind of references have the "REF" field definition with the "\h" argument: https://c-rex.net/projects/samples/ooxml/e1/Part4/OOXML_P4_DOCX_REFREF_topic_ID0ESRL1.html

This can be a possible fix: DunaMariusCosmin@0e23540

Best regards,
Cosmin

how to use it under jdk1.7

the lib is very useful , but I have to use it under the jdk1.7 version , Can you give some suggestion?
many thanks!

Two different ordered lists are merged in the conversion result

I have the following Word document: 2Lists.docx:
image
When I convert it to HTML, the lists are merged:
<ol> <li>Li1</li> <li>Li2</li> <li>NewLi1</li> <li>NewLi2</li> </ol>

This is the Word internal structure:
listsStructure
I think the converter should take into account the id of the numbering. Maybe the "org.zwobble.mammoth.internal.styles.ParagraphMatcher.matchesNumbering(Paragraph)" method should also compare these ids.

EMF image support

hi,Word conversion formula as a result <img src="......
HTML does not display the image

Getting compile time error in HTML.java

After downloading the code, when I am trying to compile the code using maven there are below compile time errors in HTML.java

The method getChildren() is undefined for the type Object line 144

The method getChildren() is undefined for the type Object line 147

The method isMatch(HtmlElement, HtmlElement) in the type Html is not applicable for the arguments (Object, HtmlElement) line 141

Type mismatch: cannot convert from Optional to Optional line 138

Getting the style of a text(alignment of text)

Hi,
How can I get the alignment style of a text using mammoth?
eg: if I am aligning a text in the center of the docx file, after converting to HTML not getting the text in the center of the HTML page.

Add fontsize during conversion from docx to html

Hi,
we want to extract the font size in the docx file and convert it to html element.
I see in your js version of mammoth,
there is a bug:mwilliamson/mammoth.js#109 which ppl found how to add color.
And in your js version file: https://github.com/mwilliamson/mammoth.js/blob/master/lib/documents.js
In your Run method, you have properties.fontSize I assume if I follow the bug109, I can apply the font size to my converted html?

How can we do it in the java version of Mammoth?

Error when multiple numbering levels reference the same paragraph style

Exception in thread "main" java.lang.IllegalStateException: Duplicate key org.zwobble.mammoth.internal.docx.Numbering$AbstractNumLevel@72bef795
at java.util.stream.Collectors.lambda$throwingMerger$0(Collectors.java:133)
at java.util.HashMap.merge(HashMap.java:1254)
at java.util.stream.Collectors.lambda$toMap$58(Collectors.java:1320)
at java.util.stream.ReduceOps$3ReducingSink.accept(ReduceOps.java:169)
at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
at java.util.HashMap$ValueSpliterator.forEachRemaining(HashMap.java:1625)
at java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:580)
at java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:270)
at java.util.HashMap$ValueSpliterator.forEachRemaining(HashMap.java:1625)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
at org.zwobble.mammoth.internal.docx.Numbering.(Numbering.java:72)
at org.zwobble.mammoth.internal.docx.NumberingXml.readNumberingXmlElement(NumberingXml.java:16)
at org.zwobble.mammoth.internal.docx.DocumentReader.lambda$readNumbering$7(DocumentReader.java:217)
at java.util.Optional.map(Optional.java:215)
at org.zwobble.mammoth.internal.docx.DocumentReader.readNumbering(DocumentReader.java:217)
at org.zwobble.mammoth.internal.docx.DocumentReader.readDocument(DocumentReader.java:29)
at org.zwobble.mammoth.internal.InternalDocumentConverter.convertToHtml(InternalDocumentConverter.java:55)
at org.zwobble.mammoth.internal.InternalDocumentConverter.lambda$convertToHtml$2(InternalDocumentConverter.java:48)
at org.zwobble.mammoth.internal.InternalDocumentConverter.withDocxFile(InternalDocumentConverter.java:85)
at org.zwobble.mammoth.internal.InternalDocumentConverter.lambda$convertToHtml$3(InternalDocumentConverter.java:47)
at org.zwobble.mammoth.internal.util.PassThroughException.unwrap(PassThroughException.java:16)
at org.zwobble.mammoth.internal.InternalDocumentConverter.convertToHtml(InternalDocumentConverter.java:46)
at org.zwobble.mammoth.DocumentConverter.convertToHtml(DocumentConverter.java:84)
at com.poc.images.ExtractImages.convertDocs(ExtractImages.java:143)
at com.poc.images.ExtractImages.main(ExtractImages.java:50)

Please add Android compability

Your great library can't be used with Android
Could you please add a few changes

  1. Don't use javax.xml.stream.* (Please use external xml writer or your own XML writer)
    javax.xml.stream - not available in Android SDK

  2. Make code compatible with JAVA 7 to support most of the devices on the market

Thank you,
I want to use your library in my reader but can't
https://github.com/foobnix/LibreraReader

List conversion issue

Hi,

My colleagues and I have observed an issue in which, when converting a .docx file into HTML, bullet point lists are being converted into <p> tags instead of <ul>/<li> tags. This occurs when the .docx file was originally created as a .doc in legacy versions of Word, then is manually converted to .docx either by saving the .doc as a .docx or by exporting it as .docx prior to passing it into Mammoth. This issue doesn't seem to occur when converting a .docx that was created as a .docx (as opposed to having been created as .doc and manually converted to .docx).

See the attached files for examples of broken and working lists. From looking at the underlying XML, it seems that the only difference between them is that the broken list items have the following tag:

<w:numId w:val="5"/>

However, there is no <w:abstractNum w:abstractNumId="5"> entry in numbering.xml. The working example has wVal="3" and a corresponding <w:abstractNum w:abstractNumId="3"> entry in numbering.xml. While the root issue is clearly a bug in Word, since Word is able to correctly interpret these list items as bullet points (presumably by looking at the <w:pStyle w:val="ListBullet"/> tag in the list items), we believe Mammoth should interpret these as bullet points too.

The following test class includes a failing test that replicates the issue and a passing happy path test, using the two attached documents (with both files located at the root of src/test/resources). In order to get this test running using the same libraries as us, we're using version 1.4.0 of Mammoth, version 4.12 of JUnit, and version 27.0.1-jre of Google Guava.

package mypackage;

import com.google.common.io.ByteStreams;
import org.junit.Assert;
import org.junit.Test;
import org.zwobble.mammoth.DocumentConverter;
import org.zwobble.mammoth.Result;

import java.io.BufferedInputStream;
import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.InputStream;

public class MammothListTest {

    @Test
    public void testBrokenList() throws Exception {
        String expected =
                "<h1><a id=\"__RefHeading___Toc508194329\"></a>Dummy title</h1>"
                        + "<p>Dummy subtitle</p>"
                        + "<ul>"
                        + "<li>dummy text</li>"
                        + "<li>dummy text</li>"
                        + "<li>dummy text</li>"
                        + "</ul>"
                        + "<p>Dummy subtitle</p>"
                        + "<ul>"
                        + "<li>dummy text</li>"
                        + "<li>dummy text</li>"
                        + "<li>dummy text</li>"
                        + "<li>dummy text</li>"
                        + "<li>dummy text</li>"
                        + "<li>dummy text</li>"
                        + "<li>dummy text</li>"
                        + "<li>dummy text</li>"
                        + "<li>dummy text</li>"
                        + "<li>dummy text</li>"
                        + "</ul>";

        byte[] brokenList = getInputDoc("/broken-list.docx");
        String actual = convertDocxToHtml(brokenList);
        Assert.assertEquals(expected, actual);
    }

    @Test
    public void testWorkingList() throws Exception {
        String expected =
                "<h1>dummy title</h1>"
                        + "<p>dummy subtitle</p>"
                        + "<ul>"
                        + "<li>dummy text</li>"
                        + "<li>dummy text</li>"
                        + "<li>dummy text</li>"
                        + "</ul>"
                        + "<p>Dummy subtitle</p>"
                        + "<ul>"
                        + "<li>dummy text</li>"
                        + "<li>dummy text</li>"
                        + "<li>dummy text</li>"
                        + "<li>dummy text</li>"
                        + "<li>dummy text</li>"
                        + "<li>dummy text</li>"
                        + "<li>dummy text</li>"
                        + "<li>dummy text</li>"
                        + "<li>dummy text</li>"
                        + "<li>dummy text</li>"
                        + "</ul>";

        byte[] brokenList = getInputDoc("/working-list.docx");
        String actual = convertDocxToHtml(brokenList);
        Assert.assertEquals(expected, actual);
    }

    private byte[] getInputDoc(String fileName) throws IOException {
        InputStream docxStream = new BufferedInputStream(getClass().getResourceAsStream(fileName));
        return ByteStreams.toByteArray(docxStream);
    }

    private String convertDocxToHtml(byte[] docxData) throws IOException {
        DocumentConverter converter = new DocumentConverter();
        Result<String> result = converter.convertToHtml(new ByteArrayInputStream(docxData));
        if (result.getWarnings().size() > 0) {
            // log any warnings that occur
        }
        return result.getValue();
    }
}

broken-list.docx
working-list.docx

Width and height of images

Hi there.

Just a short question: Is there any chance to get width and height setting of images from the word document? Right now images are imported in original size making them larger, than they are configured in word.

Android

Hi
Was this tested on android?

java.lang.NoClassDefFoundError: Failed resolution of: Ljava/util/Optional;

I am new to mammoth and I followed your document to implement it. But application terminating with the error "java.lang.NoClassDefFoundError: Failed resolution of: Ljava/util/Optional;" .
I used org.zwobble.mammoth:mammoth:1.4.0 dependency.

My Java code is

DocumentConverter converter = new DocumentConverter(); try { Result<String> result = converter.convertToHtml(new File( Environment.getExternalStoragePublicDirectory(Environment.DIRECTORY_DOWNLOADS), "sample_doc.doc")); String html = result.getValue(); Set<String> warning = result.getWarnings(); webView.setWebViewClient(new WebViewClient()); webView.loadUrl(html); } catch (IOException e) { e.printStackTrace(); }

Throwed exception is:

AndroidRuntime: FATAL EXCEPTION: main
Process: com.sparksupport.ameer.demoapplicaion, PID: 28784
java.lang.NoClassDefFoundError: Failed resolution of: Ljava/util/Optional;
at org.zwobble.mammoth.internal.styles.StyleMapBuilder.(StyleMapBuilder.java:25)
at org.zwobble.mammoth.internal.styles.StyleMap.(StyleMap.java:36)
at org.zwobble.mammoth.internal.conversion.DocumentToHtmlOptions.(DocumentToHtmlOptions.java:12)
at org.zwobble.mammoth.DocumentConverter.(DocumentConverter.java:15)
at com.sparksupport.ameer.demoapplicaion.DocViewActivity.onCreate(DocViewActivity.java:34)
at android.app.Activity.performCreate(Activity.java:5966)
this at android.app.Instrumentation.callActivityOnCreate(Instrumentation.java:1106)
at android.app.ActivityThread.performLaunchActivity(ActivityThread.java:2408)
at android.app.ActivityThread.handleLaunchActivity(ActivityThread.java:2517)
at android.app.ActivityThread.access$800(ActivityThread.java:162)
at android.app.ActivityThread$H.handleMessage(ActivityThread.java:1412)
at android.os.Handler.dispatchMessage(Handler.java:106)
at android.os.Looper.loop(Looper.java:189)
at android.app.ActivityThread.main(ActivityThread.java:5529)
at java.lang.reflect.Method.invoke(Native Method)
at java.lang.reflect.Method.invoke(Method.java:372)
at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:950)
at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:745)
Caused by: java.lang.ClassNotFoundException: Didn't find class "java.util.Optional" on path: DexPathList[[zip file "/data/app/com.sparksupport.ameer.demoapplicaion-1/base.apk"],nativeLibraryDirectories=[/vendor/lib, /system/lib]]
at dalvik.system.BaseDexClassLoader.findClass(BaseDexClassLoader.java:56)
at java.lang.ClassLoader.loadClass(ClassLoader.java:511)
at java.lang.ClassLoader.loadClass(ClassLoader.java:469)
at org.zwobble.mammoth.internal.styles.StyleMapBuilder.(StyleMapBuilder.java:25) 
at org.zwobble.mammoth.internal.styles.StyleMap.(StyleMap.java:36) 
at org.zwobble.mammoth.internal.conversion.DocumentToHtmlOptions.(DocumentToHtmlOptions.java:12) 
at org.zwobble.mammoth.DocumentConverter.(DocumentConverter.java:15) 
at com.sparksupport.ameer.demoapplicaion.DocViewActivity.onCreate(DocViewActivity.java:34) 
at android.app.Activity.performCreate(Activity.java:5966) 
at android.app.Instrumentation.callActivityOnCreate(Instrumentation.java:1106) 
at android.app.ActivityThread.performLaunchActivity(ActivityThread.java:2408) 
at android.app.ActivityThread.handleLaunchActivity(ActivityThread.java:2517) 
at android.app.ActivityThread.access$800(ActivityThread.java:162) 
at android.app.ActivityThread$H.handleMessage(ActivityThread.java:1412) 
at android.os.Handler.dispatchMessage(Handler.java:106) 
at android.os.Looper.loop(Looper.java:189) 
at android.app.ActivityThread.main(ActivityThread.java:5529) 
at java.lang.reflect.Method.invoke(Native Method) 
at java.lang.reflect.Method.invoke(Method.java:372) 
at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:950) 
at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:745) 
Suppressed: java.lang.ClassNotFoundException: java.util.Optional
at java.lang.Class.classForName(Native Method)
at java.lang.BootClassLoader.findClass(ClassLoader.java:781)
at java.lang.BootClassLoader.loadClass(ClassLoader.java:841)
at java.lang.ClassLoader.loadClass(ClassLoader.java:504)
... 19 more
Caused by: java.lang.NoClassDefFoundError: Class not found using the boot class loader; no stack available

Suggest me a solution

Alignment and list alingment

Hi,
I am testing your api in order to use it and I try to convert a word file to html. One major thing is that it doesn't make any general text alignment changes (for instance Left - Center - Right) and the Lists of word (for instance:

  • dog
    • cat
    • bird
      )

have some sort of alignment but the api doesn't make any changes either.
Is there a way for me to make the api do those things ??

Charset not set to UTF-8 throughout the project

I am receiving the below RuntimeException when running Maven tests:

embeddedStyleMapCanBeWrittenAndThenRead() Time elapsed: 0.002 sec <<< ERROR!
java.lang.RuntimeException: javax.xml.stream.XMLStreamException: Underlying stream encoding 'Cp1252' and input paramter for writeStartDocument() method 'UTF-8' do not match.
at org.zwobble.mammoth.tests.MammothTests.embeddedStyleMapCanBeWrittenAndThenRead(MammothTests.java:291)
Caused by: javax.xml.stream.XMLStreamException: Underlying stream encoding 'Cp1252' and input paramter for writeStartDocument() method 'UTF-8' do not match.
at org.zwobble.mammoth.tests.MammothTests.embeddedStyleMapCanBeWrittenAndThenRead(MammothTests.java:291)

There are some streams that are missing charset configurations, this causes the JDK to fall back to the default charset. In my particular case, the JDK determines the default charset to be 'cp1252' which differs from the hard coded UTF-8 encodings used throughout most of the project. Explicitly setting the encoding to UTF-8 everywhere will resolve this mismatch.

Operating System:

  • Microsoft Windows 10
  • 10.0.19044 N/A Build 19044

JDK:

  • openjdk 17.0.6 2023-01-17 LTS
  • OpenJDK Runtime Environment (build 17.0.6+10-LTS)
  • OpenJDK 64-Bit Server VM (build 17.0.6+10-LTS, mixed mode, sharing)

Missing a way to identify text boxes

As it stands, the API transforms every single line inside a text box into individual p 's

It would be great if there was a div having those converted p's inside, and a way to assign a class to said div, so it can be manipulated after the conversion, even if it wont be readable without user interaction.

As it stands, the text box conversion is very poor and there's no way to manipulate it.

Mammoth is unable to parse page header content in .docx file

Hi,
I was using tika parser earlier to convert word to HTML, but tika had this limitation that it couldn't parse the lists and the superscript in the word doc. Mammoth works like a charm, but the only issue I'm facing is that it is not able to parse the page header at all. Even rawText is not giving me the header content. Would you please comment on whether it is possible to do the same in mammoth?

Template might not exists or might nor be accessible by ano of the configured template resolvers

If you're reporting a bug or requesting a feature, please include:

  • a minimal example document
  • the HTML output that you'd expect

If you're reporting a bug, it's also useful to know what platform you're
running on, including:

  • the version of the JVM
  • the version of Java, or other language that you're using
  • the operating system and version

image

image

I am all the time getting this exception, can you please provide a helping example or a solution

Large file oom problem

Hi, When the doc file is too large and multiple parsing tasks are in progress at the same time. GC overrun limit exceeded or java heap space will appear. Is there any other solution besides increasing memory? Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.