beevik / etree Goto Github PK

View Code? Open in Web Editor NEW

1.4K 1.4K 173.0 358 KB

parse and generate XML easily in go

License: BSD 2-Clause "Simplified" License

Go 100.00%

dom etree go path xml xml-parser xpath

etree's People

Stargazers

Watchers

Forkers

felixge x6j8x kisielk cdunn2001 elvin-du mkrs jmptrader ma314smith luca76 hzmsrv stackadapt michaljemala enochtsang npiganeau ihor-aksonenko fattyfook2015 jardar ccbrown andreluzz iron-js bryant1410 setriones shuanzia guitarbum722 piaoyunsoft geseq funkygao wenlaizhou oliverjanik jumanjii haodreams amanbolat daniel-007 andresferreira3191 pubmatic-openwrap st-l10n izzeri wuzuf sintanial addabbyjin cww0614 xianlimei yushaona rinor benbenbear1990 okatkov sakishum snar zw-cheng amazingmarvin penggy karmen-chan xiaohonghong cshappy migzone gengzhi sobaniitekurete importpath adumville soyoo guyfran winnux isgasho scmaow turi-fly resonancellc admpub chilwalkishan xiajun325 team529nl martinadamsdev mengzi0826 ghosts1995 tangkai-prc tiller-mu apple0407 likujy2061 loirto lzy3240 jc137 adolsalamanca forkkit davleb rkoshy lesofi danieldin95 ujjwalsh infomaker gitstashpop lffranca mashengcai happy-co kukovik digital-ai jingc1413 kotlin2018 jackwiy chucongqing nayanemaia khorevaa

etree's Issues

panic: xml: encoding "ISO-8859-1" declared but Decoder.CharsetReader is nil

xml file:

.....

then Error:
panic: xml: encoding "ISO-8859-1" declared but Decoder.CharsetReader is nil

my code is:

	doc := etree.NewDocument()
	if err := doc.ReadFromFile("./test.xml"); err != nil {
		panic(err)
	}

how to find an element with a specific class and a specific text

Normally, I would do an Xpath query like this one:

//*[contains(concat(' ', normalize-space(@class), ' '), ' LookForClass ')  and text()='TheTextInTheClass']/../..

How should I do this?

In the docs I read:

XPath-like path string. Panics if an invalid path string is supplied.
I get : etree: path has invalid filter [brackets].

So after trying and trying.... This is what I found out:

//*[@class='LookForClass'][text()='TheTextInTheClass']/../..

This perhaps can help others in an example

Need an AddElement method

I need a way to add an etree Element under another etree Element.

Trying to explain in code:

doc := etree.NewDocument()
doc.ReadFromFile("bookstore.xml")
root := doc.SelectElement("bookstore")

Now the root is an etree Element under which are a bunch of <book> XML Elements.

Suppose now I have

docMore.ReadFromString(xmlMoreBooks)

The question is how can I add docMore as new entries under the root etree Element?

I think such feature would be needed by others as well. Please consider adding it.

Thanks

how to get all content of an element

Is something like this possible:

for _, e := range doc.FindElements("./bookstore/book[1]/*") {
    fmt.Printf("%s: %s\n", e.Tag, e.Content())
}

Which would show the content of the given search

    <title lang="en">Everyday Italian</title>
    <author>Giada De Laurentiis</author>
    <year>2005</year>
    <p:price>30.00</p:price>

When I do e.Text() I only get the text value, but I need it all.

Failure to parse '/' in xpath values

When '/' is not used as a path separator in a path, like in the example below, etree's xpath compilation will fail.

//http[@internet='web']//url[@pattern='/web/app' ]

This is due to line 192 in path.go

	for _, s := range strings.Split(path, "/") {

Any suggestions for a fix?

need getnext and getprovi func when we insert some element to it

Cdata is gone

when parse xml and then output Cdata is gone

etree HTML parser changes node order?

Hi,

I'm currently facing an issue where I can't explain the etree behaviour. Following code demonstrates the issue I am facing. I want to parse an HTML string as illustrated below, change the attribute of an element and reprint the HTML when done.

string = "<p><center><code>git clone https://github.com/AlexeyAB/darknet.git</code></center></p>"
test = etree.fromstring(string, parser)
print(etree.tostring(test, pretty_print=True, method="html"))

I get this output:

<html><body>
<p></p>
<center><code>git clone https://github.com/AlexeyAB/darknet.git</code></center>
</body></html>

As you can see (let's ignore the <html> and <body> tags etree adds), the order of the nodes has been changed! The <p> tag that used to wrap the <center> tag, now loses its content, and that content gets added after the </p> tag closes. Eh?

When I omit the <center> tag, all of a sudden the parsing is done right:

string = "<p><code>git clone https://github.com/AlexeyAB/darknet.git</code></p>"
test = etree.fromstring(string, parser)
print(etree.tostring(test, pretty_print=True, method="html"))

With output:

<html><body><p><code>git clone https://github.com/AlexeyAB/darknet.git</code></p></body></html>

Am I doing something wrong here? I have to use the HTML parser because I get a lot of parsing errors when not using it. I also can't change the order of the <p> and <center> tags, as I read them this way.

(X)Path OR search

I checked https://godoc.org/github.com/beevik/etree#Path and it seems the (X)Path OR operator for different nodes searching is not there.

Please consider providing that feature, as it'd be very useful, but not too difficult to do.

To recap, from above stackoverflow Q:

The XPath OR operator of

//bookstore/book/title or //bookstore/city/zipcode/title

is expressed as:

//bookstore/book/title|//bookstore/city/zipcode/title

Thanks!

How do I change line endings when writing files?

Currently, when reading XML files with CRLF line endings, these will be converted to LF when writing the XML back to disk. How could I force a different line ending? The software that uses those XML files needs CRLF line endings (it expects line breaks in text blocks with CRLF and nothing else).

Text filtering on leaf node

Given an XML like

<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<library>
  <!-- Great book. -->
  <book id="b0836217462" available="true">
    <isbn>0836217462</isbn>
    <title lang="en">Being a Dog Is a Full-Time Job</title>
    <quote>I'd dog paddle the deepest ocean.</quote>
    <author id="CMS">
      <name>Charles M Schulz</name>
      <born>1922-11-26</born>
      <dead>2000-02-12</dead>
    </author>
    <character id="PP">
      <name>Peppermint Patty</name>
      <born>1966-08-22</born>
      <qualification>bold, brash and tomboyish</qualification>
    </character>
    <character id="Snoopy">
      <name>Snoopy</name>
      <born>1950-10-04</born>
      <qualification>extroverted</qualification>
      <qualification>beagle</qualification>
    </character>
  </book>
</library>

A query like library/book/character[qualification='beagle']/qualification would return all qualification elements of character for every character with a qualification='beagle'. It'd be good to allow text() XPath queries so that a query like library/book/character/qualification[text()='beagle'] only returns the nodes of type qualification which text is beagle.

Don't deprecate InsertChild() please

Deprecated: InsertChild is deprecated. Use InsertChildAt instead.

Please don't deprecate InsertChild() because InsertChildAt won't work for my case --

The xml file that I'm working on has a rigid format of where things are:

<A attr=... >
  <B attr=... />
  <C attr=... />
  <D attr=... />
</A>

B comes before C which comes before D. I know the order doesn't matter to xml, but I'm tracking the file with version control so, I'd prefer as little change as possible.

Whether I do doc.InsertChildAt(0, c) or doc.InsertChildAt(1, c), C will always be inserted before B; whereas I need it after B but before D (after I've remove C beforehand).

Was I using InsertChildAt incorrectly, or InsertChild() is just not replaceable for my case? Thx.

Why Text() use 'break' but not 'continue'?

func (e *Element) Text() string {
if len(e.Child) == 0 {
return ""
}

text := ""
for _, ch := range e.Child {
	if cd, ok := ch.(*CharData); ok {
		if text == "" {
			text = cd.Data
		} else {
			text += cd.Data
		}
	} else {
		break
	}
}
return text

}
when I used this function to get charData in a tag,a problem happend: an element has two child,and if the first is not CharData,it may not check the second.

Regarding the inspiration for this package

This is not like etree.

Invalid memory address or nil when path doesn't exists

Nice work with this package!
I have a question regarding this scenario:

doc.FindElement("//This/Element/Does/Not/Exists")

Is there a way to check that this path actually exists? Currently I get:

--- FAIL: TestXMLResp (0.00s)
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x40 pc=0x6618b4]

Unable to extract Attribute Value w/ Path

Consider the following document partial (source: https://community.cablelabs.com/wiki/plugins/servlet/cablelabs/alfresco/download?id=8f900e8b-d1eb-4834-bd26-f04bd623c3d2 , Appendix I.1)

<?xml version="1.0" ?>
<ADI>
  <Metadata>
    <AMS Provider="InDemand" Product="First-Run" Asset_Name="The_Titanic" Version_Major="1" Version_Minor="0" Description="The Titanic asset package" Creation_Date="2002-01-11" Provider_ID="indemand.com" Asset_ID="UNVA2001081701004000" Asset_Class="package"/>
    <App_Data App="MOD" Name="Provider_Content_Tier" Value="InDemand1" />
    <App_Data App="MOD" Name="Metadata_Spec_Version" Value="CableLabsVod1.1" />
  </Metadata>
</ADI>

While i can use a Path like //AMS[@Asset_Class='package']/../App_Data[@Name='Provider_Content_Tier'] to get to a desired Element, I am not able to perform an xpath-style path search to extract just the data in the Value attribute for the identified elements as a []string. Most other XPath implementations support a path such as //AMS[@Asset_Class='package']/../App_Data[@Name='Provider_Content_Tier']/@Value to extract attribute values directly from the Path.

This would be a really great feature to have to allow us to port a legacy app over to Go, without having to refactor our existing paths that perform the attribute extractions.

I'll take a stab at implementing in the coming days.

Support descendant specifier in predicates

Given the following xml document

<root>
  <a><b><c>...</c></b></a>
  <a><b>...</b></a>
  <a><b>...</b></a>
  <a><b>...</b></a>
</root>

It seems like there is currently no way to specify an Xpath expression so that only the nodes get selected because they have a grandchild somewhere. I'm looking for support for something like this
//a[.//c] or alternatively //a[b/c].
Currently, such Xpath expressions fail with etree: path has invalid filter [brackets].

Remove element, leave whitespace

In order to adhere to some stupid API, I have to provide some XML with a blank space where an element will be placed.

Is it possible to remove an element and leave its whitespace in the document? I've tried removing an element and then inserting etree.NewElement("") but that inserts </>.

If there's a way I can do it using this library, awesome, if not I'll have to just manipulate it as a text file

can't find class (perhaps a bug)

This is the HTML (a piece of material design light)

  <body>
    <!-- Wide card with share menu button -->
    <div class="demo-card-wide mdl-card mdl-shadow--2dp">
      <div class="mdl-card__title">
        <h2 class="mdl-card__title-text">Welcome</h2>
      </div>
      <div class="mdl-card__supporting-text">
        Lorem ipsum dolor sit amet, consectetur adipiscing elit.
        Mauris sagittis pellentesque lacus eleifend lacinia...
      </div>
      <div class="mdl-card__actions mdl-card--border">
        <a class="mdl-button mdl-button--colored mdl-js-button mdl-js-ripple-effect">
          Get Started
        </a>
      </div>
      <div class="mdl-card__menu">
        <button class="mdl-button mdl-button--icon mdl-js-button mdl-js-ripple-effect">
          <i class="material-icons">share</i>
        </button>
      </div>
    </div>
    
        <!-- Wide card with share menu button -->
    <div class="demo-card-wide mdl-card mdl-shadow--2dp">
      <div class="mdl-card__title">
        <h2 class="mdl-card__title-text">Welcome</h2>
      </div>
      <div class="mdl-card__supporting-text">
        Lorem ipsum dolor sit amet, consectetur adipiscing elit.
        Mauris sagittis pellentesque lacus eleifend lacinia...
      </div>
      <div class="mdl-card__actions mdl-card--border">
        <a class="mdl-button mdl-button--colored mdl-js-button mdl-js-ripple-effect">
          Get Started
        </a>
      </div>
      <div class="mdl-card__menu">
        <button class="mdl-button mdl-button--icon mdl-js-button mdl-js-ripple-effect">
          <i class="material-icons">share</i>
        </button>
      </div>
    </div>
  </body>

and I want to check the number of "cards" on the page.
I can do that with the following xpath

//*[contains(concat(' ', normalize-space(@class), ' '), ' mdl-card ')]

(I've tested this with https://www.freeformatter.com/xpath-tester.html)
It works... and shows the 2 elements, and I can see in the html it has 2.

And now for my code: (memHtml holds the HTML in memory...)

	docLoc := etree.NewDocument()
	if err := docLoc.ReadFromString(memHtml); err != nil {
		panic(err)
	}

	counter := len(docLoc.FindElements(element))
	fmt.Println("counter", counter)

And it prints 0 (zero)

btw:
I also did: (the same xpath with "(" ")" around it. Also valid xpath

(//*[contains(concat(' ', normalize-space(@class), ' '), ' mdl-card ')])

But then, etree complains something about brackets.

Consider returning an Error in SelectElement if the element was not found

Currently, SelectElement doesn't return anything indicating an error while trying to select an element in the XML string.
Trying to do further operations on the returned node causes the program to Panic with SEGFAULT.

For example:
Consider the following XML:
data := "

vmx

"

Now, if try to run the following code:
xmlDoc := etree.NewDocument()
XMLReadError := xmlDoc.ReadFromString(capabilitiesXML)
if XMLReadError != nil {
log.Printf("Unable to read the Capabilities XML: %s", XMLReadError)
return
}
root := xmlDoc.SelectElement("host")
cpu := root.SelectElement("cpu")

The above code causes the program to panic with a SEGFAULT

Returning an error if the concerned element is not found seems to be a better approach to avoid these kinds of issues or this behavior can be documented that a nil is returned in case the element is not found and hence this needs to be checked.

About the TestAddChild()

About the TestAddChild(),

Line 316 to 318 is,

	testdoc := `<book lang="en">
  <t:title>Great Expectations</t:title>
  <author>Charles Dickens</author>
`

I.e., the <t:title> and <author> are not enclosed within <book> but parallel to it. Right? (Ref: L333)
If so, the FindElements("//book/*") should return empty right? I.e., the root.AddChild(e) result seems incorrect to me.

What I'm missing? Thx.

Use `Errorf` instead of `Fail`

The test suite currently uses Fail instead of Errorf (or similar). This is problematic as test failures provide no indication on what went wrong.

I.e. currently a test failure looks like this:

$ go test .
--- FAIL: TestDocument (0.00 seconds)
FAIL
FAIL    github.com/felixge/etree    0.017s

When using Errorf, it would look like this:

$ go test .
--- FAIL: TestDocument (0.00 seconds)
    etree_test.go:58: custom error message printed here
FAIL
FAIL    github.com/felixge/etree    0.018s

I'd be happy to submit a patch that replaces all occurrences of Fail() with more sensible error reporting - let me know.

Nodes of no descendant

The reverse side of question #28, is it possible to select those specific nodes who has no child nodes? Thx.

Attribute namespaces broken.

In etree Attr.NamespaceURI, always returns the containing element's namespace. This is wrong for two reasons:

Unprefixed attributes

Unprefixed attributes get no namespace assigned. This is different from elements. See XML Names 6.2:

The namespace name for an unprefixed attribute name always has no value.

Prefixed attributes

For prefixed attributes, the prefix should be resolved into an URI.

Example

Using a shortend etree test case:

<root xmlns="http://root.example.com" 
      xmlns:attrib="http://attrib.example.com" 
      a="foo" 
      attrib:b="bar" />

Let's use xmlstarlet, which is just a neat libxml2 CLI frontend.

$ xmlstarlet sel -N root=http://root.example.com -N attrib=http://attrib.example.com -N notattrib=http://attrib.example.com \
  -t -m '/root:root' \
  -o '@a[namespace-uri()=""]: ' -v '@a[namespace-uri()=""]' -nl \
  -o '@root:a[namespace-uri()="http://root.example.com"]:' -v '@root:a[namespace-uri()="http://root.example.com"]' -nl \
  -o '@b[namespace-uri()=""]:' -v '@b[namespace-uri()=""]' -nl \
  -o '@root:b[namespace-uri()="http://root.example.com"]:' -v '@root:b[namespace-uri()="http://root.example.com"]' -nl \
  -o '@attrib:b[namespace-uri()="http://attrib.example.com"]: ' -v '@attrib:b[namespace-uri()="http://attrib.example.com"]' -nl \
  -o '@notattrib:b[namespace-uri()="http://attrib.example.com"]: ' -v '@notattrib:b[namespace-uri()="http://attrib.example.com"]' -nl \
  test.xml

Which produces the following output:

@a[namespace-uri()=""]: foo
@root:a[namespace-uri()="http://root.example.com"]:
@b[namespace-uri()=""]:
@root:b[namespace-uri()="http://root.example.com"]:
@attrib:b[namespace-uri()="http://attrib.example.com"]: bar
@notattrib:b[namespace-uri()="http://attrib.example.com"]: bar

Note that without the test for namespace-uri() all of the result except the last one would be correct in etree, since etree does not resolve prefixes for attribute and element selection (a different bug).

EDIT: messed up shell quoting.

Prefix handling in xpath queries does not resolve namespaces

Compare

package main

import (
	"fmt"

	"github.com/beevik/etree"
)

const xmlData = `<root xmlns:b='foo'><a /><b:b /><b:b xmlns:b='bar' /></root>`

func main() {
	doc := etree.NewDocument()
	doc.ReadFromString(xmlData)

	fmt.Printf("%+v\n", doc.FindElements("//b"))
	fmt.Printf("%+v\n", doc.FindElements("//b:b"))
}

which produces

$ go run showcase.go 
[0xc0000b4240 0xc0000b42a0]
[0xc0000b4240 0xc0000b42a0]

import xml.etree.ElementTree as ET
import io

XML_DATA = "<root xmlns:b='foo'><a /><b:b /><b:b xmlns:b='bar' /></root>"

doc = ET.parse(io.StringIO(XML_DATA))

print(doc.findall('./b'))
# Fails, prefix b not defined
# print(doc.findall('.//b:b'))
print(doc.findall('./b:b', {'b': 'foo'}))
print(doc.findall('./b:b', {'b': 'bar'}))
# different prefix, still finds the same element!
print(doc.findall('./c:b', {'c': 'bar'}))

which results in

$ python showcase.py 
$ python showcase.py 
[]
[<Element '{foo}b' at 0x7f0701e43e90>]
[<Element '{bar}b' at 0x7f0701e43ef0>]
[<Element '{bar}b' at 0x7f0701e43ef0>]

Note that in the Go version, both queries return both elements that have b as local name and prefixes are only compared as text string. The python version is correct regarding to namespaces since:

the unnamed namespace does not match any other namespace
prefixes are solved to namespace uris. This implies that prefixes in xpath expressions have to be defined first. After that the actual prefix does not matter only the backing namespace uri

It would be nice if your etree package would offer similar features. Searching by prefix only is a blocker when receiving XML documents, where prefixes are unknown (Like the output of Go's XML Encoder that uses strange, but correct, prefix names and placement).
How would you search for an XML element by namespace at all?

Walking mechanism

As I understand, ReadFromFile stores the entire file's content in memory before we can do whatever we have to do with the nodes.
This is not ideal when dealing with very large files.

Is there currently a way to process the nodes as we walk through the file, therefore avoiding the need to store it all in memory at once? Hope that makes sense

FindElements

Hi,

I have xml like:

<nodes>
    <node>
      <nodeID>2</nodeID>
      <args>
        <arg0>
          <source>1</source>
        </arg0>
        <arg1>
          <source>2</source>
        </arg1>
      </args>
    </node>
</nodes>

I am trying to get "arg*" elements. Not sure how to phrase this. Awesome library by the way. I'm just having to deal with some bad xml at the moment.

    doc := etree.NewDocument()
    if err := doc.ReadFromFile(filename); err != nil {
        panic(err)
    }

    nodes := doc.SelectElement("nodes")
    for _, node := range nodes.SelectElements("node") {
        args := node.SelectElement("args")

        for i, arg := range args.FindElements("arg*") {
                     // blah
         }
}

Thanks!

New line after BOM

The doc.WriteTo is adding an extra new line after BOM. I've illustrate it with et_dump.go and et_dump.xml, which you can find under https://github.com/suntong/lang/blob/master/lang/Go/src/xml/.

Here is the result:

$ go run et_dump.go | diff -wU 1 et_dump.xml -
--- et_dump.xml 2016-03-08 16:40:41.667010100 -0500
+++ -   2016-03-08 16:40:57.842603083 -0500
@@ -1,4 +1,4 @@
-ï»¿<?xml version="1.0" encoding="utf-8"?>
+ï»¿
+<?xml version="1.0" encoding="utf-8"?>
 <bookstore xmlns:p="urn:schemas-books-com:prices">
-
   <book category="COOKING">
@@ -9,3 +9,2 @@
   </book>
-
   <book category="CHILDREN">
...
@@ -34,3 +31,2 @@
   </book>
-
 </bookstore>
\ No newline at end of file

I.e., an extra new line is added after BOM. This seems to be a trivial issue, but will cause my Microsoft Visual Studio failed to recognize the webtest file such dump creates. :-(

Please consider removing the added extra new line.

Thanks

Please skip BOM

When reading from file (via ReadFrom() or ReadFromFile()), is it possible to skip the BOM
(https://en.wikipedia.org/wiki/Byte_order_mark) char?

Every file created by MS under Windows has that witched char, which is very hard to get rid of.
So it'll be great that etree can skip them when reading from file.

The following file will fail:

$ cat et_example.xml | hexdump -C
00000000  ff fe 3c 00 62 00 6f 00  6f 00 6b 00 73 00 74 00  |..<.b.o.o.k.s.t.|
00000010  6f 00 72 00 65 00 3e 00  0d 00 0a 00 20 00 3c 00  |o.r.e.>..... .<.|
...

with the following error

panic: XML syntax error on line 1: invalid UTF-8

Hmm, wait, is it because of BOM or the UTF16 encoding?

thx

Need/How to set prefix

When outputting xml, the indentation is not the only control necessary -- prefix string is necessary as well, for the xml chunks that are not starting at the root level. Ref: https://godoc.org/encoding/xml#MarshalIndent, which has two controls: prefix, indent string.

Any way to find elements ignoring namespaces?

Sometimes I have to deal with non-standard implementations and I don't always know what Namespaces are going to be in there. Is there a possibility of adding support of finding elements with a namespace of *?

Question: regarding adding content to a document

In https://golang.org/pkg/encoding/xml/ there is support for a tag ",innerxml", where the field is written verbatim. Is there support in etree for something similar?

For e.g.

doc.CreateInnerXML("<users/>")

From https://golang.org/pkg/encoding/xml/#Marshal

The XML element for a struct contains marshaled elements for each of the exported fields of the struct, with these exceptions:

- the XMLName field, described above, is omitted.
- a field with tag "-" is omitted.
- a field with tag "name,attr" becomes an attribute with
  the given name in the XML element.
- a field with tag ",attr" becomes an attribute with the
  field name in the XML element.
- a field with tag ",chardata" is written as character data,
  not as an XML element.
- a field with tag ",cdata" is written as character data
  wrapped in one or more <![CDATA[ ... ]]> tags, not as an XML element.
- a field with tag ",innerxml" is written verbatim, not subject
  to the usual marshaling procedure.
- a field with tag ",comment" is written as an XML comment, not
  subject to the usual marshaling procedure. It must not contain
  the "--" string within it.
- a field with a tag including the "omitempty" option is omitted
  if the field value is empty. The empty values are false, 0, any
  nil pointer or interface value, and any array, slice, map, or
  string of length zero.
- an anonymous struct field is handled as if the fields of its
  value were part of the outer struct.

Cannot find element with ":" in the name.

XML structure:

<package xmlns="http://www.idpf.org/2007/opf" unique-identifier="uuid_id" version="2.0">
  <metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf">
    <dc:identifier opf:scheme="uuid" id="uuid_id"></dc:identifier>
    <dc:title></dc:title>
    <dc:creator opf:role="aut"></dc:creator>
    <dc:language>eng</dc:language>
  </metadata>
</package>

Go code:

doc.FindElement("/package/metadata/*[1]")).SetText("test") //fails
doc.FindElement("/package/metadata/dc:identifier").SetText("test") //fails

Exposing more API functions

I'd think it'd be helpful to expose the following API functions

doc.NewDocumentFromString(s string), based on:

etree/etree_test.go

Lines 14 to 22 in 4ec1305

 func newDocumentFromString(t *testing.T, s string) *Document { 

 t.Helper() 

 doc := NewDocument() 

 err := doc.ReadFromString(s) 

 if err != nil { 

 t.Error("etree: failed to parse document") 

 } 

 return doc 

 }

and,

Element.WriteTo(w *bufio.Writer, s *WriteSettings), for debugging purposes, based on:

etree/etree.go

Lines 1036 to 1060 in 4ec1305

 func (e *Element) writeTo(w *bufio.Writer, s *WriteSettings) { 

 w.WriteByte('<') 

 w.WriteString(e.FullTag()) 

 for _, a := range e.Attr { 

 w.WriteByte(' ') 

 a.writeTo(w, s) 

 } 

 if len(e.Child) > 0 { 

 w.WriteString(">") 

 for _, c := range e.Child { 

 c.writeTo(w, s) 

 } 

 w.Write([]byte{'<', '/'}) 

 w.WriteString(e.FullTag()) 

 w.WriteByte('>') 

 } else { 

 if s.CanonicalEndTags { 

 w.Write([]byte{'>', '<', '/'}) 

 w.WriteString(e.FullTag()) 

 w.WriteByte('>') 

 } else { 

 w.Write([]byte{'/', '>'}) 

 } 

 } 

 }

Please consider. thx.

panic: XML syntax error on line 30: unescaped < inside quoted string [recovered]

this is my code:

<if test="page < 0" >limit #{page}, #{size}</if>

xml can not load '<' , in element Property, it should be parsed
will panic: XML syntax error on line 30: unescaped < inside quoted string [recovered]

Text escaping not correct

The valid character range for XML is specified as:

Char := #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

When producing XML using WriteTo a set of strings.Replacer objects in etree.go are used to escape strings.

However, these replacers fail to replace some characters that are not valid in XML (e.g. 0xB vertical tab).

This leads to invalid XML being produced.

The go standard package xml replaces invalid characters with \uFFFD (unicode replacement character) which appears to be common practice across other XML libraries.

I propose using xml.EscapeText instead of the current approach based on strings.NewReplacer(...).

RemoveChild() Example

Hi @beevik,

Can you give an example using RemoveChild() please?

Also, I searched those _test files, and seems that there is no test coverage for RemoveChild() either.

thx

Bug/question nested FindElements

Hi, I have some code where I am using FindElements to get a slice of response elements, then calling FindElement on those to get their children. I am getting the same result for each iteration of the loop, i.e. the href is always "/1" instead of "/1" for the first response, then "/2" for the next, etc.. Have I done something wrong or misunderstood the API? Or is this a bug?

package example

import (
	"testing"

	"fmt"
	"github.com/beevik/etree"
)

func TestXML(t *testing.T) {
	xml := `<multistatus xmlns="DAV:">
	<response>
	<href>/1</href>
	<propstat>
	<prop>
	<getetag>A</getetag>
	</prop>
	<status>HTTP/1.1 200 OK</status>
	</propstat>
	</response>
	<response>    
	<href>/2</href>                                    
	<propstat>                                                                                                                          
	<prop>                                                                    
	<getetag>B</getetag>
	</prop>                                                                                                                     
	<status>HTTP/1.1 200 OK</status>                                        
	</propstat>
	</response>
	<response>
	<href>/3</href>
	<propstat>
	<prop>
	<getetag>C</getetag>
	</prop>
	<status>HTTP/1.1 200 OK</status>
	</propstat>
	</response>
	<response>
	<href>/4</href>
	<propstat>
	<prop>
	<getetag>D</getetag>
	</prop>
	<status>HTTP/1.1 200 OK</status>
	</propstat>
	</response>
	<response>
	<href>/5</href>
	<propstat>
	<prop>
	<getetag>E</getetag>
	</prop>
	<status>HTTP/1.1 200 OK</status>
	</propstat>
	</response>
	</multistatus>`

	doc := etree.NewDocument()
	if err := doc.ReadFromString(xml); err != nil {
		t.Fatalf("Failed to parse xml: %s", err.Error())
	}

	responseElements := doc.FindElements("//response")
	if len(responseElements) != 5 {
		t.Fatalf("Expected 5 response elements, got %d", len(responseElements))
	}

	for n, el := range responseElements {
		hrefEl := el.FindElement("//href")
		etagEl := el.FindElement("//getetag")

		if hrefEl != nil && etagEl != nil {

			href := hrefEl.Text()
			expectedHref := fmt.Sprintf("/%d", 1+n)
			if href != expectedHref {
				t.Fatalf("Expected href %s, got %s", expectedHref, href)
			}

			etag := etagEl.Text()
			expectedEtag := fmt.Sprintf("%c", 'A'+n)
			if etag != expectedEtag {
				t.Fatalf("Expected etag %s, got %s", expectedEtag, etag)
			}
		} else {
			t.Fatalf("Missing href and/or etag")
		}
	}
}

panic: xml: encoding "ISO-8859-1" declared but Decoder.CharsetReader is nil same as #48

Hi @beevik
After I done some tests I got same error presented in issue #48
Here is a sample of code + xml data :

var xmlDoc = `<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE broadWorksCDR>
<broadWorksCDR version="19.0">
  <cdrData>
    <headerModule>
      <recordId>
        <eventCounter>0002183384</eventCounter>
        <systemId>CGRateSaabb</systemId>
        <date>20160419210000.104</date>
        <systemTimeZone>1+020000</systemTimeZone>
      </recordId>
      <type>Start</type>
    </headerModule>
  </cdrData>
</broadWorksCDR>
`
doc := etree.NewDocument()
if err := doc.ReadFromBytes([]byte(xmlDoc)); err != nil {
	t.Error(err)
}

First I try to use doc.ReadFromString after doc.ReadFromBytes but got error both times.

Thanks,
TeoV

Some XML prefixes have a colon

For instance the CardDAV prefix is urn:ietf:params:xml:ns:carddav. This doesn't play well with spaceDecompose:

etree/helpers.go

Line 144 in 23e6ba8

func spaceDecompose(str string) (space, key string) {

The standard Go library is using a space character, probably because of this. Thoughts?

SelectElements doesn't handle namespace properly

Given this document:

<?xml version="1.0" encoding="utf-8" ?>
<D:propfind xmlns:D="DAV:">
  <D:prop xmlns:R="http://ns.example.com/boxschema/">
    <R:bigbox/>
    <R:author/>
    <R:DingALing/>
    <R:Random/>
  <D:prop>
</D:propfind>

doc.Root().SelectElements("DAV:prop") returns zero element
doc.Root().SelectElements("D:prop") returns one element

However the D prefix is arbitrary, it could be set to any token really (e.g. xmlns:myawesomeprefix="DAV:"). SelectElements shouldn't care about it, it should resolve the prefix depending on xmlns attributes.

https://play.golang.org/p/I8XUmLjY9pX

Problem parsing CDATA after newline

Thanks a ton for this package - super useful for my work.

I'm parsing some RSS feeds that contain HTML contained in <!CDATA[ ... ]> tags with formatted HTML for post descriptions, content, etc. It looks like when the CDATA tag is preceded by a newline, the text can't be parsed out:

	workingCDATAString := `
	<rss>
		<channel>
			<item>
		   		<summary><![CDATA[Sup]]></summary>
			</item>
		</channel>
	</rss>
	`

	doc := etree.NewDocument()
	doc.ReadFromString(workingCDATAString)
	spew.Dump(doc.FindElement("rss").FindElement("channel").FindElement("item").FindElement("summary").Text())
	// Output: (string) (len=3) "Sup"

	brokenCDATAString := `
	<rss>
		<channel>
			<item>
		   		<summary>
			 		<![CDATA[Sup]]>
				</summary>
			</item>
		</channel>
	</rss>
	`
	doc = etree.NewDocument()
	doc.ReadFromString(brokenCDATAString)
	spew.Dump(doc.FindElement("rss").FindElement("channel").FindElement("item").FindElement("summary").Text())
	// Output: (string) (len=7) "\n\t\t\t \t\t"

I'm not familiar with XML parsing enough to say that this isn't the intended behavior, but I would expect these two code blocks to output the same thing ("Sup"). Any ideas?

Copy()

Thanks for this library. Would you be interested in a patch that adds a Copy() method to Document / Element?

In the XMPP server I'm working on, one sometimes has to duplicate an incoming XML message and send out slightly modified versions of it.

I realize that I can implement Copy() in userland, but since it would probably be useful to others, I hope you're interested in a patch.

How to add missing ProcInst to the document?

I'm creating a converter that modifies a bunch of exported XML files to be imported into a different software.

Currently, I'm using etree to read each file, add or change elements, then write it again to a different folder. But the destination software seems to need the <?xml... header which is missing from the source. But CreateProcInst would not prepend but append it.

How could I prepend the ProcInst?

Extracting elements with text

Hi,

I have a need to retrieve all text tokens of an element and its descendant, i.e. a path looking something like //something/text().

Since Element.FindElementsPath() and friends returns an Element and not a Token, it would alternatively be fine if I get a list of elements with non nil text.

I'm ready to make a PR to add this functionality, but I would like to know before how you would see it.

Should we add a new function to Element to retrieve Tokens from a path or should we stick to the existing functions Element.FindElementsPath() ?
If we keep existing functions, should we keep the //something/text() syntax or a different one (since we are actually retrieving the parent of the text nodes) ? In this case, what would it be ?

Thanks

Is that a bug Text() only return first child?

Is that a bug Text() only return first child?
Below only show half of content.

            <script type="text/javascript">
                polymer.define(&apos;web.csrf&apos;, function (require) {
                    var token = &quot;<t t-esc="request.csrf_token(None)"></t>&quot;;
                    require(&apos;web.core&apos;).csrf_token = token;
                    require(&apos;qweb&apos;).default_dict.csrf_token = token;
                });
            </script>

how to edit ProcInst

i want to modify the encoding from gb2312 to utf-8
but I only find the doc.CreateProcInst() method ,did this package can do it?

<?xml version="1.0" encoding="gb2312"?>
<configuration>
    <section name="CommissionManagerConfig" requirePermission="false"/>
</configuration>

XPath for attributes?

Any way to query an attribute with xpath?

Allow python like XML mixed content processing

Was porting some code from python, mixed content was difficult/impossible to handle - see PR #60

	func newDocumentFromString(t testing.T, s string) Document {
	t.Helper()
	doc := NewDocument()
	err := doc.ReadFromString(s)
	if err != nil {
	t.Error("etree: failed to parse document")
	}
	return doc
	}

	func (e Element) writeTo(w bufio.Writer, s *WriteSettings) {
	w.WriteByte('<')
	w.WriteString(e.FullTag())
	for _, a := range e.Attr {
	w.WriteByte(' ')
	a.writeTo(w, s)
	}
	if len(e.Child) > 0 {
	w.WriteString(">")
	for _, c := range e.Child {
	c.writeTo(w, s)
	}
	w.Write([]byte{'<', '/'})
	w.WriteString(e.FullTag())
	w.WriteByte('>')
	} else {
	if s.CanonicalEndTags {
	w.Write([]byte{'>', '<', '/'})
	w.WriteString(e.FullTag())
	w.WriteByte('>')
	} else {
	w.Write([]byte{'/', '>'})
	}
	}
	}

beevik / etree Goto Github PK

etree's People

Stargazers

Watchers

Forkers

etree's Issues

Unprefixed attributes

Prefixed attributes

Example

Recommend Projects

Recommend Topics

Recommend Org