Giter Club home page Giter Club logo

etree's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

etree's Issues

how to find an element with a specific class and a specific text

Normally, I would do an Xpath query like this one:

//*[contains(concat(' ', normalize-space(@class), ' '), ' LookForClass ')  and text()='TheTextInTheClass']/../..

How should I do this?

In the docs I read:

XPath-like path string. Panics if an invalid path string is supplied.
I get : etree: path has invalid filter [brackets].

So after trying and trying.... This is what I found out:

//*[@class='LookForClass'][text()='TheTextInTheClass']/../..

This perhaps can help others in an example

Need an AddElement method

I need a way to add an etree Element under another etree Element.

Trying to explain in code:

doc := etree.NewDocument()
doc.ReadFromFile("bookstore.xml")
root := doc.SelectElement("bookstore")

Now the root is an etree Element under which are a bunch of <book> XML Elements.

Suppose now I have

docMore.ReadFromString(xmlMoreBooks)

The question is how can I add docMore as new entries under the root etree Element?

I think such feature would be needed by others as well. Please consider adding it.

Thanks

how to get all content of an element

Is something like this possible:

for _, e := range doc.FindElements("./bookstore/book[1]/*") {
    fmt.Printf("%s: %s\n", e.Tag, e.Content())
}

Which would show the content of the given search

    <title lang="en">Everyday Italian</title>
    <author>Giada De Laurentiis</author>
    <year>2005</year>
    <p:price>30.00</p:price>

When I do e.Text() I only get the text value, but I need it all.

Failure to parse '/' in xpath values

When '/' is not used as a path separator in a path, like in the example below, etree's xpath compilation will fail.

//http[@internet='web']//url[@pattern='/web/app' ]

This is due to line 192 in path.go

	for _, s := range strings.Split(path, "/") {

Any suggestions for a fix?

etree HTML parser changes node order?

Hi,

I'm currently facing an issue where I can't explain the etree behaviour. Following code demonstrates the issue I am facing. I want to parse an HTML string as illustrated below, change the attribute of an element and reprint the HTML when done.

string = "<p><center><code>git clone https://github.com/AlexeyAB/darknet.git</code></center></p>"
test = etree.fromstring(string, parser)
print(etree.tostring(test, pretty_print=True, method="html"))

I get this output:

<html><body>
<p></p>
<center><code>git clone https://github.com/AlexeyAB/darknet.git</code></center>
</body></html>

As you can see (let's ignore the <html> and <body> tags etree adds), the order of the nodes has been changed! The <p> tag that used to wrap the <center> tag, now loses its content, and that content gets added after the </p> tag closes. Eh?

When I omit the <center> tag, all of a sudden the parsing is done right:

string = "<p><code>git clone https://github.com/AlexeyAB/darknet.git</code></p>"
test = etree.fromstring(string, parser)
print(etree.tostring(test, pretty_print=True, method="html"))

With output:

<html><body><p><code>git clone https://github.com/AlexeyAB/darknet.git</code></p></body></html>

Am I doing something wrong here? I have to use the HTML parser because I get a lot of parsing errors when not using it. I also can't change the order of the <p> and <center> tags, as I read them this way.

How do I change line endings when writing files?

Currently, when reading XML files with CRLF line endings, these will be converted to LF when writing the XML back to disk. How could I force a different line ending? The software that uses those XML files needs CRLF line endings (it expects line breaks in text blocks with CRLF and nothing else).

Text filtering on leaf node

Given an XML like

<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<library>
  <!-- Great book. -->
  <book id="b0836217462" available="true">
    <isbn>0836217462</isbn>
    <title lang="en">Being a Dog Is a Full-Time Job</title>
    <quote>I'd dog paddle the deepest ocean.</quote>
    <author id="CMS">
      <name>Charles M Schulz</name>
      <born>1922-11-26</born>
      <dead>2000-02-12</dead>
    </author>
    <character id="PP">
      <name>Peppermint Patty</name>
      <born>1966-08-22</born>
      <qualification>bold, brash and tomboyish</qualification>
    </character>
    <character id="Snoopy">
      <name>Snoopy</name>
      <born>1950-10-04</born>
      <qualification>extroverted</qualification>
      <qualification>beagle</qualification>
    </character>
  </book>
</library>

A query like library/book/character[qualification='beagle']/qualification would return all qualification elements of character for every character with a qualification='beagle'. It'd be good to allow text() XPath queries so that a query like library/book/character/qualification[text()='beagle'] only returns the nodes of type qualification which text is beagle.

Don't deprecate InsertChild() please

Deprecated: InsertChild is deprecated. Use InsertChildAt instead.

Please don't deprecate InsertChild() because InsertChildAt won't work for my case --

The xml file that I'm working on has a rigid format of where things are:

<A attr=... >
  <B attr=... />
  <C attr=... />
  <D attr=... />
</A>

B comes before C which comes before D. I know the order doesn't matter to xml, but I'm tracking the file with version control so, I'd prefer as little change as possible.

Whether I do doc.InsertChildAt(0, c) or doc.InsertChildAt(1, c), C will always be inserted before B; whereas I need it after B but before D (after I've remove C beforehand).

Was I using InsertChildAt incorrectly, or InsertChild() is just not replaceable for my case? Thx.

Why Text() use 'break' but not 'continue'?

func (e *Element) Text() string {
if len(e.Child) == 0 {
return ""
}

text := ""
for _, ch := range e.Child {
	if cd, ok := ch.(*CharData); ok {
		if text == "" {
			text = cd.Data
		} else {
			text += cd.Data
		}
	} else {
		break
	}
}
return text

}
when I used this function to get charData in a tag,a problem happend: an element has two child,and if the first is not CharData,it may not check the second.

Invalid memory address or nil when path doesn't exists

Nice work with this package!
I have a question regarding this scenario:

doc.FindElement("//This/Element/Does/Not/Exists")

Is there a way to check that this path actually exists? Currently I get:

--- FAIL: TestXMLResp (0.00s)
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x40 pc=0x6618b4]

Unable to extract Attribute Value w/ Path

Consider the following document partial (source: https://community.cablelabs.com/wiki/plugins/servlet/cablelabs/alfresco/download?id=8f900e8b-d1eb-4834-bd26-f04bd623c3d2 , Appendix I.1)

<?xml version="1.0" ?>
<ADI>
  <Metadata>
    <AMS Provider="InDemand" Product="First-Run" Asset_Name="The_Titanic" Version_Major="1" Version_Minor="0" Description="The Titanic asset package" Creation_Date="2002-01-11" Provider_ID="indemand.com" Asset_ID="UNVA2001081701004000" Asset_Class="package"/>
    <App_Data App="MOD" Name="Provider_Content_Tier" Value="InDemand1" />
    <App_Data App="MOD" Name="Metadata_Spec_Version" Value="CableLabsVod1.1" />
  </Metadata>
</ADI>

While i can use a Path like //AMS[@Asset_Class='package']/../App_Data[@Name='Provider_Content_Tier'] to get to a desired Element, I am not able to perform an xpath-style path search to extract just the data in the Value attribute for the identified elements as a []string. Most other XPath implementations support a path such as //AMS[@Asset_Class='package']/../App_Data[@Name='Provider_Content_Tier']/@Value to extract attribute values directly from the Path.

This would be a really great feature to have to allow us to port a legacy app over to Go, without having to refactor our existing paths that perform the attribute extractions.

I'll take a stab at implementing in the coming days.

Support descendant specifier in predicates

Given the following xml document

<root>
  <a><b><c>...</c></b></a>
  <a><b>...</b></a>
  <a><b>...</b></a>
  <a><b>...</b></a>
</root>

It seems like there is currently no way to specify an Xpath expression so that only the nodes get selected because they have a grandchild somewhere. I'm looking for support for something like this
//a[.//c] or alternatively //a[b/c].
Currently, such Xpath expressions fail with etree: path has invalid filter [brackets].

Remove element, leave whitespace

In order to adhere to some stupid API, I have to provide some XML with a blank space where an element will be placed.

Is it possible to remove an element and leave its whitespace in the document? I've tried removing an element and then inserting etree.NewElement("") but that inserts </>.

If there's a way I can do it using this library, awesome, if not I'll have to just manipulate it as a text file

can't find class (perhaps a bug)

This is the HTML (a piece of material design light)

  <body>
    <!-- Wide card with share menu button -->
    <div class="demo-card-wide mdl-card mdl-shadow--2dp">
      <div class="mdl-card__title">
        <h2 class="mdl-card__title-text">Welcome</h2>
      </div>
      <div class="mdl-card__supporting-text">
        Lorem ipsum dolor sit amet, consectetur adipiscing elit.
        Mauris sagittis pellentesque lacus eleifend lacinia...
      </div>
      <div class="mdl-card__actions mdl-card--border">
        <a class="mdl-button mdl-button--colored mdl-js-button mdl-js-ripple-effect">
          Get Started
        </a>
      </div>
      <div class="mdl-card__menu">
        <button class="mdl-button mdl-button--icon mdl-js-button mdl-js-ripple-effect">
          <i class="material-icons">share</i>
        </button>
      </div>
    </div>
    
        <!-- Wide card with share menu button -->
    <div class="demo-card-wide mdl-card mdl-shadow--2dp">
      <div class="mdl-card__title">
        <h2 class="mdl-card__title-text">Welcome</h2>
      </div>
      <div class="mdl-card__supporting-text">
        Lorem ipsum dolor sit amet, consectetur adipiscing elit.
        Mauris sagittis pellentesque lacus eleifend lacinia...
      </div>
      <div class="mdl-card__actions mdl-card--border">
        <a class="mdl-button mdl-button--colored mdl-js-button mdl-js-ripple-effect">
          Get Started
        </a>
      </div>
      <div class="mdl-card__menu">
        <button class="mdl-button mdl-button--icon mdl-js-button mdl-js-ripple-effect">
          <i class="material-icons">share</i>
        </button>
      </div>
    </div>
  </body>

and I want to check the number of "cards" on the page.
I can do that with the following xpath

//*[contains(concat(' ', normalize-space(@class), ' '), ' mdl-card ')]

(I've tested this with https://www.freeformatter.com/xpath-tester.html)
It works... and shows the 2 elements, and I can see in the html it has 2.

And now for my code: (memHtml holds the HTML in memory...)

	docLoc := etree.NewDocument()
	if err := docLoc.ReadFromString(memHtml); err != nil {
		panic(err)
	}

	counter := len(docLoc.FindElements(element))
	fmt.Println("counter", counter)

And it prints 0 (zero)

btw:
I also did: (the same xpath with "(" ")" around it. Also valid xpath

(//*[contains(concat(' ', normalize-space(@class), ' '), ' mdl-card ')])

But then, etree complains something about brackets.

Consider returning an Error in SelectElement if the element was not found

Currently, SelectElement doesn't return anything indicating an error while trying to select an element in the XML string.
Trying to do further operations on the returned node causes the program to Panic with SEGFAULT.

For example:
Consider the following XML:
data := "


vmx


"

Now, if try to run the following code:
xmlDoc := etree.NewDocument()
XMLReadError := xmlDoc.ReadFromString(capabilitiesXML)
if XMLReadError != nil {
log.Printf("Unable to read the Capabilities XML: %s", XMLReadError)
return
}
root := xmlDoc.SelectElement("host")
cpu := root.SelectElement("cpu")

The above code causes the program to panic with a SEGFAULT

Returning an error if the concerned element is not found seems to be a better approach to avoid these kinds of issues or this behavior can be documented that a nil is returned in case the element is not found and hence this needs to be checked.

Use `Errorf` instead of `Fail`

The test suite currently uses Fail instead of Errorf (or similar). This is problematic as test failures provide no indication on what went wrong.

I.e. currently a test failure looks like this:

$ go test .
--- FAIL: TestDocument (0.00 seconds)
FAIL
FAIL    github.com/felixge/etree    0.017s

When using Errorf, it would look like this:

$ go test .
--- FAIL: TestDocument (0.00 seconds)
    etree_test.go:58: custom error message printed here
FAIL
FAIL    github.com/felixge/etree    0.018s

I'd be happy to submit a patch that replaces all occurrences of Fail() with more sensible error reporting - let me know.

Nodes of no descendant

The reverse side of question #28, is it possible to select those specific nodes who has no child nodes? Thx.

Attribute namespaces broken.

In etree Attr.NamespaceURI, always returns the containing element's namespace. This is wrong for two reasons:

Unprefixed attributes

Unprefixed attributes get no namespace assigned. This is different from elements. See XML Names 6.2:

The namespace name for an unprefixed attribute name always has no value.

Prefixed attributes

For prefixed attributes, the prefix should be resolved into an URI.

Example

Using a shortend etree test case:

<root xmlns="http://root.example.com" 
      xmlns:attrib="http://attrib.example.com" 
      a="foo" 
      attrib:b="bar" />

Let's use xmlstarlet, which is just a neat libxml2 CLI frontend.

$ xmlstarlet sel -N root=http://root.example.com -N attrib=http://attrib.example.com -N notattrib=http://attrib.example.com \
  -t -m '/root:root' \
  -o '@a[namespace-uri()=""]: ' -v '@a[namespace-uri()=""]' -nl \
  -o '@root:a[namespace-uri()="http://root.example.com"]:' -v '@root:a[namespace-uri()="http://root.example.com"]' -nl \
  -o '@b[namespace-uri()=""]:' -v '@b[namespace-uri()=""]' -nl \
  -o '@root:b[namespace-uri()="http://root.example.com"]:' -v '@root:b[namespace-uri()="http://root.example.com"]' -nl \
  -o '@attrib:b[namespace-uri()="http://attrib.example.com"]: ' -v '@attrib:b[namespace-uri()="http://attrib.example.com"]' -nl \
  -o '@notattrib:b[namespace-uri()="http://attrib.example.com"]: ' -v '@notattrib:b[namespace-uri()="http://attrib.example.com"]' -nl \
  test.xml

Which produces the following output:

@a[namespace-uri()=""]: foo
@root:a[namespace-uri()="http://root.example.com"]:
@b[namespace-uri()=""]:
@root:b[namespace-uri()="http://root.example.com"]:
@attrib:b[namespace-uri()="http://attrib.example.com"]: bar
@notattrib:b[namespace-uri()="http://attrib.example.com"]: bar

Note that without the test for namespace-uri() all of the result except the last one would be correct in etree, since etree does not resolve prefixes for attribute and element selection (a different bug).

EDIT: messed up shell quoting.

Prefix handling in xpath queries does not resolve namespaces

Compare

package main

import (
	"fmt"

	"github.com/beevik/etree"
)

const xmlData = `<root xmlns:b='foo'><a /><b:b /><b:b xmlns:b='bar' /></root>`

func main() {
	doc := etree.NewDocument()
	doc.ReadFromString(xmlData)

	fmt.Printf("%+v\n", doc.FindElements("//b"))
	fmt.Printf("%+v\n", doc.FindElements("//b:b"))
}

which produces

$ go run showcase.go 
[0xc0000b4240 0xc0000b42a0]
[0xc0000b4240 0xc0000b42a0]

to

import xml.etree.ElementTree as ET
import io

XML_DATA = "<root xmlns:b='foo'><a /><b:b /><b:b xmlns:b='bar' /></root>"

doc = ET.parse(io.StringIO(XML_DATA))

print(doc.findall('./b'))
# Fails, prefix b not defined
# print(doc.findall('.//b:b'))
print(doc.findall('./b:b', {'b': 'foo'}))
print(doc.findall('./b:b', {'b': 'bar'}))
# different prefix, still finds the same element!
print(doc.findall('./c:b', {'c': 'bar'}))

which results in

$ python showcase.py 
$ python showcase.py 
[]
[<Element '{foo}b' at 0x7f0701e43e90>]
[<Element '{bar}b' at 0x7f0701e43ef0>]
[<Element '{bar}b' at 0x7f0701e43ef0>]

Note that in the Go version, both queries return both elements that have b as local name and prefixes are only compared as text string. The python version is correct regarding to namespaces since:

  1. the unnamed namespace does not match any other namespace
  2. prefixes are solved to namespace uris. This implies that prefixes in xpath expressions have to be defined first. After that the actual prefix does not matter only the backing namespace uri

It would be nice if your etree package would offer similar features. Searching by prefix only is a blocker when receiving XML documents, where prefixes are unknown (Like the output of Go's XML Encoder that uses strange, but correct, prefix names and placement).
How would you search for an XML element by namespace at all?

Walking mechanism

As I understand, ReadFromFile stores the entire file's content in memory before we can do whatever we have to do with the nodes.
This is not ideal when dealing with very large files.

Is there currently a way to process the nodes as we walk through the file, therefore avoiding the need to store it all in memory at once? Hope that makes sense

FindElements

Hi,

I have xml like:

<nodes>
    <node>
      <nodeID>2</nodeID>
      <args>
        <arg0>
          <source>1</source>
        </arg0>
        <arg1>
          <source>2</source>
        </arg1>
      </args>
    </node>
</nodes>

I am trying to get "arg*" elements. Not sure how to phrase this. Awesome library by the way. I'm just having to deal with some bad xml at the moment.

    doc := etree.NewDocument()
    if err := doc.ReadFromFile(filename); err != nil {
        panic(err)
    }

    nodes := doc.SelectElement("nodes")
    for _, node := range nodes.SelectElements("node") {
        args := node.SelectElement("args")

        for i, arg := range args.FindElements("arg*") {
                     // blah
         }
}

Thanks!

New line after BOM

The doc.WriteTo is adding an extra new line after BOM. I've illustrate it with et_dump.go and et_dump.xml, which you can find under https://github.com/suntong/lang/blob/master/lang/Go/src/xml/.

Here is the result:

$ go run et_dump.go | diff -wU 1 et_dump.xml -
--- et_dump.xml 2016-03-08 16:40:41.667010100 -0500
+++ -   2016-03-08 16:40:57.842603083 -0500
@@ -1,4 +1,4 @@
-<?xml version="1.0" encoding="utf-8"?>
+
+<?xml version="1.0" encoding="utf-8"?>
 <bookstore xmlns:p="urn:schemas-books-com:prices">
-
   <book category="COOKING">
@@ -9,3 +9,2 @@
   </book>
-
   <book category="CHILDREN">
...
@@ -34,3 +31,2 @@
   </book>
-
 </bookstore>
\ No newline at end of file

I.e., an extra new line is added after BOM. This seems to be a trivial issue, but will cause my Microsoft Visual Studio failed to recognize the webtest file such dump creates. :-(

Please consider removing the added extra new line.

Thanks

Please skip BOM

When reading from file (via ReadFrom() or ReadFromFile()), is it possible to skip the BOM
(https://en.wikipedia.org/wiki/Byte_order_mark) char?

Every file created by MS under Windows has that witched char, which is very hard to get rid of.
So it'll be great that etree can skip them when reading from file.

The following file will fail:

$ cat et_example.xml | hexdump -C
00000000  ff fe 3c 00 62 00 6f 00  6f 00 6b 00 73 00 74 00  |..<.b.o.o.k.s.t.|
00000010  6f 00 72 00 65 00 3e 00  0d 00 0a 00 20 00 3c 00  |o.r.e.>..... .<.|
...

with the following error

panic: XML syntax error on line 1: invalid UTF-8

Hmm, wait, is it because of BOM or the UTF16 encoding?

thx

Any way to find elements ignoring namespaces?

Sometimes I have to deal with non-standard implementations and I don't always know what Namespaces are going to be in there. Is there a possibility of adding support of finding elements with a namespace of *?

Question: regarding adding content to a document

In https://golang.org/pkg/encoding/xml/ there is support for a tag ",innerxml", where the field is written verbatim. Is there support in etree for something similar?

For e.g.

doc.CreateInnerXML("<users/>")

From https://golang.org/pkg/encoding/xml/#Marshal

The XML element for a struct contains marshaled elements for each of the exported fields of the struct, with these exceptions:

- the XMLName field, described above, is omitted.
- a field with tag "-" is omitted.
- a field with tag "name,attr" becomes an attribute with
  the given name in the XML element.
- a field with tag ",attr" becomes an attribute with the
  field name in the XML element.
- a field with tag ",chardata" is written as character data,
  not as an XML element.
- a field with tag ",cdata" is written as character data
  wrapped in one or more <![CDATA[ ... ]]> tags, not as an XML element.
- a field with tag ",innerxml" is written verbatim, not subject
  to the usual marshaling procedure.
- a field with tag ",comment" is written as an XML comment, not
  subject to the usual marshaling procedure. It must not contain
  the "--" string within it.
- a field with a tag including the "omitempty" option is omitted
  if the field value is empty. The empty values are false, 0, any
  nil pointer or interface value, and any array, slice, map, or
  string of length zero.
- an anonymous struct field is handled as if the fields of its
  value were part of the outer struct.

Cannot find element with ":" in the name.

XML structure:

<package xmlns="http://www.idpf.org/2007/opf" unique-identifier="uuid_id" version="2.0">
  <metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf">
    <dc:identifier opf:scheme="uuid" id="uuid_id"></dc:identifier>
    <dc:title></dc:title>
    <dc:creator opf:role="aut"></dc:creator>
    <dc:language>eng</dc:language>
  </metadata>
</package>

Go code:

doc.FindElement("/package/metadata/*[1]")).SetText("test") //fails
doc.FindElement("/package/metadata/dc:identifier").SetText("test") //fails

Exposing more API functions

I'd think it'd be helpful to expose the following API functions

doc.NewDocumentFromString(s string), based on:

etree/etree_test.go

Lines 14 to 22 in 4ec1305

func newDocumentFromString(t *testing.T, s string) *Document {
t.Helper()
doc := NewDocument()
err := doc.ReadFromString(s)
if err != nil {
t.Error("etree: failed to parse document")
}
return doc
}

and,

Element.WriteTo(w *bufio.Writer, s *WriteSettings), for debugging purposes, based on:

etree/etree.go

Lines 1036 to 1060 in 4ec1305

func (e *Element) writeTo(w *bufio.Writer, s *WriteSettings) {
w.WriteByte('<')
w.WriteString(e.FullTag())
for _, a := range e.Attr {
w.WriteByte(' ')
a.writeTo(w, s)
}
if len(e.Child) > 0 {
w.WriteString(">")
for _, c := range e.Child {
c.writeTo(w, s)
}
w.Write([]byte{'<', '/'})
w.WriteString(e.FullTag())
w.WriteByte('>')
} else {
if s.CanonicalEndTags {
w.Write([]byte{'>', '<', '/'})
w.WriteString(e.FullTag())
w.WriteByte('>')
} else {
w.Write([]byte{'/', '>'})
}
}
}

Please consider. thx.

Text escaping not correct

The valid character range for XML is specified as:

Char := #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

When producing XML using WriteTo a set of strings.Replacer objects in etree.go are used to escape strings.

However, these replacers fail to replace some characters that are not valid in XML (e.g. 0xB vertical tab).

This leads to invalid XML being produced.

The go standard package xml replaces invalid characters with \uFFFD (unicode replacement character) which appears to be common practice across other XML libraries.

I propose using xml.EscapeText instead of the current approach based on strings.NewReplacer(...).

RemoveChild() Example

Hi @beevik,

Can you give an example using RemoveChild() please?

Also, I searched those _test files, and seems that there is no test coverage for RemoveChild() either.

thx

Bug/question nested FindElements

Hi, I have some code where I am using FindElements to get a slice of response elements, then calling FindElement on those to get their children. I am getting the same result for each iteration of the loop, i.e. the href is always "/1" instead of "/1" for the first response, then "/2" for the next, etc.. Have I done something wrong or misunderstood the API? Or is this a bug?

package example

import (
	"testing"

	"fmt"
	"github.com/beevik/etree"
)

func TestXML(t *testing.T) {
	xml := `<multistatus xmlns="DAV:">
	<response>
	<href>/1</href>
	<propstat>
	<prop>
	<getetag>A</getetag>
	</prop>
	<status>HTTP/1.1 200 OK</status>
	</propstat>
	</response>
	<response>    
	<href>/2</href>                                    
	<propstat>                                                                                                                          
	<prop>                                                                    
	<getetag>B</getetag>
	</prop>                                                                                                                     
	<status>HTTP/1.1 200 OK</status>                                        
	</propstat>
	</response>
	<response>
	<href>/3</href>
	<propstat>
	<prop>
	<getetag>C</getetag>
	</prop>
	<status>HTTP/1.1 200 OK</status>
	</propstat>
	</response>
	<response>
	<href>/4</href>
	<propstat>
	<prop>
	<getetag>D</getetag>
	</prop>
	<status>HTTP/1.1 200 OK</status>
	</propstat>
	</response>
	<response>
	<href>/5</href>
	<propstat>
	<prop>
	<getetag>E</getetag>
	</prop>
	<status>HTTP/1.1 200 OK</status>
	</propstat>
	</response>
	</multistatus>`

	doc := etree.NewDocument()
	if err := doc.ReadFromString(xml); err != nil {
		t.Fatalf("Failed to parse xml: %s", err.Error())
	}

	responseElements := doc.FindElements("//response")
	if len(responseElements) != 5 {
		t.Fatalf("Expected 5 response elements, got %d", len(responseElements))
	}

	for n, el := range responseElements {
		hrefEl := el.FindElement("//href")
		etagEl := el.FindElement("//getetag")

		if hrefEl != nil && etagEl != nil {

			href := hrefEl.Text()
			expectedHref := fmt.Sprintf("/%d", 1+n)
			if href != expectedHref {
				t.Fatalf("Expected href %s, got %s", expectedHref, href)
			}

			etag := etagEl.Text()
			expectedEtag := fmt.Sprintf("%c", 'A'+n)
			if etag != expectedEtag {
				t.Fatalf("Expected etag %s, got %s", expectedEtag, etag)
			}
		} else {
			t.Fatalf("Missing href and/or etag")
		}
	}
}

panic: xml: encoding "ISO-8859-1" declared but Decoder.CharsetReader is nil same as #48

Hi @beevik
After I done some tests I got same error presented in issue #48
Here is a sample of code + xml data :

var xmlDoc = `<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE broadWorksCDR>
<broadWorksCDR version="19.0">
  <cdrData>
    <headerModule>
      <recordId>
        <eventCounter>0002183384</eventCounter>
        <systemId>CGRateSaabb</systemId>
        <date>20160419210000.104</date>
        <systemTimeZone>1+020000</systemTimeZone>
      </recordId>
      <type>Start</type>
    </headerModule>
  </cdrData>
</broadWorksCDR>
`
doc := etree.NewDocument()
if err := doc.ReadFromBytes([]byte(xmlDoc)); err != nil {
	t.Error(err)
}

First I try to use doc.ReadFromString after doc.ReadFromBytes but got error both times.

Thanks,
TeoV

Some XML prefixes have a colon

For instance the CardDAV prefix is urn:ietf:params:xml:ns:carddav. This doesn't play well with spaceDecompose:

func spaceDecompose(str string) (space, key string) {

The standard Go library is using a space character, probably because of this. Thoughts?

SelectElements doesn't handle namespace properly

Given this document:

<?xml version="1.0" encoding="utf-8" ?>
<D:propfind xmlns:D="DAV:">
  <D:prop xmlns:R="http://ns.example.com/boxschema/">
    <R:bigbox/>
    <R:author/>
    <R:DingALing/>
    <R:Random/>
  <D:prop>
</D:propfind>
  • doc.Root().SelectElements("DAV:prop") returns zero element
  • doc.Root().SelectElements("D:prop") returns one element

However the D prefix is arbitrary, it could be set to any token really (e.g. xmlns:myawesomeprefix="DAV:"). SelectElements shouldn't care about it, it should resolve the prefix depending on xmlns attributes.

https://play.golang.org/p/I8XUmLjY9pX

Problem parsing CDATA after newline

Thanks a ton for this package - super useful for my work.

I'm parsing some RSS feeds that contain HTML contained in <!CDATA[ ... ]> tags with formatted HTML for post descriptions, content, etc. It looks like when the CDATA tag is preceded by a newline, the text can't be parsed out:

	workingCDATAString := `
	<rss>
		<channel>
			<item>
		   		<summary><![CDATA[Sup]]></summary>
			</item>
		</channel>
	</rss>
	`

	doc := etree.NewDocument()
	doc.ReadFromString(workingCDATAString)
	spew.Dump(doc.FindElement("rss").FindElement("channel").FindElement("item").FindElement("summary").Text())
	// Output: (string) (len=3) "Sup"

	brokenCDATAString := `
	<rss>
		<channel>
			<item>
		   		<summary>
			 		<![CDATA[Sup]]>
				</summary>
			</item>
		</channel>
	</rss>
	`
	doc = etree.NewDocument()
	doc.ReadFromString(brokenCDATAString)
	spew.Dump(doc.FindElement("rss").FindElement("channel").FindElement("item").FindElement("summary").Text())
	// Output: (string) (len=7) "\n\t\t\t \t\t"

I'm not familiar with XML parsing enough to say that this isn't the intended behavior, but I would expect these two code blocks to output the same thing ("Sup"). Any ideas?

Copy()

Thanks for this library. Would you be interested in a patch that adds a Copy() method to Document / Element?

In the XMPP server I'm working on, one sometimes has to duplicate an incoming XML message and send out slightly modified versions of it.

I realize that I can implement Copy() in userland, but since it would probably be useful to others, I hope you're interested in a patch.

How to add missing ProcInst to the document?

I'm creating a converter that modifies a bunch of exported XML files to be imported into a different software.

Currently, I'm using etree to read each file, add or change elements, then write it again to a different folder. But the destination software seems to need the <?xml... header which is missing from the source. But CreateProcInst would not prepend but append it.

How could I prepend the ProcInst?

Extracting elements with text

Hi,

I have a need to retrieve all text tokens of an element and its descendant, i.e. a path looking something like //something/text().

Since Element.FindElementsPath() and friends returns an Element and not a Token, it would alternatively be fine if I get a list of elements with non nil text.

I'm ready to make a PR to add this functionality, but I would like to know before how you would see it.

  • Should we add a new function to Element to retrieve Tokens from a path or should we stick to the existing functions Element.FindElementsPath() ?
  • If we keep existing functions, should we keep the //something/text() syntax or a different one (since we are actually retrieving the parent of the text nodes) ? In this case, what would it be ?

Thanks

Is that a bug Text() only return first child?

Is that a bug Text() only return first child?
Below only show half of content.

            <script type="text/javascript">
                polymer.define(&apos;web.csrf&apos;, function (require) {
                    var token = &quot;<t t-esc="request.csrf_token(None)"></t>&quot;;
                    require(&apos;web.core&apos;).csrf_token = token;
                    require(&apos;qweb&apos;).default_dict.csrf_token = token;
                });
            </script>

how to edit ProcInst

i want to modify the encoding from gb2312 to utf-8
but I only find the doc.CreateProcInst() method ,did this package can do it?

<?xml version="1.0" encoding="gb2312"?>
<configuration>
    <section name="CommissionManagerConfig" requirePermission="false"/>
</configuration>

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.