lomirus / html_editor Goto Github PK

View Code? Open in Web Editor NEW

23.0 23.0 5.0 88 KB

Pure and simple HTML parser and editor.

Home Page: https://crates.io/crates/html_editor

License: MIT License

Rust 38.38% HTML 61.62%

html parser rust xml

html_editor's Introduction

Hi 👋, I'm Lomirus

🌱 Some other things interested: Bevy, New Ithkuil
📝 Writing articles on my blog: Lomirus' Site
📫 How to reach me: [email protected] (CN) / [email protected] (Global)

Github Stats

html_editor's People

Contributors

Stargazers

Watchers

Forkers

olehbozhok arduano xeniagda priyansh-bhardwaj-juspay wanderqing

html_editor's Issues

On XML and Node spans

I'm looking to replace the current HTML parser that's being used in a project of mine and I came across this one. It looks like it would be a nice replacement, and by my estimates, would bring the final binary size down by about 16%.

There's two things stopping me from trying it out:

The ability to parse simple XML (from an EPUB) which would require the the ability to parse the XML doc type and possible namespace attributes.
The tracking of Node spans, as in the offset the where Node starts and stops at in.

Would these be something you'd consider adding or accepting a pull request for?

Multiple classes on the same html element are parsed as a single class

An html snippet like

<div class="a b">
</div>

gets parsed into an Element like

Element { name: "div", attrs: [("class", "a b")], children: [] }

There is only one classs attribute which contains both classes seperated by whitespace.
This leads to queries for a only one of the two classes to fail.

This test fails.

    #[test]
    fn html_editor_multiple_class_parsing() {
        let test_snippet = r#"<div class="a b"></div>"#;
        let result = parser::parse(test_snippet).unwrap();
        print!("{:?}", result);      
        // This selector fails
        let selector = Selector::from(".a");
       // This selector works
       // let selector = Selector::from(".a b");

        result.query(&selector).unwrap();
        assert!(true);
    }

My expected behaviour would be a seperate class attribute for each class inside the html.
In this case:

Element { name: "div", attrs: [("class", "a"),("class", "b")], children: [] }

query with numerous selectors at once

Hello, I would like to query_all all the headers (regardless of size) present, as the order matters. If I individually search for each header (h1, h2, h3 etc.) I will get them ordered based on their size, h1 first, h2 second and so on. I've looked through the docs and cannot seem to find a way to do this, so if one is indeed not present I suggest the following syntax.

fn query_all(&self, selector: Vec<&Selector>) -> Vec<Element>
so usage would be something like:
dom.query_all(vec![&Selector::from("h1"), &Selector::from("h2"), ...]);

try_parse stuck in infinite loop

The following HTML will cause try_parse to get stuck in an infinite loop

https://gist.github.com/lucasavila00/ae14f1b3284879add91f712663bdb4c7

parse does fail, with:

running 1 test
thread 'test_inf_loop' panicked at packages/rust-html2/src/lib.rs:494:37:
failed to parse: "<notranslate> is not closed"
stack backtrace:
   0: rust_begin_unwind
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:595:5
   1: core::panicking::panic_fmt
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/panicking.rs:67:14
   2: core::result::unwrap_failed
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/result.rs:1652:5
   3: core::result::Result<T,E>::expect
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/result.rs:1034:23
   4: rust_webpack_template::transform_html_unbound
             at ./src/lib.rs:494:19
   5: rust_webpack_template::test_inf_loop
             at ./src/lib.rs:859:35
   6: rust_webpack_template::test_inf_loop::{{closure}}
             at ./src/lib.rs:852:20
   7: core::ops::function::FnOnce::call_once
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/ops/function.rs:250:5
   8: core::ops::function::FnOnce::call_once
             at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/ops/function.rs:250:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
test test_inf_loop ... FAILED

Add pase html with some err

It woild be great if add html with some err.
Like this

 <!doctype html>
        <html lang="en">
            <head>
                <meta charset="utf-8">
                <title>Html parser</title>
            </head>
            <body>
                <h1 id="a" class="b c">Hello world</h1>
                <!-- comments & dangling elements are ignored -->
                <a class="trait" href="next/next2/">Queryable 

            </body>
        </html>

Because browser handle it.
I`m trying write crawler and use you parser, but got err on parse.. =-(

Fails to parse if closing tag is seperated by new-line

I came across this issue while working with version 0.3.0.

On a website, there was an <a> that was closed by <\a\n>. The parser failed with the error message "<a> does not match the </a\n>".

To reproduce, attempt to parse the following segment

<a> example </a
>

Firefox can parse the website regardless of that formatting error.
Quick workaround could be to just delete all newlines in the string?

A function to replace a Node with a different Node or Vec<Node>

I'd like to have a way of replacing a Node with something else. Similar to JavaScript's replaceWith or BeautifulSoup's replace_with.

Query by #id containing periods do not works

We can't match when querying by id contains periods:

let html = r#"<div id="foo.bar">baz</div>"#;
let nodes = parse(&html)?;
let selector = Selector::from("#foo.bar");
let element = nodes.query(&selector);

println!("{:?}", element);
// None

Encode and decode inside Node::Text?

Hi!

In the Text variant of a Node, the text is stored as-is from the source code of the HTML file. This means that a source such as a > b would be represented as Node::Text("a > b"), rather than Node::Text("a > b"). While this does make sense for performance reasons, I feel like this might be unintuitive for users. The Node data-type is for manipulating HTML after it has been parsed into an abstract syntax tree, but here the Text variant store the text unprocessed from the file, rather than storing what the text represents feels.

Additionally, this means that one could easily construct a Node::Text instance by mistake which contains HTML fragments which when serialized, either give invalid HTML or something which would parse to a different tree structure (for example doing Node::Text("a > b"), or Node::Text("a <img> b"))

From what I can see, a solution to this problem would simply be to add a dependency such as html-escape and making a call to decode_html_entities in the parse function, as well as a call to encode_html_entities in the Htmlifiable::html implementation.

(All of this also applies to attribute values as well)

wrong parse of <div\n

Hi,

I like to report a bug in html_editor.
Running this program:

use html_editor::parse;
fn main() {
    let s = r#"
<div
> </div>
"#;

   let dom = parse(&s);
   println!("wrong: {:?}",dom);

    let s = r#"
<div> </div>
"#;
   let dom = parse(&s);
   println!("good: {:?}",dom);
}

I get:

wrong: Err("<div\n> does not match the </div>")
good: Ok([Text("\n"), Element { name: "div", attrs: [], children: [Text(" ")] }, Text("\n")])

So, it seems that html_editor does not handle a newline as a space or tab where it should.
The example is with <div>, but the bug affects also other tags like <a>.

Regards, and thanks for the package,

Willem