Comments (4)
Hi, @yasoob!
This is an interesting idea. I think it is not possible with the current data structure, as you said, but could work if we have a more "complete" tree representation like we do internally today - see Floki.HTMLTree
. We already started to discuss a "wrapper" around the results in #457, and I think that wrapper could be this tree, which could include more information like the position of a tag.
However I'm not sure the amount of work required if we decide to expose that from the floki_mochi_html
tokenizer. I will investigate. But I would say this is feasible, yeah.
Just an additional context: the Mochiweb parser is not the most aligned with the specs 😅
So I'm afraid it could contain wrong data about the position of the elements.
That said, I started working in a new parser a long time ago, but this was never finished. I think the correct path - after exposing this in the "wrapper" - would be to finish the parser that is aiming to work according to HTML specs. This could take some time, though.
from floki.
So as a fun experiment I spent some time yesterday looking into it. I wanted to get the line numbers of the:
- Start tags
- End tags
- Attributes
I ended up updating the tokenize
function like this:
tokenize(B, S = #decoder{offset = O}) ->
case B of
%% ... Truncated ...
<<_:O/binary, "</", _/binary>> ->
{Tag, S1} = tokenize_literal(B, ?ADV_COL(S, 2)),
{S2, _} = find_gt(B, S1),
{{end_tag, Tag, {line_no, S#decoder.line}}, S2};
<<_:O/binary, "<", C, _/binary>> when
?IS_WHITESPACE(C); not ?IS_LETTER(C)
->
%% This isn't really strict HTML
{{data, Data, _Whitespace}, S1} = tokenize_data(B, ?INC_COL(S)),
{{data, <<$<, Data/binary>>, false}, S1};
<<_:O/binary, "<", _/binary>> ->
{Tag, S1} = tokenize_literal(B, ?INC_COL(S)),
{Attrs, S2} = tokenize_attributes(B, S1),
{S3, HasSlash} = find_gt(B, S2),
Singleton = HasSlash orelse is_singleton(Tag),
{{start_tag, Tag, Attrs, Singleton, {line_no, S#decoder.line}}, S3};
_ ->
tokenize_data(B, S)
end.
I did something similar for the attributes and added line numbers there as well. So if I directly use this new tokenize function like this:
:floki_mochi_html.tokens(doc)
It produces such output:
{:start_tag, "style", [{"type", "text/css", {:line_no, 13}}], false,
{:line_no, 13}},
{:end_tag, "style", {:line_no, 68}},
I checked the line numbers in the output and they were correct. But as you can imagine, this output can't really be used for any further processing as all other functions expect a different data structure. This is a long winded way of saying that it is not only feasible but works correctly as well in the scenarios that I tested.
As for the mochiweb parser not being according to HTML specs, do you mind sharing a concrete example? This would help me see if it breaks the kind of work I am trying to do. I don't really care for the final HTML output to be "correct". As in, I don't want Mochiweb to add a missing tag in the final output to make it compliant. But I do want it to accurately tokenize what is present in the input. I actually want the broken output where the tags that are missing in the source are also missing in the tokenized output. This would have been much easier to implement if we had a low level tokenizer in Elixir but mochiweb is what we have.
I had previously tried to add this support in the html5ever NIF as it also calls an internal method to update the line number during parsing/tokenizing according to this issue. I managed to get as far as getting a line number printed in the terminal but it wasn't super reliable and my rust is very "rusty". I doubt I can get anywhere with that solution without learning more Rust. Maybe you or someone else who has more Rust experience can look into it.
If we can get this working in Rust NIF, that would be an even bigger win but at this point I am open to whatever solution we can come up with to add this support in Floki itself.
I also wasn't aware of the HTMLTree
. I will look into it.
from floki.
I checked the line numbers in the output and they were correct [...] it is not only feasible but works correctly as well in the scenarios that I tested.
This is awesome! :D
As for the mochiweb parser not being according to HTML specs, do you mind sharing a concrete example?
I can say that most of our bugs are related to lack of support from our current parser. There is one example that can affect your output: multiple whitespace chars are collapsed to just one. So if you have multiple new lines, I think it is going to count incorrectly (I didn't try with your patch).
Maybe you or someone else who has more Rust experience can look into it.
If we can get this working in Rust NIF, that would be an even bigger win but at this point I am open to whatever solution we can come up with to add this support in Floki itself.
I will take a look when I can!
I also wasn't aware of the HTMLTree.
Thinking now, I guess we would need to change the parsing to build the HTMLTree
directly, instead of building the tree as structs like is today.
I cannot promise to add the feature soon, but I will look forward to work on this. Also, if you feel comfortable, don't hesitate to sending PRs. They are more than welcome!
from floki.
So I spent some time on this and was able to get the line number from Html5ever as well with the following changes:
- Add
line_no
field to theNode
struct and take it as an input while creating a new Node:
pub struct Node {
id: NodeHandle,
line_no: u64,
children: PoolOrVec<NodeHandle>,
parent: Option<NodeHandle>,
data: NodeData,
}
impl Node {
fn new(id: usize, line_no: u64, data: NodeData, pool: &Vec<NodeHandle>) -> Self {
Node {
id: NodeHandle(id),
parent: None,
children: PoolOrVec::new(pool),
line_no: line_no,
data,
}
}
}
- Add a
current_line
field in theFlatSink
struct and set it to 1 when creating the FlatSink:
pub struct FlatSink {
pub root: NodeHandle,
pub nodes: Vec<Node>,
pub pool: Vec<NodeHandle>,
pub current_line: u64,
}
impl FlatSink {
pub fn new() -> FlatSink {
let mut sink = FlatSink {
root: NodeHandle(0),
nodes: Vec::with_capacity(200),
pool: Vec::with_capacity(2000),
current_line: 1,
};
// Element 0 is always root
sink.nodes
.push(Node::new(0, 1, NodeData::Document, &sink.pool));
sink
}
// ... trunc ...
}
- Keep a current_line pointer during parsing by implementing the
set_current_line
method of theTreeSink
. This method is called by html5ever whenever html5ever moves to a new line during parsing:
impl TreeSink for FlatSink {
// ... trunc ...
fn set_current_line(&mut self, line_number: u64) {
self.current_line = line_number;
}
}
- Update the
make_node
method of theFlatSink
and populate theline_no
field of theNode
while creating a new Node:
impl FlatSink {
// .. trunc ...
pub fn make_node(&mut self, data: NodeData) -> NodeHandle {
let node = Node::new(self.nodes.len(), self.current_line, data, &self.pool);
let id = node.id;
self.nodes.push(node);
id
}
}
- Encode the
line_no
field as well for each node in theencode_node
function:
// Do this for all Node types:
NodeData::Document => map
.map_put(atoms::type_().encode(env), atoms::document().encode(env))
.map_err(to_custom_error)?
.map_put(atoms::line_no().encode(env), node.line_no.encode(env))
.map_err(to_custom_error),
Now if I call Html5ever.flat_parse(html)
, the output will contain the line_no:
%{
0 => %{id: 0, line_no: 1, parent: nil, type: :document},
1 => %{
attrs: [],
children: [2, 27, 28],
id: 1,
line_no: 1,
name: "html",
parent: 0,
type: :element
},
2 => %{
attrs: [],
children: [3, 4, 5, 7, 8, 9, 26],
id: 2,
line_no: 2,
name: "head",
parent: 1,
type: :element
},
// ...
}
I did not create a PR for Html5ever repo because this change will break quite a lot of other things and I don't have enough knowledge/experience to work on fixing it all. But I wanted to give you a head-start if/when you decide to implement this. Html5ever does not expose column details. It only exposes line numbers.
I hope this helps! This was fun as I had to learn some Rust and was able to create a separate NIF for a CSS inliner as well. All in all, a good thing to have worked on :D
from floki.
Related Issues (20)
- Unhandled error for Floki.parse_fragment/2 HOT 1
- CDATA inside a title tag is not handled in Mochiweb parser
- Is there a way to replace paragraph tags by newlines in Floki.text()? HOT 1
- Proposal: Add Floki.Doc HOT 10
- Drop support for Elixir 1.11
- Allow option to parse attributes as maps HOT 9
- Floki is extremely noisy in logs
- Support for :has pseudo selector
- Error after upgrading to 0.35.0 HOT 6
- Buttons in Header Section Lacks Transition!!
- Floki attribute example is ambiguous
- Proposal: optimized find for simple cases HOT 6
- Unexpected order using `find` with comma separated selectors in v0.35.3 HOT 2
- nth-child selector not working as expected HOT 7
- Floki.raw_html/2 no longer respects `config :floki, :encode_raw_html, false` HOT 1
- Add functionality to escape values in selectors
- Issue with an archive uploaded to hex HOT 3
- Empty tag attributes are not parsed correctly HOT 1
- issues with fetching floki as dependeny
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from floki.