Deserialization + serialization of a stylesheet should result in exactly the original

The deion <a href="http://www.w3.org/TR/CSS21/conform.html#ignore" rel="nofollow

Preserve exact content when round tripping CSS stylesheets,about anglesharp/anglesharp

Comments (42)

FlorianRappl commented on July 30, 2024

Actually no - round-tripping should not preserve the original stylesheet. See W3C specification. The serialization is defined exactly, whereas the deserialization allows some freedom.

I thought about some ways as you mention, such as preserving the original. However, this means keeping a lot of strings, which might be too memory intensive than desired.

I would love to come up with a solution. But have a look at browsers. Here the serialized rules / sheets / whatever look different than the original in general.

from anglesharp.

corliss commented on July 30, 2024

Understood. Well then apart from whitespace, are all contents of the original being serialized? (We already know that unknown properties are being discarded currently, but is there anything else being dropped?)

from anglesharp.

corliss commented on July 30, 2024

BTW, Where in the w3c specification does it state round tripping should not occur? I'm not arguing, just want to read the relevant part of the specification.

from anglesharp.

FlorianRappl commented on July 30, 2024

For instance see http://dev.w3.org/csswg/cssom/ (especially the various serialization algorithms). They are very specific.

In HTML the W3C specification emphasizes this even more (http://www.w3.org/TR/html51/syntax.html), e.g. "It is possible that the output of this [the serialization] algorithm, [...], will not return the original tree structure.".

The answer to the other question is: Yes, if it is legal CSS.

from anglesharp.

corliss commented on July 30, 2024

Html and CSS are completely different; it isn't appropriate to extrapolate from one specification to the other. I've been reading these specs for a couple of years which is why I was surprised that properties and rules are being dropped.

I went back and reread the cssom specification you linked to; but still don't see anything relevant to this discussion - could you please point to exactly where it says that serialization can only happen one way? The very fact that there is a css grammar implies that any serialization that can be parsed by the grammar is in fact valid. In other words, CSS does not require stripping out formatting that parses correctly. Do you agree?

from anglesharp.

FlorianRappl commented on July 30, 2024

Have you read the CSSOM specification? Just make a quick check. Open the (HTML5 / CSS3 compliant) browser of your choice. Insert the following test case:

<!doctype html>
<html>
<style>
.foo {
    what: 'foo';
    hi: there;
}
</style>

Now run the following command inside the JS console:

document.styleSheets[0].cssRules[0].cssText

The following text will be displayed: ".foo { }". How? The declarations what and hi have been dropped. This is not only a serialization issue, this is for real. You can also see that on the style attribute, which reflects the state of the CSSStyleDeclaration.

I agree that CSS does not require to strip out formatting that parses correctly. But how should CSS know that some unknown declaration parses correctly? Each declaration has its own value syntax, i.e. CSS can't know if an unknown declaration is in fact correct. Therefore dropping it makes sense.

The same example can be applied to custom rules (e.g. a custom at-rule) as well.

The CSS 2.1 specification defines "ignore" and gives examples. See: http://www.w3.org/TR/CSS21/conform.html#ignore.

Additionally in the CSS 3 syntax specification (http://www.w3.org/TR/css3-syntax/#partial) we find that "[...] authors can exploit the forward-compatible parsing rules to assign fallback values, CSS renderers must treat as invalid (and ignore as appropriate) any at-rules, properties, property values, keywords, and other syntactic constructs for which they have no usable level of support. In particular, user agents must not selectively ignore unsupported component values and honor supported values in a single multi-value property declaration: if any value is considered invalid (as unsupported values must be), CSS requires that the entire declaration be ignored."

This is quite a long answer, but I hope a convincing one. For me alone the conformance with real browsers (actual implementations) is convincing alone.

Nevertheless, coming back to the original topic. The behavior when calling official APIs such as CssText is one thing, but does not mean that we have to give up on carrying these "unknown" or "invalid" properties. I guess I could mark them as being invalid, thus exclude them from being serialized when the original APIs are called, however, still including them in an API that is available in .NET. The API in .NET will / should be more powerful anyway.

from anglesharp.

corliss commented on July 30, 2024

Thanks for the reply. I have read it, but looks like I was misunderstood or did not explain myself well enough. There are a couple of separate problems:

What to do with unknown properties: For example every browser vendor has their own properties. That's what the earlier issue (#37) I opened has to do with.
What to do with original formatting and whitespace, assume the case of perfectly recognized css and properties, i.e. this is not related to (1). The use case is CSS editing tools, i.e. source editors. These obviously exist and preserve full formatting.
Is AngleSharp CSS only intended to be used inside HTML? I was thinking of using it to build CSS editing tools where there is no HTML. That's why I don't understand why there is this tight coupling to HTML. HTML is certainly the most popular use of CSS, but CSS is an independent specification and can be used to style non-HTML (and non-XML) content, and this is in fact how I plan to use it.

from anglesharp.

corliss commented on July 30, 2024

The description http://www.w3.org/TR/CSS21/conform.html#ignore and the sections that it links to are about what User Agents (i.e. browsers) should do when encountering errors.

However, authoring tools such as text editors are not, in general, user agents. They help the user author stylesheets but must be much more lenient about errors. They can parse and highlight errors but obviously only the user should be correcting or removing content from the stylesheet.

The spec does not specify what non-UA tools should do, which makes sense, as the specification is only about parsing and computing css, and not talk about authoring.

As I mentioned at the start, I think that AngleSharp can work well in an authoring environment as well, it just needs to be more careful about preserving all formatting and also preserving erroneous content sensibly.

from anglesharp.

FlorianRappl commented on July 30, 2024

Alright so I'll try to answer every question (but keep asking if I did forget something or explained it insufficiently)

I guess the answer for #37 is good enough here. I will include invalid declarations. In fact that should be easier to handle then than it is now. The "hard" part (isn't hard at all, maybe important would have been a better word) is to distinguish between valid and invalid in e.g. the official APIs. But this will be done.
The thing is: An editing tool just gets information and uses the information. If you create a WYSIWYG style of tool, it would generate rules and declarations according to some algorithm, maybe using these official W3C APIs, maybe using their own. Anyway, the key thing is - for user typed input nothing is converted and serialized back. The conversion is done - but only to check if syntax is valid and to get the CSSOM, in whatever form it might be presented. Of course it is possible to conserve whitespaces and newlines in such a parser (by saving the original tokens), but the CSS parser does not do it in AngleSharp. I was already thinking about including more information, such as the exact text position. But the question is then where do you start and where do you end? Consider HTML: If every tag would have one text position, how do we know if the closing tag has been skipped or not? If we also include this information, then how do we know the positions of the attributes? If attributes are then specified as well, there have to be more positions - name, equals sign and value. What could help for such a scenario is to get access to the stream of tokens, or a representation of these tokens. This is sufficient to do syntax highlighting, get exact positions and more. Each token has a position, a meaning and optional data depending on the token type. It would certainly be better than using the OM representation directly, which has no information regarding the relationship between original source and OM.
The CSS part was mainly build with HTML in mind. The first reason is the ability to parse CSS selectors. The second one is inline styles and stylesheets. But then it is only logical to also include pure stylesheets. Therefore the CSS part is independent, however, the intention of AngleSharp is to utilize CSS mostly for getting styling information that can be applied to HTML documents.

You are right about authoring tools (which may be completely different), but I consider AngleSharp a UA. Nevertheless, as I wrote: It would be nice to provide abilities that enhance authoring tools. The only condition that is required, is that original (W3C) defined APIs need to behave the same as they are intended / do now.

from anglesharp.

corliss commented on July 30, 2024

I'll just focus on the second point.
Yes, keeping the underlying tokens (or text positions) during parsing is key to preserving formatting. That's why I mentioned Roslyn in the original post as an example of a (unrelated) parser that does this.
Just FYI, I already have a css parser coded in ANTLR that does exactly this, so it can be done.
However, I am considering replacing it with AngleSharp, which has more CSS3 support and seems way faster :)

BTW please let's not consider HTML in this discussion. CSS and HTML are very different from a parser perspective. The CSS parsing model is much saner and results in a top-down parse tree that maps contiguous positions within the text. In other words, serialize the tree and you get back the original text (provided you did not drop anything, including formatting.) I don't think HTML is as straightforward, as you point out. So let's consider CSS only, please.

from anglesharp.

FlorianRappl commented on July 30, 2024

You are right about HTML / CSS parsing differences. I was just using HTML as an example, because in CSS the described process would work in a similar fashion.

Yes, it can be done (I believe you!) but I am searching for a way that includes the three criteria I've mentioned in #37. Right now I tend to create more methods in the parsers.

One method to return a plain token stream. This basically only uses the tokenizer, but also wraps the tokens in a publicly representable way. This is only syntactical analysis and no semantics.
One method to return an annotated tree (we could call it semantic tree). It combines the tokenizer with the usual parser and returns elements that represent combinations of both. That way one has the full DOM and also the full token information. This tree could be serialized to the original string and should support round-tripping.

Only the parser (and maybe some helpers) will expose these methods. The common usage will not be effected in any way. I guess this could be a nice way around it. If you need other / more information you do not take the shortpath (e.g. DocumentBuilder), but use, e.g., the CssParser directly.

from anglesharp.

corliss commented on July 30, 2024

Hi. from the activity above, it looks like this is underway? Are you using the approach you planned above? Cheers.

from anglesharp.

FlorianRappl commented on July 30, 2024

Yes somehow - my current plan is to provide a kind of "token visitor" which is something like a tree visitor, however, just for the tokens. This will be straight forward for HTML - for CSS a lot of preprocessing will happen, since the CSS tokens are rather context specific (some are obsolete or have to be merged depending on the context). The token visitor will take care of these context dependencies and provide the subscriber with the complete document information.

from anglesharp.

corliss commented on July 30, 2024

I see. Will it be able to handle this scenario: Load a css file, add a rule set somewhere, and save.
Goal: apart from the new rule set which can be serialized in any way, the rest of the document should look exactly the same, including whitespace.

Edit: Is this what you mean:

A simple parse tree of objects representing the document. Only leaf nodes (representing tokens) know their position in the document. Non-leaf objects serialize themselves by calling their subtrees. This tree is not exactly the same as the css DOM tree, but is related - there are additional nodes for whitespace and tokens. You get this tree starting from a different property on the root Stylesheet object.

Thinking about this further, it seems like this parse tree is a perfect superset of the CSS DOM. To get the DOM, the whitespace and token nodes are excluded from enumeration. To get the fully serializable tree, all nodes are enumerated.

Thus, it just requires a couple of additional node types - Whitespace and Token - and a way to exclude these from enumeration depending on the mode of traversal - DOM or ParseTree.

Since leaf tokens store their text, there isn't any need to keep track of document positions after deserialization.

from anglesharp.

FlorianRappl commented on July 30, 2024

This will be a read-only solution. You will use it to identify the parts and select where to insert what. The insertion and the new rule has to be provided by you. But you could use both, the CSSOM and the given token visitor to combine syntax (incl. positions) information and semantics (the CSSOM) to achieve this. I will possibly write a helper for that scenario as I regard it as very useful, too. But I don't want to constrain / overload the basic idea. Which is why the token visitor cannot handle it out of the box.

from anglesharp.

corliss commented on July 30, 2024

Hmm - not to sound critical, but that does not seem either clean or simple. The edit I made above proposes something very straightforward that does meet this scenario. What do you think?

from anglesharp.

FlorianRappl commented on July 30, 2024

It is a possibility, but it also has some drawbacks and unanswered questions. My approach would be more modular and could also provide a better basis for the CSS parser in general, which now has to handle these context dependencies in the corresponding region. Therefore I liked that approach, because it is already internally usable.

In general I like re-usability and modularity. If I provide small, but usable classes and methods, then these can be combined, re-combined and used in different ways. I therefore can't see how the approach I described excludes the one you mention. But only making the one you describe could make scenarios such as only getting basic syntax information unnecessary hard (you have to iterate through the tree yourself). It could also result in overhead (you don't want the whole tree, if you are just interested in the positions of some tokens).

But I don't want to start a discussion about performance. There are pros and cons for every solution. If you go ahead and implement the tree yourself I would also be happy and help you in succeeding. Then the API is as you want it.

from anglesharp.

corliss commented on July 30, 2024

I've taken a look at the code and here's a detailed proposal:

CssToken and all CssObjects derive from a base class CssObject
CssObject has these methods and properties:
- _wasParsed: true if the object was created during parsing, false if it was created programmatically.
- IEnumerable<CssObject> Children {get; } // the child nodes
- ToCss(StringBuilder sb) which has this implementation:

      public override void ToCss(StringBuilder sb)
      {
            if (_wasParsed && Children != null && Children.Any()) // non-leaf parsed nodes.
            {
                 foreach (var child in Children)
                     child.ToCss(sb);
            }
            else // this code path is for a) tokens and b) nodes 
                    // created after parsing, which do not have tokens 
                    // and so must compute their string representation.
                sb.Append(ToCss()); // we don't have source text, so compute the text.
      }

The existing ToCss() which computes the css using the DOM.

Now, a key insight: Css DOM objects should use Children to retrieve properties, for example

class CssProperty
{
    string Name { get { return Children.OfType<CssToken>()
            .Where(x => x.Type == CssTokenType.Ident>().FirstOrDefault().ToSafeString(); } }

    ICssValue Value { get {return Children
           .OfType<CssValue>().FirstOrDefault(); } } 
}

By retrieving properties this way, even when nodes are added and removed from Children, the DOM properties always report the current objects. Thus, no special code is needed to keep the two trees in sync, since the DOM always refers to the parse tree.

The parser implementation also changes. The basic flow is the same, but Instead of setting properties on the DOM objects, it only sets Children, with child tokens and DOM objects. Important: no tokens are ever thrown away during parsing. This is what permits perfect round-tripping of source css.

To recap,

The parse tree (the Children property) consists of tokens and Css DOM objects. It represents every character in the source text, which can be reproduced by calling ToCss(StringBuilder).
Tokens that were part of the original source preserve their original text. Newly added tokens serialize using computed ToCss().
The DOM always stays current with the parse tree because all DOM properties simply refer to the parse tree (and never copy values into properties.) [Note: if performance is an issue, then the values may be cached together with an invalidate mechanism, but that's a pure optimization. However it is unlikely that performance will be much of a problem since there are only a few nodes under each object. The Linq queries shown could be replaced by fast methods.]

I believe this satisfies the scenarios of a modifiable parse tree.that preserves all content and a readonly DOM (that stays current with the parse tree.) Do you agree? Are there any scenarios missing?

Cheers :)

from anglesharp.

FlorianRappl commented on July 30, 2024

I partially agree. First, these tokens should be thrown away. Tokens are just intermediate objects that provide information for building the tree. Second, they also should be as decoupled as possible, which is not the case in your approach.

Your way is a mixture where a part of the parser is built into every OM node. But of course a real parser has still to exist, which creates the nodes and gives them their tokens.

But every medal has two sides and I like the live updating.

from anglesharp.

corliss commented on July 30, 2024

Cool, I'll let you think about this for a couple of days :)

Edit: just one comment though:

Your way is a mixture where a part of the parser is built into every OM node.

The OM nodes don't do any parsing. They only retrieve values from the parse tree as needed. The parser is the only object that reads and parses the document and builds the parse tree, which is then accessed by the OM. Parsing is done just once. After that nodes can be inserted programmatically into the parse tree by the client app, and the OM will automatically reflect those changes.

Just wanted to be clear about the term parsing: reading the document and generating a parse tree. Let's not use this term for anything else :)

from anglesharp.

corliss commented on July 30, 2024

Hi - have you had a chance to think about this?
BTW it looks like a lot of changes are currently underway - is the code stable now?
Cheers,

from anglesharp.

corliss commented on July 30, 2024

You mentioned in #131 that this will be fixed in v0.9.
Do you have an expected date or timeframe for this? I am looking forward to this quite eagerly.
Thanks!

from anglesharp.

FlorianRappl commented on July 30, 2024

Hopefully v0.9 will come this Wednesday according to the milestones. Maybe we have a potential delay of another week if I can't get all features listed in the issues into v0.9 until Wednesday, but right now I'd like to think optimistic about it.

from anglesharp.

corliss commented on July 30, 2024

Is this code already checked into a branch? I am building AngleSharp myself rather than using nuget, so would love to try it now if it is already available. Please let me know.

from anglesharp.

FlorianRappl commented on July 30, 2024

There is nothing available at the moment. I'll reference this issue from respective commits.

from anglesharp.

FlorianRappl commented on July 30, 2024

The roundtrip is implemented. I will write something up it when I do the documentation for v0.9. It is not a perfect roundtrip though. There are some cases, such as escape sequences and the choice of quotes (single or double), which are not covered yet.

That being said those cases shouldn't matter too much at the moment. What matter probably more is that the round trip scenario is only possible when you specify the right options (IsStoringTrivia) and choose to ParseStyleSheet. Only a ICssStyleSheet has a ParseTree property right now, which represents the root CssNode.

A CssNode is a node in (an artificial) CSS parse tree. It is loosely coupled to the real CSSOM. That way I could keep all the existing stuff, but insert a second tree, which covers the rest. Keeping the original tokens might be too much overhead for the original use-case, which is why an option was needed for turning it on. If not turned on, the secondary tree won't be created at all.

from anglesharp.

corliss commented on July 30, 2024

This is great news. I will try it out soon - in the next couple of days.
I want to thank you for all the great work you've done with AngleSharp - the vision, the quality and the testing. Great stuff!

from anglesharp.

FlorianRappl commented on July 30, 2024

Thanks a lot! Please tell me how you like the API and what kind of improvements you want. At this stage it will be mostly API polishing, performance improvements and providing more extension points.

Also the NuGet issue #85 is still open. I'd like to upgrade the NuGet package to NuGet v3 (http://blog.nuget.org/20150729/Introducing-nuget-uwp.html) soon (before v1.0), which may open AngleSharp to more platforms and reduce bugs such as the one discussed in the issue.

from anglesharp.

corliss commented on July 30, 2024

I couldn't wait so tried out the API :)
Perhaps I'm not using it correctly, but the parse tree seems a lot more difficult to use than the OM.

The parse tree contains CssNodes all of the same type. To know what the CssNode is, we have to look at CssNode.Entity.
The children of a particular CssNode are also just CssNodes. There are no strongly-typed properties based on the Entity. E.g. If a CssNode.Entity is ICssStyleSheet, there should be a strongly typed property that exposes the child nodes that are of type ICssStyleRule.
The design decision to base the Parse tree on the OM rather than the other way around means that the parse tree is unaffected by changes made after the document is loaded (I think.) This means that there isn't a way to get back the original exact text + the changes made after the stylesheet is loaded. Is this correct? The proposal I had made earlier (on Feb 19) would allow for this. This is, I believe a quite serious limitation, as live updates are not synchronized between the trees. (Again, please correct me if I'm wrong about this.)

I believe that all three problems can be solved by having the OM be based on the parse tree rather than the other way around.

To recap my original proposal, every DOM object is a CssNode, and its strongly typed properties are simply Linq expressions on its Children collection. This means that there are effectively two trees, but only one set of objects. This means Intellisense works great for both the parse tree and for the OM, and live updates affect both trees. This is true because both trees are "almost" the same - the OM is just a filtered, strongly typed version of the parse tree. Please let me know if this is not clear, and I'll create a diagram of what I mean.

Finally, a separate issue, but one that I had hoped would be solved by this, but I don't think is:

There is no way to find out (using either the parse tree or OM) many selector properties. Instead, AngleSharp directly converts these into functions (lambdas) for its own internal use only. For example:
a) In nth-child of the format an+b, AngleSharp does not expose a and b.
b) Other examples: the AttrAvailable and AttrBegins forms of simple selectors do not expose their parameters.
I've added some code to AngleSharp to expose all these properties. If you like I can publish to github so that you can take a look at my changes and decide how best to expose these properties.

Edit: about your comment that

Keeping the original tokens might be too much overhead for the original use-case, which is why an option was needed for turning it on. If not turned on, the secondary tree won't be created at all.

It is possible to avoid keeping the low-level tokens even in my proposal, as long as the original source and trivia is kept.

[As a side point, keeping the tokens might not be a big deal. As a case in point, I refer you to the Roslyn implementation of how Visual Studio parses the files in a solution. It has to deal with very large files, and large numbers of them: https://joshvarty.wordpress.com/2014/07/11/learn-roslyn-now-part-3-syntax-nodes-and-syntax-tokens/]

from anglesharp.

corliss commented on July 30, 2024

I'm writing this as a separate comment to keep it simple. An easy improvement that addresses points 1 and 2 above is to make CssNode the base class of all the entity nodes. Then we don't need to store a separate parse tree at all. And if IsStoringTrivia is false, just don't set the Children property. (Might want to rename IsStoringTrivia to something more meaningful like IsSettingChildren.)

This seems a very easy change to make that eliminates the separate tree of CssNodes and the complexity of CssNode.Entity lookups.

Do you agree? Thanks!

from anglesharp.

FlorianRappl commented on July 30, 2024

The decision to have a separate tree was on purpose. If you follow the commits you will see that my first attempt was to inherit everything from CssNode. That would have worked, but then I had a tremendous overhead of 40 bytes per node, which can be easily some more megs without wanting them. Of course you get much more overhead if you keep the tokens - most overhead actually from instantiating and expanding the lists.

So why the current design? It was the least change for the CSSOM. And let's do not forget: The CSSOM is still the main purpose of AngleSharp. If the CSS parser can be used for more (and I try to expose those things), then fine.

Actually for lookups and such you can just use the helper methods. Did you use them? They can be found in AngleSharp.Extensions. There you have some tree helpers. They will give you, e.g., all descendants or the node to CSSOM object mapping.

I think your proposal is clear and I thought about it when I implemented the current solution. But in the end it just was not the right solution. One of the main issues was that even strings should then be some kind of nodes. Otherwise the exact mapping where a string (e.g. for a name) was created is not clear. In a programming language that is easy. You have a token for an identifier, which will result in an identifier node, which will then be able to be validated (is it already defined? what is it bound to? context? type?). In CSS the topic is, unfortunately, a little bit more complicated. Strings may come from a variety of sources (tokens). They could come from a single token or multiple ones.

To summarize: The basic idea, that there are two trees and that they are loosely connected, will most probably remain (I am sorry - but I thought about it for a long time and I tried various solutions - including the one noted below as "Alternative"). What could be boosted is the API around it. That is not perfect - for sure - but specific suggestions are highly welcome. For instance the naming of IsStoringTrivia is certainly something that might be up for a change. I don't know if IsSettingChildren is better (its not that children aren't set, its that tokens are not stored and therefore no CssNode instances to keep them are created).

Alternative: The option I liked the most is create a completely different CSSOM, which does everything you need and much more, and will finally be used / converted to the CSSOM we now have. That way we get the best and most flexible parsing experience, but (and here's the but!) we have to do another transformation for the CSSOM as exposed by the DOM, which is the main purpose of AngleSharp.

This transformation would slow down the whole thing by ~30% (a vague estimate - certainly depends on N/V, the number of nodes against the volume measured in chars). This is the reason I did not go for it. Right now it is hard enough to keep the overhead of the CSS parsing low. HTML should be parsed fast. CSS should even be parsed faster. If AngleSharp takes too long, it is practically useless.

I hope you understand my concerns. And yes, I did run benchmarks. I have a CSS benchmark that you can also run. Having the simple CssNode inheritance overhead (some bytes + list (null, not null?) management) was already a ~15% penalty.

from anglesharp.

FlorianRappl commented on July 30, 2024

A separate comment to move away from the big picture to little details:

and 2) see above
you are right - there is no synchronization. But why would you want them? If you interact with the CSSOM directly there won't be any tokens. Nothing. Also at which position should the tokens be inserted? Also as I understood your use-case: Its always string -> stylesheet. You only use the OM to retrieve some information. Therefore there is no synchronization required. The current model is only to retrieve static information.

Regarding exposing information:
4) Yes - I think such information should be exposed in one way or the other. I don't want to show the "zoo" of classes to users, but probably that is required or can be solved more elegantly. Anyway - you are always free to push such changes and I will merge them happily. I think getting more information about the selectors would be beneficial - so please: go for it!

from anglesharp.

corliss commented on July 30, 2024

Thanks for the detailed explanation. It makes things a lot clearer.

The decision to have a separate tree was on purpose. If you follow the commits you will see that my first attempt was to inherit everything from CssNode. That would have worked, but then I had a tremendous overhead of 40 bytes per node, which can be easily some more megs without wanting them. Of course you get much more overhead if you keep the tokens - most overhead actually from instantiating and expanding the lists.

Yes, this overhead of List bites a lot. In this situation, it seems like an implicit chained list of CssNodes might help significantly. That is, using _child and _next pointers to construct the children collection and expose the collection via a yield return enumerator. This eliminates the separate allocation for the List and shrinks the memory overhead to two pointers, and completely eliminates the expanding List overhead.

Do you think it might be possible to run benchmarks on this? I've resorted to this optimization many times myself, so am quite optimistic. I've also noticed that the .net framework uses this in a few places, such as the Child collection of XmlNode.

from anglesharp.

FlorianRappl commented on July 30, 2024

Yes I also thought about this. My idea was to make a single chain of tokens (i.e. a single list) and just pass on positions (start and end index in the chain, i.e. 1 pointer and 2 integers; 16 bytes) to the various nodes. I think this is also where you are going at.

I guess it might be possible to benchmark this / try around to see if its possible to come with something. I will definitely give it a shot soon (it must be figured out before v1.0).

from anglesharp.

corliss commented on July 30, 2024

Personally I don't need the tokens at all as long as the text of each node is available. I suppose you are keeping the tokens instead of the string text just because tokenization has already occurred?

I don't understand the exact relationship between tokens and CssNodes - in my simplistic view the entire parse tree would have only CssNodes.

I took a look at String GetSource(this CssNode node) and think I see what's going on - the tokens are anything that's not part of a CssNode, correct? The reordering seems a bit complex.

BTW this method does a couple of array allocations and then String.Join. If there were only CssNodes in the right order then a StringBuilder could be used with Append, which means just one allocation. This is probably not the hot path in your use cases, but does speak to the usefulness of having the parse tree just be constructed of CssNodes.

So perhaps a special type of CssNode that represents all trivia (i.e. non-OM entities?) Then the parse tree could indeed just be constructed via _next and _child pointers, and every OM node could inherit from CssNode (assuming the benchmarks turn out OK.)

To put it another way, all tokens could also derive from CssNode, so that they fit into the parse tree as needed. This is what most people think of as a parse tree - the leaf nodes are the source text, and non-leaf nodes are the parsed structure. Both leaf and non-leaf nodes are ultimately the same type of node.

Cheers.

from anglesharp.

FlorianRappl commented on July 30, 2024

I suppose you are keeping the tokens instead of the string text just because tokenization has already occurred?

Yes - this is the reason why I keep the tokens.

in my simplistic view the entire parse tree would have only CssNodes.

This is way too complex. I give a simple example: Consider the following snippet of CSS (format on purpose):

div /* element */ > .foo /* class */ {   
   color  : rgb( 255,   0,255)  !important  ;
font-weight:bold
}

Alright. First we need to consider the comments. They are within the selector. Hence the selector would need further spaces. But a linear model that does not require reordering for stringification seems hopeless, unless we store the original tokens (or maybe text) 1-1.

This is currently done. The reordering has to be in there because of things like the upper declaration. Obviously the spaces inside the brackets ( { } ) belong to the declaration. We could now assign the spaces between the declaration name (color) and colon to the declaration. The rest after the colon is assigned to the value. This goes until (including) a potential semicolon. Any potential tokens afterwards (in a ; } scenario) would again belong to the declarations. So now we have CssNode instances between some other trivia (like whitespace or comments etc.).

Tokens do not know any hierarchy. The are flat. They come from a 1D representation. They will be used to form the tree. I cheat a little here and there (e.g. function tokens contain other tokens) for convenience, but this is the main reason why a CssToken is not a CssNode and therefore cannot be ordered directly. Yes, there is overhead in the ordering, but it is already optimized a little and should be fine as its not the hot path. The join method with a string array work fast, of course using the StringBuilder from the Pool would be faster, but here we need the order in the beginning.

But it is possible to package (a single or multiple) CssToken as a CssNode (e.g. done by the comments). Nevertheless, the only purpose I see here would be to speed up the GetSource method, but the downside is losing the difference between unimportant trivia node and actual OM carrier.

What do you think?

from anglesharp.

corliss commented on July 30, 2024

This is exactly the kind of scenario that Roslyn does deal with, so I think this scenario works out quite well as follows:

The parse tree preserves the original document, i.e. an inorder traversal of leaf nodes (tokens) produces the source text. In this example the start of the tree looks something like this (I have omitted the whitespace tokens, but they would be in the tree as as well.

CssStyleSheet
+ CssStyleRule
  + Selector
    + SimpleSelector
      + Token (identifier): div
    + Token (comment): /* element*/
    + Combinator:
       + Token: >
    + SimpleSelector:
       + Token :foo

We just need two different methods to retrieve the source, one that skips comments, and one that preserves them. No reordering is then necessary, and the parse tree has a simple structure with minimum memory overhead - just the _child and _next pointers.

So yes, it is true that tokens are flat - concatenating them in an inorder traversal of the tree produces the source document. But they are also contained within the parsed tree structure.

Note that this discussion pertains only to the parsed tree, i.e. the _child and _next pointers. The OM tree is unaffected.

BTW in this scenario Tokens only need one piece of data: the text. The TextPosition can be computed as needed (unless this is in the hot path of some usage; I don't need it personally.) In addition CssTokenType can be eliminated altogether by just having subclasses of CssToken. This further reduces the memory footprint.

To summarize, here is the set of proposed changes:

CssNode is the ultimate base class of all OM entities and tokens.
Each token type is a separate class, e.g. UrlToken, ColorToken (this is basically a memory optimization.)
The parse tree is composed of _child and _next pointers of CssNode.
Tokens appear at the correct position in the parse tree.
An inorder traversal of the parse tree (or any node) produces the stylesheet (or the text of the node.) Comments may be skipped during traversal if desired.

from anglesharp.

corliss commented on July 30, 2024

Hi - any response to the above is appreciated. It would be great to have a unified, fast API. Thanks.

from anglesharp.

FlorianRappl commented on July 30, 2024

It's on the list. But maybe we should open another feature issue (would be required anyway), such that I / potential contributors have already a branch name.

I would suggest opening the following two feature requests:

ISelector enhancements. I am thinking about using a visitor pattern here. It would open ISelector objects without requiring any casts / opening the whole internals here. Still all the information would be accessible.
CssNode model.

There could be more, but in my opinion specifying only one features (or a combination where one cannot live without the other) makes sense.

What's your opinion here?

from anglesharp.

corliss commented on July 30, 2024

Could we reopen this issue for the CssNode request? It has all the detailed history. There is just a single set of related changes here, and at this point very contained.

The ISelector enhancements should be in a new issue, I agree.

from anglesharp.

FlorianRappl commented on July 30, 2024

Well I don't want to continue development in this issue as the title refers to roundtrip (which is essentially solved with the current solution) and the CssNode proposal is a general CSSOM rework.

I would rather bring the essential information from this issue (regarding the CssNode) in a new issue (thus removing potential noise) and refer in the new one to this issue (therefore creating a link to view the discussion for historic reasons).

from anglesharp.

corliss commented on July 30, 2024

Done, cheers.

from anglesharp.

Preserve exact content when round tripping CSS stylesheets about anglesharp HOT 42 CLOSED

Comments (42)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent