I think I'm going to need a few types of parser. a normal one, one that uses python bu

some links... <a href="https://www.tutorialspoint.com/python3/python_xml_processin

this looks exciting... <a href="https://github.com/byteface/html5-parser/blob/mast

I managed to mod the file. easier that I thought... <a class="commit

It is a cool toolkit, but is there a way to quick tran html page to python code?

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

parser about domonic HOT 12 OPEN

byteface commented on May 25, 2024

parser

from domonic.

Comments (12)

byteface commented on May 25, 2024

hmmm, been modding expatbuilder and seems to have worked. a decent parseString could be coming, quite soon. can you feel the excitement?.

from domonic.

byteface commented on May 25, 2024

some links...
https://www.tutorialspoint.com/python3/python_xml_processing.htm
https://www.computerhope.com/unix/pylibml.htm

from domonic.

byteface commented on May 25, 2024

this looks exciting...
https://github.com/byteface/html5-parser/blob/master/src/html5_parser/dom.py

given what i just did with expat. may be able to mod that to generate domonic from huge sites?

from domonic.

byteface commented on May 25, 2024

I managed to mod the file. easier that I thought...

byteface/html5-parser@fa83bf1

so that appears to work. even with lots of websites. It seems to build trees with domonic.

import requests
from html5_parser import parse

sites = []  # add webpages here
for SITE in sites:
    try:
        r = requests.get("https://"+SITE)
        some_html = r.content.decode("utf-8")
        root = html5_parser.parse(some_html, treebuilder='dom')#, return_root=False)
        print(root)
        # print(type(root))  # a domonic Document
        # print([str(el) for el in root.getElementsByTagName("a")])
        # print(page)
    except Exception as e:    
        print('Failed to dl page', e)

from domonic.

byteface commented on May 25, 2024

So the options are to patch that file after each install. or

pip install git+https://path to my patched version

i need to figure out that path and test. again. But very promising. It's so fast.

from domonic.

byteface commented on May 25, 2024

https://html5-parser.readthedocs.io/en/latest/

from domonic.

ipfans commented on May 25, 2024

It is a cool toolkit, but is there a way to quick transcript html page to python code?

from domonic.

byteface commented on May 25, 2024

Hi @ipfans , thanks for feedback.

There is Not yet a perfect way as I originally only set out to generate html. But it IS on the roadmap.

Some more complete parsers for html/python will hopefully be ready by v1. Which I'd love to get done within 12 months.

We can already get about 75% or more of the way. (but is dangerous and uses eval)

see codemirror.py in this folder...
i.e
https://github.com/byteface/domonic/tree/master/examples/parsing

or via the command line util...
python3 -m domonic -d http://eventual.technology

Also all tags recently had a __pyml__() secret function added but it may not recurse and is not fully tested. so not documented.

so if you do:

    mydom.__pyml__()

it might work. If you have an existing dom. A precursory option was added to the renderer.

render(root, 'test.pyml', 'pyml')

However for this to work we need a dom already parsed.

As people know who use minidom (some may be coming here) . It can only parse very very strict XML not html. So it seems to work sometimes but very easily doesn't. Hence domonic parsers failing as it leverages the same. Usually failing due to content not node structure. Often the default parsers work fine for html strings without content for example.

I then tried to get around this with a simple parser myself. But found I wanted to keep expanding on it and that is at the heart of domonic. an unfinished regex, in-place html to python converter.

However it still has errors and the main issue is python wants keyword args last. Therefor you have to not only parse but swap around the nodes to put 'content' before _classes for example. (the only real crux of learning domonic)

Anyway during investigation I found several ways to parse. python has a builtin html parser too. But you have to use it like a lexer and I've not gotten round to it yet. There's also PEG parsers and some offshelf ones. I found also a html5 c++ one referenced above. So my long term goal would be to have a default good one out of the box, with options of picking some others.

for now. if you are brave domonic __init__ class has a host of methods that are trying to work towards this aim.
After the inital regex parse which does syntax only. It then then passes through a series of self iterating failures to try and fix syntax issues and swap the parameters to the order python expects them. This currently uses eval to check the line is valid. So therefore is dangerous. Hence not documented.

By using these tools you can get 75% of the way there for some huge files and manual modify and edit them to work. By rendering them then fixing the syntax issues pointed out when trying to compile. (there's a guide on the readme for common errors that can help speed this up).

My biggest success was using the hacked html5 c++ parser as mentioned above and then calling pyml() on the dom it produces. However there's still issues compared to my existing parser (which isn't too bad in some cases).

i.e the c++ one does not yet convert data-attributes to the keyword argument syntax format.

it doesn't do this...
i.e. **{'_data-tag':'somevalue'}

automatically for you.

So I hadn't released any further documentation until I come back to investigate parsing. Or get help.

Anyway I hope these tips assist you while I'm still figuring it all out and maybe you might like the codemirror.py example.

once done you may also enjoy this plugin. that will format it for you.

useful plugin for formatting flat .pyml in vscode

https://marketplace.visualstudio.com/items?itemName=mgesbert.indent-nested-dictionary

Also as a final note. If you don't want it ALL in domonic if templating parts is laborious, you can mixin your own fstrings. See DocumentFragment example here...

https://github.com/byteface/htmxtest/blob/master/app.py

from domonic.

byteface commented on May 25, 2024

to explain maybe a little deeper. and future progress. As parser stuff is undocumented.

domonic orignally had a simple regex parser, for tags only no content.

which grew. domonic currently uses that... (which you then need to eval if you want to auto fix it up)
domonic.parse

but it can also use a copy of builtin in minidom parseString. This autofails with single char replacement so could take infinity to gen a working doc if the XML is not perfect. : / . I achieved that by hacking the builtin expatparser to use domonic rather than minidom. However that needs replacing by a html5 parser.

so the c++ one i knocked up to prove the concept and check compatibility but is not ideal as not pure python and needs extra steps to setup on windows. so will be a later 'option'.

i need to write a pure python one using the builtin if possible.

There's a new window class that will eventually let you do

window.location = x

which I on my own fork swapped out the parseString method for to get working the c++ one. So if you need a quick fix you can do somethign like that. To help with this I've been moving some of the parse methods discovered to a new utility parse package. So if you want to play you can try to hook the data-attribute fixer to the hacked c++ parser and bingo.

However the full solution I'm probably at least several months away from as I need to start a whole new one or find a compatible lib that can build with my dom as an option rather than hacking it like i did with expat. Before I can get back to my regex curiosity.

Also for compatibility 'html' needs not BE the document. So a slight re-architecure on the dom is needed without breaking current useage. Which I'm also in the process of considering which should help with other dom builders. To understand what im talking about diff the native expat parser vs mine 'borrowed' one. and you will see.

from domonic.

ipfans commented on May 25, 2024

Thanks for your replies, and I made a just works version of transcript :) But it is a good news for official support.

from domonic.

byteface commented on May 25, 2024

html5lib now has an integration point.

An example exists in the /examples/parsers/html5libtest...

and notes on the release. https://github.com/byteface/domonic/releases/tag/0.6.5

from domonic.

byteface commented on May 25, 2024

I've included html5lib. and and integration point for the c++ one.

import html5_parser
from domonic.ext.html5_parser_ import parse
root = parse(some_html_string, treebuilder='domonic')

though that one is still experimental and to test.

from domonic.

parser about domonic HOT 12 OPEN

Comments (12)

useful plugin for formatting flat .pyml in vscode

https://marketplace.visualstudio.com/items?itemName=mgesbert.indent-nested-dictionary

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent