Comments (12)
hmmm, been modding expatbuilder and seems to have worked. a decent parseString could be coming, quite soon. can you feel the excitement?.
from domonic.
some links...
https://www.tutorialspoint.com/python3/python_xml_processing.htm
https://www.computerhope.com/unix/pylibml.htm
from domonic.
this looks exciting...
https://github.com/byteface/html5-parser/blob/master/src/html5_parser/dom.py
given what i just did with expat. may be able to mod that to generate domonic from huge sites?
from domonic.
I managed to mod the file. easier that I thought...
so that appears to work. even with lots of websites. It seems to build trees with domonic.
import requests
from html5_parser import parse
sites = [] # add webpages here
for SITE in sites:
try:
r = requests.get("https://"+SITE)
some_html = r.content.decode("utf-8")
root = html5_parser.parse(some_html, treebuilder='dom')#, return_root=False)
print(root)
# print(type(root)) # a domonic Document
# print([str(el) for el in root.getElementsByTagName("a")])
# print(page)
except Exception as e:
print('Failed to dl page', e)
from domonic.
So the options are to patch that file after each install. or
pip install git+https://path to my patched version
i need to figure out that path and test. again. But very promising. It's so fast.
from domonic.
https://html5-parser.readthedocs.io/en/latest/
from domonic.
It is a cool toolkit, but is there a way to quick transcript html page to python code?
from domonic.
Hi @ipfans , thanks for feedback.
There is Not yet a perfect way as I originally only set out to generate html. But it IS on the roadmap.
Some more complete parsers for html/python will hopefully be ready by v1. Which I'd love to get done within 12 months.
We can already get about 75% or more of the way. (but is dangerous and uses eval)
see codemirror.py in this folder...
i.e
https://github.com/byteface/domonic/tree/master/examples/parsing
or via the command line util...
python3 -m domonic -d http://eventual.technology
Also all tags recently had a __pyml__()
secret function added but it may not recurse and is not fully tested. so not documented.
so if you do:
mydom.__pyml__()
it might work. If you have an existing dom. A precursory option was added to the renderer.
render(root, 'test.pyml', 'pyml')
However for this to work we need a dom already parsed.
As people know who use minidom (some may be coming here) . It can only parse very very strict XML not html. So it seems to work sometimes but very easily doesn't. Hence domonic parsers failing as it leverages the same. Usually failing due to content not node structure. Often the default parsers work fine for html strings without content for example.
I then tried to get around this with a simple parser myself. But found I wanted to keep expanding on it and that is at the heart of domonic. an unfinished regex, in-place html to python converter.
However it still has errors and the main issue is python wants keyword args last. Therefor you have to not only parse but swap around the nodes to put 'content' before _classes for example. (the only real crux of learning domonic)
Anyway during investigation I found several ways to parse. python has a builtin html parser too. But you have to use it like a lexer and I've not gotten round to it yet. There's also PEG parsers and some offshelf ones. I found also a html5 c++ one referenced above. So my long term goal would be to have a default good one out of the box, with options of picking some others.
for now. if you are brave domonic __init__
class has a host of methods that are trying to work towards this aim.
After the inital regex parse which does syntax only. It then then passes through a series of self iterating failures to try and fix syntax issues and swap the parameters to the order python expects them. This currently uses eval to check the line is valid. So therefore is dangerous. Hence not documented.
By using these tools you can get 75% of the way there for some huge files and manual modify and edit them to work. By rendering them then fixing the syntax issues pointed out when trying to compile. (there's a guide on the readme for common errors that can help speed this up).
My biggest success was using the hacked html5 c++ parser as mentioned above and then calling pyml() on the dom it produces. However there's still issues compared to my existing parser (which isn't too bad in some cases).
i.e the c++ one does not yet convert data-attributes to the keyword argument syntax format.
it doesn't do this...
i.e. **{'_data-tag':'somevalue'}
automatically for you.
So I hadn't released any further documentation until I come back to investigate parsing. Or get help.
Anyway I hope these tips assist you while I'm still figuring it all out and maybe you might like the codemirror.py example.
once done you may also enjoy this plugin. that will format it for you.
useful plugin for formatting flat .pyml in vscode
https://marketplace.visualstudio.com/items?itemName=mgesbert.indent-nested-dictionary
Also as a final note. If you don't want it ALL in domonic if templating parts is laborious, you can mixin your own fstrings. See DocumentFragment example here...
https://github.com/byteface/htmxtest/blob/master/app.py
from domonic.
to explain maybe a little deeper. and future progress. As parser stuff is undocumented.
domonic orignally had a simple regex parser, for tags only no content.
which grew. domonic currently uses that... (which you then need to eval if you want to auto fix it up)
domonic.parse
but it can also use a copy of builtin in minidom parseString. This autofails with single char replacement so could take infinity to gen a working doc if the XML is not perfect. : / . I achieved that by hacking the builtin expatparser to use domonic rather than minidom. However that needs replacing by a html5 parser.
so the c++ one i knocked up to prove the concept and check compatibility but is not ideal as not pure python and needs extra steps to setup on windows. so will be a later 'option'.
i need to write a pure python one using the builtin if possible.
There's a new window class that will eventually let you do
window.location = x
which I on my own fork swapped out the parseString method for to get working the c++ one. So if you need a quick fix you can do somethign like that. To help with this I've been moving some of the parse methods discovered to a new utility parse package. So if you want to play you can try to hook the data-attribute fixer to the hacked c++ parser and bingo.
However the full solution I'm probably at least several months away from as I need to start a whole new one or find a compatible lib that can build with my dom as an option rather than hacking it like i did with expat. Before I can get back to my regex curiosity.
Also for compatibility 'html' needs not BE the document. So a slight re-architecure on the dom is needed without breaking current useage. Which I'm also in the process of considering which should help with other dom builders. To understand what im talking about diff the native expat parser vs mine 'borrowed' one. and you will see.
from domonic.
Thanks for your replies, and I made a just works version of transcript :) But it is a good news for official support.
from domonic.
html5lib now has an integration point.
An example exists in the /examples/parsers/html5libtest...
and notes on the release. https://github.com/byteface/domonic/releases/tag/0.6.5
from domonic.
I've included html5lib. and and integration point for the c++ one.
import html5_parser
from domonic.ext.html5_parser_ import parse
root = parse(some_html_string, treebuilder='domonic')
though that one is still experimental and to test.
from domonic.
Related Issues (20)
- dquery finish get and ajax calls
- mathml example with the shim in place
- stub out any missing classes/methods HOT 1
- more assertions in unit tests for all HOT 2
- number utils
- insertAdjacent HOT 2
- linting HOT 1
- docstrings
- f-string response includes closing tags for void elements HOT 3
- Exception: unable to update parent 'Comment' object has no attribute 'args' HOT 4
- feature request: boolean attributes HOT 4
- xpath HOT 3
- type hints
- MutationObserver HOT 2
- requirements upper bounds HOT 10
- d3 - polygon HOT 2
- d3 - Tile HOT 1
- Some tests are skipped due to duplicate names
- import domonic.html HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from domonic.