tim-gromeyer / html2md Goto Github PK
View Code? Open in Web Editor NEWTransform your HTML into clean, easy-to-read markdown with html2md.
Home Page: https://tim-gromeyer.github.io/html2md/
License: MIT License
Transform your HTML into clean, easy-to-read markdown with html2md.
Home Page: https://tim-gromeyer.github.io/html2md/
License: MIT License
Describe the bug
A have a simple HTML with <br>
tags at the end of the lines and they are not converted properly.
To Reproduce
Run html2md.exe breaks.html -p
with the following HTML document:
<!DOCTYPE html>
<html lang="en">
<head><meta charset="utf-8"></head>
<body>
line 1<br>
line 2<br>
</body>
</html>
You will get:
line 1
<br>
line 2
<br>
Expected behavior
Should convert <br>
to a new line instead.
Hey Tim!
Thanks for this library. I'm planning to use this in my block-editor (https://github.com/nuttyartist/notes/tree/block-editor) when a user paste HTML content into the editor, I want to convert it to Markdown.
But I'm encountering a problem, the same one I encountered with QTextDocument::toMarkdown
(after doing setHTML
). For some reason both insert line breaks (\n
) unnecessarily. For example I took the following random text from the internet (https://news.ycombinator.com/item?id=38108048).
m_clipboard->mimeData(QClipboard::Clipboard)->html()
returns:
<meta charset='utf-8'>
<span style=\"color: rgb(0, 0, 0); font-family: Verdana, Geneva, sans-serif; font-size: 12px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; background-color: rgb(246, 246, 239); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial; display: inline !important; float: none;\">My partner is an Astrophysicist who relies on Gnu Emacs as her daily driver. Her work involves managing a treasure trove of legacy code written in a variety of languages like Fortran, Matlab, IDL, and IRAF. This code is essential for her data reduction pipelines, supporting instruments across observatories such as Keck 1 & 2, the AAT, Gemini, and more.</span>
<p style=\"margin-top: 8px; margin-bottom: 0px; color: rgb(0, 0, 0); font-family: Verdana, Geneva, sans-serif; font-size: 12px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; background-color: rgb(246, 246, 239); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;\">Each time she acquires a new Mac, she embarks on a week-long odyssey to set up her computing environment from scratch. It's not because she enjoys it; rather, it's a necessity because the built-in migration assistant just doesn't cut it for her specialised needs.</p>
<p style=\"margin-top: 8px; margin-bottom: 0px; color: rgb(0, 0, 0); font-family: Verdana, Geneva, sans-serif; font-size: 12px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; background-color: rgb(246, 246, 239); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;\">While she currently wields the power of an M1 Max MacBook Pro and runs on the Monterey operating system, she tends to stick with the pre-installed OS for the lifespan of her hardware, which often spans several years. In her case, this could be another 2-3 years or even more before she retires the machine or hands it over to a postdoc or student.</p>
<p style=\"margin-top: 8px; margin-bottom: 0px; color: rgb(0, 0, 0); font-family: Verdana, Geneva, sans-serif; font-size: 12px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; background-color: rgb(246, 246, 239); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;\">But why does she avoid the annual OS upgrades? It's simple. About a decade ago, every OS update would wreak havoc on her meticulously set-up environment. Paths would break, software would malfunction, and libraries that used to reside in one place mysteriously migrated to another. The headache and disruptions were just not worth it.</p>
<p style=\"margin-top: 8px; margin-bottom: 0px; color: rgb(0, 0, 0); font-family: Verdana, Geneva, sans-serif; font-size: 12px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; background-color: rgb(246, 246, 239); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;\">She decided to call it quits on annual OS upgrades roughly 7-8 years ago. While I've suggested Docker as a potential solution, it still requires her to take on the role of administrator and caretaker, which, in her busy world of astrophysical research, can be quite the distraction.</p>"
Using html2md
:
My partner is an Astrophysicist who relies on Gnu Emacs as her daily driver. Her\nwork involves managing a treasure trove of legacy code written in a variety of languages\nlike Fortran, Matlab, IDL, and IRAF. This code is essential for her data reduction\npipelines, supporting instruments across observatories such as Keck 1 & 2, the\nAAT, Gemini, and more.\nEach time she acquires a new Mac, she embarks on a week-long odyssey to set up her\ncomputing environment from scratch. It's not because she enjoys it; rather, it's\na necessity because the built-in migration assistant just doesn't cut it for her\nspecialised needs.\n\nWhile she currently wields the power of an M1 Max MacBook Pro and runs on the Monterey\noperating system, she tends to stick with the pre-installed OS for the lifespan of\nher hardware, which often spans several years. In her case, this could be another\n2-3 years or even more before she retires the machine or hands it over to a postdoc\nor student.\n\nBut why does she avoid the annual OS upgrades? It's simple. About a decade ago, every\nOS update would wreak havoc on her meticulously set-up environment. Paths would break,\nsoftware would malfunction, and libraries that used to reside in one place mysteriously\nmigrated to another. The headache and disruptions were just not worth it.\n\nShe decided to call it quits on annual OS upgrades roughly 7-8 years ago. While I've\nsuggested Docker as a potential solution, it still requires her to take on the role\nof administrator and caretaker, which, in her busy world of astrophysical research,\ncan be quite the distraction.\n
While it should return:
My partner is an Astrophysicist who relies on Gnu Emacs as her daily driver. Her work involves managing a treasure trove of legacy code written in a variety of languages like Fortran, Matlab, IDL, and IRAF. This code is essential for her data reduction pipelines, supporting instruments across observatories such as Keck 1 & 2, the AAT, Gemini, and more.\nEach time she acquires a new Mac, she embarks on a week-long odyssey to set up her computing environment from scratch. It's not because she enjoys it; rather, it's a necessity because the built-in migration assistant just doesn't cut it for her specialised needs.\n\nWhile she currently wields the power of an M1 Max MacBook Pro and runs on the Monterey operating system, she tends to stick with the pre-installed OS for the lifespan of her hardware, which often spans several years. In her case, this could be another 2-3 years or even more before she retires the machine or hands it over to a postdoc or student.\n\nBut why does she avoid the annual OS upgrades? It's simple. About a decade ago, every OS update would wreak havoc on her meticulously set-up environment. Paths would break, software would malfunction, and libraries that used to reside in one place mysteriously migrated to another. The headache and disruptions were just not worth it.\n\nShe decided to call it quits on annual OS upgrades roughly 7-8 years ago. While I've suggested Docker as a potential solution, it still requires her to take on the role of administrator and caretaker, which, in her busy world of astrophysical research, can be quite the distraction.
What can be done about this? (QTextMarkdown shares the same problem).
Hey again Tim!
I tried to paste some HTML from HackerNews into my note app and it crashed on the convert()
function. This is the HTML that was copied from the clipboard:
<meta charset='utf-8'><table border=\"0\" style=\"font-family: Verdana, Geneva, sans-serif; letter-spacing: normal; orphans: 2; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(246, 246, 239); text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;\"><tbody><tr><td class=\"ind\" indent=\"0\" style=\"font-family: Verdana, Geneva, sans-serif; font-size: 10pt; color: rgb(130, 130, 130);\"><img src=\"https://news.ycombinator.com/s.gif\" height=\"1\" width=\"0\"></td><td valign=\"top\" class=\"votelinks\" style=\"font-family: Verdana, Geneva, sans-serif; font-size: 10pt; color: rgb(130, 130, 130);\"><center><a id=\"up_21885445\" class=\"clicky\" href=\"https://news.ycombinator.com/vote?id=21885445&how=up&auth=4ab46530c8158343f958f2cda580e250bcc8e667&goto=item%3Fid%3D21884828#21885445\" style=\"color: rgb(0, 0, 0); text-decoration: none;\"><div class=\"votearrow\" title=\"upvote\" style=\"width: 10px; height: 10px; border: 0px; margin: 3px 2px 6px; background: url("triangle.svg") 0% 0% / 10px, linear-gradient(transparent, transparent) no-repeat;\"></div></a></center></td><td class=\"default\" style=\"font-family: Verdana, Geneva, sans-serif; font-size: 10pt; color: rgb(130, 130, 130);\"><div style=\"margin-top: 2px; margin-bottom: -10px;\"><span class=\"comhead\" style=\"font-family: Verdana, Geneva, sans-serif; font-size: 8pt; color: rgb(130, 130, 130);\"><a href=\"https://news.ycombinator.com/user?id=brudgers\" class=\"hnuser\" style=\"color: rgb(130, 130, 130); text-decoration: none;\">brudgers</a><span> </span><span class=\"age\" title=\"2019-12-26T17:52:06\"><a href=\"https://news.ycombinator.com/item?id=21885445\" style=\"color: rgb(130, 130, 130); text-decoration: none;\">on Dec 27, 2019</a></span><span> </span><span id=\"unv_21885445\"></span><span class=\"navs\">|<span> </span><a href=\"https://news.ycombinator.com/item?id=21884828#21894436\" class=\"clicky\" aria-hidden=\"true\" style=\"color: rgb(130, 130, 130); text-decoration: none;\">next</a><span> </span><a class=\"togg clicky\" id=\"21885445\" n=\"28\" href=\"javascript:void(0)\" style=\"color: rgb(130, 130, 130); text-decoration: none;\">[–]</a><span class=\"onstory\"></span></span></span></div><br><div class=\"comment\" style=\"font-family: Verdana, Geneva, sans-serif; font-size: 9pt; max-width: 970px; overflow-wrap: anywhere; overflow: hidden;\"><span class=\"commtext c00\" style=\"color: rgb(0, 0, 0);\">Excel alternatives might be uncountable. Implementing spreadsheet basics is an advanced beginner exercise. But even Google’s billions only get it a distant second best because Microsoft is still working hard despite the lead. Sure Google and Apple can meet most needs most of the time. They’re good enough mainly because they are free beer. Not because they are open source. Obviously.</span></div></td></tr></tbody></table>
Can you verify if you also experience this?
Is your feature request related to a problem? Please describe.
I'm trying to use pyhtml2md as a more performant replacement for html2text. One current difference is that html2text will do a hard wrap at 78 (default) characters.
Describe the solution you'd like
It would be nice to be able to configure the line length limit and use it as a "hard stop" instead of having it somewhere between 80 and 100 characters. It would be nice to be able to configure the line length limit and use it as a "hard stop" instead of having it somewhere between 80 and 100 characters.
Describe alternatives you've considered
I'm currently using python's textwrap module. But its slow and a bit hacky.
Describe the bug
When parsing text with an anchor tag that isn't closed:
To Reproduce
import pyhtml2md
html = """
<p>Some text<a href="http://example.com"/>the anchor should end but doesn't. A lot more text to demonstrate that the wrapping is also affected. Here it comes. Ready or not. </p>
"""
print(pyhtml2md.convert(html))
And the output is:
Some text[the anchor should end but doesn't. A lot more text to demonstrate that the wrapping is also affected. Here it comes. Ready or not.
Expected behavior
Firefox renders the HTML with the entire rest of the document inside the link. I think it makes more sense to stop at the end of the paragraph. So the output should look like:
Some text[the anchor should end but doesn't. A lot more text to
demonstrate that the wrapping is also affected. Here it comes. Ready or not.](http://example.com)
Desktop (please complete the following information):
Ubuntu 23.10
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.