A Python utility for extracting content from URLs and providing flexible output formats, such as Markdown and TXT. Simplify the retrieval and formatting of web content.
- Extract content from a URL
- Output content in Markdown or TXT format
- Extract content from a list of URLs
- Output content in HTML format
- Output content in PDF format
- Output content in DOCX format
- Output content in JSON format
- Python 3.8+ (tested with 3.12)
pip install -r requirements.txt
Either import the module into your own Python project or use the command line interface.
python ./url_content_extractor/main.py --url https://www.google.com --output markdown.md
Run tests in command line with pytest:
pytest
You can also run tests with DEBUG logging enabled:
pytest --log-cli-level=DEBUG