Giter Club home page Giter Club logo

mdsplit's Introduction

mdsplit

mdsplit is a python command line tool to split markdown files into chapters at a given heading level.

Each chapter (or subchapter) is written to its own file, which is named after the heading title. These files are written to subdirectories representing the document's structure. Optionally a table of contents (toc.md) can be created for each input file.

Note:

  • Code blocks (```)are detected (and headers inside ignored)
  • The output is guaranteed to be identical with the input (except for the separation into multiple files of course)
    • This means: no touching of whitespace or changing - to * of your lists like some viusual markdown editors tend to do
  • Text before the first heading is written to a file with the same name as the markdown file
  • Chapters with the same heading name are written to the same file.
  • Reading from stdin is supported
  • Can easily handle large files, e.g. a 1 GB file is split into 30k files in 35 seconds on my 2015 Thinkpad (with an SSD)

Limitations:

Similar projects:

You may also be interested in https://github.com/alandefreitas/mdsplit (C++-based).

Installation

Either use pip:

pip install mdsplit
mdsplit

Or simply download mdsplit.py and run it (it does not use any dependencies but python itself):

python3 mdsplit.py

Usage

Show documentation and supported arguments:

mdsplit --help
usage: mdsplit.py [-h] [-e ENCODING] [-l {1,2,3,4,5,6}] [-t] [-o OUTPUT] [-f] [-v] [input]

positional arguments:
  input                 path to input file/folder (omit or set to '-' to read from stdin)

options:
  -h, --help            show this help message and exit
  -e ENCODING, --encoding ENCODING
                        force a specific encoding, default: python's default platform encoding
  -l {1,2,3,4,5,6}, --max-level {1,2,3,4,5,6}
                        maximum heading level to split, default: 1
  -t, --table-of-contents
                        Generate a table of contents (one 'toc.md' per input file)
  -o OUTPUT, --output OUTPUT
                        path to output folder (must not exist)
  -f, --force           write into output folder even if it already exists
  -v, --verbose

Split a file at level 1 headings, e.g. # This Heading, and write results to an output folder based on the input name:

mdsplit in.md
%%{init: {'themeVariables': { 'fontFamily': 'Monospace', 'text-align': 'left'}}}%%
flowchart LR
    subgraph in.md
        SRC[# Heading 1<br>lorem ipsum<br><br># HeadingTwo<br>dolor sit amet<br><br>## Heading 2.1<br>consetetur sadipscing elitr]
    end
    SRC --> MDSPLIT(mdsplit in.md)
    MDSPLIT --> SPLIT_A
    MDSPLIT --> SPLIT_B
    subgraph in/HeadingTwo.md
        SPLIT_B[# HeadingTwo<br>dolor sit amet<br><br>## Heading 2.1<br>consetetur sadipscing elitr]
    end
    subgraph in/Heading 1.md
        SPLIT_A[# Heading 1<br>lorem ipsum<br><br>]
    end
    style SRC text-align:left
    style SPLIT_A text-align:left
    style SPLIT_B text-align:left
    style MDSPLIT fill:#000,color:#0F0
Loading

Split a file at level 2 headings and higher, e.g. # This Heading and ## That Heading, and write to a specific output directory:

mdsplit in.md --max-level 2 --output out
%%{init: {'themeVariables': { 'fontFamily': 'Monospace', 'text-align': 'left'}}}%%
flowchart LR
    subgraph in.md
        SRC[# Heading 1<br>lorem ipsum<br><br># HeadingTwo<br>dolor sit amet<br><br>## Heading 2.1<br>consetetur sadipscing elitr]
    end
    SRC --> MDSPLIT(mdsplit in.md -l 2 -o out)
    subgraph out/HeadingTwo/Heading 2.1.md
        SPLIT_C[## Heading 2.1<br>consetetur sadipscing elitr]
    end
    subgraph out/HeadingTwo.md
        SPLIT_B[# HeadingTwo<br>dolor sit amet<br><br>]
    end
    subgraph out/Heading 1.md
        SPLIT_A[# Heading 1<br>lorem ipsum<br><br>]
    end
    MDSPLIT --> SPLIT_A
    MDSPLIT --> SPLIT_B
    MDSPLIT --> SPLIT_C
    style SRC text-align:left
    style SPLIT_A text-align:left
    style SPLIT_B text-align:left
    style MDSPLIT fill:#000,color:#0F0
Loading

Split markdown from stdin:

cat in.md | mdsplit --output out

Development (Ubuntu 22.04)

Add the deadsnakes PPA and install additional python versions for testing

sudo add-apt-repository ppa:deadsnakes/ppa
sudo apt install python3.7 python3.7-distutils
...

Install poetry

Prepare virtual environment and download dependencies

poetry install

Run tests (for the default python version)

poetry run pytest

Run tests for all supported python versions

poetry run tox

Release new version

poetry build
poetry publish

Download statistics

mdsplit's People

Contributors

markusstraub avatar raprism avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

mdsplit's Issues

Gitlab Wiki Usage

  • File names should not contain spaces. This is fixed in 81d5925.
  • TODO: For the Wiki a link syntax [link](<basename.md>) does not get the proper MIME type assigned by the server. .md needs to be removed. Together with above change such links are served correctly: [link](basename)

Assignees: @prismv
Labels:
Milestone:

Turn mdsplit into a Python library

The CLI interface of mdsplit is really neat, but for some use cases I'd really need a python library version of it:

  • rewrite relative image links in the split out markdown files so that they work in the new directory structure
  • clean up markdown text (background: I am working with many markdown files that have been converted from .docx via pandoc)
  • decide whether a split markdown file should be written out or merged with the next document (think of a h1-level headline immediately followed by a h2-level headline, currently the h1 headline would be put into a document of its own)

Thus the library version needs to represent the split markdown documents as objects and offer methods to access their contents as well as some merge logic.

Assume utf-8 as default encoding

Hi,

first of all thanks for this really useful script! Unfortunately I got a encoding errors when I ran it for the first time as Python 3.8 under windows assumes the cp1252 encoding on my computer.

  File "C:\dev\tmsspec-doc-py\mdsplit.py", line 341, in <module>
    main()
  File "C:\dev\tmsspec-doc-py\mdsplit.py", line 333, in main
    splitter.process()
  File "C:\dev\tmsspec-doc-py\mdsplit.py", line 147, in process
    self.process_file(self.in_path, self.out_path)
  File "C:\dev\tmsspec-doc-py\mdsplit.py", line 166, in process_file
    self.process_stream(stream, in_file_path.name, out_path)
  File "C:\dev\tmsspec-doc-py\mdsplit.py", line 88, in process_stream
    file.write(line)
  File "C:\Program Files\Python38\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\ue60f' in position 4: character maps to <undefined>`

Wouldn't make it sense if the script assumed utf-8 encoding by default - maybe overrideable via CLI parameter? Had to monkey-patch the file.open() calls in your script with encoding="utf-8" to make it work.

Thanks,
Franz

Correct internal cross-references when splitting the markdown file

It would be great if mdsplit could also fix markdown cross-references like [text](#target) (where target is a link to an internal heading) if link and target are no longer in the same file.

Same goes for relative image links in the input markdown that are no longer valid in the split output markdown.

Truncate file path length

If one heading is very long, the software will fail due to OSError in linux.

Traceback (most recent call last):
File "mdsplit", line 8, in
sys.exit(main())
^^^^^^
File "mdsplit.py", line 342, in main
splitter.process()
File "mdsplit.py", line 148, in process
self.process_file(self.in_path, self.out_path)
File "mdsplit.py", line 167, in process_file
self.process_stream(stream, in_file_path.name, out_path)
File "mdsplit.py", line 75, in process_stream
if not chapter_path.exists():
^^^^^^^^^^^^^^^^^^^^^
File "pathlib.py", line 1235, in exists
self.stat()
File "pathlib.py", line 1013, in stat
return os.stat(self, follow_symlinks=follow_symlinks)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: [Errno 36] File name too long:

More than 6 levels to split into

Is there some particular limitation to the number of accepted levels to descent into? I have a file with 8 levels, and I would really like to split it for my Obsidian vault. Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.