It's just a little code, using xq in python, and parallel to use plenty of cores without oversubscribing memory.
Also included is every page in June 2021 Wikipedia, in JSON, as lines of [page id, page title, parsed body text, and gzip-compressed size].