warc-extractor
by Junqi Ma ([email protected]) and Tim Henderson ([email protected])
"-n 30000" is used to generate about 700 files whose sizes are larger than 300kb Example
./WarcExtractor -n 30000 --file crawl-file.warc.gz -o result-dir
TODO: add command input to give the size of html file