Giter Club home page Giter Club logo

ratarmount's Introduction

Ratarmount Logo

Random Access Tar Mount (Ratarmount)

Python Version PyPI version Downloads Conda
Changelog License Build Status Discord Telegram

Ratarmount collects all file positions inside a TAR so that it can easily jump to and read from any file without extracting it. It, then, mounts the TAR using fusepy for read access just like archivemount. In contrast to libarchive, on which archivemount is based, random access and true seeking is supported. And in contrast to tarindexer, which also collects file positions for random access, ratarmount offers easy access via FUSE and support for compressed TARs.

Capabilities:

  • Highly Parallelized: By default, all cores are used for parallelized algorithms like for the gzip, bzip2, and xz decoders. This can yield huge speedups on most modern processors but requires more main memory. It can be controlled or completely turned off using the -P <cores> option.
  • Recursive Mounting: Ratarmount will also mount TARs inside TARs inside TARs, ... recursively into folders of the same name, which is useful for the 1.31TB ImageNet data set.
  • Mount Compressed Files: You may also mount files with one of the supported compression schemes. Even if these files do not contain a TAR, you can leverage ratarmount's true seeking capabilities when opening the mounted uncompressed view of such a file.
  • Read-Only Bind Mounting: Folders may be mounted read-only to other folders for usecases like merging a backup TAR with newer versions of those files residing in a normal folder.
  • Union Mounting: Multiple TARs, compressed files, and bind mounted folders can be mounted under the same mountpoint.
  • Write Overlay: A folder can be specified as write overlay. All changes below the mountpoint will be redirected to this folder and deletions are tracked so that all changes can be applied back to the archive.

TAR compressions supported for random access:

Other supported archive formats:

  • Rar as provided by rarfile by Marko Kreen. See also the RAR 5.0 archive format.
  • Zip as provided by zipfile, which is distributed with Python itself. See also the ZIP File Format Specification.
  • Many Others as provided by libarchive via python-libarchive-c.
    • Formats with tests: 7z, ar, cab, compress, cpio, iso, lrzip, lzma, lz4, lzip, lzo, warc, xar.
    • Untested formats that might work or not: deb, grzip, rpm, uuencoding.
    • Beware that libarchive has no performant random access to files and to file contents. In order to seek or open a file, in general, it needs to be assumed that the archive has to be parsed from the beginning. If you have a performance-critical use case for a format only supported via libarchive, then please open a feature request for a faster customized archive format implementation. The hope would be to add suitable stream compressors such as "short"-distance LZ-based compressions to rapidgzip.

Table of Contents

  1. Installation
    1. Installation via AppImage
    2. Installation via Package Manager
      1. Arch Linux
    3. System Dependencies for PIP Installation (Rarely Necessary)
    4. PIP Package Installation
  2. Benchmarks
  3. The Problem
  4. The Solution
  5. Usage
    1. Metadata Index Cache
    2. Bind Mounting
    3. Union Mounting
    4. File versions
    5. Compressed non-TAR files
    6. Xz and Zst Files
    7. As a Library

Installation

You can install ratarmount either by simply downloading the AppImage or via pip. The latter might require installing additional dependencies.

pip install ratarmount

Installation via AppImage

The AppImage files are attached under "Assets" on the releases page. They require no installation and can be simply executed like a portable executable. If you want to install it, you can simply copy it into any of the folders listed in your PATH.

appImageName=ratarmount-0.15.0-x86_64.AppImage
wget 'https://github.com/mxmlnkn/ratarmount/releases/download/v0.15.0/$appImageName'
chmod u+x -- "$appImageName"
./"$appImageName" --help  # Simple test run
sudo cp -- "$appImageName" /usr/local/bin/ratarmount  # Example installation

Installation via Package Manager

Packaging status

Arch Linux

Arch Linux's AUR offers ratarmount as stable and development package. Use an AUR helper, like yay or paru, to install one of them:

# stable version
paru -Syu ratarmount
# development version
paru -Syu ratarmount-git

Conda

conda install -c conda-forge ratarmount

System Dependencies for PIP Installation (Rarely Necessary)

Python 3.6+, preferably pip 19.0+, FUSE, and sqlite3 are required. These should be preinstalled on most systems.

On Debian-like systems like Ubuntu, you can install/update all dependencies using:

sudo apt install python3 python3-pip fuse sqlite3 unar libarchive13 lzop

On macOS, you have to install macFUSE with:

brew install macfuse

If you are installing on a system for which there exists no manylinux wheel, then you'll have to install further dependencies that are required to build some of the Python packages that ratarmount depends on from source:

sudo apt install \
    python3 python3-pip fuse \
    build-essential software-properties-common \
    zlib1g-dev libzstd-dev liblzma-dev cffi libarchive-dev

PIP Package Installation

Then, you can simply install ratarmount from PyPI:

pip install ratarmount

Or, if you want to test the latest version:

python3 -m pip install --user --force-reinstall \
    'git+https://github.com/mxmlnkn/ratarmount.git@develop#egginfo=ratarmountcore&subdirectory=core' \
    'git+https://github.com/mxmlnkn/ratarmount.git@develop#egginfo=ratarmount'

If there are troubles with the compression backend dependencies, you can try the pip --no-deps argument. Ratarmount will work without the compression backends. The hard requirements are fusepy and for Python versions older than 3.7.0 dataclasses.

Benchmarks

Benchmark comparison between ratarmount, archivemount, and fuse-archive

  • Not shown in the benchmarks, but ratarmount can mount files with preexisting index sidecar files in under a second making it vastly more efficient compared to archivemount for every subsequent mount. Also, archivemount has no progress indicator making it very unlikely the user will wait hours for the mounting to finish. Fuse-archive, an iteration on archivemount, has the --asyncprogress option to give a progress indicator using the timestamp of a dummy file. Note that fuse-archive daemonizes instantly but the mount point will not be usable for a long time and everything trying to use it will hang until then when not using --asyncprogress!
  • Getting file contents of a mounted archive is generally vastly faster than archivemount and fuse-archive and does not increase with the archive size or file count resulting in the largest observed speedups to be around 5 orders of magnitude!
  • Memory consumption of ratarmount is mostly less than archivemount and mostly does not grow with the archive size. Not shown in the plots, but the memory usage will be much smaller when not specifying -P 0, i.e., when not parallelizing. The gzip backend grows linearly with the archive size because the data for seeking is thousands of times larger than the simple two 64-bit offsets required for bzip2. The memory usage of the zstd backend only seems humongous because it uses mmap to open. The memory used by mmap is not even counted as used memory when showing the memory usage with free or htop.
  • For empty files, mounting with ratarmount and archivemount does not seem be bounded by decompression nor I/O bandwidths but instead by the algorithm for creating the internal file index. This algorithm scales linearly for ratarmount and fuse-archive but seems to scale worse than even quadratically for archives containing more than 1M files when using archivemount. Ratarmount 0.10.0 improves upon earlier versions by batching SQLite insertions.
  • Mounting bzip2 and xz archives has actually become faster than archivemount and fuse-archive with ratarmount -P 0 on most modern processors because it actually uses more than one core for decoding those compressions. indexed_bzip2 supports block parallel decoding since version 1.2.0.
  • Gzip compressed TAR files are two times slower than archivemount during first time mounting. It is not totally clear to me why that is because streaming the file contents after the archive being mounted is comparably fast, see the next benchmarks below. In order to have superior speeds for both of these, I am experimenting with a parallelized gzip decompressor like the prototype pugz offers for non-binary files only.
  • For the other cases, mounting times become roughly the same compared to archivemount for archives with 2M files in an approximately 100GB archive.
  • Getting a lot of metadata for archive contents as demonstrated by calling find on the mount point is an order of magnitude slower compared to archivemount. Because the C-based fuse-archive is even slower than ratarmount, the difference is very likely that archivemount uses the low-level FUSE interface while ratarmount and fuse-archive use the high-level FUSE interface.

Reading bandwidth benchmark comparison between ratarmount, archivemount, and fuse-archive

  • Reading files from the archive with archivemount are scaling quadratically instead of linearly. This is because archivemount starts reading from the beginning of the archive for each requested I/O block. The block size depends on the program or operating system and should be in the order of 4 kiB. Meaning, the scaling is O( (sizeOfFileToBeCopiedFromArchive / readChunkSize)^2 ). Both, ratarmount and fuse-archive avoid this behavior. Because of this quadratic scaling, the average bandwidth with archivemount seems like it decreases with the file size.
  • Reading bz2 and xz are both an order of magnitude faster, as tested on my 12/24-core Ryzen 3900X, thanks to parallelization.
  • Memory is bounded in these tests for all programs but ratarmount is a lot more lax with memory because it uses a Python stack and because it needs to hold caches for a constant amount of blocks for parallel decoding of bzip2 and xz files. The zstd backend in ratarmount looks unbounded because it uses mmap, whose memory usage will automatically stop and be freed if the memory limit has been reached.
  • The peak for the xz decoder reading speeds happens because some blocks will be cached when loading the index, which is not included in the benchmark for technical reasons. The value for the 1 GiB file size is more realistic.

Further benchmarks can be viewed here.

The Problem

You downloaded a large TAR file from the internet, for example the 1.31TB large ImageNet, and you now want to use it but lack the space, time, or a file system fast enough to extract all the 14.2 million image files.

Existing Partial Solutions

Partial Solutions

Archivemount

Archivemount seems to have large performance issues for too many files and large archive for both mounting and file access in version 0.8.7. A more in-depth comparison benchmark can be found here.

  • Mounting the 6.5GB ImageNet Large-Scale Visual Recognition Challenge 2012 validation data set, and then testing the speed with: time cat mounted/ILSVRC2012_val_00049975.JPEG | wc -c takes 250ms for archivemount and 2ms for ratarmount.
  • Trying to mount the 150GB ILSVRC object localization data set containing 2 million images was given up upon after 2 hours. Ratarmount takes ~15min to create a ~150MB index and <1ms for opening an already created index (SQLite database) and mounting the TAR. In contrast, archivemount will take the same amount of time even for subsequent mounts.
  • Does not support recursive mounting. Although, you could write a script to stack archivemount on top of archivemount for all contained TAR files.

Tarindexer

Tarindex is a command line to tool written in Python which can create index files and then use the index file to extract single files from the tar fast. However, it also has some caveats which ratarmount tries to solve:

  • It only works with single files, meaning it would be necessary to loop over the extract-call. But this would require loading the possibly quite large tar index file into memory each time. For example for ImageNet, the resulting index file is hundreds of MB large. Also, extracting directories will be a hassle.
  • It's difficult to integrate tarindexer into other production environments. Ratarmount instead uses FUSE to mount the TAR as a folder readable by any other programs requiring access to the contained data.
  • Can't handle TARs recursively. In order to extract files inside a TAR which itself is inside a TAR, the packed TAR first needs to be extracted.

TAR Browser

I didn't find out about TAR Browser before I finished the ratarmount script. That's also one of it's cons:

  • Hard to find. I don't seem to be the only one who has trouble finding it as it has one star on Github after 7 years compared to 45 stars for tarindexer after roughly the same amount of time.
  • Hassle to set up. Needs compilation and I gave up when I was instructed to set up a MySQL database for it to use. Confusingly, the setup instructions are not on its Github but here.
  • Doesn't seem to support recursive TAR mounting. I didn't test it because of the MysQL dependency but the code does not seem to have logic for recursive mounting.
  • Xz compression also is only block or frame based, i.e., only works faster with files created by pixz or pxz.

Pros:

  • supports bz2- and xz-compressed TAR archives

The Solution

Ratarmount creates an index file with file names, ownership, permission flags, and offset information. This sidecar is stored at the TAR file's location or in ~/.ratarmount/. Ratarmount can load that index file in under a second if it exists and then offers FUSE mount integration for easy access to the files inside the archive.

Here is a more recent test for version 0.2.0 with the new default SQLite backend:

  • TAR size: 124GB
  • Contains TARs: yes
  • Files in TAR: 1000
  • Files in TAR (including recursively in contained TARs): 1.26 million
  • Index creation (first mounting): 15m 39s
  • Index size: 146MB
  • Index loading (subsequent mounting): 0.000s
  • Reading a 64kB file: ~4ms
  • Running 'find mountPoint -type f | wc -l' (1.26M stat calls): 1m 50s

The reading time for a small file simply verifies the random access by using file seek to be working. The difference between the first read and subsequent reads is not because of ratarmount but because of operating system and file system caches.

Older test with 1.31 TB Imagenet (Fall 2011 release)

The test with the first version of ratarmount (50e8dbb), which used the, as of now removed, pickle backend for serializing the metadata index, for the ImageNet data set:

  • TAR size: 1.31TB
  • Contains TARs: yes
  • Files in TAR: ~26 000
  • Files in TAR (including recursively in contained TARs): 14.2 million
  • Index creation (first mounting): 4 hours
  • Index size: 1GB
  • Index loading (subsequent mounting): 80s
  • Reading a 40kB file: 100ms (first time) and 4ms (subsequent times)

Index loading is relatively slow with 80s because of the pickle backend, which now has been replaced with SQLite and should take less than a second now.

Usage

Command Line Options

See ratarmount --help or here.

Metadata Index Cache

In order to reduce the mounting time, the created index for random access to files inside the tar will be saved to one of these locations. These locations are checked in order and the first, which works sufficiently, will be used. This is the default location order:

  1. .index.sqlite
  2. ~/.ratarmount/<path to tar: '/' -> '_'>.index.sqlite E.g., ~/.ratarmount/_media_cdrom_programm.tar.index.sqlite

This list of fallback folders can be overwritten using the --index-folders option. Furthermore, an explicitly named index file may be specified using the --index-file option. If --index-file is used, then the fallback folders, including the default ones, will be ignored!

Bind Mounting

The mount sources can be TARs and/or folders. Because of that, ratarmount can also be used to bind mount folders read-only to another path similar to bindfs and mount --bind. So, for:

ratarmount folder mountpoint

all files in folder will now be visible in mountpoint.

Union Mounting

If multiple mount sources are specified, the sources on the right side will be added to or update existing files from a mount source left of it. For example:

ratarmount folder1 folder2 mountpoint

will make both, the files from folder1 and folder2, visible in mountpoint. If a file exists in both multiple source, then the file from the rightmost mount source will be used, which in the above example would be folder2.

If you want to update / overwrite a folder with the contents of a given TAR, you can specify the folder both as a mount source and as the mount point:

ratarmount folder file.tar folder

The FUSE option -o nonempty will be automatically added if such a usage is detected. If you instead want to update a TAR with a folder, you only have to swap the two mount sources:

ratarmount file.tar folder folder

File versions

If a file exists multiple times in a TAR or in multiple mount sources, then the hidden versions can be accessed through special .versions folders. For example, consider:

ratarmount folder updated.tar mountpoint

and the file foo exists both in the folder and as two different versions in updated.tar. Then, you can list all three versions using:

ls -la mountpoint/foo.versions/
    dr-xr-xr-x 2 user group     0 Apr 25 21:41 .
    dr-x------ 2 user group 10240 Apr 26 15:59 ..
    -r-x------ 2 user group   123 Apr 25 21:41 1
    -r-x------ 2 user group   256 Apr 25 21:53 2
    -r-x------ 2 user group  1024 Apr 25 22:13 3

In this example, the oldest version has only 123 bytes while the newest and by default shown version has 1024 bytes. So, in order to look at the oldest version, you can simply do:

cat mountpoint/foo.versions/1

Note that these version numbers are the same as when used with tar's --occurrence=N option.

Prefix Removal

Use ratarmount -o modules=subdir,subdir=<prefix> to remove path prefixes using the FUSE subdir module. Because it is a standard FUSE feature, the -o ... argument should also work for other FUSE applications.

When mounting an archive created with absolute paths, e.g., tar -P cf /var/log/apt/history.log, you would see the whole var/log/apt hierarchy under the mount point. To avoid that, specified prefixes can be stripped from paths so that the mount target directory directly contains history.log. Use ratarmount -o modules=subdir,subdir=/var/log/apt/ to do so. The specified path to the folder inside the TAR will be mounted to root, i.e., the mount point.

Compressed non-TAR files

If you want a compressed file not containing a TAR, e.g., foo.bz2, then you can also use ratarmount for that. The uncompressed view will then be mounted to <mountpoint>/foo and you will be able to leverage ratarmount's seeking capabilities when opening that file.

Xz and Zst Files

In contrast to bzip2 and gzip compressed files, true seeking on xz and zst files is only possible at block or frame boundaries. This wouldn't be noteworthy, if both standard compressors for xz and zstd were not by default creating unsuited files. Even though both file formats do support multiple frames and xz even contains a frame table at the end for easy seeking, both compressors write only a single frame and/or block out, making this feature unusable. In order to generate truly seekable compressed files, you'll have to use pixz for xz files. For zstd compressed, you can try with t2sz. The standard zstd tool does not support setting smaller block sizes yet although an issue does exist. Alternatively, you can simply split the original file into parts, compress those parts, and then concatenate those parts together to get a suitable multiframe zst file. Here is a bash function, which can be used for that:

Bash script: createMultiFrameZstd
createMultiFrameZstd()
(
    # Detect being piped into
    if [ -t 0 ]; then
        file=$1
        frameSize=$2
        if [[ ! -f "$file" ]]; then echo "Could not find file '$file'." 1>&2; return 1; fi
        fileSize=$( stat -c %s -- "$file" )
    else
        if [ -t 1 ]; then echo 'You should pipe the output to somewhere!' 1>&2; return 1; fi
        echo 'Will compress from stdin...' 1>&2
        frameSize=$1
    fi
    if [[ ! $frameSize =~ ^[0-9]+$ ]]; then
        echo "Frame size '$frameSize' is not a valid number." 1>&2
        return 1
    fi

    # Create a temporary file. I avoid simply piping to zstd
    # because it wouldn't store the uncompressed size.
    if [[ -d /dev/shm ]]; then frameFile=$( mktemp --tmpdir=/dev/shm ); fi
    if [[ -z $frameFile ]]; then frameFile=$( mktemp ); fi
    if [[ -z $frameFile ]]; then
        echo "Could not create a temporary file for the frames." 1>&2
        return 1
    fi

    if [ -t 0 ]; then
        true > "$file.zst"
        for (( offset = 0; offset < fileSize; offset += frameSize )); do
            dd if="$file" of="$frameFile" bs=$(( 1024*1024 )) \
               iflag=skip_bytes,count_bytes skip="$offset" count="$frameSize" 2>/dev/null
            zstd -c -q -- "$frameFile" >> "$file.zst"
        done
    else
        while true; do
            dd of="$frameFile" bs=$(( 1024*1024 )) \
               iflag=count_bytes count="$frameSize" 2>/dev/null
            # pipe is finished when reading it yields no further data
            if [[ ! -s "$frameFile" ]]; then break; fi
            zstd -c -q -- "$frameFile"
        done
    fi

    'rm' -f -- "$frameFile"
)

In order to compress a file named foo into a multiframe zst file called foo.zst, which contains frames sized 4MiB of uncompressed ata, you would call it like this:

createMultiFrameZstd foo  $(( 4*1024*1024 ))

It also works when being piped to. This can be useful for recompressing files to avoid having to decompress them first to disk.

lbzip2 -cd well-compressed-file.bz2 | createMultiFrameZstd $(( 4*1024*1024 )) > recompressed.zst

Writable Mounting

The --write-overlay <folder> option can be used to create a writable mount point. The original archive will not be modified.

  • File creations will create these files in the specified overlay folder.
  • File deletions and renames will be registered in a database that also resides in the overlay folder.
  • File modifications will copy the file from the archive into the overlay folder before applying the modification.

This overlay folder can be stored alongside the archive or it can be deleted after unmounting the archive. This is useful when building the executable from a source tarball without extracting. After installation, the intermediary build files residing in the overlay folder can be safely removed.

If it is desired to apply the modifications to the original archive, then the --commit-overlay can be prepended to the original ratarmount call.

Here is an example for applying modifications to a writable mount and then committing those modifications back to the archive:

  1. Mount it with a write overlay and add new files. The original archive is not modified.

    ratarmount --write-overlay example-overlay example.tar example-mount-point
    echo "Hello World" > example-mount-point/new-file.txt
  2. Unmount. Changes persist solely in the overlay folder.

    fusermount -u example-mount-point
  3. Commit changes to the original archive.

    ratarmount --commit-overlay --write-overlay example-overlay example.tar example-mount-point

    Output:

    To commit the overlay folder to the archive, these commands have to be executed:
    
        tar --delete --null --verbatim-files-from --files-from='/tmp/tmp_ajfo8wf/deletions.lst' \
            --file 'example.tar' 2>&1 |
           sed '/^tar: Exiting with failure/d; /^tar.*Not found in archive/d'
        tar --append -C 'zlib-wiki-overlay' --null --verbatim-files-from --files-from='/tmp/tmp_ajfo8wf/append.lst' --file 'example.tar'
    
    Committing is an experimental feature!
    Please confirm by entering "commit". Any other input will cancel.
    > 
    Committed successfully. You can now remove the overlay folder at example-overlay.
  4. Verify the modifications to the original archive.

    tar -tvlf example.tar

    Output:

    -rw-rw-r-- user/user 652817 2022-08-08 10:44 example.txt
    -rw-rw-r-- user/user     12 2023-02-16 09:49 new-file.txt
    
  5. Remove the obsole write overlay folder.

    rm -r example-overlay

As a Library

Ratarmount can also be used as a library. Using ratarmountcore, files inside archives can be accessed directly from Python code without requiring FUSE. For a more detailed description, see the ratarmountcore readme here.

ratarmount's People

Contributors

cphyc avatar epicfaace avatar martinellimarco avatar mxmlnkn avatar peteruhrig avatar rizzel avatar rmmh avatar rubenkelevra avatar shawwn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ratarmount's Issues

dataclasses missing in the deps?

Just FYI to @mxmlnkn

I was noticing this error message after upgrading to the latest version (0.8.0) on Ubuntu 18.04 (using pip3)

Jul 11 16:44:02 wrangler2 automount[25310]: >>   File "/usr/local/bin/ratarmount", line 7, in <module>
Jul 11 16:44:02 wrangler2 automount[25310]: >>     from ratarmount import cli
Jul 11 16:44:02 wrangler2 automount[25310]: >>   File "/usr/local/lib/python3.6/dist-packages/ratarmount.py", line 21, in <module>
Jul 11 16:44:02 wrangler2 automount[25310]: >>     from dataclasses import dataclass
Jul 11 16:44:02 wrangler2 automount[25310]: >> ModuleNotFoundError: No module named 'dataclasses'

I just manually pip3 install dataclasses and fixed the issue.

Auto-mount like AVFS (or better) for transparent navigation

Basically, I expected something AVFS-esque, but couldn't wrap my head how to achieve this with ratarmount.
What AVFS currently does:

  • mount whole "/" to ~/.avfs as FUSE
  • when accessing archives under ~/.avfs/ (with # appended to path) it auto-mounts them to these paths e.g. ~/.avfs/tmp/bash-5.0.tar.gz#/

Improvement for above workflow would be to auto-unmount these bindmounts too, when e.g. lsof shows no program is using files under mountpoint (however better method would be desirable).

Primary usecase: integration with filemanagers, to jump into VFS folder when trying to open/preview archive.
E.g. ranger/ranger#456 (comment)

Creating index with many recursive TARs inside an xz compressed TAR is 100x slower than bz2!

The lzmaffi module provides seeking support in multi-block xz files as created with pixz, see also #42. And in small unit-like tests, it really does provide true-seeking capabilities. However, I noticed that the test for tests/2k-recursive-tars.tar.xz is roughly 100x slower compared to tests/2k-recursive-tars.tar.bz2! This is the only test where this difference is so glaring because it contains recursive TARs and after each recursive TAR, a backwards seek has to be applied in order to resume reading the outer TAR. For some reason lzmaffi.seek seems to have some performance problems. Even if it might implement true seeking to a block under the hood, there might be some constant overhead cost. The file also has two problems:

  1. It is highly compressed, with 20MiB compressed to 16kiB for a compression ratio of roughly 1000!
  2. The xz file only has 3 blocks while the bz2 file has 24 blocks. That might slow down "true" seeking to an arbitrary point by factor 8 compared to bz2. But there still is a factor 12 missing for the observed slowdown! Also, the simple decoding time was found to be twice as fast a bz2, that means there effectively is even a factor 24 that can't be explained.

Alternatively, to finding the problem in lzmaffi, I could try to reduce seeks for recursive indexing in ratarmount, e.g., by:

  1. Jumping to the next TAR block after analyzing the recursive TAR, effectively resulting in zero backward seeks. However, tarfile might not have an API allowing me to do this. However, I could make use of StenciledFile again, to force it to support this.
  2. First analyze the outer TAR and only then mount the recursive TARs in order. This would effectively reduce the backward seeks to the maximum recursion level. A nice side-effect would be that this solution could avoid recursion in ratarmount itself.

Here are some notes and benchmarks I made to try and find the problem:

Seemingly affected tests:

  • tests/gnu-sparse-files.tar
  • tests/2k-recursive-tars.tar.bz2

Reproduce problem:

bzip2 -kd tests/2k-recursive-tars.tar.bz2
xz -fk tests/2k-recursive-tars.tar
pixz -k tests/2k-recursive-tars.{tar,tpxz}

indexed_bzip2/tools/blockfinder tests/2k-recursive-tars.tar.bz2
    Block offsets  :
    4 B 0 b -> magic bytes: 0x314159265359
    590 B 0 b -> magic bytes: 0x314159265359
    1205 B 0 b -> magic bytes: 0x314159265359
    1796 B 0 b -> magic bytes: 0x314159265359
    2360 B 0 b -> magic bytes: 0x314159265359
    2897 B 0 b -> magic bytes: 0x314159265359
    3441 B 0 b -> magic bytes: 0x314159265359
    3997 B 0 b -> magic bytes: 0x314159265359
    4545 B 0 b -> magic bytes: 0x314159265359
    5169 B 0 b -> magic bytes: 0x314159265359
    5757 B 0 b -> magic bytes: 0x314159265359
    6313 B 0 b -> magic bytes: 0x314159265359
    6863 B 0 b -> magic bytes: 0x314159265359
    7441 B 0 b -> magic bytes: 0x314159265359
    8034 B 0 b -> magic bytes: 0x314159265359
    8584 B 0 b -> magic bytes: 0x314159265359
    9127 B 0 b -> magic bytes: 0x314159265359
    9688 B 0 b -> magic bytes: 0x314159265359
    10299 B 0 b -> magic bytes: 0x314159265359
    10834 B 0 b -> magic bytes: 0x314159265359
    11395 B 0 b -> magic bytes: 0x314159265359
    11963 B 0 b -> magic bytes: 0x314159265359
    12624 B 0 b -> magic bytes: 0x314159265359
    13174 B 0 b -> magic bytes: 0x314159265359
    Found 24 blocks

xz -l tests/2k-recursive-tars.*xz
    Strms  Blocks   Compressed Uncompressed  Ratio  Check   Filename
        1       1     15.8 KiB     20.5 MiB  0.001  CRC64   tests/2k-recursive-tars.tar.xz
        1       3     20.2 KiB     20.6 MiB  0.001  CRC32   tests/2k-recursive-tars.tpxz

./ratarmount.py -cr tests/2k-recursive-tars.tar.bz2 bibi
    Creating offset dictionary for ratarmount/tests/2k-recursive-tars.tar.bz2 ...
    Creating new SQLite index database at ratarmount/tests/2k-recursive-tars.tar.bz2.index.sqlite
    Creating offset dictionary for mimi/00001.tar ...
    Creating offset dictionary for mimi/00001.tar took 0.00s
    [...]
    Creating offset dictionary for mimi/02000.tar ...
    Creating offset dictionary for mimi/02000.tar took 0.00s
    Creating offset dictionary for ratarmount/tests/2k-recursive-tars.tar.bz2 took 0.53s
    Writing out TAR index to ratarmount/tests/2k-recursive-tars.tar.bz2.index.sqlite took 0s and is sized 589824 B

./ratarmount.py -cr tests/2k-recursive-tars.tar.xz mimi
    [Warning] The specified file 'ratarmount/tests/2k-recursive-tars.tar.xz'
    [Warning] is compressed using xz but only contains one xz block. This makes it
    [Warning] impossible to use true seeking! Please (re)compress your TAR using pixz
    [Warning] (see https://github.com/vasi/pixz) in order for ratarmount to do be able
    [Warning] to do fast seeking to requested files.
    [Warning] As it is, each file access will decompress the whole TAR from the beginning!

    Creating offset dictionary for ratarmount/tests/2k-recursive-tars.tar.xz ...
    Creating new SQLite index database at ratarmount/tests/2k-recursive-tars.tar.xz.index.sqlite
    Creating offset dictionary for mimi/00001.tar ...
    Creating offset dictionary for mimi/00001.tar took 0.00s
    Creating offset dictionary for mimi/00002.tar ...
    Creating offset dictionary for mimi/00002.tar took 0.00s
    [...]
    Creating offset dictionary for mimi/01999.tar ...
    Creating offset dictionary for mimi/01999.tar took 0.00s
    Creating offset dictionary for mimi/02000.tar ...
    Creating offset dictionary for mimi/02000.tar took 0.00s
    Creating offset dictionary for ratarmount/tests/2k-recursive-tars.tar.xz took 104.80s
    Writing out TAR index to ratarmount/tests/2k-recursive-tars.tar.xz.index.sqlite took 0s and is sized 589824 B

./ratarmount.py -cr tests/2k-recursive-tars.tpxz pipi
    Creating offset dictionary for ratarmount/tests/2k-recursive-tars.tpxz ...
    Creating new SQLite index database at ratarmount/tests/2k-recursive-tars.tpxz.index.sqlite
    Creating offset dictionary for mimi/00001.tar ...
    Creating offset dictionary for mimi/00001.tar took 0.00s
    Creating offset dictionary for mimi/00002.tar ...
    Creating offset dictionary for mimi/00002.tar took 0.00s
    [...]
    Creating offset dictionary for mimi/02000.tar ...
    Creating offset dictionary for mimi/02000.tar took 0.00s
    Creating offset dictionary for ratarmount/tests/2k-recursive-tars.tpxz took 58.66s
    Writing out TAR index to ratarmount/tests/2k-recursive-tars.tpxz.index.sqlite took 0s and is sized 589824 B


time python3 -c 'import lzmaffi, sys; print( len( lzmaffi.open( sys.argv[1] ).read() ) );' tests/2k-recursive-tars.tar.xz
    21514240

    real	0m0.129s
    user	0m0.087s
    sys	0m0.038s

time python3 -c 'import lzmaffi, sys; print( len( lzmaffi.open( sys.argv[1] ).read() ) );' tests/2k-recursive-tars.tpxz
    21560288

    real	0m0.109s
    user	0m0.086s
    sys	0m0.020s

time python3 -c 'import indexed_bzip2, sys; print( len( indexed_bzip2.IndexedBzip2File( sys.argv[1] ).read() ) );' tests/2k-recursive-tars.tar.bz2
    21514240

    real	0m0.119s
    user	0m0.090s
    sys	0m0.028s

python3 -m timeit -s 'import lzmaffi' 'lzmaffi.open( "tests/2k-recursive-tars.tar.xz" ).read()'
    5 loops, best of 5: 41.5 msec per loop
python3 -m timeit -s 'import lzmaffi' 'lzmaffi.open( "tests/2k-recursive-tars.tpxz" ).read()'
    10 loops, best of 5: 32.4 msec per loop
python3 -m timeit -s 'import indexed_bzip2' 'indexed_bzip2.IndexedBzip2File( "tests/2k-recursive-tars.tar.bz2" ).read()'
    5 loops, best of 5: 98 msec per loop
  -> The xz decoder is actually 2-3x faster than the bz2 decoder!

time cat bibi/mimi/01333.tar/foo
    1333

    real	0m0.003s
    user	0m0.002s
    sys	0m0.000s

time cat mimi/mimi/01333.tar/foo
    1333

    real	0m0.042s
    user	0m0.002s
    sys	0m0.000s

time cat pipi/mimi/01333.tar/foo
    1333

    real	0m0.029s
    user	0m0.001s
    sys	0m0.000s

time cat pipi/mimi/01500.tar/foo
    1500

    real	0m0.012s
    user	0m0.001s
    sys	0m0.000s

python3 -m timeit -s 'import io, lzmaffi; f = lzmaffi.open( "tests/2k-recursive-tars.tar.xz" );' 'f.seek( -1, io.SEEK_END ); f.seek( 10*1024*1024 ); f.read( 1 )'
    10 loops, best of 5: 34.1 msec per loop
python3 -m timeit -s 'import io, lzmaffi; f = lzmaffi.open( "tests/2k-recursive-tars.tpxz" );' 'f.seek( -1, io.SEEK_END ); f.seek( 10*1024*1024 ); f.read( 1 )'
    20 loops, best of 5: 13.7 msec per loop
python3 -m timeit -s 'import indexed_bzip2, io; f = indexed_bzip2.IndexedBzip2File( "tests/2k-recursive-tars.tar.bz2" )' 'f.seek( -1, io.SEEK_END ); f.seek( 10*1024*1024 ); f.read( 1 )'
    20 loops, best of 5: 12.7 msec per loop
  • You can actually see the seeking and block boundaries by accessing the files and timing the access

  • Also, reading a file later in the TAR than the last accessed is actually multitudes faster (2ms -> ~10-20x)
    than reading that same file a second time because on the second time it will have to backward seek a bit!

  • Index Creation: BZ2 (24 Blocks): 0.52s, XZ (1 Block): 105s, XZ (3 Blocks): 58.7s

    • There seems to be a multitude of factors making the backend ~100x slower for mounting:
      • The recursive mounting requires one backwards seek per recursive TAR
      • The xz files have 8x and 24x less blocks, making seeking less efficient
      • Decoding is actually roughly twice as fast as bz2!
      • The pixz file is generally ~25% faster for some reason. Maybe, a different default compression.
        => Decoding isn't the problem. Seeking by itself also does not seem to be the problem. At this point, I'm not sure why it's not working as fast as bz2

Try to find the critical code location with cProfile

diff --git a/ratarmount.py b/ratarmount.py
index b71005d..7a6b5bd 100755
--- a/ratarmount.py
+++ b/ratarmount.py
@@ -1346,6 +1346,9 @@ class SQLiteIndexedTar:
         assert False, ( "Could not load or store block offsets for {} probably because adding support was forgotten!"
                         .format( self.compression ) )

+import cProfile
+import pstats
+
 class TarMount( fuse.Operations ):
     """
     This class implements the fusepy interface in order to create a mounted file system view
@@ -1384,6 +1387,15 @@ class TarMount( fuse.Operations ):
             except:
                 pass

+        tarFile = pathToMount[0]
+        pfname =  'ratarmount-profile'
+        cProfile.runctx( 'SQLiteIndexedTar( tarFile, writeIndex = True, encoding = self.encoding, **sqliteIndexedTarOptions )',
+                         globals(), locals(), pfname )
+        p = pstats.Stats( pfname )
+        p.sort_stats( pstats.SortKey.CUMULATIVE )
+        p.print_stats()
+        sys.exit( 0 )
+
         self.mountSources: List[Any] = [
             SQLiteIndexedTar( tarFile,
                               writeIndex = True,
./ratarmount.py -cr tests/2k-recursive-tars.tar.bz2 bibi
    Sun Dec 13 14:18:28 2020    ratarmount-profile

             686148 function calls (684134 primitive calls) in 0.671 seconds

       Ordered by: cumulative time

       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
            1    0.000    0.000    0.671    0.671 {built-in method builtins.exec}
            1    0.001    0.001    0.671    0.671 <string>:1(<module>)
            1    0.000    0.000    0.670    0.670 ./ratarmount.py:287(__init__)
       2001/1    0.038    0.000    0.651    0.651 ./ratarmount.py:600(createIndex)
         8005    0.013    0.000    0.358    0.000 /usr/lib/python3.8/tarfile.py:2292(next)
         6004    0.007    0.000    0.234    0.000 /usr/lib/python3.8/tarfile.py:1097(fromtarfile)
         6003    0.004    0.000    0.203    0.000 /usr/lib/python3.8/tarfile.py:2407(__iter__)
        14005    0.007    0.000    0.195    0.000 /usr/lib/python3.8/tarfile.py:516(read)
        14005    0.005    0.000    0.187    0.000 /usr/lib/python3.8/tarfile.py:523(_read)
        14005    0.017    0.000    0.182    0.000 /usr/lib/python3.8/tarfile.py:550(__read)
         2002    0.004    0.000    0.172    0.000 /usr/lib/python3.8/tarfile.py:1552(open)
         2002    0.006    0.000    0.166    0.000 /usr/lib/python3.8/tarfile.py:1441(__init__)
         4105    0.164    0.000    0.164    0.000 {method 'read' of '_io.BufferedReader' objects}
         6004    0.022    0.000    0.116    0.000 /usr/lib/python3.8/tarfile.py:1034(frombuf)
         6003    0.106    0.000    0.106    0.000 {method 'seek' of '_io.BufferedReader' objects}
         4001    0.004    0.000    0.101    0.000 /usr/lib/python3.8/tarfile.py:503(seek)
         2000    0.004    0.000    0.077    0.000 ./ratarmount.py:214(read)
         4002    0.006    0.000    0.059    0.000 ./ratarmount.py:957(_setFileInfo)
        32024    0.017    0.000    0.039    0.000 /usr/lib/python3.8/tarfile.py:172(nti)
         4003    0.010    0.000    0.036    0.000 /usr/lib/python3.8/tarfile.py:221(calc_chksums)
         6009    0.032    0.000    0.032    0.000 {method 'execute' of 'sqlite3.Connection' objects}
        52039    0.019    0.000    0.032    0.000 /usr/lib/python3.8/tarfile.py:164(nts)
         4002    0.008    0.000    0.025    0.000 ./ratarmount.py:931(_tryAddParentFolders)
         4004    0.018    0.000    0.018    0.000 {built-in method builtins.print}
         8006    0.015    0.000    0.015    0.000 {built-in method builtins.sum}
         4003    0.002    0.000    0.014    0.000 /usr/lib/python3.8/tarfile.py:1118(_proc_member)
         4003    0.005    0.000    0.012    0.000 /usr/lib/python3.8/tarfile.py:1131(_proc_builtin)
         8006    0.011    0.000    0.011    0.000 {built-in method _struct.unpack_from}
            1    0.000    0.000    0.011    0.011 ./ratarmount.py:1199(_openCompressedFile)
         4004    0.007    0.000    0.011    0.000 /usr/lib/python3.8/posixpath.py:334(normpath)
         2000    0.003    0.000    0.009    0.000 ./ratarmount.py:148(__init__)
            7    0.009    0.001    0.009    0.001 {method 'executescript' of 'sqlite3.Connection' objects}
         2004    0.002    0.000    0.008    0.000 ./ratarmount.py:1018(indexIsLoaded)
         4002    0.004    0.000    0.008    0.000 ./ratarmount.py:937(<listcomp>)
         4002    0.004    0.000    0.008    0.000 ./ratarmount.py:584(_updateProgressBar)
        52039    0.007    0.000    0.007    0.000 {method 'find' of 'bytes' objects}
         2251    0.007    0.000    0.007    0.000 {method 'executemany' of 'sqlite3.Connection' objects}
            1    0.000    0.000    0.006    0.006 ./ratarmount.py:529(_pathIsWritable)
        52041    0.006    0.000    0.006    0.000 {method 'decode' of 'bytes' objects}
            1    0.006    0.006    0.006    0.006 {method 'write' of '_io.BufferedWriter' objects}
            1    0.000    0.000    0.005    0.005 ./ratarmount.py:1183(_detectTar)
            1    0.000    0.000    0.005    0.005 ./ratarmount.py:1153(_detectCompression)
            1    0.000    0.000    0.005    0.005 /usr/lib/python3.8/tarfile.py:1643(taropen)
         2000    0.003    0.000    0.005    0.000 ./ratarmount.py:242(seek)
    [...]

./ratarmount.py -cr tests/2k-recursive-tars.tpxz bibi
    Sun Dec 13 14:20:01 2020    ratarmount-profile

             4455897 function calls (4453893 primitive calls) in 52.952 seconds

       Ordered by: cumulative time

       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
            1    0.000    0.000   52.952   52.952 {built-in method builtins.exec}
            1    0.000    0.000   52.952   52.952 <string>:1(<module>)
            1    0.000    0.000   52.952   52.952 ./ratarmount.py:287(__init__)
       2001/1    0.126    0.000   52.913   52.913 ./ratarmount.py:600(createIndex)
 !!! ->  6001    0.025    0.000   51.823    0.009 ~/.local/lib/python3.8/site-packages/lzmaffi/__init__.py:482(seek)
        10104    1.746    0.000   49.294    0.005 ~/.local/lib/python3.8/site-packages/lzmaffi/__init__.py:399(_read_block)
         9091    0.024    0.000   47.518    0.005 ~/.local/lib/python3.8/site-packages/lzmaffi/__init__.py:453(_fill_buffer)
         4991    0.027    0.000   47.430    0.010 ~/.local/lib/python3.8/site-packages/lzmaffi/_lzmamodule2.py:711(decompress)
         4991   10.804    0.002   47.403    0.009 ~/.local/lib/python3.8/site-packages/lzmaffi/_lzmamodule2.py:727(_decompress)
       309720    0.121    0.000   31.368    0.000 ~/.local/lib/python3.8/site-packages/lzmaffi/_lzmamodule2.py:346(catch_lzma_error)
       297999   31.215    0.000   31.215    0.000 {built-in method _compiled_module.lzma_code}
       293008    4.770    0.000    4.770    0.000 {built-in method _compiled_module.realloc}
         3905    0.010    0.000    2.593    0.001 ~/.local/lib/python3.8/site-packages/lzmaffi/__init__.py:356(_move_to_block)
         3905    2.294    0.001    2.490    0.001 ~/.local/lib/python3.8/site-packages/lzmaffi/__init__.py:343(_init_decompressor)
       595998    0.263    0.000    0.461    0.000 ~/.local/lib/python3.8/site-packages/cffi/api.py:293(cast)
         8005    0.032    0.000    0.452    0.000 /usr/lib/python3.8/tarfile.py:2292(next)
         6004    0.013    0.000    0.338    0.000 /usr/lib/python3.8/tarfile.py:1097(fromtarfile)
         2002    0.014    0.000    0.257    0.000 /usr/lib/python3.8/tarfile.py:1552(open)
         6003    0.008    0.000    0.246    0.000 /usr/lib/python3.8/tarfile.py:2407(__iter__)
         2002    0.017    0.000    0.233    0.000 /usr/lib/python3.8/tarfile.py:1441(__init__)
         4002    0.023    0.000    0.222    0.000 ./ratarmount.py:957(_setFileInfo)
         6004    0.041    0.000    0.191    0.000 /usr/lib/python3.8/tarfile.py:1034(frombuf)
         6006    0.167    0.000    0.167    0.000 {method 'execute' of 'sqlite3.Connection' objects}
        14005    0.010    0.000    0.148    0.000 /usr/lib/python3.8/tarfile.py:516(read)
         3905    0.050    0.000    0.140    0.000 ~/.local/lib/python3.8/site-packages/lzmaffi/_lzmamodule2.py:656(__init__)
        14005    0.009    0.000    0.137    0.000 /usr/lib/python3.8/tarfile.py:523(_read)
        14005    0.026    0.000    0.127    0.000 /usr/lib/python3.8/tarfile.py:550(__read)
         4103    0.007    0.000    0.110    0.000 ~/.local/lib/python3.8/site-packages/lzmaffi/__init__.py:367(read)
       620531    0.096    0.000    0.096    0.000 ~/.local/lib/python3.8/site-packages/cffi/api.py:180(_typeof)
        12814    0.093    0.000    0.093    0.000 {method 'read' of '_io.BufferedReader' objects}
         3905    0.023    0.000    0.093    0.000 ~/.local/lib/python3.8/site-packages/lzmaffi/_lzmamodule2.py:549(find)
       595998    0.082    0.000    0.082    0.000 {built-in method _cffi_backend.cast}
         4001    0.008    0.000    0.068    0.000 /usr/lib/python3.8/tarfile.py:503(seek)
         4002    0.025    0.000    0.063    0.000 ./ratarmount.py:931(_tryAddParentFolders)
        32024    0.028    0.000    0.061    0.000 /usr/lib/python3.8/tarfile.py:172(nti)
        24533    0.021    0.000    0.060    0.000 ~/.local/lib/python3.8/site-packages/cffi/api.py:242(new)
         4007    0.060    0.000    0.060    0.000 {built-in method builtins.print}
         4003    0.016    0.000    0.052    0.000 /usr/lib/python3.8/tarfile.py:221(calc_chksums)
         2004    0.005    0.000    0.049    0.000 ./ratarmount.py:1018(indexIsLoaded)
         2000    0.014    0.000    0.046    0.000 ./ratarmount.py:148(__init__)
       639627    0.045    0.000    0.045    0.000 {built-in method builtins.isinstance}
         4002    0.017    0.000    0.044    0.000 ./ratarmount.py:584(_updateProgressBar)
        52039    0.022    0.000    0.043    0.000 /usr/lib/python3.8/tarfile.py:164(nts)
         2000    0.007    0.000    0.041    0.000 ./ratarmount.py:214(read)
         7816    0.018    0.000    0.040    0.000 ~/.local/lib/python3.8/site-packages/lzmaffi/_lzmamodule2.py:575(__init__)
         4003    0.006    0.000    0.039    0.000 /usr/lib/python3.8/tarfile.py:1118(_proc_member)
         3915    0.007    0.000    0.038    0.000 ~/.local/lib/python3.8/site-packages/lzmaffi/__init__.py:287(_peek)
         4003    0.010    0.000    0.033    0.000 /usr/lib/python3.8/tarfile.py:1131(_proc_builtin)
            1    0.000    0.000    0.030    0.030 ./ratarmount.py:1199(_openCompressedFile)
         2000    0.008    0.000    0.028    0.000 ./ratarmount.py:242(seek)
         3905    0.008    0.000    0.025    0.000 ~/.local/lib/python3.8/site-packages/lzmaffi/_lzmamodule2.py:296(_new_lzma_stream)
        24533    0.024    0.000    0.024    0.000 {built-in method _cffi_backend.newp}
         9091    0.008    0.000    0.023    0.000 ~/.local/lib/python3.8/site-packages/lzmaffi/__init__.py:41(memoryview_tobytes)
         4004    0.014    0.000    0.021    0.000 /usr/lib/python3.8/posixpath.py:334(normpath)
         8006    0.020    0.000    0.020    0.000 {built-in method _struct.unpack_from}
         3905    0.020    0.000    0.020    0.000 {built-in method _compiled_module.lzma_block_decoder}
         4002    0.011    0.000    0.017    0.000 ./ratarmount.py:937(<listcomp>)
         8006    0.017    0.000    0.017    0.000 {built-in method builtins.sum}
         4003    0.013    0.000    0.017    0.000 /usr/lib/python3.8/tarfile.py:1335(_apply_pax_info)
         2250    0.016    0.000    0.016    0.000 {method 'executemany' of 'sqlite3.Connection' objects}
         7838    0.016    0.000    0.016    0.000 {method 'seek' of '_io.BufferedReader' objects}
            1    0.000    0.000    0.015    0.015 ./ratarmount.py:1153(_detectCompression)
            1    0.000    0.000    0.015    0.015 ./ratarmount.py:1183(_detectTar)
         4003    0.014    0.000    0.014    0.000 /usr/lib/python3.8/tarfile.py:747(__init__)
        14009    0.014    0.000    0.014    0.000 {method 'join' of 'str' objects}
            1    0.000    0.000    0.014    0.014 /usr/lib/python3.8/tarfile.py:1643(taropen)
    [...]

=> Looks like the lzmaffi seek function is indeed problematic!

RecursionError: maximum recursion depth exceeded while calling a Python object

When trying to recursively mount ILSVRC2012_train.tar, I get this error at roughly 97% of the file, which is very frustrating. It looks like Python has a recursion depth of 1000, which is funnily enough identical to the 1000 synset classes in ILSVRC2012. It looks like reading the internal tar file does not stop at the end of the tar for some reason. Maybe because ignore_zeros is set to True. Looks like the best option would be to use StenciledFile for the internal tar so that an end of file is triggered if tarfile reaches the end. This might lead to some seeking but normally this shouldn't be a problem. Alternatively another different file layer might be introduced, which can work on a streaming file object. Basically only allowing read until a predefined size is met.

Currently at position 144614683136 of 147897477120 (97.78%). Estimated time remaining with current rate: 6 min 50 s, with average rate: 2 min 14 s.
Creating offset dictionary for n09835506.tar ...
Traceback (most recent call last):
  File "/home/user/.local/bin/ratarmount", line 8, in <module>
    sys.exit(cli())
  File "/home/user/.local/lib/python3.8/site-packages/ratarmount.py", line 1557, in cli
    fuseOperationsObject = TarMount(
  File "/home/user/.local/lib/python3.8/site-packages/ratarmount.py", line 995, in __init__
    self.mountSources = [ self._openTar( tarFile, clearIndexCache, recursive, gzipSeekPointSpacing )
  File "/home/user/.local/lib/python3.8/site-packages/ratarmount.py", line 995, in <listcomp>
    self.mountSources = [ self._openTar( tarFile, clearIndexCache, recursive, gzipSeekPointSpacing )
  File "/home/user/.local/lib/python3.8/site-packages/ratarmount.py", line 1039, in _openTar
    return SQLiteIndexedTar( tarFilePath,
  File "/home/user/.local/lib/python3.8/site-packages/ratarmount.py", line 294, in __init__
    self.createIndex( self.tarFileObject )
  File "/home/user/.local/lib/python3.8/site-packages/ratarmount.py", line 447, in createIndex
    self.createIndex( fileObject, progressBar, fullPath, globalOffset if streamed else 0 )
  File "/home/user/.local/lib/python3.8/site-packages/ratarmount.py", line 447, in createIndex
    self.createIndex( fileObject, progressBar, fullPath, globalOffset if streamed else 0 )
  File "/home/user/.local/lib/python3.8/site-packages/ratarmount.py", line 447, in createIndex
    self.createIndex( fileObject, progressBar, fullPath, globalOffset if streamed else 0 )
  [Previous line repeated 979 more times]
  File "/home/user/.local/lib/python3.8/site-packages/ratarmount.py", line 398, in createIndex
    loadedTarFile = tarfile.open( fileobj = fileObject, mode = 'r|' if streamed else 'r:', ignore_zeros = True )
  File "/usr/lib/python3.8/tarfile.py", line 1617, in open
    return func(name, filemode, fileobj, **kwargs)
  File "/usr/lib/python3.8/tarfile.py", line 1647, in taropen
    return cls(name, mode, fileobj, **kwargs)
  File "/usr/lib/python3.8/tarfile.py", line 1510, in __init__
    self.firstmember = self.next()
  File "/usr/lib/python3.8/tarfile.py", line 2311, in next
    tarinfo = self.tarinfo.fromtarfile(self)
  File "/usr/lib/python3.8/tarfile.py", line 1103, in fromtarfile
    obj = cls.frombuf(buf, tarfile.encoding, tarfile.errors)
  File "/usr/lib/python3.8/tarfile.py", line 1045, in frombuf
    chksum = nti(buf[148:156])
  File "/usr/lib/python3.8/tarfile.py", line 186, in nti
    s = nts(s, "ascii", "strict")
  File "/usr/lib/python3.8/tarfile.py", line 167, in nts
    p = s.find(b"\0")
RecursionError: maximum recursion depth exceeded while calling a Python object
Exception ignored in: <function TarMount.__del__ at 0x7f87deeb75e0>
Traceback (most recent call last):
  File "/home/user/.local/lib/python3.8/site-packages/ratarmount.py", line 1036, in __del__
    os.close( self.mountPointFd )
AttributeError: mountPointFd

I think, I had other problems with ignore_zero and already made it a command line option in my local develop branch.

Python API

It would be very useful to have a Python API for ratarmount. Specifically, in my use case, I'd like to mount .tar.gz files directly from S3 / Azure Blob Storage -- in order to do that, I would need to be able to pass in file objects to mount (as opposed to actual files) to a ratarmount Python API.

Invalid file format

This is what I am getting from ratarmount:

python ./ratarmount.py /home/jose/.cache/yay/vfs495-daemon/sp84530.tar
Loading offset dictionary from /home/jose/.cache/yay/vfs495-daemon/sp84530.tar.index.custom ...
Traceback (most recent call last):
  File "./ratarmount.py", line 701, in <module>
    serializationBackend = args.serialization_backend  ),
  File "./ratarmount.py", line 555, in __init__
    serializationBackend = serializationBackend )
  File "./ratarmount.py", line 99, in __init__
    self.loadIndex( indexPathWitExt )
  File "./ratarmount.py", line 507, in loadIndex
    self.fileIndex = IndexedTar.load( indexFile )
  File "./ratarmount.py", line 206, in load
    value = IndexedTar.load( file )
  File "./ratarmount.py", line 211, in load
    str( int.from_bytes( valueType, byteorder = 'little' ) ) + ')' )
Exception: Custom TAR index loader: invalid file format (expected msgpack or dict but got0)

Tar file looks to be pretty normal:

$ tar tvf /home/jose/.cache/yay/vfs495-daemon/sp84530.tar               
drwxrwxrwx 0/0               0 2018-01-05 09:26 SP84530/
-rwxrwxrwx 0/0         2962461 2017-12-29 14:25 SP84530/Validity-Sensor-Setup-4.5-136.0.x86_64.rpm
$ file /home/jose/.cache/yay/vfs495-daemon/sp84530.tar
/home/jose/.cache/yay/vfs495-daemon/sp84530.tar: POSIX tar archive

No files mounted when tar uses relative paths

All the files appear to have been mounted within an unreachable '.' folder, with the index file listing the them as being at locations like /./.config.

The tar.gz was generated on osx with a relative path given as well as a list of folders to exclude.

Indexing fails on surrogates

Trying to open a file but it seems to error at one point due to not being able to utf-8 encode surrogates. I made a new tar containing surrogates in file names to try to reproduce but was unable to. Unfortunately I cannot open the tar using other programs to see any culprit files, as it is ~700GB.
Taceback:

Traceback (most recent call last):
  File "/home/collin/.local/bin/ratarmount", line 8, in <module>
    sys.exit(cli())
  File "/home/collin/.local/lib/python3.8/site-packages/ratarmount.py", line 1557, in cli
    fuseOperationsObject = TarMount(
  File "/home/collin/.local/lib/python3.8/site-packages/ratarmount.py", line 995, in __init__
    self.mountSources = [ self._openTar( tarFile, clearIndexCache, recursive, gzipSeekPointSpacing )
  File "/home/collin/.local/lib/python3.8/site-packages/ratarmount.py", line 995, in <listcomp>
    self.mountSources = [ self._openTar( tarFile, clearIndexCache, recursive, gzipSeekPointSpacing )
  File "/home/collin/.local/lib/python3.8/site-packages/ratarmount.py", line 1039, in _openTar
    return SQLiteIndexedTar( tarFilePath,
  File "/home/collin/.local/lib/python3.8/site-packages/ratarmount.py", line 294, in __init__
    self.createIndex( self.tarFileObject )
  File "/home/collin/.local/lib/python3.8/site-packages/ratarmount.py", line 479, in createIndex
    self._setFileInfo( fileInfo )
  File "/home/collin/.local/lib/python3.8/site-packages/ratarmount.py", line 657, in _setFileInfo
    self.sqlConnection.execute( 'INSERT OR REPLACE INTO "files" VALUES (' +
UnicodeEncodeError: 'utf-8' codec can't encode character '\udc83' in position 2: surrogates not allowed
Exception ignored in: <function TarMount.__del__ at 0x7fe07c99d550>
Traceback (most recent call last):
  File "/home/collin/.local/lib/python3.8/site-packages/ratarmount.py", line 1036, in __del__
    os.close( self.mountPointFd )
AttributeError: mountPointFd

SQLiteIndexedTar requires a file object with the .fileno() method implemented

SQLiteIndexedTar requires a file object with the .fileno() method implemented. This means that I can't pass in something like BytesIO() to the fileObject initialization parameter, or it will give an error.

For now, this workaround works.

fileobj = BytesIO()
fileobj.fileno = lambda: 0
t = SQLiteIndexedTar(fileObject=f)

Request for union mount.

One of the nice things about tar is that it can be used to update an already existing folder. Currently, I mount the tar file in a separate directory, move the directory that I want the tar file to another directory, and then use unionfs-fuse to merge the just moved directory, and the mounted tar.

I wish there was an easier way to do this.

Adding a setup.py and modifying a little for easy installing

Hi, by adding a setup.py and modifying the script a little to use entry points, installing in a sever can be made much easier. Thanks a lot for the script btw, very usefull.

posible setup.py file

#!/usr/bin/env python3

from setuptools import setup

setup(
    name="ratarmount",
    version="1.0",
    description="Random Access Read-Only Tar Mount (Ratarmount)",
    author="Maximilian K.",
    author_email="https://github.com/mxmlnkn",
    py_modules=['ratarmount'],
    install_requires=[
        "fusepy",
        "lz4",
        "msgpack",
        "simplejson",
        "pyyaml",
        "ujson",
        "cbor",
        "python-rapidjson",
    ],
    entry_points={"console_scripts": ["ratarmount=ratarmount:cli"]},
)

Modifications in script to use a function to launch the cli, change is at the end

#!/usr/bin/env python3

import argparse
import io
import itertools
import os
import re
import stat
import tarfile
import time
import traceback
from collections import namedtuple
from timeit import default_timer as timer

import fuse

printDebug = 1


def overrides(parentClass):
    def overrider(method):
        assert method.__name__ in dir(parentClass)
        return method

    return overrider


FileInfo = namedtuple("FileInfo", "offset size mtime mode type linkname uid gid istar")


class ProgressBar:
    def __init__(self, maxValue):
        self.maxValue = maxValue
        self.lastUpdateTime = time.time()
        self.lastUpdateValue = 0
        self.updateInterval = 2  # seconds
        self.creationTime = time.time()

    def update(self, value):
        if (
            self.lastUpdateTime is not None
            and (time.time() - self.lastUpdateTime) < self.updateInterval
        ):
            return

        # Use whole interval since start to estimate time
        eta1 = int((time.time() - self.creationTime) / value * (self.maxValue - value))
        # Use only a shorter window interval to estimate time.
        # Accounts better for higher speeds in beginning, e.g., caused by caching effects.
        # However, this estimate might vary a lot while the other one stabilizes after some time!
        eta2 = int(
            (time.time() - self.lastUpdateTime)
            / (value - self.lastUpdateValue)
            * (self.maxValue - value)
        )
        print(
            "Currently at position {} of {} ({:.2f}%). "
            "Estimated time remaining with current rate: {} min {} s, with average rate: {} min {} s.".format(
                value,
                self.maxValue,
                value / self.maxValue * 100.0,
                eta2 // 60,
                eta2 % 60,
                eta1 // 60,
                eta1 % 60,
            ),
            flush=True,
        )

        self.lastUpdateTime = time.time()
        self.lastUpdateValue = value


class IndexedTar:
    """
    This class reads once through the whole TAR archive and stores TAR file offsets
    for all contained files in an index to support fast seeking to a given file.
    """

    __slots__ = (
        "tarFileName",
        "fileIndex",
        "mountRecursively",
        "cacheFolder",
        "possibleIndexFilePaths",
        "indexFileName",
        "progressBar",
    )

    # these allowed backends also double as extensions for the index file to look for
    availableSerializationBackends = [
        "none",
        "pickle",
        "pickle2",
        "pickle3",
        "custom",
        "cbor",
        "msgpack",
        "rapidjson",
        "ujson",
        "simplejson",
    ]
    availableCompressions = ["", "lz4", "gz"]  # no compression

    def __init__(
        self,
        pathToTar=None,
        fileObject=None,
        writeIndex=False,
        clearIndexCache=False,
        recursive=False,
        serializationBackend=None,
        progressBar=None,
    ):
        self.progressBar = progressBar
        self.tarFileName = os.path.normpath(pathToTar)

        # Stores the file hierarchy in a dictionary with keys being either
        #  - the file and containing file metainformation
        #  - or keys being a folder name and containing a recursively defined dictionary.
        self.fileIndex = {}
        self.mountRecursively = recursive

        # will be used for storing indexes if current path is read-only
        self.cacheFolder = os.path.expanduser("~/.ratarmount")
        self.possibleIndexFilePaths = [
            self.tarFileName + ".index",
            self.cacheFolder + "/" + self.tarFileName.replace("/", "_") + ".index",
        ]

        if not serializationBackend:
            serializationBackend = "custom"

        if serializationBackend not in self.supportedIndexExtensions():
            print(
                "[Warning] Serialization backend '"
                + str(serializationBackend)
                + "' not supported.",
                "Defaulting to '" + serializationBackend + "'!",
            )
            print(
                "List of supported extensions / backends:",
                self.supportedIndexExtensions(),
            )

            serializationBackend = "custom"

        # this is the actual index file, which will be used in the end, and by default
        self.indexFileName = self.possibleIndexFilePaths[0] + "." + serializationBackend

        if clearIndexCache:
            for indexPath in self.possibleIndexFilePaths:
                for extension in self.supportedIndexExtensions():
                    indexPathWitExt = indexPath + "." + extension
                    if os.path.isfile(indexPathWitExt):
                        os.remove(indexPathWitExt)

        if fileObject is not None:
            if writeIndex:
                print(
                    "Can't write out index for file object input. Ignoring this option."
                )
            self.createIndex(fileObject)
        else:
            # first try loading the index for the given serialization backend
            if serializationBackend is not None:
                for indexPath in self.possibleIndexFilePaths:
                    if self.tryLoadIndex(indexPath + "." + serializationBackend):
                        break

            # try loading the index from one of the pre-configured paths
            for indexPath in self.possibleIndexFilePaths:
                for extension in self.supportedIndexExtensions():
                    if self.tryLoadIndex(indexPath + "." + extension):
                        break

            if not self.indexIsLoaded():
                with open(self.tarFileName, "rb") as file:
                    self.createIndex(file)

                if writeIndex:
                    for indexPath in self.possibleIndexFilePaths:
                        indexPath += "." + serializationBackend

                        try:
                            folder = os.path.dirname(indexPath)
                            if not os.path.exists(folder):
                                os.mkdir(folder)

                            f = open(indexPath, "wb")
                            f.close()
                            os.remove(indexPath)
                            self.indexFileName = indexPath

                            break
                        except IOError:
                            if printDebug >= 2:
                                print("Could not create file:", indexPath)

                    try:
                        self.writeIndex(self.indexFileName)
                    except IOError:
                        print(
                            "[Info] Could not write TAR index to file. ",
                            "Subsequent mounts might be slow!",
                        )

    @staticmethod
    def supportedIndexExtensions():
        return [
            ".".join(combination).strip(".")
            for combination in itertools.product(
                IndexedTar.availableSerializationBackends,
                IndexedTar.availableCompressions,
            )
        ]

    @staticmethod
    def dump(toDump, file):
        import msgpack

        if isinstance(toDump, dict):
            file.write(b"\x01")  # magic code meaning "start dictionary object"

            for key, value in toDump.items():
                file.write(b"\x03")  # magic code meaning "serialized key value pair"
                IndexedTar.dump(key, file)
                IndexedTar.dump(value, file)

            file.write(b"\x02")  # magic code meaning "close dictionary object"

        elif isinstance(toDump, FileInfo):
            serialized = msgpack.dumps(toDump)
            file.write(b"\x05")  # magic code meaning "msgpack object"
            file.write(len(serialized).to_bytes(4, byteorder="little"))
            file.write(serialized)

        elif isinstance(toDump, str):
            serialized = toDump.encode()
            file.write(b"\x04")  # magic code meaning "string object"
            file.write(len(serialized).to_bytes(4, byteorder="little"))
            file.write(serialized)

        else:
            print("Ignoring unsupported type to write:", toDump)

    @staticmethod
    def load(file):
        import msgpack

        elementType = file.read(1)

        if elementType != b"\x01":  # start of dictionary
            raise Exception("Custom TAR index loader: invalid file format")

        result = {}

        dictElementType = file.read(1)
        while dictElementType:
            if dictElementType == b"\x02":
                break

            elif dictElementType == b"\x03":
                keyType = file.read(1)
                if keyType != b"\x04":  # key must be string object
                    raise Exception("Custom TAR index loader: invalid file format")
                size = int.from_bytes(file.read(4), byteorder="little")
                key = file.read(size).decode()

                valueType = file.read(1)
                if valueType == b"\x05":  # msgpack object
                    size = int.from_bytes(file.read(4), byteorder="little")
                    serialized = file.read(size)
                    value = FileInfo(*msgpack.loads(serialized))

                elif valueType == b"\x01":  # dict object
                    file.seek(-1, io.SEEK_CUR)
                    value = IndexedTar.load(file)

                else:
                    raise Exception(
                        "Custom TAR index loader: invalid file format "
                        + "(expected msgpack or dict but got"
                        + str(int.from_bytes(valueType, byteorder="little"))
                        + ")"
                    )

                result[key] = value

            else:
                raise Exception(
                    "Custom TAR index loader: invalid file format "
                    + "(expected end-of-dict or key-value pair but got"
                    + str(int.from_bytes(dictElementType, byteorder="little"))
                    + ")"
                )

            dictElementType = file.read(1)

        return result

    def getFileInfo(self, path, listDir=False):
        # go down file hierarchy tree along the given path
        p = self.fileIndex
        for name in os.path.normpath(path).split(os.sep):
            if not name:
                continue
            if name not in p:
                return None
            p = p[name]

        def repackDeserializedNamedTuple(p):
            if isinstance(p, list) and len(p) == len(FileInfo._fields):
                return FileInfo(*p)

            if (
                isinstance(p, dict)
                and len(p) == len(FileInfo._fields)
                and "uid" in p
                and isinstance(p["uid"], int)
            ):
                # a normal directory dict must only have dict or FileInfo values,
                # so if the value to the 'uid' key is an actual int,
                # then it is sure it is a deserialized FileInfo object and not a file named 'uid'
                print("P ===", p)
                print("FileInfo ===", FileInfo(**p))
                return FileInfo(**p)

            return p

        p = repackDeserializedNamedTuple(p)

        # if the directory contents are not to be printed and it is a directory,
        # return the "file" info of ".", which holds the directory metainformation
        if not listDir and isinstance(p, dict):
            if "." in p:
                p = p["."]
            else:
                return FileInfo(
                    offset=0,  # not necessary for directory anyways
                    size=1,  # might be misleading / non-conform
                    mtime=0,
                    mode=0o555 | stat.S_IFDIR,
                    type=tarfile.DIRTYPE,
                    linkname="",
                    uid=0,
                    gid=0,
                    istar=False,
                )

        return repackDeserializedNamedTuple(p)

    def isDir(self, path):
        return isinstance(self.getFileInfo(path, listDir=True), dict)

    def exists(self, path):
        path = os.path.normpath(path)
        return self.isDir(path) or isinstance(self.getFileInfo(path), FileInfo)

    def setFileInfo(self, path, fileInfo):
        """
        path: the full path to the file with leading slash (/) for which to set the file info
        """
        assert isinstance(fileInfo, FileInfo)

        pathHierarchy = os.path.normpath(path).split(os.sep)
        if not pathHierarchy:
            return

        # go down file hierarchy tree along the given path
        p = self.fileIndex
        for name in pathHierarchy[:-1]:
            if not name:
                continue
            assert isinstance(p, dict)
            p = p.setdefault(name, {})

        # create a new key in the dictionary of the parent folder
        p.update({pathHierarchy[-1]: fileInfo})

    def setDirInfo(self, path, dirInfo, dirContents={}):
        """
        path: the full path to the file with leading slash (/) for which to set the folder info
        """
        assert isinstance(dirInfo, FileInfo)
        assert isinstance(dirContents, dict)

        pathHierarchy = os.path.normpath(path).strip(os.sep).split(os.sep)
        if not pathHierarchy:
            return

        # go down file hierarchy tree along the given path
        p = self.fileIndex
        for name in pathHierarchy[:-1]:
            if not name:
                continue
            assert isinstance(p, dict)
            p = p.setdefault(name, {})

        # create a new key in the dictionary of the parent folder
        p.update({pathHierarchy[-1]: dirContents})
        p[pathHierarchy[-1]].update({".": dirInfo})

    def createIndex(self, fileObject):
        if printDebug >= 1:
            print(
                "Creating offset dictionary for",
                "<file object>" if self.tarFileName is None else self.tarFileName,
                "...",
            )
        t0 = timer()

        self.fileIndex = {}
        try:
            loadedTarFile = tarfile.open(fileobj=fileObject, mode="r:")
        except tarfile.ReadError as exception:
            print(
                "Archive can't be opened! This might happen for compressed TAR archives, "
                "which currently is not supported."
            )
            raise exception

        if self.progressBar is None and os.path.isfile(self.tarFileName):
            self.progressBar = ProgressBar(os.stat(self.tarFileName).st_size)

        for tarInfo in loadedTarFile:
            if self.progressBar is not None:
                self.progressBar.update(tarInfo.offset_data)

            mode = tarInfo.mode
            if tarInfo.isdir():
                mode |= stat.S_IFDIR
            if tarInfo.isfile():
                mode |= stat.S_IFREG
            if tarInfo.issym():
                mode |= stat.S_IFLNK
            if tarInfo.ischr():
                mode |= stat.S_IFCHR
            if tarInfo.isfifo():
                mode |= stat.S_IFIFO
            fileInfo = FileInfo(
                offset=tarInfo.offset_data,
                size=tarInfo.size,
                mtime=tarInfo.mtime,
                mode=mode,
                type=tarInfo.type,
                linkname=tarInfo.linkname,
                uid=tarInfo.uid,
                gid=tarInfo.gid,
                istar=False,
            )

            # open contained tars for recursive mounting
            indexedTar = None
            if (
                self.mountRecursively
                and tarInfo.isfile()
                and tarInfo.name.endswith(".tar")
            ):
                oldPos = fileObject.tell()
                if oldPos != tarInfo.offset_data:
                    fileObject.seek(tarInfo.offset_data)
                indexedTar = IndexedTar(
                    tarInfo.name,
                    fileObject=fileObject,
                    writeIndex=False,
                    progressBar=self.progressBar,
                )
                # might be especially necessary if the .tar is not actually a tar!
                fileObject.seek(fileObject.tell())

            # Add a leading '/' as a convention where '/' represents the TAR root folder
            # Partly, done because fusepy specifies paths in a mounted directory like this
            path = os.path.normpath("/" + tarInfo.name)

            # test whether the TAR file could be loaded and if so "mount" it recursively
            if indexedTar is not None and indexedTar.indexIsLoaded():
                # actually apply the recursive tar mounting
                extractedName = re.sub(r"\.tar$", "", path)
                if not self.exists(extractedName):
                    path = extractedName

                mountMode = (fileInfo.mode & 0o777) | stat.S_IFDIR
                if mountMode & stat.S_IRUSR != 0:
                    mountMode |= stat.S_IXUSR
                if mountMode & stat.S_IRGRP != 0:
                    mountMode |= stat.S_IXGRP
                if mountMode & stat.S_IROTH != 0:
                    mountMode |= stat.S_IXOTH
                fileInfo = fileInfo._replace(mode=mountMode, istar=True)

                if self.exists(path):
                    print(
                        "[Warning]",
                        path,
                        "already exists in database and will be overwritten!",
                    )

                # merge fileIndex from recursively loaded TAR into our Indexes
                self.setDirInfo(path, fileInfo, indexedTar.fileIndex)

            elif path != "/":
                # just a warning and check for the path already existing
                if self.exists(path):
                    fileInfo = self.getFileInfo(path, listDir=False)
                    if fileInfo.istar:
                        # move recursively mounted TAR directory to original .tar name if there is a name-clash,
                        # e.g., when foo/ also exists in the TAR but foo.tar would be mounted to foo/.
                        # In this case, move that mount to foo.tar/
                        self.setFileInfo(
                            path + ".tar",
                            fileInfo,
                            self.getFileInfo(path, listDir=True),
                        )
                    else:
                        print(
                            "[Warning]",
                            path,
                            "already exists in database and will be overwritten!",
                        )

                # simply store the file or directory information from current TAR item
                if tarInfo.isdir():
                    self.setDirInfo(path, fileInfo, {})
                else:
                    self.setFileInfo(path, fileInfo)

        t1 = timer()
        if printDebug >= 1:
            print(
                "Creating offset dictionary for",
                "<file object>" if self.tarFileName is None else self.tarFileName,
                "took {:.2f}s".format(t1 - t0),
            )

    def serializationBackendFromFileName(self, fileName):
        splitName = fileName.split(".")

        if (
            len(splitName) > 2
            and ".".join(splitName[-2:]) in self.supportedIndexExtensions()
        ):
            return ".".join(splitName[-2:])

        if splitName[-1] in self.supportedIndexExtensions():
            return splitName[-1]

        return None

    def indexIsLoaded(self):
        return bool(self.fileIndex)

    def writeIndex(self, outFileName):
        """
        outFileName: Full file name with backend extension.
                     Depending on the extension the serialization is chosen.
        """

        serializationBackend = self.serializationBackendFromFileName(outFileName)

        if printDebug >= 1:
            print(
                "Writing out TAR index using",
                serializationBackend,
                "to",
                outFileName,
                "...",
            )
        t0 = timer()

        fileMode = "wt" if "json" in serializationBackend else "wb"

        if serializationBackend.endswith(".lz4"):
            import lz4.frame

            wrapperOpen = lambda x: lz4.frame.open(x, fileMode)
        elif serializationBackend.endswith(".gz"):
            import gzip

            wrapperOpen = lambda x: gzip.open(x, fileMode)
        else:
            wrapperOpen = lambda x: open(x, fileMode)
        serializationBackend = serializationBackend.split(".")[0]

        # libraries tested but not working:
        #  - marshal: can't serialize namedtuples
        #  - hickle: for some reason, creates files almost 64x larger and slower than pickle!?
        #  - yaml: almost a 10 times slower and more memory usage and deserializes everything including ints to string

        if serializationBackend == "none":
            print(
                "Won't write out index file because backend 'none' was chosen. "
                "Subsequent mounts might be slow!"
            )
            return

        with wrapperOpen(outFileName) as outFile:
            if serializationBackend == "pickle2":
                import pickle

                pickle.dump(self.fileIndex, outFile)
                pickle.dump(self.fileIndex, outFile, protocol=2)

            # default serialization because it has the fewest dependencies and because it was legacy default
            elif (
                serializationBackend == "pickle3"
                or serializationBackend == "pickle"
                or serializationBackend is None
            ):
                import pickle

                pickle.dump(self.fileIndex, outFile)
                pickle.dump(
                    self.fileIndex, outFile, protocol=3
                )  # 3 is default protocol

            elif serializationBackend == "simplejson":
                import simplejson

                simplejson.dump(self.fileIndex, outFile, namedtuple_as_object=True)

            elif serializationBackend == "custom":
                IndexedTar.dump(self.fileIndex, outFile)

            elif serializationBackend in ["msgpack", "cbor", "rapidjson", "ujson"]:
                import importlib

                module = importlib.import_module(serializationBackend)
                getattr(module, "dump")(self.fileIndex, outFile)

            else:
                print(
                    "Tried to save index with unsupported extension backend:",
                    serializationBackend,
                    "!",
                )

        t1 = timer()
        if printDebug >= 1:
            print(
                "Writing out TAR index to",
                outFileName,
                "took {:.2f}s".format(t1 - t0),
                "and is sized",
                os.stat(outFileName).st_size,
                "B",
            )

    def loadIndex(self, indexFileName):
        if printDebug >= 1:
            print("Loading offset dictionary from", indexFileName, "...")
        t0 = timer()

        serializationBackend = self.serializationBackendFromFileName(indexFileName)

        fileMode = "rt" if "json" in serializationBackend else "rb"

        if serializationBackend.endswith(".lz4"):
            import lz4.frame

            wrapperOpen = lambda x: lz4.frame.open(x, fileMode)
        elif serializationBackend.endswith(".gz"):
            import gzip

            wrapperOpen = lambda x: gzip.open(x, fileMode)
        else:
            wrapperOpen = lambda x: open(x, fileMode)
        serializationBackend = serializationBackend.split(".")[0]

        with wrapperOpen(indexFileName) as indexFile:
            if serializationBackend in ("pickle2", "pickle3", "pickle"):
                import pickle

                self.fileIndex = pickle.load(indexFile)

            elif serializationBackend == "custom":
                self.fileIndex = IndexedTar.load(indexFile)

            elif serializationBackend == "msgpack":
                import msgpack

                self.fileIndex = msgpack.load(indexFile, raw=False)

            elif serializationBackend == "simplejson":
                import simplejson

                self.fileIndex = simplejson.load(indexFile, namedtuple_as_object=True)

            elif serializationBackend in ["cbor", "rapidjson", "ujson"]:
                import importlib

                module = importlib.import_module(serializationBackend)
                self.fileIndex = getattr(module, "load")(indexFile)

            else:
                print(
                    "Tried to load index path with unsupported serializationBackend:",
                    serializationBackend,
                    "!",
                )
                return

        if printDebug >= 2:

            def countDictEntries(d):
                n = 0
                for value in d.values():
                    n += countDictEntries(value) if isinstance(value, dict) else 1
                return n

            print("Files:", countDictEntries(self.fileIndex))

        t1 = timer()
        if printDebug >= 1:
            print(
                "Loading offset dictionary from",
                indexFileName,
                "took {:.2f}s".format(t1 - t0),
            )

    def tryLoadIndex(self, indexFileName):
        """calls loadIndex if index is not loaded already and provides extensive error handling"""

        if self.indexIsLoaded():
            return True

        if not os.path.isfile(indexFileName):
            return False

        if os.path.getsize(indexFileName) == 0:
            try:
                os.remove(indexFileName)
            except OSError:
                print(
                    "[Warning] Failed to remove empty old cached index file:",
                    indexFileName,
                )

            return False

        try:
            self.loadIndex(indexFileName)
        except Exception:
            self.fileIndex = None

            traceback.print_exc()
            print("[Warning] Could not load file '" + indexFileName)

            print(
                "[Info] Some likely reasons for not being able to load the index file:"
            )
            print("[Info]   - Some dependencies are missing. Please isntall them with:")
            print("[Info]       pip3 --user -r requirements.txt")
            print("[Info]   - The file has incorrect read permissions")
            print("[Info]   - The file got corrupted because of:")
            print(
                "[Info]     - The program exited while it was still writing the index because of:"
            )
            print("[Info]       - the user sent SIGINT to force the program to quit")
            print("[Info]       - an internal error occured while writing the index")
            print("[Info]       - the disk filled up while writing the index")
            print("[Info]     - Rare lowlevel corruptions caused by hardware failure")

            print(
                "[Info] This might force a time-costly index recreation, so if it happens often and "
                "mounting is slow, try to find out why loading fails repeatedly, "
                "e.g., by opening an issue on the public github page."
            )

            try:
                os.remove(indexFileName)
            except OSError:
                print(
                    "[Warning] Failed to remove corrupted old cached index file:",
                    indexFileName,
                )

        return self.indexIsLoaded()


class TarMount(fuse.Operations):
    """
    This class implements the fusepy interface in order to create a mounted file system view
    to a TAR archive.
    This class can and is relatively thin as it only has to create and manage an IndexedTar
    object and query it for directory or file contents.
    It also adds a layer over the file permissions as all files must be read-only even
    if the TAR reader reports the file as originally writable because no TAR write support
    is planned.
    """

    def __init__(
        self,
        pathToMount,
        clearIndexCache=False,
        recursive=False,
        serializationBackend=None,
        prefix="",
    ):
        self.tarFileName = pathToMount
        self.tarFile = open(self.tarFileName, "rb")
        self.indexedTar = IndexedTar(
            self.tarFileName,
            writeIndex=True,
            clearIndexCache=clearIndexCache,
            recursive=recursive,
            serializationBackend=serializationBackend,
        )

        if prefix and not self.indexedTar.isDir(prefix):
            prefix = ""
        if prefix and not prefix.endswith("/"):
            prefix += "/"
        self.prefix = prefix

        # make the mount point read only and executable if readable, i.e., allow directory listing
        # @todo In some cases, I even 2(!) '.' directories listed with ls -la!
        #       But without this, the mount directory is owned by root
        tarStats = os.stat(self.tarFileName)
        # clear higher bits like S_IFREG and set the directory bit instead
        mountMode = (tarStats.st_mode & 0o777) | stat.S_IFDIR
        if mountMode & stat.S_IRUSR != 0:
            mountMode |= stat.S_IXUSR
        if mountMode & stat.S_IRGRP != 0:
            mountMode |= stat.S_IXGRP
        if mountMode & stat.S_IROTH != 0:
            mountMode |= stat.S_IXOTH
        self.indexedTar.fileIndex[self.prefix + "."] = FileInfo(
            offset=0,
            size=tarStats.st_size,
            mtime=tarStats.st_mtime,
            mode=mountMode,
            type=tarfile.DIRTYPE,
            linkname="",
            uid=tarStats.st_uid,
            gid=tarStats.st_gid,
            istar=True,
        )

        if printDebug >= 3:
            print("Loaded File Index:", self.indexedTar.fileIndex)

    @overrides(fuse.Operations)
    def getattr(self, path, fh=None):
        if printDebug >= 2:
            print("[getattr( path =", path, ", fh =", fh, ")] Enter")

        fileInfo = self.indexedTar.getFileInfo(self.prefix + path, listDir=False)
        if not isinstance(fileInfo, FileInfo):
            if printDebug >= 2:
                print("Could not find path:", path)
            raise fuse.FuseOSError(fuse.errno.EROFS)

        # dictionary keys: https://pubs.opengroup.org/onlinepubs/007904875/basedefs/sys/stat.h.html
        statDict = dict(
            ("st_" + key, getattr(fileInfo, key))
            for key in ("size", "mtime", "mode", "uid", "gid")
        )
        # signal that everything was mounted read-only
        statDict["st_mode"] &= ~(stat.S_IWUSR | stat.S_IWGRP | stat.S_IWOTH)
        statDict["st_mtime"] = int(statDict["st_mtime"])
        statDict["st_nlink"] = 2

        if printDebug >= 2:
            print("[getattr( path =", path, ", fh =", fh, ")] return:", statDict)

        return statDict

    @overrides(fuse.Operations)
    def readdir(self, path, fh):
        if printDebug >= 2:
            print(
                "[readdir( path =",
                path,
                ", fh =",
                fh,
                ")] return:",
                self.indexedTar.getFileInfo(self.prefix + path, listDir=True).keys(),
            )

        # we only need to return these special directories. FUSE automatically expands these and will not ask
        # for paths like /../foo/./../bar, so we don't need to worry about cleaning such paths
        yield "."
        yield ".."

        for key in self.indexedTar.getFileInfo(self.prefix + path, listDir=True).keys():
            yield key

    @overrides(fuse.Operations)
    def readlink(self, path):
        if printDebug >= 2:
            print("[readlink( path =", path, ")]")

        fileInfo = self.indexedTar.getFileInfo(self.prefix + path)
        if not isinstance(fileInfo, FileInfo):
            raise fuse.FuseOSError(fuse.errno.EROFS)

        pathname = fileInfo.linkname
        if pathname.startswith("/"):
            return os.path.relpath(
                pathname, "/"
            )  # @todo Not exactly sure what to return here

        return pathname

    @overrides(fuse.Operations)
    def read(self, path, length, offset, fh):
        if printDebug >= 2:
            print(
                "[read( path =",
                path,
                ", length =",
                length,
                ", offset =",
                offset,
                ",fh =",
                fh,
                ")] path:",
                path,
            )

        fileInfo = self.indexedTar.getFileInfo(self.prefix + path)
        if not isinstance(fileInfo, FileInfo):
            raise fuse.FuseOSError(fuse.errno.EROFS)

        self.tarFile.seek(fileInfo.offset + offset, os.SEEK_SET)
        return self.tarFile.read(length)


def cli():

    global printDebug

    parser = argparse.ArgumentParser(
        formatter_class=argparse.ArgumentDefaultsHelpFormatter,
        description="""\
        If no mount path is specified, then the tar will be mounted to a folder of the same name but without a file extension.
        TAR files contained inside the tar and even TARs in TARs in TARs will be mounted recursively at folders of the same name barred the file extension '.tar'.

        In order to reduce the mounting time, the created index for random access to files inside the tar will be saved to <path to tar>.index.<backend>[.<compression]. If it can't be saved there, it will be saved in ~/.ratarmount/<path to tar: '/' -> '_'>.index.<backend>[.<compression].
        """,
    )

    parser.add_argument(
        "-f",
        "--foreground",
        action="store_true",
        default=False,
        help="keeps the python program in foreground so it can print debug"
        "output when the mounted path is accessed.",
    )

    parser.add_argument(
        "-d",
        "--debug",
        type=int,
        default=1,
        help="sets the debugging level. Higher means more output. Currently 3 is the highest",
    )

    parser.add_argument(
        "-c",
        "--recreate-index",
        action="store_true",
        default=False,
        help="if specified, pre-existing .index files will be deleted and newly created",
    )

    parser.add_argument(
        "-r",
        "--recursive",
        action="store_true",
        default=False,
        help="mount TAR archives inside the mounted TAR recursively. Note that this only has an effect when creating an index. If an index already exists, then this option will be effectively ignored. Recreate the index if you want change the recursive mounting policy anyways.",
    )

    parser.add_argument(
        "-s",
        "--serialization-backend",
        type=str,
        default="custom",
        help="specify which library to use for writing out the TAR index. Supported keywords: ("
        + ",".join(IndexedTar.availableSerializationBackends)
        + ")[.("
        + ",".join(IndexedTar.availableCompressions).strip(",")
        + ")]",
    )

    parser.add_argument(
        "-p",
        "--prefix",
        type=str,
        default="",
        help="The specified path to the folder inside the TAR will be mounted to root. "
        "This can be useful when the archive as created with absolute paths. "
        "E.g., for an archive created with `tar -P cf /var/log/apt/history.log`, "
        "-p /var/log/apt/ can be specified so that the mount target directory "
        ">directly< contains history.log.",
    )

    parser.add_argument(
        "tarfilepath",
        metavar="tar-file-path",
        type=argparse.FileType("r"),
        nargs=1,
        help="the path to the TAR archive to be mounted",
    )
    parser.add_argument(
        "mountpath",
        metavar="mount-path",
        nargs="?",
        help="the path to a folder to mount the TAR contents into",
    )

    args = parser.parse_args()

    tarToMount = os.path.abspath(args.tarfilepath[0].name)
    try:
        tarfile.open(tarToMount, mode="r:")
    except tarfile.ReadError:
        print(
            "Archive",
            tarToMount,
            "can't be opened!",
            "This might happen for compressed TAR archives, which currently is not supported.",
        )
        exit(1)

    mountPath = args.mountpath
    if mountPath is None:
        mountPath = os.path.splitext(tarToMount)[0]

    mountPathWasCreated = False
    if not os.path.exists(mountPath):
        os.mkdir(mountPath)

    printDebug = args.debug

    fuseOperationsObject = TarMount(
        pathToMount=tarToMount,
        clearIndexCache=args.recreate_index,
        recursive=args.recursive,
        serializationBackend=args.serialization_backend,
        prefix=args.prefix,
    )

    fuse.FUSE(
        operations=fuseOperationsObject,
        mountpoint=mountPath,
        foreground=args.foreground,
    )

    if mountPathWasCreated and args.foreground:
        os.rmdir(mountPath)


if __name__ == "__main__":
    cli()

Support for access to incomplete .tar.gz files

Try this:

timeout 30 kaggle competitions download -c imagenet-object-localization-challenge
7z x imagenet-object-localization-challenge.zip
gdb --args python3 $( which ratarmount ) imagenet_object_localization_patched2019.tar.gz
    Currently at position 31223576 of 190691051 (16.37%). Estimated time remaining with current rate: 0 min 46 s, with average rate: 2 min 45 s.
    Currently at position 148666663 of 190691051 (77.96%). Estimated time remaining with current rate: 0 min 0 s, with average rate: 0 min 9 s.

Presumably any other truncated .tar.gz will do, though. Backtrace:

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff6a285c8 in zran_read () from /home/user/.local/lib/python3.8/site-packages/indexed_gzip/indexed_gzip.cpython-38-x86_64-linux-gnu.so
(gdb) bt
#0  0x00007ffff6a285c8 in zran_read () from /home/user/.local/lib/python3.8/site-packages/indexed_gzip/indexed_gzip.cpython-38-x86_64-linux-gnu.so
#1  0x00007ffff6a20923 in ?? () from /home/user/.local/lib/python3.8/site-packages/indexed_gzip/indexed_gzip.cpython-38-x86_64-linux-gnu.so
#2  0x00007ffff6a093df in ?? () from /home/user/.local/lib/python3.8/site-packages/indexed_gzip/indexed_gzip.cpython-38-x86_64-linux-gnu.so
#3  0x00000000005f2246 in _PyObject_MakeTpCall ()
#4  0x00000000005084f3 in ?? ()
#5  0x00000000005ef1f1 in ?? ()
#6  0x00000000005efd2c in PyObject_CallMethodObjArgs ()
#7  0x0000000000648ced in ?? ()
#8  0x0000000000648dc2 in ?? ()
#9  0x0000000000649dbd in ?? ()
#10 0x0000000000502323 in ?? ()
#11 0x0000000000567325 in _PyEval_EvalFrameDefault ()
#12 0x00000000005f19cb in _PyFunction_Vectorcall ()
#13 0x0000000000567325 in _PyEval_EvalFrameDefault ()
#14 0x00000000005f19cb in _PyFunction_Vectorcall ()
#15 0x0000000000567325 in _PyEval_EvalFrameDefault ()
#16 0x00000000005f19cb in _PyFunction_Vectorcall ()
#17 0x0000000000567325 in _PyEval_EvalFrameDefault ()
#18 0x00000000005654d2 in _PyEval_EvalCodeWithName ()
#19 0x00000000005f1bc5 in _PyFunction_Vectorcall ()
#20 0x0000000000567325 in _PyEval_EvalFrameDefault ()
#21 0x00000000005f19cb in _PyFunction_Vectorcall ()
#22 0x0000000000567325 in _PyEval_EvalFrameDefault ()
#23 0x00000000004fe210 in ?? ()
#24 0x00000000005675bd in _PyEval_EvalFrameDefault ()
#25 0x00000000005654d2 in _PyEval_EvalCodeWithName ()
#26 0x00000000005f1bc5 in _PyFunction_Vectorcall ()
#27 0x0000000000567325 in _PyEval_EvalFrameDefault ()
#28 0x00000000005654d2 in _PyEval_EvalCodeWithName ()
#29 0x00000000005f1bc5 in _PyFunction_Vectorcall ()
#30 0x0000000000597ff8 in ?? ()
#31 0x00000000005f21a7 in _PyObject_MakeTpCall ()
#32 0x000000000056cbdd in _PyEval_EvalFrameDefault ()
#33 0x00000000005f19cb in _PyFunction_Vectorcall ()
--Type <RET> for more, q to quit, c to continue without paging--
#34 0x0000000000567325 in _PyEval_EvalFrameDefault ()
#35 0x00000000005654d2 in _PyEval_EvalCodeWithName ()
#36 0x00000000005f1bc5 in _PyFunction_Vectorcall ()
#37 0x00000000005671fd in _PyEval_EvalFrameDefault ()
#38 0x00000000005654d2 in _PyEval_EvalCodeWithName ()
#39 0x00000000005f1bc5 in _PyFunction_Vectorcall ()
#40 0x0000000000597ff8 in ?? ()
#41 0x00000000005f21a7 in _PyObject_MakeTpCall ()
#42 0x000000000056cbdd in _PyEval_EvalFrameDefault ()
#43 0x00000000006ab288 in ?? ()
#44 0x00000000005671fd in _PyEval_EvalFrameDefault ()
#45 0x00000000005654d2 in _PyEval_EvalCodeWithName ()
#46 0x0000000000686d53 in PyEval_EvalCode ()
#47 0x0000000000676101 in ?? ()
#48 0x000000000067617f in ?? ()
#49 0x0000000000676237 in PyRun_FileExFlags ()
#50 0x00000000006782ba in PyRun_SimpleFileExFlags ()
#51 0x00000000006af5ce in Py_RunMain ()
#52 0x00000000006af959 in Py_BytesMain ()
#53 0x00007ffff7dd90b3 in __libc_start_main (main=0x4ec640 <main>, argc=3, argv=0x7fffffffdca8, init=<optimized out>, fini=<optimized out>, 
    rtld_fini=<optimized out>, stack_end=0x7fffffffdc98) at ../csu/libc-start.c:308
#54 0x00000000005f69be in _start ()

Because it is a segfault, it doesn't seem like I can catch any exception. Also the error comes from indexed_gzip, so I guess I'd have to ask there.

Edit:

This seems to be another issue. I can't reproduce it with e.g. the first 1MB of firefox-2.tar.gz.

Event the system gzip prints a warning:

gzip -d -k imagenet_object_localization_patched2019.tar.gz
gzip: imagenet_object_localization_patched2019.tar.gz: invalid compressed data--format violated

It might be that kaggle is downloading in parallel and therefore the end of the gz might have some patches containing only 0 bytes or something like that.

--recreate-index should be run automatically when archive checksum changes

Had some really annoying behavior with ratarmount where the "view" of the filesystem was completely out of sync with the real tar data (that was replaced with other data at the same path after initial run).

workaround was to run with --recreate-index.

I would suggest keeping a shasum on the tar archive and automatically recreating the index anytime that shasum changes.

Nondeterministic "zran_read returned error (-3)"

I seem to be nondeterministically receiving "zran_read returned error (-3)" errors.

Full traceback:

Traceback (most recent call last):
  File "/Users/epicfaace/codalab/codalab-worksheets/tests/unit/server/upload_download_test.py", line 93, in test_bundle_folder
    bundle, [("item.txt", b"hello world"), ("src/item2.txt", b"hello world")]
  File "/Users/epicfaace/codalab/codalab-worksheets/tests/unit/server/upload_download_test.py", line 217, in upload_folder
    use_azure_blob_beta=True,
  File "/Users/epicfaace/codalab/codalab-worksheets/codalab/lib/upload_manager.py", line 162, in upload_to_bundle_store
    indexFileName=tmp_index_file.name,
  File "/Users/epicfaace/codalab/codalab-worksheets/codalab/lib/beam/ratarmount.py", line 450, in __init__
    self._createIndex(self.tarFileObject)
  File "/Users/epicfaace/codalab/codalab-worksheets/codalab/lib/beam/ratarmount.py", line 741, in _createIndex
    encoding     = self.encoding,
  File "/Users/epicfaace/.pyenv/versions/3.6.10/lib/python3.6/tarfile.py", line 1601, in open
    t = cls(name, filemode, stream, **kwargs)
  File "/Users/epicfaace/.pyenv/versions/3.6.10/lib/python3.6/tarfile.py", line 1482, in __init__
    self.firstmember = self.next()
  File "/Users/epicfaace/.pyenv/versions/3.6.10/lib/python3.6/tarfile.py", line 2297, in next
    tarinfo = self.tarinfo.fromtarfile(self)
  File "/Users/epicfaace/.pyenv/versions/3.6.10/lib/python3.6/tarfile.py", line 1092, in fromtarfile
    buf = tarfile.fileobj.read(BLOCKSIZE)
  File "/Users/epicfaace/.pyenv/versions/3.6.10/lib/python3.6/tarfile.py", line 539, in read
    buf = self._read(size)
  File "/Users/epicfaace/.pyenv/versions/3.6.10/lib/python3.6/tarfile.py", line 547, in _read
    return self.__read(size)
  File "/Users/epicfaace/.pyenv/versions/3.6.10/lib/python3.6/tarfile.py", line 572, in __read
    buf = self.fileobj.read(self.bufsize)
  File "indexed_gzip/indexed_gzip.pyx", line 720, in indexed_gzip.indexed_gzip._IndexedGzipFile.readinto
indexed_gzip.indexed_gzip.ZranError: zran_read returned error (-3)

I'll post more details as I get them.

Use mmap or GDBM for indexes to save memory

Having the index in memory makes access fast. But the whole idea ratarmount is to be able to access a few files, so it is a waste of memory to keep the index in memory for all files.

Python has mmap support, so it ought to be possible to store the index as a mmap file. That way the system will only read the relevant parts of the index from disk, and virtual memory management will work for you.

Another alternative to mmap is GDBM: This may make it easier to do { key => value } than mmap can, and it will make the index work cross platform (whereas the mmap would only work on the platform it was built).

One way to do this would be to serialize every FileInfo object and save those as values for each key. When a key is needed, fetch the value and deserialize it. It may make the index file bigger, but memory requirements and startup time will be lower, as you will only have a single FileInfo object in memory at a time.

I found a tar.gz that broke ratarmount after 16 hours ;-;

Currently at position 1664643022848 of 1664666965947 (100.00%). Estimated time remaining with current rate: 0 min 1 s, with average rate: 0 min 0 s.
Creating offset dictionary for /home/download/thefile.tar.gz took 57789.62s

Traceback (most recent call last):
File "/usr/local/bin/ratarmount", line 8, in
sys.exit(cli())
File "/usr/local/lib/python3.8/dist-packages/ratarmount.py", line 2604, in cli
fuseOperationsObject = TarMount(
File "/usr/local/lib/python3.8/dist-packages/ratarmount.py", line 1937, in init
self.mountSources: List[Union[SQLiteIndexedTar, FolderMountSource]] = [
File "/usr/local/lib/python3.8/dist-packages/ratarmount.py", line 1938, in
SQLiteIndexedTar(tarFile, writeIndex=True, **sqliteIndexedTarOptions)
File "/usr/local/lib/python3.8/dist-packages/ratarmount.py", line 504, in init
self._loadOrStoreCompressionOffsets() # store
File "/usr/local/lib/python3.8/dist-packages/ratarmount.py", line 1622, in _loadOrStoreCompressionOffsets
db.execute('INSERT INTO gzipindex VALUES (?)', (file.read(),))
OverflowError: BLOB longer than INT_MAX bytes

typo bzip2->gzip

ratarmount/ratarmount.py

Lines 1139 to 1143 in a74cd54

elif magicBytes[0:2] == b"\x1f\x8b":
if not hasBzip2Support:
raise Exception( "You are trying open a gzip compressed TAR file but no gzip support was detected!" )
type = 'GZ'
self.rawFile = self.tarFile # save so that garbage collector won't close it!

Line 1140 should read hasGzipSupport

AttributeError: 'dict' object has no attribute 'mtime' when mounting

I get the following error whenever I try to mount
python ratarmount.py /taskonomy_data/taskonomy_small_ordered.tar /taskonomy_data_mnt

Creating offset dictionary for /taskonomy_data/taskonomy_small_ordered.tar ...
Traceback (most recent call last):
File "ratarmount.py", line 447, in
recursive = args.recursive ),
File "ratarmount.py", line 312, in init
clearIndexCache = clearIndexCache, recursive = recursive )
File "ratarmount.py", line 67, in init
self.createIndex( file )
File "ratarmount.py", line 242, in createIndex
self.setFileInfo( path, {} )
File "ratarmount.py", line 138, in setFileInfo
mtime = fileInfo.mtime ,
AttributeError: 'dict' object has no attribute 'mtime'

Publish a ratarmount-core package without fusepy as a dependency

@mxmlnkn , thanks so much for continuing to accept my PRs to ratarmount. However, it turns out that I still can't use the upstream ratarmount package as a dependency for the package I'm publishing, because my package needs to be installed in environments where fuse is not available. Because fusepy is a dependency of ratarmount, and fusepy installation fails when fuse is not available, this causes the installation of my package to fail in environments when fuse is not available, which is a problem. I have to create a fork that just removes fusepy as a dependency and use that for now.

I think the best way around this would be to publish a ratarmount-core package without fusepy as a dependency, which only includes the Python API and doesn't actually include the CLI / fuse-related code. Is that something you'd be willing to do @mxmlnkn ? If so, I'm happy to help make PRs that would add the scaffolding for such a change.

Memory leak when index+mount

I tested ratarmount on a tar file with 10M files. The process took around 10 GB RAM after index+mount was complete.

But if I unmounted and then mounted using the generated index, then it only took 5 GB RAM.

This suggests there is a memory leak. Maybe you are copying the data structure and never destroying a copy that you do not use when indexing?

Error: No such table: gzipindex

I have a ~500GB .tar.gz file. Trying to run ratarmount, I get the following after the creation of the index

Writing out TAR index to /mnt/data/final.tar.gz.index.sqlite took 0s and is sized 7143424 B
Traceback (most recent call last):
  File "/usr/lib/python3.8/site-packages/ratarmount.py", line 1202, in __init__
    file.write( db.execute( 'SELECT data FROM gzipindex' ).fetchone()[0] )
sqlite3.OperationalError: no such table: gzipindex

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/bin/ratarmount", line 10, in <module>
    sys.exit(cli())
  File "/usr/lib/python3.8/site-packages/ratarmount.py", line 1445, in cli
    fuseOperationsObject = TarMount(
  File "/usr/lib/python3.8/site-packages/ratarmount.py", line 1212, in __init__
    self.tarFile.export_index( filename = gzindex )
  File "indexed_gzip/indexed_gzip.pyx", line 721, in indexed_gzip.indexed_gzip._IndexedGzipFile.export_index
indexed_gzip.indexed_gzip.ZranError: ('export_index returned error: {}', -1)

Improve detection of index changes

  • Use imohash. A full hash takes too much time to calculate but imohash does not use the full file but only a constant amount of samples from it. The problem with modification time verification is that often the modification time and also the creation time can change, e.g., because the file was downloaded or copied between devices and possibly other cases.
  • Creating the index is expense, so I'd want to avoid deleting it unnecessarily. Maybe I should even backup it first instead of deletion and/or ask for confirmation if it wants to recreate the index automatically (of course with an option to turn that off for non-interactive mode).
  • Store which relevant arguments were used to create the index in order to determine when arguments change and hint the user to it and automatically recreate it after a confirmation. Arguments influencing the built index: --recursive, --ignore-zeros, --strip-recursive-tar-extension, --encoding.
  • Detect when a tar simply was appended to and then only index the newly added files.

Reduce memory footprint

There is only so much I can do about python memory usage but it is particularly obnoxious when the index creation is successful and then the program gets killed because the serialization library increases the memory usage by roughly three quarter. Maybe I can find some alternative serialization routines aside from pickle or write a custom (de)serializer or even use more low-footprint data types instead of a nested dictionary tree of named tuples.

As a test, I created a tar file with roughly 11k files with each a file name of 96B length amounting to roughly 1MiB metadata for the file names alone. To increase the memory required for the TAR index, I packed as many of these "1MiB" metadata TARs as needed into another TAR and then used the recursive mounting option. I then watched the size (first) and resident (second) entries in /proc/<pid>/statm over the time of ratarmount for different serialization backends.

Here are the results for a 64 MiB TAR:

Index creation and saving

resident-memory-over-time-saving-64-MiB-metadata

Index loading

resident-memory-over-time-loading-64-MiB-metadata

Performance Metrics Comparison

performance-comparison

Remarks

  • The first linear increase in memory usage is the creation of the index itself
  • As can be seen, all but simplejson, rapidjson, and my custom loader require significant additional memory just for the serialization
  • I also tested hickle and yaml which are magnitudes worse than the ones tested and therefore excluded
  • Unfortunately, there is no clear winner and there are always tradeoffs between peak memory usage for (de)serialization, index file size, and (de)serialization time.
  • The output of simplejson is larger than the other JSON serializers, because it includes a space between , and : for written dictionaries. Except for deserialization time, rapidjson is the clear JSON winner.
  • Fall backends, except for the custom one and simplejson, the lz4 compression scheme actually leads to shorter serialization times possibly by reducing the disk I/O bottleneck. I'm not sure why The custom one lacks so much. Maybe because it makes many small write calls instead of buffering the output.
  • The custom backend, which consists of a custom dict serializer and uses msgpack to serialize named tuples, is the only one, which has no significant peak memory overhead for both saving and loading! But, the time required is worse by factor 10 compared to msgpack.lz4 but it is still only roughly 10% of the index creation time but it will matter more for index loading, which is roughly 5x slower.
  • lz4 compression is sufficiently good with almost no overhead for most backends. But, these results are not very representative because of the artificially construed file names which contain almost 95 zeros which can be compressed very well and are not a good real-life example.
  • The pickled indexes are almost twice the size of any other serialization scheme but the index file size shouldn't matter all that much anyway as it for most TARs be only 1% of the original TAR file. So, if you can store the TAR, the TAR index size should not matter.
  • The peak memory usage for index loading differs because some frameworks can't deserialize namedtuples out of the box, therefore returning lists or even dictionaries, which both have much more overhead than a simple namedtuple.

Performance Metrics Comparison (256 MiB file name metadata)

performance-comparison-256-MiB-metadata

Conclusion

When low on memory, use the uncompressed custom serializer else use lz4 compressed msgpack for a <10% boost when storing the index, 3-5x faster index loading (for 256MiB file name meta data in 256 recursive TARs containing each 11k files, this translates to 14s vs 4s, so still quite fast for 2.8 million file objects).

'SQLiteIndexedTar' object has no attribute 'tarFileName'

My code:

t = SQLiteIndexedTar(fileObject=f, tarFileName="file")

Error:

Traceback (most recent call last):
  File "test.py", line 9, in <module>
    t = SQLiteIndexedTar(fileObject=f, tarFileName="file")
  File "/Users/epicfaace/codalab/codalab-worksheets/codalab/lib/beam/ratarmount.py", line 407, in __init__
    possibleIndexFilePaths = [self.tarFileName + ".index.sqlite"]
AttributeError: 'SQLiteIndexedTar' object has no attribute 'tarFileName'

Adding support for zstd

Hi,
is it possible to add support for .tar.zst file made with zstd ?

I have a few big archives (hundreds of GB each) and the performance with archivemount are not good.

I tried converting a couple of archives in a more common .tar.bz2 but the resulted files are much bigger than the originals and I can't do that for all the archives.

It is my understanding that for this the first step is to write a new "indexed_zst" module. I tried looking at the code of indexed_bzip2 and indexed_gzip but I'm not a python programmer and I can't contribute much.

Is there anything I can do to support this format?

Regards

Use SQLiteIndexedTar in a context manager

It would be nice to use SQLiteIndexedTar in a context manager, like this:

with SQLiteIndexedTar(fileobj=open(...)) as t:
   # ...

In order to do that, we would need to implement the __enter__ and __exit__ methods.

Add support for hardlink-ed files in .tar?

I was troubleshooting some issue, and I have discovered that some .tar files contain hard-links. I didn't know tar could support hard links, but we have a lot of these that we need to handle..

Here is the example tar file.

http://dev1.soichi.us/tmp/5b525b1258153f004fc0883f.tar

This contains fils like ./output/surf/rh.white.K which is a hardlink to ./output/surf/rh.white.preaparc.K

Following files are also hard links

./output/surf/lh.white.K
./output/surf/lh.white.preaparc.H
./output/surf/rh.white.preaparc.K
./output/surf/rh.white.preaparc.H

When I try to mount this .tar with ratarmount, these files shows up with Input/output error and I can not access them.

hayashis@haswell:~/test/ratarmount/5b525b1258153f004fc0883f/output/surf $ ls -lrt
ls: cannot access 'lh.white.K': Input/output error
ls: cannot access 'lh.white.preaparc.H': Input/output error
ls: cannot access 'rh.white.preaparc.H': Input/output error
ls: cannot access 'rh.white.preaparc.K': Input/output error
total 0
?????????? ? ?        ?              ?            ? rh.white.preaparc.K
?????????? ? ?        ?              ?            ? rh.white.preaparc.H
?????????? ? ?        ?              ?            ? lh.white.preaparc.H
?????????? ? ?        ?              ?            ? lh.white.K
-r--r--r-- 2 hayashis hayashis 5165534 Jul 20  2018 lh.orig.nofix
-r--r--r-- 2 hayashis hayashis 5028590 Jul 20  2018 rh.orig.nofix
-r--r--r-- 2 hayashis hayashis 5028984 Jul 20  2018 rh.smoothwm.nofix
-r--r--r-- 2 hayashis hayashis 5165928 Jul 20  2018 lh.smoothwm.nofix
-r--r--r-- 2 hayashis hayashis 5029382 Jul 20  2018 rh.inflated.nofix
-r--r--r-- 2 hayashis hayashis 5166326 Jul 20  2018 lh.inflated.nofix
-r--r--r-- 2 hayashis hayashis 5029777 Jul 20  2018 rh.qsphere.nofix
-r--r--r-- 2 hayashis hayashis 5166721 Jul 20  2018 lh.qsphere.nofix
-r--r--r-- 2 hayashis hayashis  558255 Jul 20  2018 rh.defect_labels
...

Is it possible for add support for hardlinks? Or is there a workaround? Recreating all the .tar file with --hard-dereference flag is extremely difficult to do, due to the number of .tar files we have.

ModuleNotFoundError: No module named '_sqlite3'

running ratarmount I'm getting this:

Traceback (most recent call last):
  File "/venv/bin/ratarmount", line 6, in <module>
    from ratarmount import cli
  File "/venv/lib/python3.8/site-packages/ratarmount.py", line 11, in <module>
    import sqlite3
  File "/usr/local/lib/python3.8/sqlite3/__init__.py", line 23, in <module>
    from sqlite3.dbapi2 import *
  File "/usr/local/lib/python3.8/sqlite3/dbapi2.py", line 27, in <module>
    from _sqlite3 import *
ModuleNotFoundError: No module named '_sqlite3'

the list of packages:

$ ./venv/bin/pip list
Package       Version
------------- -------
click         7.1.2
fusepy        3.0.1
indexed-bzip2 1.1.0
indexed-gzip  1.0.0
pip           20.1
pip-tools     5.1.2
pysqlite3     0.4.2
ratarmount    0.4.1
setuptools    41.2.0
six           1.14.0
wheel         0.34.2

the list of system packages:

$ dpkg -l | grep sqlite3
ii  libsqlite3-0:amd64                   3.22.0-1ubuntu0.3                      amd64        SQLite 3 shared library
ii  libsqlite3-dev:amd64                 3.22.0-1ubuntu0.3                      amd64        SQLite 3 development files
ii  sqlite3                              3.22.0-1ubuntu0.3                      amd64        Command line interface for SQLite 3

reinstalling python3.8 or deleting the venv and starting over doesn't help

Support for split archives

I could use mounting split tar archive created like this: tar cf - ./folder | split --bytes=500MB - archive.tar.

GNU sparse tar file not properly handled

Hi
GNU sparse tar file (https://www.gnu.org/software/tar/manual/html_node/sparse.html) are not properly handled.

Here a sample tar file following the PAX Format version 1.0 format (https://www.gnu.org/software/tar/manual/html_node/Sparse-Formats.html) generated using GNU tar 1.30 with the command: tar --format=pax -cSzf ../sample1.tar.gz *

sample1.tar.gz

The file contains 3 identical files, two of them being sparse files. The file "01.sparse1.bin" was created
as a 10MB file filled with zeros and with the word "toto" writed at a random offset in the file. the files "02.normal1.bin" is a copy of this file. The file "01.sparse1.bin" is transformed into sparse files using fallocate -d 01.sparse1.bin and then copied into "03.sparse1.bin"

$ ls -alnh *
-rw-r--r-- 1 1000 1000 11M févr. 17 15:54 01.sparse1.bin
-rw-r--r-- 1 1000 1000 11M févr. 17 15:54 02.normal1.bin
-rw-r--r-- 1 1000 1000 11M févr. 17 15:53 03.sparse1.bin
$ du -sh *
8,0K	01.sparse1.bin
11M	02.normal1.bin
8,0K	03.sparse1.bin
$ md5sum *
832c78afcb9832e1a21c18212fc6c38b  01.sparse1.bin
832c78afcb9832e1a21c18212fc6c38b  02.normal1.bin
832c78afcb9832e1a21c18212fc6c38b  03.sparse1.bin

If I mount this tar file using ratarmount the sparse files are corrupted

$ ratarmount -c sample1.tar.gz mount
Creating offset dictionary for /tmp/ratarmount/sample1.tar.gz ...
Creating new SQLite index database at /tmp/ratarmount/sample1.tar.gz.index.sqlite
Creating offset dictionary for /tmp/ratarmount/sample1.tar.gz took 0.10s
$ cd mount/
$ md5sum *
01fcca0bffe4125261d2e2a85fa8223b  01.sparse1.bin
832c78afcb9832e1a21c18212fc6c38b  02.normal1.bin
390d8adb1cf6f6c1ef76345fc9758dcc  03.sparse1.bin

Even if ratarmount do not handle sparse file, the mounted sparse files are still corrupted compared to a tar implementation not supporting sparse files. 01.sparse1.bin and 03.sparse1.bin should be identical because there are identical inside the tar file.
Cf https://www.gnu.org/software/tar/manual/html_node/PAX-1.html#SEC193
"The format is designed in such a way that non-posix aware tars and tars not supporting GNU.sparse.* keywords will extract each sparse file in its condensed form with the file map prepended and will place it into a separate directory. Then, using a simple program it would be possible to expand the file to its original form even without GNU tar."

I use a lot of big sparse tar and I stumbled on ratarmount which seems a lot more faster than archivemount. I would love to be able to use ratarmount with this files.

Write support

See this comment.

At least for uncompressed TAR archives, I could easily add write support for now. Although because of the mentioned performance pit traps, it should be by default but it should only be triggered with another command line option. I think archivemount also has write support, so whether to add this to ratarmount is kinda questionable and if done so it should definitely work with very large tars for which archivemount has problems.

Support individual files?

It'd be really nice to be able to mount single files in a directory so that file is exposed decompressed without having to be a tar.

Adding support for xz

This was already discussed to some extent in #7. I took a closer look at the file format. It is very similar to BZ2 in that it has streams and blocks and blocks can be seeked to easily. The problem is that the standard xz encoder only creates one stream with one block and seeking inside a block seems completely infeasible. pixz creates multiple blocks in one stream and pxz creates multiple streams with each one block. This an be checked by using xz -l, e.g. with a 600MB uncompressed tar file:

tar -Ipixz -cf multiblock-compressed.tpxz indexed_bzip2
xz -l multiblock-compressed.tpxz
    Strms  Blocks   Compressed Uncompressed  Ratio  Check   Filename
        1      43    303.0 MiB    659.3 MiB  0.460  CRC32   multiblock-compressed.tpxz

xz -cdk multiblock-compressed.tpxz > uncompressed.tar
xz -czk6 uncompressed.tar > compressed-6.tar.xz
xz -l compressed-6.tar.xz
    Strms  Blocks   Compressed Uncompressed  Ratio  Check   Filename
        1       1    303.2 MiB    659.3 MiB  0.460  CRC64   compressed-6.tar.xz

pxz -k uncompressed.tar
mv uncompressed.tar.xz multistream-compressed.tar.xz
xz -l multistream-compressed.tar.xz
    Strms  Blocks   Compressed Uncompressed  Ratio  Check   Filename
       28      28    302.9 MiB    659.3 MiB  0.459  CRC64   multistream-compressed.tar.xz

As for the previously mentioned tar-as-a-filesystem, it also does not support seeking inside a block, only to blocks. Their conclusion about the problem with the single-block xz files was:

However, it is suggested in the official XZ documentation that any
program utilizing XZ random access reads such as TAR Browser should recommend the usage
of multi-block archives; this is addressed in the usage guide in Appendix A.

Pixz supports seeking, so my first thought was making a Python module out of it. However, I stumbled upon lzmaffi, which is a fork of the Python3 lzma module backport, which actually supports block seeking. Unfortunately, the Python3 internal module does not support seeking. Here is a simple snippet to test that:

import io
import lzma
import lzmaffi
import time

for module in [ lzmaffi, lzma ]:
    for filePath in [ "multistream-compressed.tpxz", "multiblock-compressed.tar.xz", "compressed-6.tar.xz" ]:
        with module.open( filePath, 'rb' ) as file:
            t0 = time.time()
            file.seek( -1, io.SEEK_END )
            lastByte = file.read( 1 )
            t1 = time.time()
            print( f"Seeking to end and reading the last byte {lastByte.hex()} in {filePath} "
                   f"using {module.__name__} took {t1-t0:.3f}s" )

I get:

Seeking to end and reading the last byte 00 in multistream-compressed.tpxz using lzmaffi took 0.238s
Seeking to end and reading the last byte 00 in multiblock-compressed.tar.xz using lzmaffi took 0.004s
Seeking to end and reading the last byte 00 in compressed-6.tar.xz using lzmaffi took 6.589s

Seeking to end and reading the last byte 00 in multistream-compressed.tpxz using lzma took 12.337s
Seeking to end and reading the last byte 00 in multiblock-compressed.tar.xz using lzma took 12.226s
Seeking to end and reading the last byte 00 in compressed-6.tar.xz using lzma took 12.164s

As can be seen, reading the very last byte of the file takes 12s with the standard lzma module no matter the xz compression scheme. However, when using the lzmaffi module, seeking in the multiblock file created with pixz takes only 4ms, seeking in the multistream file created with pxz takes 238ms, and as a sanity check, reading the last byte in the single stream single block file created with the xz 5.2.4 from XZ Utils takes 6.5s.

The standard lzma module is really dumb. When seeking to the end of the file and not before the last byte, it also takes ~6s like lzmaffi for the single block file! It seems like seeking before the last byte is done in two steps, first decode the whole file to get the file size. Then seek to the offset at size-1 by completely redecoding the file from the beginning.

Now that I found a module, which can seek, it should be a low-hanging fruit to add this to ratarmount:

  • Add lzmaffi as an optional dependency and compression backend to ratarmount
  • Detect when the archive only has a single block and warn the user about it and suggest using pixz. I might have to use the xz command line tool for this because no Python binding exposes this kind of information for some reason. Alternatively, I could try seeking to the end like done above and check for a timeout of 2s or so.

Segmentation fault during installation with pip

Ubuntu 18.04.

bastian@laptwo ~> pip3 install --user ratarmount
Collecting ratarmount
Collecting fusepy (from ratarmount)
Collecting python-rapidjson (from ratarmount)
  Using cached https://files.pythonhosted.org/packages/61/2c/ec819d4603da706a80c4b20583b2ed7df89b6a0fc8467f090b45908f905c/python_rapidjson-0.9.1-cp36-cp36m-manylinux1_x86_64.whl
Collecting lz4 (from ratarmount)
  Using cached https://files.pythonhosted.org/packages/5d/5e/cedd32c203ce0303188b0c7ff8388bba3c33e4bf6da21ae789962c4fb2e7/lz4-2.2.1-cp36-cp36m-manylinux1_x86_64.whl
Collecting simplejson (from ratarmount)
Collecting indexed-gzip (from ratarmount)
  Using cached https://files.pythonhosted.org/packages/5b/e8/1472b03a6c3db08d46383b81bf1cc955752540370f4e3879acfe102bfac5/indexed_gzip-0.8.10-cp36-cp36m-manylinux1_x86_64.whl
Collecting cbor (from ratarmount)
Collecting pyyaml (from ratarmount)
Collecting msgpack (from ratarmount)
  Using cached https://files.pythonhosted.org/packages/3d/a8/e01fea81691749044a7bfd44536483a296d9c0a7ed4ec8810a229435547c/msgpack-0.6.2-cp36-cp36m-manylinux1_x86_64.whl
Collecting ujson (from ratarmount)
Installing collected packages: fusepy, python-rapidjson, lz4, simplejson, indexed-gzip, cbor, pyyaml, msgpack, ujson, ratarmount
Successfully installed cbor-1.0.0 fusepy-3.0.1 indexed-gzip-0.8.10 lz4-2.2.1 msgpack-0.6.2 python-rapidjson-0.9.1 pyyaml-5.1.2 ratarmount-0.3.1 simplejson-3.17.0 ujson-1.35
fish: “pip3 install --user ratarmount
```” terminated by signal SIGSEGV (Address boundary error)

ZIP file support

Hi!
Amazing work with tars ;D
It would be amazing to see support for ZIP files with all this features like recursive mounting.
ZIP file is AFAIK the most used archive format.

It shouldn't be very hard as zip is already supported by python itself.
There's something like this but without recursive mounts, but mixing archive types in recursive mounting would be amazing feature.

If you think it's out of scope - just close with short note ;)

High load average

I am accessing a bunch of .tar files with ratarmount stored on a remote filesystem through sshfs.

As each file had to be accessed via sshfs, I am seeing a high iowait. I think that's normal, as it has to wait for the data to be xfer from remote server. However I am also seeing a very high cpu load.

Screenshot from 2020-10-20 11-58-17

I don't quite understand how cpu load is calculated, but my feeling is that having high iowait shouldn't automatically lead to high cpu load - as each ratarmount should just sit and wait for each block to be transferred from the remote server before requesting more work to do on CPU.. I was just curious if anyone else might be seeing high cpu load while using ratarmount and if this is normal.

bad index on tars created with file lists

I can create archives by specifying list of files to include in tar file. ratarmount can't work on such tars.

Steps to reproduce problem are below. Archive with problematic files,tars,pickles is attached.

reproduce-bug-tar-list.tar.gz
`

Creating tar with and without list of files to include

xx/root$ echo -n "dir_1/dir_1_1/file_1_1_0.txt" > ../file-list.txt

xx/root$ tar cvf ../tar-list.tar -T ../file-list.txt
dir_1/dir_1_1/file_1_1_0.txt

xx/root$ tar cvf ../tar-no-list.tar *
dir_1/
dir_1/dir_1_1/
dir_1/dir_1_1/file_1_1_0.txt

ratarmounting

xx/root$ cd ..

xx$./ratarmount.py tar-list.tar
Creating offset dictionary for tar-list.tar ...
Creating offset dictionary for tar-list.tar took 0.00s

xx$ ./ratarmount.py tar-no-list.tar
Creating offset dictionary for tar-no-list.tar ...
Creating offset dictionary for tar-no-list.tar took 0.00s

it is OK:

xx$ cd tar-no-list/
xx/tar-no-list$ ls -l
total 4
dr-xr-xr-x 2 zatv zatv 10240 май 12 00:03 ./
drwxrwxr-x 5 zatv zatv 4096 май 12 00:05 ../
dr-xr-xr-x 2 zatv zatv 0 май 11 23:56 dir_1/

it is BUG:

xx$ cd ..
xx$ cd tar-list/
xx/tar-list$ ls -l
ls: cannot access 'dir_1': Read-only file system
total 4
dr-xr-xr-x 2 zatv zatv 10240 май 12 00:04 ./
drwxrwxr-x 5 zatv zatv 4096 май 12 00:05 ../
?????????? ? ? ? ? ? dir_1

xx/tar-list$ cd dir_1
bash: cd: dir_1: Read-only file system
`

duplicate file entry in .tar causes ratermount to fail?

I come across a .tar file that contains duplicate file entries like this.

tar --list -f files.tar
mask.nii.gz
.brainlife.json
.brainlife.json

I can untar this and I believe the last file overwrites the 2nd `.brainlife.json.

$ tar -xf files.tar
$ ls -a
.  ..  .brainlife.json  files.tar  mask.nii.gz

However, when I mount this .tar file with ratar mount, something goes wrong with the duplicated file.

$ ls -la
ls: cannot access '.brainlife.json': Invalid argument
total 0
dr-x------  2 brlife brlife 204800 Jul 21  2019 .
drwxr-xr-x 10 root   root        0 Jan  5 21:26 ..
??????????  ? ?      ?           ?            ? .brainlife.json
-r--r--r--  2 833400 817579 178716 Jul 21  2019 mask.nii.gz
$ cat .brainlife.json
cat: .brainlife.json: Invalid argument

I know that tar file shouldn't have duplicate files to begin with, would it be possible to make ratarmount so that it can somehow handle this (maybe ignore duplicate files, or make the last/first file accessible?)

File read errors and index creation weirdness

Working with https://mcfp.felk.cvut.cz/publicDatasets/CTU-13-Dataset/CTU-13-Dataset.tar.bz2, there are some weird progress messages while ratarmount creates the index database and there are also differences between reading ratarmount's files and files directly extracted from the archive.

$ ratarmount -c --fuse uid=$(id -u),gid=$(id -g) CTU-13-Dataset.tar.bz2 
Creating offset dictionary for /home/rick/Data/CTU-13-Dataset.tar.bz2 ...
Creating new SQLite index database at /home/rick/Data/CTU-13-Dataset.tar.bz2.index.sqlite
Currently at position 222381568 of 1997547391 (11.13%). Estimated time remaining with current rate: 0 min 39 s, with average rate: 0 min 39 s.
Currently at position 376172032 of 1997547391 (18.83%). Estimated time remaining with current rate: 0 min 41 s, with average rate: 0 min 37 s.
Currently at position 763630080 of 1997547391 (38.23%). Estimated time remaining with current rate: 0 min 31 s, with average rate: 0 min 30 s.
Currently at position 821897216 of 1997547391 (41.15%). Estimated time remaining with current rate: 1 min 0 s, with average rate: 0 min 30 s.
Currently at position 1116801024 of 1997547391 (55.91%). Estimated time remaining with current rate: 1 min 0 s, with average rate: 0 min 33 s.
Currently at position 1185362944 of 1997547391 (59.34%). Estimated time remaining with current rate: 0 min 41 s, with average rate: 0 min 31 s.
Currently at position 1604954624 of 1997547391 (80.35%). Estimated time remaining with current rate: 0 min 9 s, with average rate: 0 min 13 s.
Currently at position 1657442304 of 1997547391 (82.97%). Estimated time remaining with current rate: 0 min 18 s, with average rate: 0 min 12 s.
Currently at position 1854896640 of 1997547391 (92.86%). Estimated time remaining with current rate: 0 min 3 s, with average rate: 0 min 4 s.
Currently at position 72543459840 of 1997547391 (3631.63%). Estimated time remaining with current rate: -9 min 49 s, with average rate: -10 min 59 s.
Currently at position 72625526272 of 1997547391 (3635.73%). Estimated time remaining with current rate: -38 min 32 s, with average rate: -10 min 57 s.
Currently at position 73752841728 of 1997547391 (3692.17%). Estimated time remaining with current rate: -86 min 46 s, with average rate: -11 min 38 s.
Currently at position 74038690304 of 1997547391 (3706.48%). Estimated time remaining with current rate: -31 min 57 s, with average rate: -11 min 31 s.
Currently at position 74678334976 of 1997547391 (3738.50%). Estimated time remaining with current rate: -29 min 30 s, with average rate: -11 min 17 s.
Currently at position 74806910976 of 1997547391 (3744.94%). Estimated time remaining with current rate: -75 min 13 s, with average rate: -11 min 9 s.
Currently at position 75054960128 of 1997547391 (3757.36%). Estimated time remaining with current rate: -31 min 22 s, with average rate: -11 min 3 s.
Currently at position 75091222528 of 1997547391 (3759.17%). Estimated time remaining with current rate: -72 min 56 s, with average rate: -11 min 1 s.
Currently at position 75205798400 of 1997547391 (3764.91%). Estimated time remaining with current rate: -66 min 11 s, with average rate: -12 min 55 s.
Currently at position 75469068800 of 1997547391 (3778.09%). Estimated time remaining with current rate: -31 min 35 s, with average rate: -12 min 48 s.
Currently at position 79732708352 of 1997547391 (3991.53%). Estimated time remaining with current rate: -7 min 2 s, with average rate: -12 min 25 s.
Creating offset dictionary for /home/rick/Data/CTU-13-Dataset.tar.bz2 took 713.90s
Writing out TAR index to /home/rick/Data/CTU-13-Dataset.tar.bz2.index.sqlite took 0s and is sized 32768 B

$ cmp CTU-13-Dataset/CTU-13-Dataset/11/botnet-capture-20110818-bot-2.pcap tmp/CTU-13-Dataset/11/botnet-capture-20110818-bot-2.pcap 
CTU-13-Dataset/CTU-13-Dataset/11/botnet-capture-20110818-bot-2.pcap tmp/CTU-13-Dataset/11/botnet-capture-20110818-bot-2.pcap differ: byte 2345910491, line 8785903

Unable to `pip3 install ratarmount` on MacOS 11.4 using a MacBook Air M1

The problem seems to be something related to indexed-zstd.

I git clone'd ratarmount, then ran pip3 install -e .. Here's the output:

:~/ml/ratarmount-clean$ pip3 install -e .
Obtaining file:///Users/shawn/ml/ratarmount-clean
Collecting fusepy
  Using cached fusepy-3.0.1-py3-none-any.whl
Collecting indexed_bzip2>=1.3.0
  Using cached indexed_bzip2-1.3.0-cp39-cp39-macosx_11_0_arm64.whl
Collecting indexed_gzip>=1.6.3
  Using cached indexed_gzip-1.6.3-cp39-cp39-macosx_11_0_arm64.whl (584 kB)
Requirement already satisfied: python-xz>=0.1.2 in /opt/homebrew/lib/python3.9/site-packages (from ratarmount==0.9.0) (0.1.2)
Requirement already satisfied: rarfile>=4.0 in /opt/homebrew/lib/python3.9/site-packages (from ratarmount==0.9.0) (4.0)
Collecting indexed_zstd>=1.2.2
  Using cached indexed_zstd-1.3.1.tar.gz (60 kB)
Building wheels for collected packages: indexed-zstd
  Building wheel for indexed-zstd (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: /opt/homebrew/opt/[email protected]/bin/python3.9 -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/cz/4fmt40y176988qbnggwg9rc00000gn/T/pip-install-7wfd6b92/indexed-zstd_a2f537afce334a818672cb3d4034280a/setup.py'"'"'; __file__='"'"'/private/var/folders/cz/4fmt40y176988qbnggwg9rc00000gn/T/pip-install-7wfd6b92/indexed-zstd_a2f537afce334a818672cb3d4034280a/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /private/var/folders/cz/4fmt40y176988qbnggwg9rc00000gn/T/pip-wheel-jdfu418w
       cwd: /private/var/folders/cz/4fmt40y176988qbnggwg9rc00000gn/T/pip-install-7wfd6b92/indexed-zstd_a2f537afce334a818672cb3d4034280a/
  Complete output (18 lines):
  running bdist_wheel
  running build
  running build_py
  file indexed_zstd.py (for module indexed_zstd) not found
  file indexed_zstd.py (for module indexed_zstd) not found
  running build_clib
  building 'zstd_zeek' library
  creating build
  creating build/temp.macosx-11-arm64-3.9
  creating build/temp.macosx-11-arm64-3.9/indexed_zstd
  creating build/temp.macosx-11-arm64-3.9/indexed_zstd/libzstd-seek
  clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -c indexed_zstd/libzstd-seek/zstd-seek.c -o build/temp.macosx-11-arm64-3.9/indexed_zstd/libzstd-seek/zstd-seek.o
  In file included from indexed_zstd/libzstd-seek/zstd-seek.c:19:
  indexed_zstd/libzstd-seek/zstd-seek.h:20:10: fatal error: 'zstd.h' file not found
  #include <zstd.h>
           ^~~~~~~~
  1 error generated.
  error: command '/usr/bin/clang' failed with exit code 1
  ----------------------------------------
  ERROR: Failed building wheel for indexed-zstd
  Running setup.py clean for indexed-zstd
Failed to build indexed-zstd
Installing collected packages: indexed-zstd, indexed-gzip, indexed-bzip2, fusepy, ratarmount
    Running setup.py install for indexed-zstd ... error
    ERROR: Command errored out with exit status 1:
     command: /opt/homebrew/opt/[email protected]/bin/python3.9 -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/cz/4fmt40y176988qbnggwg9rc00000gn/T/pip-install-7wfd6b92/indexed-zstd_a2f537afce334a818672cb3d4034280a/setup.py'"'"'; __file__='"'"'/private/var/folders/cz/4fmt40y176988qbnggwg9rc00000gn/T/pip-install-7wfd6b92/indexed-zstd_a2f537afce334a818672cb3d4034280a/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/cz/4fmt40y176988qbnggwg9rc00000gn/T/pip-record-qyq_93tp/install-record.txt --single-version-externally-managed --compile --install-headers /opt/homebrew/include/python3.9/indexed-zstd
         cwd: /private/var/folders/cz/4fmt40y176988qbnggwg9rc00000gn/T/pip-install-7wfd6b92/indexed-zstd_a2f537afce334a818672cb3d4034280a/
    Complete output (18 lines):
    running install
    running build
    running build_py
    file indexed_zstd.py (for module indexed_zstd) not found
    file indexed_zstd.py (for module indexed_zstd) not found
    running build_clib
    building 'zstd_zeek' library
    creating build
    creating build/temp.macosx-11-arm64-3.9
    creating build/temp.macosx-11-arm64-3.9/indexed_zstd
    creating build/temp.macosx-11-arm64-3.9/indexed_zstd/libzstd-seek
    clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -c indexed_zstd/libzstd-seek/zstd-seek.c -o build/temp.macosx-11-arm64-3.9/indexed_zstd/libzstd-seek/zstd-seek.o
    In file included from indexed_zstd/libzstd-seek/zstd-seek.c:19:
    indexed_zstd/libzstd-seek/zstd-seek.h:20:10: fatal error: 'zstd.h' file not found
    #include <zstd.h>
             ^~~~~~~~
    1 error generated.
    error: command '/usr/bin/clang' failed with exit code 1
    ----------------------------------------
ERROR: Command errored out with exit status 1: /opt/homebrew/opt/[email protected]/bin/python3.9 -u -c 'import io, os, sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/cz/4fmt40y176988qbnggwg9rc00000gn/T/pip-install-7wfd6b92/indexed-zstd_a2f537afce334a818672cb3d4034280a/setup.py'"'"'; __file__='"'"'/private/var/folders/cz/4fmt40y176988qbnggwg9rc00000gn/T/pip-install-7wfd6b92/indexed-zstd_a2f537afce334a818672cb3d4034280a/setup.py'"'"';f = getattr(tokenize, '"'"'open'"'"', open)(__file__) if os.path.exists(__file__) else io.StringIO('"'"'from setuptools import setup; setup()'"'"');code = f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/cz/4fmt40y176988qbnggwg9rc00000gn/T/pip-record-qyq_93tp/install-record.txt --single-version-externally-managed --compile --install-headers /opt/homebrew/include/python3.9/indexed-zstd Check the logs for full command output.
WARNING: You are using pip version 21.1.1; however, version 21.2.4 is available.
You should consider upgrading via the '/opt/homebrew/opt/[email protected]/bin/python3.9 -m pip install --upgrade pip' command.

Then I ran git clone https://github.com/martinellimarco/indexed_zstd and tried python3 setup.py develop:

$ python3 setup.py develop
running develop
running egg_info
writing indexed_zstd.egg-info/PKG-INFO
writing dependency_links to indexed_zstd.egg-info/dependency_links.txt
writing top-level names to indexed_zstd.egg-info/top_level.txt
file indexed_zstd.py (for module indexed_zstd) not found
adding license file 'LICENSE' (matched pattern 'LICEN[CS]E*')
reading manifest file 'indexed_zstd.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
warning: no files found matching '*.py' under directory 'indexed_zstd'
writing manifest file 'indexed_zstd.egg-info/SOURCES.txt'
running build_ext
building 'indexed_zstd' extension
clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk -I. -I/opt/homebrew/include -I/opt/homebrew/opt/[email protected]/include -I/opt/homebrew/opt/sqlite/include -I/opt/homebrew/opt/[email protected]/Frameworks/Python.framework/Versions/3.9/include/python3.9 -c indexed_zstd/indexed_zstd.cpp -o build/temp.macosx-11-arm64-3.9/indexed_zstd/indexed_zstd.o -std=c++11 -O3 -DNDEBUG
In file included from indexed_zstd/indexed_zstd.cpp:673:
indexed_zstd/ZSTDReader.hpp:175:5: warning: 'seek' overrides a member function but is not marked 'override' [-Winconsistent-missing-override]
    seek( long long int offset,
    ^
indexed_zstd/FileReader.hpp:33:5: note: overridden virtual function is here
    seek( long long int offset,
    ^
1 warning generated.
clang++ -bundle -undefined dynamic_lookup -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX.sdk build/temp.macosx-11-arm64-3.9/indexed_zstd/indexed_zstd.o -L/opt/homebrew/lib -L/opt/homebrew/opt/[email protected]/lib -L/opt/homebrew/opt/sqlite/lib -Lbuild/temp.macosx-11-arm64-3.9 -lm -lzstd -lzstd_zeek -o build/lib.macosx-11-arm64-3.9/indexed_zstd.cpython-39-darwin.so
ld: library not found for -lzstd_zeek
clang: error: linker command failed with exit code 1 (use -v to see invocation)
error: command '/usr/bin/clang++' failed with exit code

Then I tried:

cd indexed_zstd/libzstd-seek
mkdir build
cd build
cmake ..
make

and got:

$ make
[ 16%] Building C object CMakeFiles/zstd-seek.dir/zstd-seek.c.o
In file included from /Users/shawn/ml/indexed_zstd/indexed_zstd/libzstd-seek/zstd-seek.c:19:
/Users/shawn/ml/indexed_zstd/indexed_zstd/libzstd-seek/zstd-seek.h:20:10: fatal error: 'zstd.h' file not found
#include <zstd.h>
         ^~~~~~~~
1 error generated.
make[2]: *** [CMakeFiles/zstd-seek.dir/zstd-seek.c.o] Error 1
make[1]: *** [CMakeFiles/zstd-seek.dir/all] Error 2
make: *** [all] Error 2

So then I decided to open this issue and go to sleep. :)

Allow mounting tar subdirectory

Sometimes that tar archive itself has a topdirectory where all files resides under it.
For example: TopDir/blabla

So I want to be able to mount not root dir but TopDir/ or any other sub-directory onto the target mount point.

absolute links transformed to relative links in ratarmount mount

problem

with an archive (export from a docker filesystem), I have an absolute link:

<path to symlink> -> /usr/bin/python

if I untar the archive, I see the path is absolute.

However, when I mount with ratarmount, this path is transformed into a relative path

<path to symlink> -> usr/bin/python

I would expect this to keep as a an absolute path.

more

I'm assuming this is related to fuse rellinks vs norellinks behavior:

subdir=DIR

  Directory to prepend to all paths.  This option is mandatory.

rellinks

  Transform absolute symlinks into relative

norellinks

  Do not transform absolute symlinks into relative.  This is the default.

However I'm confused as to 1) the default is norellinks and 2) how do I pass this option using ratarmount --fuse?

workaround

instead of mount with ratarmount + rsync to a target mount, I ended up untar'ing directly into the target:

tar --overwrite -xvf $ARCHIVE_FILENAME -C $TARGET_MOUNT

Doesn't work with libfuse 3-3.10.1 due to use of "nonempty" option

See CI builds for #58 -- we get the following error:

ratarmount.stderr.log-  File "/home/runner/work/ratarmount/ratarmount/ratarmount.py", line 2411, in <module>
ratarmount.stderr.log-    cli(sys.argv[1:])
ratarmount.stderr.log-  File "/home/runner/work/ratarmount/ratarmount/ratarmount.py", line 2399, in cli
ratarmount.stderr.log-    fuse.FUSE(
ratarmount.stderr.log-  File "/opt/hostedtoolcache/Python/3.9.2/x64/lib/python3.9/site-packages/fuse.py", line 711, in __init__
ratarmount.stderr.log:    raise RuntimeError(err)
ratarmount.stderr.log:RuntimeError: 1
runAndCheckRatarmount:145 <- checkUnionMount main
Found warnings while executing: python3 -u /home/runner/work/ratarmount/ratarmount/ratarmount.py -c hardlink hardlink
TEST FAILED!
ratarmount.stdout.log:
ratarmount.stderr.log:
fusermount: unknown option 'nonempty'

According to rclone/rclone#3562 (comment), we might just be able to omit the "nonempty" option and it should use that behavior by default.

Additionally, according to rclone/rclone#3562 (comment), the "nonempty" option is accepted and ignored in libfuse 3.10.2+, so this issue really only exists for libfuse versions 3-3.10.1.

Awesome tool!

This is an amazing tool!

I have thousands of ever-glowing-number of .tar in our storage, and I'd like to use ratarmount to access the content inside the tar files when someone makes a request (only a few tar are accessed at any given time). Since I have way too many .tar to ratarmount all simultaneously, I'd like to mount
them on-demand, whenever there is a request made to the .tar file.

I think I can do this using autofs, but I am having trouble configuring it to work with ratarmount. Is it even possible? Basically, I'd like any request made to this directory pattern..

/mnt/access/

to automatically mount a tar file stored in

/mnt/archive/.tar

Has anyone else tried it?

I've tried something like this in /etc/auto.ratar, but it doesn't work at all..

ratar -fstype=fuse :ratarmount#/mnt/wrangler/tmp/$1.tar#$1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.