Module refinery.units.formats.archive.xttar

Expand source code Browse git
from __future__ import annotations

import datetime
import tarfile

from refinery.lib.structures import MemoryFile
from refinery.units.formats.archive import ArchiveUnit


class xttar(ArchiveUnit, docs='{0}{p}{PathExtractorUnit}'):
    """
    Extract files from a Tar archive.
    """
    def unpack(self, data: bytearray):
        with MemoryFile(data) as stream:
            try:
                archive = tarfile.open(fileobj=stream)
            except Exception:
                ustar = data.find(B'ustar')
                if ustar < 257:
                    raise
                stream.seek(ustar - 257)
                archive = tarfile.open(fileobj=stream)
            for info in archive.getmembers():
                if not info.isfile():
                    continue
                extractor = archive.extractfile(info)
                if extractor is None:
                    continue
                date = datetime.datetime.fromtimestamp(info.mtime)
                yield self._pack(info.name, date, lambda e=extractor: e.read())

    @classmethod
    def handles(cls, data) -> bool:
        return data[257:262] == B'ustar'

Classes

class xttar (*paths, list=False, join_path=False, drop_path=False, fuzzy=0, exact=False, regex=False, path=b'path', exclude=None, date=b'date', pwd=b'')

Extract files from a Tar archive.

This unit extracts items with an associated virtual path from a container; each extracted item is emitted as a separate chunk with a corresponding meta variable named "path".

Positional arguments to xttar are patterns to filter the extracted items. Use the -x flag to add an exclusion pattern. To extract all files with a foo or bar extension, but none that has the word "temp" in its path:

xttar .foo .bar -x temp

To view only the paths of all chunks, use the listing switch:

emit data | ... | xttar -l

Otherwise, extracted items are written to the standard output port and usually require a frame to properly process. In order to dump all extracted data to disk, the following pipeline can be used:

emit data | ... | xttar [| dump extracted/{path} ]

The value {path} is a placeholder which is substituted by the virtual path of the extracted item. When using xttar to unpack a file on disk, the following pattern can be useful:

ef pack.bin [| xttar -j | d2p ]

The unit ef is also a path extractor. By specifying -j (or --join), the paths of extracted items are combined. Here, d2p is a shortcut for dump {path}. It deconflicts the joined paths with the local file system: If pack.bin contains items one.txt and two.txt, the following local file tree would be the result:

pack.bin
pack/one.txt
pack/two.txt

Finally, the -d (or --drop) switch can be used to not create (or alter) the path metadata at all, which is useful in cases where path metadata from a previous unit should be preserved.

Expand source code Browse git
class xttar(ArchiveUnit, docs='{0}{p}{PathExtractorUnit}'):
    """
    Extract files from a Tar archive.
    """
    def unpack(self, data: bytearray):
        with MemoryFile(data) as stream:
            try:
                archive = tarfile.open(fileobj=stream)
            except Exception:
                ustar = data.find(B'ustar')
                if ustar < 257:
                    raise
                stream.seek(ustar - 257)
                archive = tarfile.open(fileobj=stream)
            for info in archive.getmembers():
                if not info.isfile():
                    continue
                extractor = archive.extractfile(info)
                if extractor is None:
                    continue
                date = datetime.datetime.fromtimestamp(info.mtime)
                yield self._pack(info.name, date, lambda e=extractor: e.read())

    @classmethod
    def handles(cls, data) -> bool:
        return data[257:262] == B'ustar'

Ancestors

Subclasses

Class variables

var reverse

The type of the None singleton.

Methods

def unpack(self, data)
Expand source code Browse git
def unpack(self, data: bytearray):
    with MemoryFile(data) as stream:
        try:
            archive = tarfile.open(fileobj=stream)
        except Exception:
            ustar = data.find(B'ustar')
            if ustar < 257:
                raise
            stream.seek(ustar - 257)
            archive = tarfile.open(fileobj=stream)
        for info in archive.getmembers():
            if not info.isfile():
                continue
            extractor = archive.extractfile(info)
            if extractor is None:
                continue
            date = datetime.datetime.fromtimestamp(info.mtime)
            yield self._pack(info.name, date, lambda e=extractor: e.read())

Inherited members