Module refinery.units.formats.archive.xtchm

Expand source code Browse git
from __future__ import annotations

from refinery.lib.chm import CHM, ChmHeader
from refinery.units.formats import PathExtractorUnit, UnpackResult


class xtchm(PathExtractorUnit, docs='{0}{p}{PathExtractorUnit}'):
    """
    Extract files from CHM (Windows Help) files. Compiled HTML Help archives contain HTML, images,
    and scripts used in Microsoft Help.
    """
    def unpack(self, data):
        chm = CHM.Parse(memoryview(data))

        self.log_info(F'language: {chm.header.language_name}')
        self.log_info(F'codepage: {chm.header.codepage}')

        for path, record in chm.filesystem.items():
            def extract(chm=chm, record=record):
                return chm.read(record)
            if record.length <= 0:
                continue
            if path.startswith('::DataSpace'):
                continue
            yield UnpackResult(path, extract)

    @classmethod
    def handles(cls, data):
        return data[:4] == ChmHeader.Magic

Classes

class xtchm (*paths, exclude=None, list=False, join_path=False, drop_path=False, fuzzy=0, exact=False, regex=False, path=b'path')

Extract files from CHM (Windows Help) files. Compiled HTML Help archives contain HTML, images, and scripts used in Microsoft Help.

This unit extracts items with an associated virtual path from a container; each extracted item is emitted as a separate chunk with a corresponding meta variable named "path".

Positional arguments to xtchm are patterns to filter the extracted items. Use the -x flag to add an exclusion pattern. To extract all files with a foo or bar extension, but none that has the word "temp" in its path:

xtchm .foo .bar -x temp

To view only the paths of all chunks, use the listing switch:

emit data | ... | xtchm -l

Otherwise, extracted items are written to the standard output port and usually require a frame to properly process. In order to dump all extracted data to disk, the following pipeline can be used:

emit data | ... | xtchm [| dump extracted/{path} ]

The value {path} is a placeholder which is substituted by the virtual path of the extracted item. When using xtchm to unpack a file on disk, the following pattern can be useful:

ef pack.bin [| xtchm -j | d2p ]

The unit ef is also a path extractor. By specifying -j (or --join), the paths of extracted items are combined. Here, d2p is a shortcut for dump {path}. It deconflicts the joined paths with the local file system: If pack.bin contains items one.txt and two.txt, the following local file tree would be the result:

pack.bin
pack/one.txt
pack/two.txt

Finally, the -d (or --drop) switch can be used to not create (or alter) the path metadata at all, which is useful in cases where path metadata from a previous unit should be preserved.

Expand source code Browse git
class xtchm(PathExtractorUnit, docs='{0}{p}{PathExtractorUnit}'):
    """
    Extract files from CHM (Windows Help) files. Compiled HTML Help archives contain HTML, images,
    and scripts used in Microsoft Help.
    """
    def unpack(self, data):
        chm = CHM.Parse(memoryview(data))

        self.log_info(F'language: {chm.header.language_name}')
        self.log_info(F'codepage: {chm.header.codepage}')

        for path, record in chm.filesystem.items():
            def extract(chm=chm, record=record):
                return chm.read(record)
            if record.length <= 0:
                continue
            if path.startswith('::DataSpace'):
                continue
            yield UnpackResult(path, extract)

    @classmethod
    def handles(cls, data):
        return data[:4] == ChmHeader.Magic

Ancestors

Subclasses

Class variables

var reverse

The type of the None singleton.

Methods

def unpack(self, data)
Expand source code Browse git
def unpack(self, data):
    chm = CHM.Parse(memoryview(data))

    self.log_info(F'language: {chm.header.language_name}')
    self.log_info(F'codepage: {chm.header.codepage}')

    for path, record in chm.filesystem.items():
        def extract(chm=chm, record=record):
            return chm.read(record)
        if record.length <= 0:
            continue
        if path.startswith('::DataSpace'):
            continue
        yield UnpackResult(path, extract)

Inherited members