Module refinery.units.formats.archive.xtchm
Expand source code Browse git
from __future__ import annotations
from refinery.lib.chm import CHM, ChmHeader
from refinery.units.formats import PathExtractorUnit, UnpackResult
class xtchm(PathExtractorUnit, docs='{0}{p}{PathExtractorUnit}'):
"""
Extract files from CHM (Windows Help) files. Compiled HTML Help archives contain HTML, images,
and scripts used in Microsoft Help.
"""
def unpack(self, data):
chm = CHM.Parse(memoryview(data))
self.log_info(F'language: {chm.header.language_name}')
self.log_info(F'codepage: {chm.header.codepage}')
for path, record in chm.filesystem.items():
def extract(chm=chm, record=record):
return chm.read(record)
if record.length <= 0:
continue
if path.startswith('::DataSpace'):
continue
yield UnpackResult(path, extract)
@classmethod
def handles(cls, data):
return data[:4] == ChmHeader.Magic
Classes
class xtchm (*paths, exclude=None, list=False, join_path=False, drop_path=False, fuzzy=0, exact=False, regex=False, path=b'path')-
Extract files from CHM (Windows Help) files. Compiled HTML Help archives contain HTML, images, and scripts used in Microsoft Help.
This unit extracts items with an associated virtual path from a container; each extracted item is emitted as a separate chunk with a corresponding meta variable named "path".
Positional arguments to xtchm are patterns to filter the extracted items. Use the
-xflag to add an exclusion pattern. To extract all files with a foo or bar extension, but none that has the word "temp" in its path:xtchm .foo .bar -x tempTo view only the paths of all chunks, use the listing switch:
emit data | ... | xtchm -lOtherwise, extracted items are written to the standard output port and usually require a frame to properly process. In order to dump all extracted data to disk, the following pipeline can be used:
emit data | ... | xtchm [| dump extracted/{path} ]The value
{path}is a placeholder which is substituted by the virtual path of the extracted item. When using xtchm to unpack a file on disk, the following pattern can be useful:ef pack.bin [| xtchm -j | d2p ]The unit
efis also a path extractor. By specifying-j(or--join), the paths of extracted items are combined. Here,d2pis a shortcut fordump {path}. It deconflicts the joined paths with the local file system: Ifpack.bincontains itemsone.txtandtwo.txt, the following local file tree would be the result:pack.bin pack/one.txt pack/two.txtFinally, the
-d(or--drop) switch can be used to not create (or alter) the path metadata at all, which is useful in cases where path metadata from a previous unit should be preserved.Expand source code Browse git
class xtchm(PathExtractorUnit, docs='{0}{p}{PathExtractorUnit}'): """ Extract files from CHM (Windows Help) files. Compiled HTML Help archives contain HTML, images, and scripts used in Microsoft Help. """ def unpack(self, data): chm = CHM.Parse(memoryview(data)) self.log_info(F'language: {chm.header.language_name}') self.log_info(F'codepage: {chm.header.codepage}') for path, record in chm.filesystem.items(): def extract(chm=chm, record=record): return chm.read(record) if record.length <= 0: continue if path.startswith('::DataSpace'): continue yield UnpackResult(path, extract) @classmethod def handles(cls, data): return data[:4] == ChmHeader.MagicAncestors
Subclasses
Class variables
var reverse-
The type of the None singleton.
Methods
def unpack(self, data)-
Expand source code Browse git
def unpack(self, data): chm = CHM.Parse(memoryview(data)) self.log_info(F'language: {chm.header.language_name}') self.log_info(F'codepage: {chm.header.codepage}') for path, record in chm.filesystem.items(): def extract(chm=chm, record=record): return chm.read(record) if record.length <= 0: continue if path.startswith('::DataSpace'): continue yield UnpackResult(path, extract)
Inherited members
PathExtractorUnit:CustomJoinBehaviourCustomPathSeparatorFilterEverythingRequiresactassemblecodecconsolefilterfinishhandlesis_quietis_reversibleisattylabelledleniencylog_alwayslog_debuglog_detachlog_faillog_infolog_levellog_warnloggernamenozzleoptional_dependenciesreadread1required_dependenciesresetrunsourcesuperinit
UnitBase: