doFolder.hashing.calculator module

Advanced file hash calculators with caching and multithreading capabilities.

This module provides FileHashCalculator and ThreadedFileHashCalculator classes that offer intelligent caching, configurable recalculation policies, and parallel processing for efficient batch file hashing operations.

Added in version 2.3.0.

class doFolder.hashing.calculator.FileHashCalculator(useCache: bool = True, reCalcHashMode: ReCalcHashMode = ReCalcHashMode.TIMETAG, chunkSize: int = 16384, fileIOMinSize: int = 65536, algorithm: str = 'sha256', cacheManager: FileHashCacheManagerBase = <doFolder.hashing.cache.NullFileHashManager object>)

Bases: object

Advanced file hash calculator with intelligent caching and optimization.

This class provides a sophisticated interface for calculating file hashes with built-in caching capabilities, configurable recalculation policies, and performance optimizations. It’s designed for scenarios where multiple files need to be hashed and where avoiding redundant calculations is important.

The caching system uses file modification times and configurable policies to determine when cached results are still valid, significantly improving performance when processing file sets repeatedly. Cache operations are delegated to a pluggable cache manager system.

useCache

DEPRECATED - Use cacheManager parameter instead. Enable/disable result caching. When enabled, calculated hashes are stored and reused based on the recalculation mode. This attribute is kept for backward compatibility and affects the default cache manager selection when cacheManager is not provided.

Deprecated since version 2.2.3: Use cacheManager parameter instead. Pass MemoryFileHashManager() for caching or NullFileHashManager() to disable caching.

Type:

bool

reCalcHashMode

Policy for when to recalculate cached hashes: - TIMETAG: Recalculate if file was modified after last hash calculation - ALWAYS: Always recalculate, ignore cache (cache still stores results) - NEVER: Always use cached results if available

Type:

ReCalcHashMode

chunkSize

Size of chunks for reading files. Affects memory usage and I/O performance for large files.

Type:

int

fileIOMinSize

Size threshold for switching between memory-based and streaming file access.

Type:

int

algorithm

Default hash algorithm for all calculations.

Type:

str

cacheManager

Cache manager instance that handles all cache operations. If not provided, defaults to MemoryFileHashManager when useCache is True, or NullFileHashManager when useCache is False.

Type:

cache.FileHashCacheManagerBase

Example

Basic usage with default caching:

calculator = FileHashCalculator(
    algorithm="sha256",
    useCache=True,
    reCalcHashMode=ReCalcHashMode.TIMETAG
)

# First calculation - computed and cached
result1 = calculator.get(file1)

# Second calculation - uses cache if file unchanged
result2 = calculator.get(file1)

Using custom cache manager:

calculator = FileHashCalculator(
    cacheManager=LfuMemoryFileHashManager(maxSize=1000),
    reCalcHashMode=ReCalcHashMode.TIMETAG
)

Performance tuning for large files:

calculator = FileHashCalculator(
    chunkSize=1024 * 64,  # 64KB chunks
    fileIOMinSize=1024 * 1024,  # 1MB threshold
    algorithm="blake2b"  # Fast algorithm
)

Note

The default cache manager stores results in memory and is not persistent across program runs. For long-running applications processing many files, consider using LfuMemoryFileHashManager or implementing a custom persistent cache manager.

algorithm: str = 'sha256'
cacheManager: FileHashCacheManagerBase = <doFolder.hashing.cache.NullFileHashManager object>
calc(file: File, algorithm: str | None = None, progress: ProgressController | None = None) FileHashResult

Calculate the hash of a file and optionally cache the result.

This method performs the actual hash calculation using the calculator’s configured parameters (algorithm, chunk size, etc.) and stores the result in the cache using the cacheManager.

Parameters:
  • file (File) – The file to calculate the hash for.

  • algorithm (str, optional) – Hash algorithm to use. If None, uses the calculator’s default algorithm.

  • progress (ProgressController, optional) –

    Progress controller for tracking calculation progress. Can be used to monitor progress or cancel the operation.

    Added in version 2.3.0.

Returns:

Complete hash result with metadata.

Return type:

FileHashResult

Note

The result is automatically stored using the cacheManager.

chunkSize: int = 16384
fileIOMinSize: int = 65536
findCache(file: File, algorithm: str | None = None) FileHashResult | None

Locate and validate a cached hash result for the given file.

This method searches the cache for a hash result and validates its currency according to the current recalculation mode. It combines cache lookup with validation in a single operation.

Parameters:

file (File) – The file to look up in the cache.

Returns:

The cached result if valid, None if no valid cache entry exists (either missing or invalidated by recalc mode).

Return type:

FileHashResult or None

Note

Uses the configured cacheManager to retrieve cached results.

get(file: File, algorithm: str | None = None) FileHashResult

Get the hash of a file, using cache when possible.

This is the main method for retrieving file hashes. It first checks the cache for a valid result according to the current recalculation mode, and only performs a new calculation if necessary.

Parameters:

file (File) – The file to hash.

Returns:

Complete hash result with metadata.

Return type:

FileHashResult

Note

Cache behavior depends on the reCalcHashMode setting: - TIMETAG: Uses cache if file hasn’t been modified - ALWAYS: Ignores cache, always calculates - NEVER: Always uses cache if available

multipleCalc(file: File, algorithms: str | Iterable[str], progress: ProgressController | None = None) dict[str, FileHashResult]

Calculate multiple hashes for a file and cache all results.

Performs hash calculations for multiple algorithms and stores each result in the cache for future use.

Parameters:
  • file (File) – The file to calculate hashes for.

  • algorithms (Union[str, Iterable[str]]) – Algorithm(s) to compute hashes for.

  • progress (ProgressController, optional) –

    Progress controller for tracking calculation progress. Can be used to monitor progress or cancel the operation.

    Added in version 2.3.0.

Returns:

Mapping of algorithm names to hash results.

Return type:

Dict[str, FileHashResult]

multipleGet(file: File, algorithms: str | Iterable[str]) dict[str, FileHashResult]

Get multiple hash results for a file, using cache when possible.

Efficiently retrieves hash results for multiple algorithms, leveraging cache for available results and calculating only missing ones.

Parameters:
  • file (File) – The file to hash.

  • algorithms (Union[str, Iterable[str]]) – Algorithm(s) to compute hashes for.

Returns:

Mapping of algorithm names to hash results.

Return type:

Dict[str, FileHashResult]

reCalcHashMode: ReCalcHashMode = 'TIME_TAG'
useCache: bool = True
validateCache(file: File, res: FileHashResult | None) bool

Validate whether a cached hash result is still current and usable.

This method implements the core cache validation logic based on the configured recalculation mode. It determines whether a cached result should be trusted or if a new calculation is needed.

Parameters:
  • file (File) – The file whose cache entry is being validated.

  • res (FileHashResult or None) – The cached result to validate, or None if no cached result exists.

Returns:

True if the cached result is valid and should be used,

False if a new calculation is needed.

Return type:

bool

Validation Rules:
  • TIMETAG mode: Valid if cached mtime >= current file mtime

  • ALWAYS mode: Never valid (always recalculate)

  • NEVER mode: Always valid if result exists

  • None result: Always invalid

Raises:

ValueError – If reCalcHashMode is set to an invalid/unknown value.

class doFolder.hashing.calculator.ThreadedFileHashCalculator(useCache: bool = True, reCalcHashMode: ReCalcHashMode = ReCalcHashMode.TIMETAG, chunkSize: int = 16384, fileIOMinSize: int = 65536, algorithm: str = 'sha256', cacheManager: FileHashCacheManagerBase = <doFolder.hashing.cache.NullFileHashManager object>, threadNum: int = 4)

Bases: FileHashCalculator

Multithreaded file hash calculator with automatic resource management.

This class extends FileHashCalculator with multithreading capabilities, allowing multiple files to be hashed concurrently. It’s particularly beneficial when processing large numbers of files or when I/O latency is high, as it can overlap computation and I/O operations.

The class implements the context manager protocol (__enter__/__exit__) for automatic thread pool lifecycle management, ensuring proper resource cleanup when used with the ‘with’ statement.

The threading model uses a ThreadPoolExecutor to manage worker threads, with intelligent cache checking to avoid unnecessary thread overhead for cache hits.

threadNum

Number of worker threads in the thread pool. More threads can improve performance for I/O-bound workloads but may cause resource contention. Defaults to 4.

Type:

int

threadPool

Internal thread pool for parallel execution. Automatically initialized after dataclass construction.

Type:

ThreadPoolExecutor

Threading Behavior:
  • Cache hits are resolved immediately without using threads

  • Cache misses are submitted to the thread pool for parallel processing

  • Each thread performs independent file I/O and hash calculation

  • Results are returned as Future objects for asynchronous handling

Context Manager Usage:

Recommended usage with automatic cleanup:

with ThreadedFileHashCalculator(threadNum=8) as calculator:
    # Submit all files for processing
    futures = [calculator.threadedGet(file) for file in file_list]

    # Collect results within the with block
    results = [future.result() for future in futures]

    # Process results
    for result in results:
        print(f"{result.path}: {result.hash}")
# Thread pool automatically shut down here

Manual management (if needed):

calculator = ThreadedFileHashCalculator()
try:
    futures = [calculator.threadedGet(file) for file in file_list]
    results = [future.result() for future in futures]
finally:
    calculator.threadPool.shutdown(wait=False)

Example

Processing multiple files concurrently:

with ThreadedFileHashCalculator(
    threadNum=8,  # Use 8 worker threads
    algorithm="blake2b",
    useCache=True
) as calculator:
    # Submit all files for processing
    futures = [calculator.threadedGet(file) for file in file_list]

    # Collect results as they complete
    for future in futures:
        result = future.result()
        print(f"{result.path}: {result.hash}")

Processing with error handling:

with ThreadedFileHashCalculator() as calculator:
    futures = []
    for file in file_list:
        futures.append(calculator.threadedGet(file))

    for future in futures:
        try:
            result = future.result()
            print(f"Success: {result.path} -> {result.hash}")
        except Exception as e:
            print(f"Error processing file: {e}")
Performance Notes:
  • Optimal thread count depends on system characteristics and file sizes

  • For CPU-bound workloads (fast storage), fewer threads may be better

  • For I/O-bound workloads (network storage), more threads can help

  • Very small files may not benefit from threading due to overhead

  • Use context manager (with statement) for automatic resource cleanup

Resource Management:

The context manager automatically shuts down the thread pool with wait=False when exiting the ‘with’ block. This means: - No new tasks will be accepted after exit - Currently running tasks may continue briefly - The program continues immediately without waiting

If you need to ensure all tasks complete, collect all Future results within the ‘with’ block or manually call shutdown(wait=True).

threadNum: int = 4
threadPool: ThreadPoolExecutorWithProgress
threadedGet(file: File, algorithm: str | None = None) FutureWithProgress[FileHashResult]

Get the hash of a file using background thread processing.

This method provides asynchronous hash calculation by checking the cache first and only submitting uncached work to the thread pool. Cache hits are resolved immediately with a completed Future to maintain consistent return types while avoiding unnecessary thread overhead.

Parameters:

file (File) – The file to hash.

Returns:

A Future object that will contain the hash result.
  • For cache hits: A completed Future with the cached result

  • For cache misses: A Future representing the ongoing calculation

Return type:

Future[FileHashResult]

Usage:

The returned Future can be used with standard concurrent.futures patterns:

# Get Future immediately
future = calculator.threadedGet(file)

# Block until result is available
result = future.result()

# Check if calculation is complete
if future.done():
    result = future.result()

# Add callback for when complete
future.add_done_callback(lambda f: print(f.result().hash))

Note

Cache validation follows the same rules as the synchronous get() method, but the actual calculation (if needed) happens in a background thread.

threadedMultipleGet(file: File, algorithms: str | Iterable[str]) FutureWithProgress[dict[str, FileHashResult]]

Get multiple hash results for a file using background thread processing.

Provides asynchronous calculation of multiple hash algorithms by checking the cache first and only submitting uncached work to the thread pool.

Parameters:
  • file (File) – The file to hash.

  • algorithms (Union[str, Iterable[str]]) – Algorithm(s) to compute hashes for.

Returns:

A Future containing the mapping of

algorithm names to hash results. - For complete cache hits: A completed Future with cached results - For cache misses: A Future representing the ongoing calculation

Return type:

Future[Dict[str, FileHashResult]]

Note

Cache validation follows the same rules as the synchronous multipleGet() method, but the actual calculation (if needed) happens in a background thread.