doFolder.hashing.calculator module¶
Advanced file hash calculators with caching and multithreading capabilities.
This module provides FileHashCalculator and ThreadedFileHashCalculator classes that offer intelligent caching, configurable recalculation policies, and parallel processing for efficient batch file hashing operations.
Added in version 2.3.0.
- class doFolder.hashing.calculator.FileHashCalculator(useCache: bool = True, reCalcHashMode: ReCalcHashMode = ReCalcHashMode.TIMETAG, chunkSize: int = 16384, fileIOMinSize: int = 65536, algorithm: str = 'sha256', cacheManager: FileHashCacheManagerBase = <doFolder.hashing.cache.NullFileHashManager object>)¶
Bases:
objectAdvanced file hash calculator with intelligent caching and optimization.
This class provides a sophisticated interface for calculating file hashes with built-in caching capabilities, configurable recalculation policies, and performance optimizations. It’s designed for scenarios where multiple files need to be hashed and where avoiding redundant calculations is important.
The caching system uses file modification times and configurable policies to determine when cached results are still valid, significantly improving performance when processing file sets repeatedly. Cache operations are delegated to a pluggable cache manager system.
- useCache¶
DEPRECATED - Use cacheManager parameter instead. Enable/disable result caching. When enabled, calculated hashes are stored and reused based on the recalculation mode. This attribute is kept for backward compatibility and affects the default cache manager selection when cacheManager is not provided.
Deprecated since version 2.2.3: Use cacheManager parameter instead. Pass MemoryFileHashManager() for caching or NullFileHashManager() to disable caching.
- Type:
bool
- reCalcHashMode¶
Policy for when to recalculate cached hashes: - TIMETAG: Recalculate if file was modified after last hash calculation - ALWAYS: Always recalculate, ignore cache (cache still stores results) - NEVER: Always use cached results if available
- Type:
- chunkSize¶
Size of chunks for reading files. Affects memory usage and I/O performance for large files.
- Type:
int
- fileIOMinSize¶
Size threshold for switching between memory-based and streaming file access.
- Type:
int
- algorithm¶
Default hash algorithm for all calculations.
- Type:
str
- cacheManager¶
Cache manager instance that handles all cache operations. If not provided, defaults to MemoryFileHashManager when useCache is True, or NullFileHashManager when useCache is False.
Example
Basic usage with default caching:
calculator = FileHashCalculator( algorithm="sha256", useCache=True, reCalcHashMode=ReCalcHashMode.TIMETAG ) # First calculation - computed and cached result1 = calculator.get(file1) # Second calculation - uses cache if file unchanged result2 = calculator.get(file1)
Using custom cache manager:
calculator = FileHashCalculator( cacheManager=LfuMemoryFileHashManager(maxSize=1000), reCalcHashMode=ReCalcHashMode.TIMETAG )
Performance tuning for large files:
calculator = FileHashCalculator( chunkSize=1024 * 64, # 64KB chunks fileIOMinSize=1024 * 1024, # 1MB threshold algorithm="blake2b" # Fast algorithm )
Note
The default cache manager stores results in memory and is not persistent across program runs. For long-running applications processing many files, consider using LfuMemoryFileHashManager or implementing a custom persistent cache manager.
- algorithm: str = 'sha256'¶
- cacheManager: FileHashCacheManagerBase = <doFolder.hashing.cache.NullFileHashManager object>¶
- calc(file: File, algorithm: str | None = None, progress: ProgressController | None = None) FileHashResult¶
Calculate the hash of a file and optionally cache the result.
This method performs the actual hash calculation using the calculator’s configured parameters (algorithm, chunk size, etc.) and stores the result in the cache using the cacheManager.
- Parameters:
file (File) – The file to calculate the hash for.
algorithm (str, optional) – Hash algorithm to use. If None, uses the calculator’s default algorithm.
progress (ProgressController, optional) –
Progress controller for tracking calculation progress. Can be used to monitor progress or cancel the operation.
Added in version 2.3.0.
- Returns:
Complete hash result with metadata.
- Return type:
Note
The result is automatically stored using the cacheManager.
- chunkSize: int = 16384¶
- fileIOMinSize: int = 65536¶
- findCache(file: File, algorithm: str | None = None) FileHashResult | None¶
Locate and validate a cached hash result for the given file.
This method searches the cache for a hash result and validates its currency according to the current recalculation mode. It combines cache lookup with validation in a single operation.
- Parameters:
file (File) – The file to look up in the cache.
- Returns:
The cached result if valid, None if no valid cache entry exists (either missing or invalidated by recalc mode).
- Return type:
FileHashResult or None
Note
Uses the configured cacheManager to retrieve cached results.
- get(file: File, algorithm: str | None = None) FileHashResult¶
Get the hash of a file, using cache when possible.
This is the main method for retrieving file hashes. It first checks the cache for a valid result according to the current recalculation mode, and only performs a new calculation if necessary.
- Parameters:
file (File) – The file to hash.
- Returns:
Complete hash result with metadata.
- Return type:
Note
Cache behavior depends on the reCalcHashMode setting: - TIMETAG: Uses cache if file hasn’t been modified - ALWAYS: Ignores cache, always calculates - NEVER: Always uses cache if available
- multipleCalc(file: File, algorithms: str | Iterable[str], progress: ProgressController | None = None) dict[str, FileHashResult]¶
Calculate multiple hashes for a file and cache all results.
Performs hash calculations for multiple algorithms and stores each result in the cache for future use.
- Parameters:
file (File) – The file to calculate hashes for.
algorithms (Union[str, Iterable[str]]) – Algorithm(s) to compute hashes for.
progress (ProgressController, optional) –
Progress controller for tracking calculation progress. Can be used to monitor progress or cancel the operation.
Added in version 2.3.0.
- Returns:
Mapping of algorithm names to hash results.
- Return type:
Dict[str, FileHashResult]
- multipleGet(file: File, algorithms: str | Iterable[str]) dict[str, FileHashResult]¶
Get multiple hash results for a file, using cache when possible.
Efficiently retrieves hash results for multiple algorithms, leveraging cache for available results and calculating only missing ones.
- Parameters:
file (File) – The file to hash.
algorithms (Union[str, Iterable[str]]) – Algorithm(s) to compute hashes for.
- Returns:
Mapping of algorithm names to hash results.
- Return type:
Dict[str, FileHashResult]
- reCalcHashMode: ReCalcHashMode = 'TIME_TAG'¶
- useCache: bool = True¶
- validateCache(file: File, res: FileHashResult | None) bool¶
Validate whether a cached hash result is still current and usable.
This method implements the core cache validation logic based on the configured recalculation mode. It determines whether a cached result should be trusted or if a new calculation is needed.
- Parameters:
file (File) – The file whose cache entry is being validated.
res (FileHashResult or None) – The cached result to validate, or None if no cached result exists.
- Returns:
- True if the cached result is valid and should be used,
False if a new calculation is needed.
- Return type:
bool
- Validation Rules:
TIMETAG mode: Valid if cached mtime >= current file mtime
ALWAYS mode: Never valid (always recalculate)
NEVER mode: Always valid if result exists
None result: Always invalid
- Raises:
ValueError – If reCalcHashMode is set to an invalid/unknown value.
- class doFolder.hashing.calculator.ThreadedFileHashCalculator(useCache: bool = True, reCalcHashMode: ReCalcHashMode = ReCalcHashMode.TIMETAG, chunkSize: int = 16384, fileIOMinSize: int = 65536, algorithm: str = 'sha256', cacheManager: FileHashCacheManagerBase = <doFolder.hashing.cache.NullFileHashManager object>, threadNum: int = 4)¶
Bases:
FileHashCalculatorMultithreaded file hash calculator with automatic resource management.
This class extends FileHashCalculator with multithreading capabilities, allowing multiple files to be hashed concurrently. It’s particularly beneficial when processing large numbers of files or when I/O latency is high, as it can overlap computation and I/O operations.
The class implements the context manager protocol (__enter__/__exit__) for automatic thread pool lifecycle management, ensuring proper resource cleanup when used with the ‘with’ statement.
The threading model uses a ThreadPoolExecutor to manage worker threads, with intelligent cache checking to avoid unnecessary thread overhead for cache hits.
- threadNum¶
Number of worker threads in the thread pool. More threads can improve performance for I/O-bound workloads but may cause resource contention. Defaults to 4.
- Type:
int
- threadPool¶
Internal thread pool for parallel execution. Automatically initialized after dataclass construction.
- Type:
ThreadPoolExecutor
- Threading Behavior:
Cache hits are resolved immediately without using threads
Cache misses are submitted to the thread pool for parallel processing
Each thread performs independent file I/O and hash calculation
Results are returned as Future objects for asynchronous handling
- Context Manager Usage:
Recommended usage with automatic cleanup:
with ThreadedFileHashCalculator(threadNum=8) as calculator: # Submit all files for processing futures = [calculator.threadedGet(file) for file in file_list] # Collect results within the with block results = [future.result() for future in futures] # Process results for result in results: print(f"{result.path}: {result.hash}") # Thread pool automatically shut down here
Manual management (if needed):
calculator = ThreadedFileHashCalculator() try: futures = [calculator.threadedGet(file) for file in file_list] results = [future.result() for future in futures] finally: calculator.threadPool.shutdown(wait=False)
Example
Processing multiple files concurrently:
with ThreadedFileHashCalculator( threadNum=8, # Use 8 worker threads algorithm="blake2b", useCache=True ) as calculator: # Submit all files for processing futures = [calculator.threadedGet(file) for file in file_list] # Collect results as they complete for future in futures: result = future.result() print(f"{result.path}: {result.hash}")
Processing with error handling:
with ThreadedFileHashCalculator() as calculator: futures = [] for file in file_list: futures.append(calculator.threadedGet(file)) for future in futures: try: result = future.result() print(f"Success: {result.path} -> {result.hash}") except Exception as e: print(f"Error processing file: {e}")
- Performance Notes:
Optimal thread count depends on system characteristics and file sizes
For CPU-bound workloads (fast storage), fewer threads may be better
For I/O-bound workloads (network storage), more threads can help
Very small files may not benefit from threading due to overhead
Use context manager (with statement) for automatic resource cleanup
- Resource Management:
The context manager automatically shuts down the thread pool with wait=False when exiting the ‘with’ block. This means: - No new tasks will be accepted after exit - Currently running tasks may continue briefly - The program continues immediately without waiting
If you need to ensure all tasks complete, collect all Future results within the ‘with’ block or manually call shutdown(wait=True).
- threadNum: int = 4¶
- threadPool: ThreadPoolExecutorWithProgress¶
- threadedGet(file: File, algorithm: str | None = None) FutureWithProgress[FileHashResult]¶
Get the hash of a file using background thread processing.
This method provides asynchronous hash calculation by checking the cache first and only submitting uncached work to the thread pool. Cache hits are resolved immediately with a completed Future to maintain consistent return types while avoiding unnecessary thread overhead.
- Parameters:
file (File) – The file to hash.
- Returns:
- A Future object that will contain the hash result.
For cache hits: A completed Future with the cached result
For cache misses: A Future representing the ongoing calculation
- Return type:
Future[FileHashResult]
- Usage:
The returned Future can be used with standard concurrent.futures patterns:
# Get Future immediately future = calculator.threadedGet(file) # Block until result is available result = future.result() # Check if calculation is complete if future.done(): result = future.result() # Add callback for when complete future.add_done_callback(lambda f: print(f.result().hash))
Note
Cache validation follows the same rules as the synchronous get() method, but the actual calculation (if needed) happens in a background thread.
- threadedMultipleGet(file: File, algorithms: str | Iterable[str]) FutureWithProgress[dict[str, FileHashResult]]¶
Get multiple hash results for a file using background thread processing.
Provides asynchronous calculation of multiple hash algorithms by checking the cache first and only submitting uncached work to the thread pool.
- Parameters:
file (File) – The file to hash.
algorithms (Union[str, Iterable[str]]) – Algorithm(s) to compute hashes for.
- Returns:
- A Future containing the mapping of
algorithm names to hash results. - For complete cache hits: A completed Future with cached results - For cache misses: A Future representing the ongoing calculation
- Return type:
Future[Dict[str, FileHashResult]]
Note
Cache validation follows the same rules as the synchronous multipleGet() method, but the actual calculation (if needed) happens in a background thread.