The Winnowing Algorithm
The Winnowing algorithm converts source code into fingerprints through four key steps:- Normalisation
- Gram Fingerprinting
- Window Selection
- Output Formatting
Normalisation
The normalisation process eliminates all non-alphanumeric characters from the input source code, converting it to lowercase and removing spaces, punctuation, and special characters.Example
After Normalization
Gram Fingerprinting
From the normalised code, overlapping data samples (called grams) are taken and fingerprinted. SCANOSS uses:- GRAM size: 30 bytes
- Hash algorithm: CRC32C checksum (embedded in most Intel chipsets for performance)
Example
Window Selection
From the series of gram fingerprints, a sliding window is applied to select representative hashes:- WINDOW size: 64 gram fingerprints
- Selection method: Choose the minimum hash from each window
Why These Values?
The values gram=30 and window=64 were chosen after extensive testing across multiple programming languages (C, Java, JavaScript, Ruby) to provide the optimal balance between:- Footprint: Number of fingerprints generated (affects storage and performance)
- Uniformity: Even distribution of hash values (prevents database index imbalance)
- Match accuracy: Ability to find matches even in modified code
Output Formatting
The fingerprints are formatted as a .wfp (Winnowing FingerPrint) file with:- File metadata (MD5 hash, filename, size)
- Line numbers where each fingerprint was found
- The actual hash values
WFP File Format
A WFP (Winnowing FingerPrint) file is a simple, machine-readable yet human-readable format that contains fingerprints for source code files.Structure
The .wfp file contains:- File declarations with metadata
- Fingerprints organised by line number
Example WFP File
Format Components
File Declaration:- MD5_HASH: MD5 checksum of the entire file (for exact file matching)
- FILE_SIZE: File size in bytes
- FILE_PATH: Relative path to the file
- LINE_NUMBER: The line number where these fingerprints were found
- HASH: CRC32C checksum values representing code at that line
Fingerprinting with SCANOSS-PY
Basic Fingerprinting
Generate fingerprints for a file or directory:Fingerprint a Specific File
Output to File
Save fingerprints to a specific file:What Programming Languages are Supported?
Fingerprinting works with any text-based programming language because it operates on normalized character sequences, not language syntax. It has been tested extensively on:- C/C++
- Java
- JavaScript/TypeScript
- Python
- Ruby
- Go
- Rust
- PHP
- And many others
Files Skipped During Fingerprinting
By default, SCANOSS skips fingerprinting for certain file types that are not suitable for code matching. Binary and Archive Files.exe,.zip,.tar,.tgz,.gz,.7z,.rar.jar,.war,.ear,.whl,.bin,.app,.out
.class,.pyc,.o,.a,.so,.obj,.dll,.lib
.doc,.docx,.xls,.xlsx,.ppt,.pptx,.pdf.odt,.ods,.odp,.pages,.key,.numbers
.json,.xml,.html,.htm,.dat,.lst,.xsd,.pom,.mf,.sum
.md,.txt,.min.js,.woff,.woff2
--all-extensions flag.
Fingerprinting vs Scanning
Fingerprinting creates the WFP file but doesn’t compare it against the SCANOSS knowledge base. It’s useful when you want to:- Generate fingerprints for later analysis
- Create a WFP file to share or archive
- Understand what data will be sent during a scan