Skip to main content
SCANOSS uses this technique to identify and match code snippets across its knowledge base, enabling accurate open source component detection and license compliance analysis. The Winnowing algorithm has been used for many years in academic networks to detect plagiarism by comparing fingerprints against known texts and source code. SCANOSS has adopted this algorithm due to its wide acceptance and proven effectiveness in code comparison. The SCANOSS implementation generates a WFP (Winnowing FingerPrint) for each file, which contains metadata and a series of hash values representing the code’s unique characteristics.

The Winnowing Algorithm

The Winnowing algorithm converts source code into fingerprints through four key steps:
  1. Normalisation
  2. Gram Fingerprinting
  3. Window Selection
  4. Output Formatting

Normalisation

The normalisation process eliminates all non-alphanumeric characters from the input source code, converting it to lowercase and removing spaces, punctuation, and special characters.

Example

for (uint32_t i = 0; i < src_len; i++)
{
    if (src[i] == '\n') line++;
    uint8_t byte = normalize(src[i]);
    if (!byte) continue;
    gram[gram_ptr++] = byte;
}

After Normalization

foruint32ti0isrcleniifsrcinlineuint8tbytenormalizesrciifbytecontinuegramgramptrbyteifgramptrgramwindowwindowptrcalccrc32c...
All spaces, operators, brackets, and punctuation are removed, leaving only alphanumeric characters in lowercase.

Gram Fingerprinting

From the normalised code, overlapping data samples (called grams) are taken and fingerprinted. SCANOSS uses:
  • GRAM size: 30 bytes
  • Hash algorithm: CRC32C checksum (embedded in most Intel chipsets for performance)
Each gram is a 30-byte sequence from the normalized code, and each sequence is hashed using CRC32C.

Example

foruint32t = 1adf644b
oruint32ti = 6f72669d
ruint32ti0 = 88ad5ece
uint32ti0i = d368b44c
int32ti0is = 2123892a
nt32ti0isr = 336cdfdd
t32ti0isrc = 1c8e832d
Each 10-byte sequence produces a unique CRC32C hash. SCANOSS uses 30-byte sequences in production.

Window Selection

From the series of gram fingerprints, a sliding window is applied to select representative hashes:
  • WINDOW size: 64 gram fingerprints
  • Selection method: Choose the minimum hash from each window
Selecting the minimum hash naturally creates lower checksum values. To balance this and ensure uniform distribution in database indexes, SCANOSS calculates a checksum of the checksum (double hashing).

Why These Values?

The values gram=30 and window=64 were chosen after extensive testing across multiple programming languages (C, Java, JavaScript, Ruby) to provide the optimal balance between:
  • Footprint: Number of fingerprints generated (affects storage and performance)
  • Uniformity: Even distribution of hash values (prevents database index imbalance)
  • Match accuracy: Ability to find matches even in modified code

Output Formatting

The fingerprints are formatted as a .wfp (Winnowing FingerPrint) file with:
  • File metadata (MD5 hash, filename, size)
  • Line numbers where each fingerprint was found
  • The actual hash values

WFP File Format

A WFP (Winnowing FingerPrint) file is a simple, machine-readable yet human-readable format that contains fingerprints for source code files.

Structure

The .wfp file contains:
  1. File declarations with metadata
  2. Fingerprints organised by line number

Example WFP File

file=34cff02ed13a3d26e716e473d4e8900d,948,test.c
3=688c09fe,fc6d701d,61b2b37c
5=5f7b1b19,99181ce1,79923cb2,64691599
6=f218cd1c
8=7cf9f396,17c3dd99
10=3a693f60,fb9493ca,54fc128c
12=6f8dfa99,d3f3a3ca,04a0062b
13=bccec1a8,1657ceac
15=4dde1f15,a4c8bf7a
16=b657086d,39b9f206,bec983db,2978bdfa
18=1fb6cdda
20=c18636e3,47091215,7f040b14

Format Components

File Declaration:
file=<MD5_HASH>,<FILE_SIZE>,<FILE_PATH>
  • MD5_HASH: MD5 checksum of the entire file (for exact file matching)
  • FILE_SIZE: File size in bytes
  • FILE_PATH: Relative path to the file
Fingerprint Lines:
<LINE_NUMBER>=<HASH1>,<HASH2>,<HASH3>,...
  • LINE_NUMBER: The line number where these fingerprints were found
  • HASH: CRC32C checksum values representing code at that line

Fingerprinting with SCANOSS-PY

Basic Fingerprinting

Generate fingerprints for a file or directory:
scanoss-py fingerprint /path/to/code

Fingerprint a Specific File

scanoss-py fingerprint /path/to/file.py

Output to File

Save fingerprints to a specific file:
scanoss-py fingerprint /path/to/code -o fingerprints.wfp

What Programming Languages are Supported?

Fingerprinting works with any text-based programming language because it operates on normalized character sequences, not language syntax. It has been tested extensively on:
  • C/C++
  • Java
  • JavaScript/TypeScript
  • Python
  • Ruby
  • Go
  • Rust
  • PHP
  • And many others

Files Skipped During Fingerprinting

By default, SCANOSS skips fingerprinting for certain file types that are not suitable for code matching. Binary and Archive Files
  • .exe, .zip, .tar, .tgz, .gz, .7z, .rar
  • .jar, .war, .ear, .whl, .bin, .app, .out
Compiled and Object Files
  • .class, .pyc, .o, .a, .so, .obj, .dll, .lib
Document and Office Files
  • .doc, .docx, .xls, .xlsx, .ppt, .pptx, .pdf
  • .odt, .ods, .odp, .pages, .key, .numbers
Data and Configuration Files
  • .json, .xml, .html, .htm, .dat, .lst, .xsd, .pom, .mf, .sum
Other Text and Web Assets
  • .md, .txt, .min.js, .woff, .woff2
You can override this behavior using the --all-extensions flag.

Fingerprinting vs Scanning

Fingerprinting creates the WFP file but doesn’t compare it against the SCANOSS knowledge base. It’s useful when you want to:
  • Generate fingerprints for later analysis
  • Create a WFP file to share or archive
  • Understand what data will be sent during a scan
Scanning performs fingerprinting AND compares the results against the SCANOSS knowledge base to identify components, licenses, and vulnerabilities. To scan using a pre-generated WFP file:
scanoss-py scan --wfp fingerprints.wfp