Tutorial

Visualizing billions of molecules

How do you explore 9.6 billion molecules on a laptop? You compress them, cluster them, and build navigable maps. This tutorial walks through the entire pipeline -from SMILES strings to interactive visualizations- using nothing but a 64GB workstation.

The problem

Chemical libraries are growing faster than our ability to explore them.

The Enamine REAL database alone has 9.6 billion synthesizable molecules. Loading their fingerprints into memory would take terabytes. Traditional visualization tools tap out at millions of points. We needed something different.

The solution

Compress, cluster, then visualize in layers.

Product Quantization shrinks fingerprints by 28x. PQk-means clusters the compressed codes directly. Nested TMAPs let you zoom from a global overview down to individual molecules. The whole thing runs on commodity hardware.

Scale

9.6B molecules

Clusters

~100k centroids

Memory footprint

57.9 GB

pipeline.pypython

from chelombus import DataStreamer, FingerprintCalculator, PQKMeans
from chelombus.encoder.encoder import PQEncoder

# Stream molecules in manageable chunks
ds = DataStreamer()
smiles_chunk = next(ds.parse_input("molecules.smi", chunksize=100_000))

# Calculate fingerprints
calc = FingerprintCalculator()
fingerprints = calc.FingerprintFromSmiles(smiles_chunk, "mqn")

# Compress with Product Quantization
encoder = PQEncoder(k=256, m=6, iterations=20)
encoder.fit(fingerprints)
pq_codes = encoder.transform(fingerprints)

# Cluster at scale
clusterer = PQKMeans(encoder=encoder, k=100_000, iterations=20)
labels = clusterer.fit_predict(pq_codes)

View on GitHub

The pipeline

Five steps from SMILES to maps

Each section below covers one stage of the pipeline. The code snippets are copy-paste ready -run them in sequence and you'll have your own nested TMAP by the end.

1. Fingerprints

Turn molecules into numbers

Every molecule needs a numerical representation before we can do anything useful with it. Chelombus supports two fingerprint types: Morgan fingerprints (hashed bit vectors) and MQN descriptors (42 interpretable molecular counts). For clustering, we must use MQN. Euclidean distances are meaningful between MQN vectors, whereas they are not for binary fingerprints. Applying PCA to binary fingerprints to force them into a continuous space does not fix this. The problem is structural: you end up compressing the zeros (what molecules lack) and then trying to infer similarity from that absence. This is akin to the black raven paradox: just as you cannot learn what a black raven is by observing red apples, you cannot recover meaningful molecular similarity by embedding binary absence into a continuous space.

Morgan fingerprints: configurable size and radius, good for similarity searching
MQN fingerprints: fixed 42 dimensions covering atoms, bonds, polarity, and topology
Both work seamlessly with the rest of the pipeline

Morgan default

1024 bits, radius 2

MQN dimensions

fingerprints.pypython

from chelombus import FingerprintCalculator

calc = FingerprintCalculator()

smiles = ["CCO", "c1ccccc1", "CC(=O)O", "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"]
fingerprints = calc.FingerprintFromSmiles(smiles, "mqn")

print(f"Shape: {fingerprints.shape}")  # (4, 42)

2. Product Quantization

Compress vectors without losing what matters

Here's the problem: 9.6 billion 42-dimensional vectors would eat hundreds of gigabytes. Product Quantization fixes this by splitting each vector into subspaces, learning a codebook for each, and storing tiny indices instead of full floats. The result? 28x compression while preserving distance relationships almost perfectly.

Splits vectors into m subspaces (6 works well for MQN)
Learns k centroids per subspace (256 is the sweet spot)
Euclidean distances stay accurate (Pearson r ≈ 0.99)

Compression

28x

Distance fidelity

r ≈ 0.99

encoder.pypython

from chelombus.encoder.encoder import PQEncoder

# 6 subspaces x 256 centroids = 6-byte codes per molecule
encoder = PQEncoder(k=256, m=6, iterations=20)
encoder.fit(training_fingerprints)

# Transform to compact codes
pq_codes = encoder.transform(fingerprints)
print(f"Original: {fingerprints.nbytes:,} bytes")
print(f"Compressed: {pq_codes.nbytes:,} bytes")

3. PQk-Means Clustering

Group billions of molecules into meaningful clusters

Standard k-means chokes on billions of points. PQk-means operates directly on compressed codes, computing approximate distances without ever decompressing. The math works out surprisingly well and clusters end up tight and chemically coherent. One thing to notice is that this is the main bottleneck of the process. The more molecules or clusters you pass, the longer it will take. For 1 billion molecules, clustering in 10k clusters can be achieved under a day on a desktop computer. However, if you push it to 100k, then it will take more than a couple days

Works directly on PQ codes, no decompression needed
Scales to billions with reasonable memory
Clusters show low dispersion in molecular properties

Clusters

~100,000

Avg cluster size

87k molecules

clustering.pypython

from chelombus import PQKMeans

# Cluster into 100k groups
clusterer = PQKMeans(encoder=encoder, k=100_000, iterations=20)
labels = clusterer.fit_predict(pq_codes)

print(f"Assigned {len(labels):,} molecules to {clusterer.k:,} clusters")

4. Cluster I/O

Query and export your clusters

Once clustering finishes, you'll want to poke around. Chelombus stores results in Parquet files and provides helpers to query specific clusters, grab random samples, or export slices for downstream analysis. Everything runs through DuckDB, so queries are fast even on large datasets. For the first drafts of the project I was storing everything in a clickhouse DB -which worked well but made the process more prone to error and time consuming-. However, duckDB achieves pretty much the same without leaving your notebook.

Get cluster statistics with a single call
Query or sample from any cluster by ID
Export individual clusters to Parquet

Storage format

Parquet

Query engine

DuckDB

cluster_io.pypython

from chelombus import get_cluster_stats, query_cluster, sample_from_cluster

# See what we've got
stats = get_cluster_stats("results/")
print(stats.head())

# Grab molecules from a specific cluster
cluster_42 = query_cluster("results/", cluster_id=42)

# Random sample for visualization
sample = sample_from_cluster("results/", cluster_id=42, n=1000)

5. TMAP Visualization

See the chemical space

TMAPs turn high-dimensional data into explorable 2D maps. You get a primary map showing cluster representatives, then drill down into individual clusters. The result is a two-level navigation system: start with the 100k-node overview, click a region, and explore the molecules inside.

create_tmap: build maps from raw SMILES
representative_tmap: visualize cluster centroids with labels
Output is standalone HTML -no backend required

Output

Interactive HTML

Properties

8 molecular descriptors

visualization.pypython

from chelombus.utils.visualization import create_tmap, representative_tmap

# Map from raw SMILES
create_tmap(
    smiles=smiles_list,
    fingerprint="mqn",
    properties=["hac", "mw", "clogp", "num_rings"],
    tmap_name="cluster_42",
)

# Map of cluster representatives
representative_tmap(
    smiles=representative_smiles,
    cluster_ids=cluster_ids,
    tmap_name="primary_map",
)

6. What's next

Roadmap

Chelombus works, but there's room to make it faster and more flexible. Here's what I'm thinking about for future versions.

GPU-accelerated PQk-means

Rewrite the clustering core in CUDA/Metal for 10-100x speedups on modern hardware.

Apple Silicon support

Port pqkmeans to work natively on M1/M2/M3 chips without Rosetta overhead.

Adaptive cluster counts

Let the algorithm decide how many clusters based on data density instead of fixing k upfront.

Incremental updates

Add new molecules to existing clusters without reprocessing the entire dataset.

Questions?

I'm happy to help.

If you're trying to run the pipeline on your own data or hitting weird errors, reach out. I'll do my best to point you in the right direction.

Get in touch View the code