Visualizing billions of molecules
How do you explore 9.6 billion molecules on a laptop? You compress them, cluster them, and build navigable maps. This tutorial walks through the entire pipeline -from SMILES strings to interactive visualizations- using nothing but a 64GB workstation.
The problem
Chemical libraries are growing faster than our ability to explore them.
The Enamine REAL database alone has 9.6 billion synthesizable molecules. Loading their fingerprints into memory would take terabytes. Traditional visualization tools tap out at millions of points. We needed something different.
The solution
Compress, cluster, then visualize in layers.
Product Quantization shrinks fingerprints by 28x. PQk-means clusters the compressed codes directly. Nested TMAPs let you zoom from a global overview down to individual molecules. The whole thing runs on commodity hardware.
Scale
9.6B molecules
Clusters
~100k centroids
Memory footprint
57.9 GB
from chelombus import DataStreamer, FingerprintCalculator, PQKMeans
from chelombus.encoder.encoder import PQEncoder
# Stream molecules in manageable chunks
ds = DataStreamer()
smiles_chunk = next(ds.parse_input("molecules.smi", chunksize=100_000))
# Calculate fingerprints
calc = FingerprintCalculator()
fingerprints = calc.FingerprintFromSmiles(smiles_chunk, "mqn")
# Compress with Product Quantization
encoder = PQEncoder(k=256, m=6, iterations=20)
encoder.fit(fingerprints)
pq_codes = encoder.transform(fingerprints)
# Cluster at scale
clusterer = PQKMeans(encoder=encoder, k=100_000, iterations=20)
labels = clusterer.fit_predict(pq_codes)Five steps from SMILES to maps
Each section below covers one stage of the pipeline. The code snippets are copy-paste ready -run them in sequence and you'll have your own nested TMAP by the end.
1. Fingerprints
Turn molecules into numbers
Every molecule needs a numerical representation before we can do anything useful with it. Chelombus supports two fingerprint types: Morgan fingerprints (hashed bit vectors) and MQN descriptors (42 interpretable molecular counts). For clustering, we must use MQN. Euclidean distances are meaningful between MQN vectors, whereas they are not for binary fingerprints. Applying PCA to binary fingerprints to force them into a continuous space does not fix this. The problem is structural: you end up compressing the zeros (what molecules lack) and then trying to infer similarity from that absence. This is akin to the black raven paradox: just as you cannot learn what a black raven is by observing red apples, you cannot recover meaningful molecular similarity by embedding binary absence into a continuous space.
- Morgan fingerprints: configurable size and radius, good for similarity searching
- MQN fingerprints: fixed 42 dimensions covering atoms, bonds, polarity, and topology
- Both work seamlessly with the rest of the pipeline
Morgan default
1024 bits, radius 2
MQN dimensions
42
from chelombus import FingerprintCalculator
calc = FingerprintCalculator()
smiles = ["CCO", "c1ccccc1", "CC(=O)O", "CN1C=NC2=C1C(=O)N(C(=O)N2C)C"]
fingerprints = calc.FingerprintFromSmiles(smiles, "mqn")
print(f"Shape: {fingerprints.shape}") # (4, 42)2. Product Quantization
Compress vectors without losing what matters
Here's the problem: 9.6 billion 42-dimensional vectors would eat hundreds of gigabytes. Product Quantization fixes this by splitting each vector into subspaces, learning a codebook for each, and storing tiny indices instead of full floats. The result? 28x compression while preserving distance relationships almost perfectly.
- Splits vectors into m subspaces (6 works well for MQN)
- Learns k centroids per subspace (256 is the sweet spot)
- Euclidean distances stay accurate (Pearson r ≈ 0.99)
Compression
28x
Distance fidelity
r ≈ 0.99
from chelombus.encoder.encoder import PQEncoder
# 6 subspaces x 256 centroids = 6-byte codes per molecule
encoder = PQEncoder(k=256, m=6, iterations=20)
encoder.fit(training_fingerprints)
# Transform to compact codes
pq_codes = encoder.transform(fingerprints)
print(f"Original: {fingerprints.nbytes:,} bytes")
print(f"Compressed: {pq_codes.nbytes:,} bytes")3. PQk-Means Clustering
Group billions of molecules into meaningful clusters
Standard k-means chokes on billions of points. PQk-means operates directly on compressed codes, computing approximate distances without ever decompressing. The math works out surprisingly well and clusters end up tight and chemically coherent. One thing to notice is that this is the main bottleneck of the process. The more molecules or clusters you pass, the longer it will take. For 1 billion molecules, clustering in 10k clusters can be achieved under a day on a desktop computer. However, if you push it to 100k, then it will take more than a couple days
- Works directly on PQ codes, no decompression needed
- Scales to billions with reasonable memory
- Clusters show low dispersion in molecular properties
Clusters
~100,000
Avg cluster size
87k molecules
from chelombus import PQKMeans
# Cluster into 100k groups
clusterer = PQKMeans(encoder=encoder, k=100_000, iterations=20)
labels = clusterer.fit_predict(pq_codes)
print(f"Assigned {len(labels):,} molecules to {clusterer.k:,} clusters")4. Cluster I/O
Query and export your clusters
Once clustering finishes, you'll want to poke around. Chelombus stores results in Parquet files and provides helpers to query specific clusters, grab random samples, or export slices for downstream analysis. Everything runs through DuckDB, so queries are fast even on large datasets. For the first drafts of the project I was storing everything in a clickhouse DB -which worked well but made the process more prone to error and time consuming-. However, duckDB achieves pretty much the same without leaving your notebook.
- Get cluster statistics with a single call
- Query or sample from any cluster by ID
- Export individual clusters to Parquet
Storage format
Parquet
Query engine
DuckDB
from chelombus import get_cluster_stats, query_cluster, sample_from_cluster
# See what we've got
stats = get_cluster_stats("results/")
print(stats.head())
# Grab molecules from a specific cluster
cluster_42 = query_cluster("results/", cluster_id=42)
# Random sample for visualization
sample = sample_from_cluster("results/", cluster_id=42, n=1000)5. TMAP Visualization
See the chemical space
TMAPs turn high-dimensional data into explorable 2D maps. You get a primary map showing cluster representatives, then drill down into individual clusters. The result is a two-level navigation system: start with the 100k-node overview, click a region, and explore the molecules inside.
- create_tmap: build maps from raw SMILES
- representative_tmap: visualize cluster centroids with labels
- Output is standalone HTML -no backend required
Output
Interactive HTML
Properties
8 molecular descriptors
from chelombus.utils.visualization import create_tmap, representative_tmap
# Map from raw SMILES
create_tmap(
smiles=smiles_list,
fingerprint="mqn",
properties=["hac", "mw", "clogp", "num_rings"],
tmap_name="cluster_42",
)
# Map of cluster representatives
representative_tmap(
smiles=representative_smiles,
cluster_ids=cluster_ids,
tmap_name="primary_map",
)Roadmap
Chelombus works, but there's room to make it faster and more flexible. Here's what I'm thinking about for future versions.
Questions?
I'm happy to help.
If you're trying to run the pipeline on your own data or hitting weird errors, reach out. I'll do my best to point you in the right direction.