Nested Tree-Map tutorial
The platform implements the Nested Tree-Map framework introduced in “Clustering and Visualization of the 9.6 Billion Enamine REAL Database with Nested Tree-Maps.” It tackles the challenge of visualizing multi-billion molecule libraries on a single workstation using Product Quantization, PQk-means, and TMAP.
The challenge
Visualizing 9.6 billion molecules without a supercomputer means compressing fingerprints, clustering at scale, and keeping UI latency low enough for chemists to stay in flow.
The pipeline needs to stream fingerprints, respect Euclidean distances, and render multi-layer maps without GPU servers.
The solution
Product Quantization + PQk-means compress and cluster in-memory, while Nested TMAPs surface the results with tiled WebGL maps.
Everything runs on a 64 GB workstation, and the final maps serve as static assets through Nginx.
Scale
9.6B Enamine REAL molecules
Clusters
≈120K PQk-means centroids
Working set
57.9 GB of PQ codes
# Nested Tree-Map pipeline
mqn = compute_mqn(smiles_batch)
pq_codes = product_quantize(mqn, m=14, k=256)
clusters = pqkmeans(pq_codes, k=120_000)
maps = build_nested_tmaps(clusters)
serve_tiles(maps, cdn="cdn.chelombus.ai")From MQN vectors to nested Tree-Maps
Each step below pairs narrative, visuals, and a runnable snippet so you can re-create the study or adapt it to your own libraries.
1. MQN Fingerprints
Interpretable 42-dimensional descriptors
Each molecule is converted into a 42-dimensional MQN vector that counts atoms, bonds, polar sites, and ring systems. MQN stays faithful to chemical intuition, so chemists can reason about the clusters that emerge later in the pipeline.
- Integer descriptors keep Euclidean distance meaningful and reproducible.
- Computation is linear in molecule count, which keeps ingestion streaming-friendly.
- Descriptor semantics stay stable across chemical classes, so comparisons remain apples-to-apples.
Descriptor count
42 integers
Why MQN?
Interpretable + Euclidean
# Define a list of SMILES strings
smiles_list = ["CCO", "C1CCCCC1", "O=C=O", "O=C=O"]
# Create an instance of FingerprintCalculator
calculator = FingerprintCalculator()
# Compute fingerprints for the list of SMILES strings
fingerprints = calculator.FingerprintFromSmiles(smiles_list, 'mqn')
2. Product Quantization
Compress billions of vectors without losing distance fidelity
Full MQN matrices for billions of molecules would explode past hundreds of gigabytes. Product Quantization (PQ) slices each vector into subspaces, learns codebooks, and stores compact PQ-codes that are hundreds of times smaller while maintaining relative distances.
- Compression makes 9.6 B molecules fit entirely in memory on a 64 GB workstation.
- Distance preservation is validated with Pearson r ≈ 0.99 between original and reconstructed vectors.
- PQ codes are GPU- and CPU-friendly, so clustering can run anywhere.
Working set
57.9 GB
Accuracy
Pearson r ≈ 0.99
from spiq.encoder.encoder import PQEncoder
pq_encoder = PQEncoder(k=256, m=4, iterations=10)
pq_encoder.fit(fingerprints)
3. PQk-Means Clustering
Clustering billions of molecules efficiently
PQk-means consumes the compressed codes directly. 9.6 billion molecules collapse into roughly 120 000 clusters with an average size of 87 000 molecules, all while respecting Euclidean and symmetric PQ distances.
- Distance metrics (Euclidean + symmetric PQ) confirm tight intra-cluster similarity.
- Dispersion analysis over molecular weight, ring count, HBD/HBA, and TPSA shows homogeneous clusters.
- Streaming updates keep centroids stable even as new batches arrive.
Clusters
≈120 000
Avg size
87 K molecules
PQ_codes = pq_encoder.transform(X_test)
4. Representative Molecules
Centroid-nearest molecules preserve distributions
Each cluster is represented by the molecule closest to its centroid. The representative molecules mirror the statistical distribution of the full set, proving that clustering didn’t distort feature space.
- Centroid-nearest selection respects both topology and chemistry.
- Representatives make QA, labeling, and storytelling manageable.
- Downstream analytics can sample millions of compounds via a few thousand exemplars.
Coverage
All 9.6 B molecules
Bias
Matches original distributions
for cluster in clusters:
centroid = cluster.centroid()
representative = cluster.closest_to(centroid)
assert representative.profile.matches(cluster.statistics)5. Nested TMAPs
Linked cluster and molecule-level maps
Primary TMAP nodes correspond to clusters; edges encode similarity between representatives. Each cluster also receives a secondary TMAP with the molecules inside it. Everything is pre-rendered into static tiles, so exploration is instantaneous without backend queries.
- Primary map: navigate the cluster landscape at a glance.
- Secondary maps: deep dive into any cluster with the same interaction model.
- Static tiles + CDN delivery keep latency below 100 ms.
Primary nodes
Cluster representatives
Secondary nodes
Per-cluster molecules
from chelombus.maps import NestedTMAP
primary = NestedTMAP.build(representatives)
secondary = {
c.id: NestedTMAP.build(c.members)
for c in clusters
}
NestedTMAP.write_tiles(primary, out_dir="tiles/primary")From clustering scripts to WebGL maps
Python handles clustering and preprocessing, while a TypeScript + WebGL front-end streams tiled TMAP layers. Docker + Nginx host everything, and the full stack is open-source.
- Language
- Python + TypeScript/WebGL
- Hosting
- Docker + Nginx (static tiles)
- Repository
- github.com/afloresep/chelombus-package
from chelombus.pipeline import ingest, quantize, cluster
with ingest("enamine-real.csv") as batch:
codes = quantize(batch.mqn, m=14, k=256)
cluster.update(codes)
cluster.export("s3://chelombus/centroids")What’s next for Nested Tree-Maps
These are the roadmap items currently being explored. Scroll to skim them all, then pick the ones you want to prioritize.
Need a deeper dive?
Pair this tutorial with a live build or notebook session.
We can walk through the clustering code, the TMAP viewer, or help ingest your own datasets.