Chelombus
Overview

Nested Tree-Map tutorial

The platform implements the Nested Tree-Map framework introduced in “Clustering and Visualization of the 9.6 Billion Enamine REAL Database with Nested Tree-Maps.” It tackles the challenge of visualizing multi-billion molecule libraries on a single workstation using Product Quantization, PQk-means, and TMAP.

The challenge

Visualizing 9.6 billion molecules without a supercomputer means compressing fingerprints, clustering at scale, and keeping UI latency low enough for chemists to stay in flow.

The pipeline needs to stream fingerprints, respect Euclidean distances, and render multi-layer maps without GPU servers.

The solution

Product Quantization + PQk-means compress and cluster in-memory, while Nested TMAPs surface the results with tiled WebGL maps.

Everything runs on a 64 GB workstation, and the final maps serve as static assets through Nginx.

Scale

9.6B Enamine REAL molecules

Clusters

≈120K PQk-means centroids

Working set

57.9 GB of PQ codes

pipeline.pypython
# Nested Tree-Map pipeline
mqn = compute_mqn(smiles_batch)
pq_codes = product_quantize(mqn, m=14, k=256)
clusters = pqkmeans(pq_codes, k=120_000)
maps = build_nested_tmaps(clusters)
serve_tiles(maps, cdn="cdn.chelombus.ai")
Framework pillars

From MQN vectors to nested Tree-Maps

Each step below pairs narrative, visuals, and a runnable snippet so you can re-create the study or adapt it to your own libraries.

1. MQN Fingerprints

Interpretable 42-dimensional descriptors

Each molecule is converted into a 42-dimensional MQN vector that counts atoms, bonds, polar sites, and ring systems. MQN stays faithful to chemical intuition, so chemists can reason about the clusters that emerge later in the pipeline.

  • Integer descriptors keep Euclidean distance meaningful and reproducible.
  • Computation is linear in molecule count, which keeps ingestion streaming-friendly.
  • Descriptor semantics stay stable across chemical classes, so comparisons remain apples-to-apples.

Descriptor count

42 integers

Why MQN?

Interpretable + Euclidean

Fingerprint Modulepython
# Define a list of SMILES strings
smiles_list = ["CCO", "C1CCCCC1", "O=C=O", "O=C=O"]

# Create an instance of FingerprintCalculator
calculator = FingerprintCalculator()

# Compute fingerprints for the list of SMILES strings
fingerprints = calculator.FingerprintFromSmiles(smiles_list, 'mqn')

2. Product Quantization

Compress billions of vectors without losing distance fidelity

Full MQN matrices for billions of molecules would explode past hundreds of gigabytes. Product Quantization (PQ) slices each vector into subspaces, learns codebooks, and stores compact PQ-codes that are hundreds of times smaller while maintaining relative distances.

  • Compression makes 9.6 B molecules fit entirely in memory on a 64 GB workstation.
  • Distance preservation is validated with Pearson r ≈ 0.99 between original and reconstructed vectors.
  • PQ codes are GPU- and CPU-friendly, so clustering can run anywhere.

Working set

57.9 GB

Accuracy

Pearson r ≈ 0.99

Quantizepython
from spiq.encoder.encoder import PQEncoder
pq_encoder = PQEncoder(k=256, m=4, iterations=10)
pq_encoder.fit(fingerprints)

3. PQk-Means Clustering

Clustering billions of molecules efficiently

PQk-means consumes the compressed codes directly. 9.6 billion molecules collapse into roughly 120 000 clusters with an average size of 87 000 molecules, all while respecting Euclidean and symmetric PQ distances.

  • Distance metrics (Euclidean + symmetric PQ) confirm tight intra-cluster similarity.
  • Dispersion analysis over molecular weight, ring count, HBD/HBA, and TPSA shows homogeneous clusters.
  • Streaming updates keep centroids stable even as new batches arrive.

Clusters

≈120 000

Avg size

87 K molecules

pqkmeans.pypython
PQ_codes = pq_encoder.transform(X_test)

  

4. Representative Molecules

Centroid-nearest molecules preserve distributions

Each cluster is represented by the molecule closest to its centroid. The representative molecules mirror the statistical distribution of the full set, proving that clustering didn’t distort feature space.

  • Centroid-nearest selection respects both topology and chemistry.
  • Representatives make QA, labeling, and storytelling manageable.
  • Downstream analytics can sample millions of compounds via a few thousand exemplars.

Coverage

All 9.6 B molecules

Bias

Matches original distributions

representatives.pypython
for cluster in clusters:
    centroid = cluster.centroid()
    representative = cluster.closest_to(centroid)
    assert representative.profile.matches(cluster.statistics)

5. Nested TMAPs

Linked cluster and molecule-level maps

Primary TMAP nodes correspond to clusters; edges encode similarity between representatives. Each cluster also receives a secondary TMAP with the molecules inside it. Everything is pre-rendered into static tiles, so exploration is instantaneous without backend queries.

  • Primary map: navigate the cluster landscape at a glance.
  • Secondary maps: deep dive into any cluster with the same interaction model.
  • Static tiles + CDN delivery keep latency below 100 ms.

Primary nodes

Cluster representatives

Secondary nodes

Per-cluster molecules

tmap.pypython
from chelombus.maps import NestedTMAP

primary = NestedTMAP.build(representatives)
secondary = {
    c.id: NestedTMAP.build(c.members)
    for c in clusters
}

NestedTMAP.write_tiles(primary, out_dir="tiles/primary")
6. Implementation

From clustering scripts to WebGL maps

Python handles clustering and preprocessing, while a TypeScript + WebGL front-end streams tiled TMAP layers. Docker + Nginx host everything, and the full stack is open-source.

Language
Python + TypeScript/WebGL
Hosting
Docker + Nginx (static tiles)
Repository
github.com/afloresep/chelombus-package
pipeline.pypython
from chelombus.pipeline import ingest, quantize, cluster

with ingest("enamine-real.csv") as batch:
    codes = quantize(batch.mqn, m=14, k=256)
    cluster.update(codes)

cluster.export("s3://chelombus/centroids")
7. Future directions

What’s next for Nested Tree-Maps

These are the roadmap items currently being explored. Scroll to skim them all, then pick the ones you want to prioritize.

Weighted MQN / hybrid fingerprints

Blend MQN with learned embeddings for scaffolds that need extra discrimination while keeping interpretability.

Adaptive k-means & density awareness

Let sparse or dense chemistry adjust cluster counts automatically instead of fixing ≈120K upfront.

Incremental map updates

Hot-load new vendor drops, recompute PQ codes, and re-tile affected clusters without full reprocessing.

Notebook integrations

Pipe representatives and cluster stats into cheminformatics notebooks for on-demand queries.

Need a deeper dive?

Pair this tutorial with a live build or notebook session.

We can walk through the clustering code, the TMAP viewer, or help ingest your own datasets.