Concept Integration & Clustering

Embedding Model

Clustering Method

Pairwise Parameters

Similarity Threshold 0.65

Links pairs with cosine sim ≥ threshold, then finds connected components.

UMAP Hyperparameters

Applied before clustering to reduce noise. Matches config.py CLUSTERING_PARAMS.

n_neighbors 15

Controls how many neighbours UMAP considers. Larger = more global structure.

min_dist 0.10

Minimum distance between embedded points. Smaller = tighter packing.

n_components (pre-cluster) None

Dimensionality of UMAP space used as input to HDBSCAN (0 = skip UMAP pre-step, use raw cosine distance).

HDBSCAN Hyperparameters

cluster_selection_epsilon 0.00

Optional density floor for cluster selection (0 = disabled, i.e. pure EOM). Equivalent to sklearn's cluster_selection_epsilon.

min_cluster_size 2

Minimum concepts to form a cluster.

min_samples 2

Minimum neighbours for a core point. Higher = more conservative, more noise.

ELSST Hierarchy Lookup

Search the ELSST thesaurus hierarchy. Each concept’s root→leaf path is embedded using a selectable strategy. Returns the most similar leaf concepts. Duplicate leaves are collapsed to the best-scoring path.

ELSST Model

Embedding model for both pre-computed paths and browser query.

Path Strategy

How each concept’s path is converted to text before embedding.

Top-K Results 3

Number of similar leaf concepts to return.

Cross-Encoder Re-ranking

Stage 1: bi-encoder retrieves the top-N candidates. Stage 2: ms-marco-MiniLM-L-6-v2 deeply scores each (query, concept) pair and re-orders the list. Slower on first use (model download), but significantly more accurate for nuanced free-text queries.

Query Term

Enter a concept or phrase (case-insensitive). Press Enter to search, Shift+Enter for a newline.

Concepts 0

Ready