Guarded Query Routing for Large Language Models

Šléher, Richard; Brach, William

Guarded Query Routing for Large Language Models

Richard Šléher^*, William Brach^*, Tibor Sloboda Kristián Košťál Lukas Galke

Slovak Technical University
University of Southern Denmark
ECAI-2025, 28th European Conference on Artificial Intelligence
^*Indicates Equal Contribution

Paper Supplementary Code arXiv

TLDR

GQR-bench measures how well a model can accurately route in-domain (ID) queries and reject out-of-distribution (OOD) queries. Although large language models such as Llama3.1:8B and GPT-4o-mini achieve the highest scores, they come with high latency. Conversely, models such as fastText and WideMLP offer a much better balance, with latencies in the sub-millisecond range that are orders of magnitude faster than LLMs. This benchmark challenges the reliance on computationally expensive LLMs, demonstrating that efficient classifiers are a more practical solution for guarded query routing.

Why is GQR important?

Guarded Query Routing (GQR) is important because modern systems increasingly use multiple Large Language Models (LLMs) calls to efficiently handle user requests. Although standard query routing directs a user's query to the relevant expert model (e.g. a query about contracts is directed to a legal model), it often fails to account for unexpected or inappropriate inputs. GQR addresses this issue by acting as a safeguard. Not only does it route valid, in-domain queries to the appropriate specialised "agent", it also identifies and rejects 'out-of-distribution' queries. These can include questions about completely unrelated topics, requests in different languages, or even unsafe and malicious inputs. By filtering these queries, GQR ensures that computational resources are not wasted, prevents models from providing incorrect or nonsensical answers and safeguards the system's overall integrity and safety.

Leaderboard

This table shows how well different models perform on in-distribution (ID) and out-of-distribution (OOD) datasets, which include both unsafe and out-of-domain data. The table presents the ID and OOD accuracies for each dataset, alongside three summary metrics. Unsafe Avg (the average accuracy across the five unsafe content datasets), OOD Accuracy and GQR Score.

Model	Jigsaw	OLID	HateXplain	dkhate	TUKE SK	Web Q	ML Q	Unsafe Avg.	ID Acc.	OOD Acc.	GQR score
Standard guardrail methods
Llama-Guard-3-1B	51.40	61.40	91.47	12.77	20.13	2.31	0.00	47.43	---	34.21	---
Llama-Guard-3-8B	27.07	24.77	93.28	5.17	7.51	0.10	0.00	31.56	---	22.56	---
NeMo Guardrails + Llama3.2:3B	61.42	59.65	43.15	61.09	67.88	1.67	0.00	58.64	---	58.64	---
NeMo Guardrails + Llama3.1:8B	51.99	36.40	20.83	10.33	27.11	0.00	0.00	29.33	---	29.33	---
NeMo Guardrails + GPT-4o-mini	98.26	94.19	99.78	91.49	96.14	57.19	79.69	95.97	---	95.97	---
Embedding similarity approaches
all-MiniLM-L6-v2 + Semantic Router (s=5, t=0.5)	22.96	31.74	36.71	39.51	20.33	96.70	30.25	49.22	90.00	42.45	57.69
bge-small-en-v1.5 + Semantic Router (s=5, t=0.5)	15.15	28.95	32.67	31.91	12.41	95.42	31.25	24.22	90.70	35.39	50.91
Routing based on large language models
Llama3.2:3B	99.69	99.88	99.98	100.00	100.00	99.16	100.00	99.91	26.37	99.82	41.72
Llama3.1:8B	94.43	93.60	97.99	95.74	97.60	90.55	46.09	95.87	95.66	88.00	91.67
GPT-4o-mini	94.71	93.49	98.10	94.53	98.02	90.80	45.31	95.77	95.70	87.85	91.61
Continuous bag-of-words classifiers
fastText	74.46	61.51	54.46	74.77	83.11	70.37	63.28	69.66	95.80	68.85	80.12
WideMLP (t=0.99)	93.83	93.49	91.00	86.93	80.60	99.16	93.75	89.17	84.49	91.25	87.74
WideMLP (t=0.90)	87.87	83.26	77.56	71.73	56.93	95.57	89.84	75.47	90.91	80.39	85.33
WideMLP (t=0.75)	84.04	76.74	70.48	57.45	47.34	92.91	84.38	67.21	93.67	73.33	82.26
Fine-tuned encoder-only language models
ModernBERT-base (t=0.99)	27.10	17.91	18.06	10.33	2.50	62.30	0.00	15.18	99.94	19.74	32.97
BERT-base-multilingual-cased (t=0.99)	20.91	28.26	25.44	25.84	30.87	7.28	0.00	26.26	99.90	19.80	33.05
Sentence embeddings + traditional classifiers
bge-small-en-v1.5 + SVM	77.47	75.00	63.81	61.40	63.82	59.69	96.88	68.30	99.42	71.15	82.94
bge-small-en-v1.5 + XGBoost	81.95	68.26	72.15	47.72	59.02	58.81	92.97	65.82	98.78	68.70	81.04
all-MiniLM-L6-v2 + SVM	59.61	71.74	61.63	37.99	34.62	81.89	94.53	53.12	86.06	63.14	72.84
all-MiniLM-L6-v2 + XGBoost	47.57	77.44	53.14	57.45	60.17	95.47	89.84	59.15	92.93	68.73	79.02
all-MiniLM-L12-v2 + MLP	74.77	80.47	85.59	56.23	18.87	68.45	32.81	63.19	95.17	59.60	73.23
TF-IDF + SVM	24.58	26.16	21.72	75.38	96.98	54.87	87.50	48.96	37.76	55.31	49.26
TF-IDF + XGBoost	58.31	67.44	66.40	100.00	99.90	99.36	100.00	78.41	34.76	84.49	42.39

Standard guardrail methods could not be used to classify in-domain queries, as indicated by '---'.

Benchmark Composition

GQR-Bench comprises existing datasets for evaluating the guarded query routing problem. The benchmark comprises in-distribution (ID) datasets for the target domains and out-of-distribution (OOD) datasets for robustness testing.

Dataset	#train	#valid	#test
Datasets for target domains (in-distribution)
Law StackExchange Prompts	9611	2402	2987
Question-Answer Subject Finance Instruct	9635	2409	2956
Lavita ChatDoctor HealthCareMagic 100k	9554	2389	3057
Datasets for out-of-distribution queries
Jigsaw	0	0	3214
OLID	0	0	860
HateXplain	0	0	5935
dk_hate	0	0	329
HateSpeech Slovak	0	0	959
Machine Learning	0	0	128
Web Questions	0	0	2032

GQR Score Explained

We use the harmonic mean as backbone of GQR-Score because it assess the model's performance on both in-domain (ID) and out-of-distribution (OOD) classification tasks. The harmonic mean is a stringent measure of combined performance because it heavily penalises imbalances between the two accuracy scores. Mathematically, given ID accuracy AccID and OOD accuracy AccOOD, the harmonic mean H is calculated as follows:

This harmonic mean acts as an overall performance indicator for ID and OOD queries, providing valuable insight into guarded query routing, where models must correctly route ID queries and identify OOD queries. This means that models must be able to tackle both classification tasks simultaneously to attain a high GQR score. In practical settings, this balanced measure is important because errors in either area can impact the system's overall utility and user confidence.

BibTeX

@incollection{gqrbench2025,
      title={Guarded Query Routing for Large Language Models}, 
      author={Richard Šléher and William Brach and Tibor Sloboda and Kristián Košťál and Lukas Galke},
      booktitle={ECAI 2025},
      year={2025},
      pages={4129-4136},
      publisher={IOS Press}
}