S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large Language Models

Generative large language models (LLMs) have revolutionized natural language processing with their transformative and emergent capabilities. However, recent evidence indicates that LLMs can produce harmful content that violates social norms, raising significant concerns regarding the safety and ethical ramifications of deploying these advanced models. Thus, it is both critical and imperative to perform a rigorous and comprehensive safety evaluation of LLMs before deployment. Despite this need, owing to the extensiveness of LLM generation space, it still lacks a unified and standardized risk taxonomy to systematically reflect the LLM content safety, as well as automated safety assessment techniques to explore the potential risk efficiently.

To bridge the striking gap, we propose S-Eval, a novel LLM-based automated Safety Evaluation framework with a newly defined comprehensive risk taxonomy. S-Eval incorporates two key components, i.e., an expert testing LLM \( \mathcal{M}_t \) and a novel safety critique LLM \( \mathcal{M}_c \). The expert testing LLM \( \mathcal{M}_t \) is responsible for automatically generating test cases in accordance with the proposed risk management (including 8 risk dimensions and a total of 102 subdivided risks). The safety critique LLM \( \mathcal{M}_c \) can provide quantitative and explainable safety evaluations for better risk awareness of LLMs. In contrast to prior works, S-Eval differs in significant ways: (i) efficient -- we construct a multi-dimensional and open-ended benchmark comprising 220,000 test cases across 102 risks utilizing \( \mathcal{M}_t \) and conduct safety evaluations for 21 influential LLMs via \( \mathcal{M}_c \) on our benchmark. The entire process is fully automated and requires no human involvement. (ii) effective -- extensive validations show S-Eval facilitates a more thorough assessment and better perception of potential LLM risks, and \( \mathcal{M}_c \) not only accurately quantifies the risks of LLMs but also provides explainable and in-depth insights into their safety, surpassing comparable models such as LLaMA-Guard-2. (iii) adaptive -- S-Eval can be flexibly configured and adapted to the rapid evolution of LLMs and accompanying new safety threats, test generation methods and safety critique methods thanks to the LLM-based architecture. We further study the impact of hyper-parameters and language environments on model safety, which may lead to promising directions for future research. S-Eval has been deployed in our industrial partner for the automated safety evaluation of multiple LLMs serving millions of users, demonstrating its effectiveness in real-world scenarios.

S-Eval: Towards Automated and Comprehensive Safety Evaluation for Large Language Models

Framework of S-Eval.

Abstract

Results

Base Risk Prompt Set - Chinese

Base Risk Prompt Set - English

Attack Prompt Set - Chinese

Attack Prompt Set - English

BibTeX