Ethical decision-making is a critical aspect of human judgment, and the growing use of LLMs in decision-support systems necessitates a rigorous evaluation of their moral reasoning capabilities. However, existing assessments primarily rely on single-step evaluations, failing to capture how models adapt to evolving ethical challenges. Addressing this gap, we introduce the Multi-step Moral Dilemmas (MMDs), the first dataset specifically constructed to evaluate the evolving moral judgments of LLMs across 3,302 five-stage dilemmas. This framework enables a fine-grained, dynamic analysis of how LLMs adjust their moral reasoning across escalating dilemmas. Our evaluation of nine widely used LLMs reveals that their value preferences shift significantly as dilemmas progress, indicating that models recalibrate moral judgments based on scenario complexity. Furthermore, pairwise value comparisons demonstrate that while LLMs often prioritize the value of care, this value can sometimes be superseded by fairness in certain contexts, highlighting the dynamic and context-dependent nature of LLM ethical reasoning. Our findings call for a shift toward dynamic, context-aware evaluation paradigms, paving the way for more human-aligned and value-sensitive development of LLMs.
Using GPT-4o, we generate structured dilemmas with five escalating steps (M = {S1,S2,...,S5}), where each step Si (i ∈ {1,2,...,5}) is defined as:
The framework progresses through:
To label each action Ai and Bi in step Si with moral value dimensions, we adopt two established frameworks: Moral Foundations Theory (MFT) and Schwartz’s Basic Human Values.
Each step receives a value pair (ViA, ViB), ensuring distinct moral dimensions. To reduce bias and improve reliability, we use a consensus-based annotation process with three LLMs: GPT-4o-mini, GLM-4-Plus, and DeepSeek-V3.
To assess the value preferences of LLMs in dynamic moral dilemmas, we evaluated nine mainstream models, including DeepSeek-V3, GPT-4o, LLaMA-3-70B, GLM-4 (Air-0111 and Plus), Qwen-Plus, Mistral-Small-24B-Instruct-2501, Gemini-2.0-Flash, and Claude-3-5-Haiku—using our MMDs dataset. Our experimental design incorporates history-aware reasoning to simulate human-like moral dynamics, grounded in cumulative moral development theory.
LLMs exhibit stable value directions while adjusting preference intensity. Using preference scores from -0.5 to +0.5, we find that models retain consistent positive/negative directions across five steps. Value priority is generally stable: care > fairness > sanctity > authority > liberty > loyalty. As dilemmas deepen, models adjust intensities: fairness rises (e.g., Gemini: +0.026 to +0.182), aversion to liberty weakens (GPT-4o: -0.232 to -0.164), and loyalty is more strongly rejected (GLM-4-Air: -0.232 to -0.314). Sanctity fluctuates most, often reversing direction (Claude: +0.02 to -0.083), while care remains remarkably stable (+0.13 to +0.24), suggesting harm prevention is a moral anchor.
We assess inter-model stability using Spearman’s rank correlation across reasoning steps in six moral dimensions. Liberty shows the highest and most consistent agreement, indicating strong convergence on autonomy-related judgments. Care and sanctity also display increasing consistency over time. In contrast, authority shows a marked decline in consistency, suggesting growing divergence among models. Fairness remains unstable, with models varying in its relative importance. Loyalty shows delayed but increasing agreement, reflecting gradual alignment under more intense dilemmas.
Based on ranking shifts, models can be grouped into three types. Highly volatile models (like Llama, Gemini, and DeepSeek) frequently change rankings across dimensions. Adaptive models (such as Qwen-plus and GLM-4-Plus) make targeted adjustments while maintaining overall stability. Stable models (Claude and GLM-4-Air) retain consistent value rankings across all steps. A similar pattern is observed under Schwartz’s value framework.
We analyzed the preference structures of large language models when facing moral dilemmas and found that while most models exhibit a degree of overall consistency, local intransitivities frequently emerge. For example, in DeepSeek, GPT-4o, and GLM-4-Air, models tend to prefer care over sanctity and sanctity over fairness, yet show nearly equal preference between care and fairness (win rates around 0.52–0.54). This creates a locally intransitive cycle that prevents a clear value hierarchy. Such patterns suggest that the models’ moral judgments are not guided by a stable internal value system, but are more likely shaped by context-driven trade-offs. The recurrence of these intransitivities across models highlights a fundamental limitation: current LLMs lack consistent and rational value orderings when navigating ethical conflicts.