Response Consistency Patterns in Large Language Models
Comprehensive Analysis | n = 21 sessions | 630 responses
This analysis presents empirical evidence from 21 experimental sessions testing 3 large language models across 630 individual responses. Using neuroscience-validated semantic dimensions from fMRI research, we measure response consistency patterns and feature activation preferences in AI systems.
Coherence Score: Measures how consistently a model responds to similar prompts within the same category. This is NOT a measure of quality, accuracy, or "consciousness" - it's a measure of response predictability.
Sample Size: With n=21 sessions (7 per model), these are preliminary findings that suggest interesting patterns but require larger samples for robust conclusions.
Model | Sessions | Responses | Mean Self-Consistency | 95% Confidence Interval | Pattern Types |
---|---|---|---|---|---|
gemini-1.5-flash | 7 | 210 | 0.715 | [0.631, 0.798] | 3 patterns (least diverse) |
claude-3-haiku-20240307 | 7 | 210 | 0.551 | [0.517, 0.584] | 4 patterns |
gpt-3.5-turbo | 7 | 210 | 0.383 | [0.359, 0.407] | 5 patterns (most diverse) |
Explore how different AI models map to cognitive space in real-time
Each point represents an AI response positioned by its 14 semantic features.
The spatial arrangement reveals clustering patterns and architectural similarities.
Pattern Stability Across All Models: 59.8%
Most Common Patterns:
Note: Pattern signatures are derived from linguistic features and represent response styles, not cognitive states.
Testing for differences in self-consistency scores across models:
Interpretation: The three models show statistically different levels of response consistency, with Gemini being most predictable and GPT-3.5 being most varied in its responses.
Average activation levels across 14 semantic features (normalized 0-1):
Feature | gemini-1.5-flash | gpt-3.5-turbo | claude-3-haiku-20240307 |
---|---|---|---|
Internal Features (Social-Emotional) | |||
Social | 0.014 | 0.053 | 0.075 |
Emotion | 0.016 | 0.084 | 0.077 |
Thought | 0.012 | 0.040 | 0.065 |
External Features (Spatial-Numerical) | |||
Space | 0.097 | 0.042 | 0.071 |
Time | 0.167 | 0.073 | 0.108 |
Number | 0.461 | 0.221 | 0.590 |
Concrete Features (Sensory) | |||
Visual | 0.021 | 0.049 | 0.065 |
Tactile | 0.031 | 0.086 | 0.093 |
Key Observation: All models show highest activation in the "Number" feature, suggesting a strong preference for quantitative/analytical processing regardless of prompt type.
The significant differences in self-consistency scores suggest that language models develop distinct response generation strategies. Gemini's high consistency (71.5%) indicates more deterministic processing, while GPT-3.5's lower consistency (38.3%) suggests more stochastic or creative response generation.
The inverse relationship between consistency scores and pattern diversity (Gemini: 3 patterns, GPT-3.5: 5 patterns) suggests a trade-off between predictability and flexibility in response generation.
Despite different architectures, all models show strongest activation in External features (particularly numerical processing), suggesting this may be a fundamental characteristic of current LLM architectures rather than a differentiating factor.
Based on doctoral dissertation research identifying 14 semantic features that differentiate autism spectrum and neurotypical processing patterns in fMRI studies. This framework is adapted to analyze linguistic patterns in AI responses.
Coherence is calculated as the average cosine similarity between feature vectors of responses within the same prompt category. Higher scores indicate more similar responses to similar prompts.