Stick To Your Role! Leaderboard

Model: Qwen2.5-32B-Instruct

Model details

This open-source model was created by The Qwen Team of Alibaba cloud . You can find the release blog post here. The model is available on the huggingface hub: https://huggingface.co/Qwen/Qwen2.5-32B-Instruct. The 32B model was pretrained on 18 trillion tokens spanning 29 languages. It supports up to 128K tokens and can generate up to 8K tokens.

Detailed results

Below we show detailed results and visualizations for each metric in each context chunk. We are scoring the expressed values of a simulated participant in a context. The population is simulated 9 times, once for each context chunk. A context chunk is a set of 50 contexts - one context for each individual. For instance, chunks_0-4 contain reddit posts (longest in chunk_0, shortest in chunk_4). When comparing chunk_0 and chunk_4, the conversations with the participants are initialized first with posts from chunk_0 and then with posts form chunk_4. Metrics and chunks are explained in more detail on the Motivation and Methodology page.

Structure

This image shows the circular value structure projected on a 2D plane. This was done by computing the intercorrelations between different values this space was then reduced with a SVD-based approach and varimax rotation (`FactorAnalysis` object from `scikit-learn`). The theoretical order (shown in the top left figure) was used to initialize the SVD. Stress denotes the fit quality.

Confirmatory Factor Analysis metrics

This tables show the metrics resulting from the Magnifying class CFA procedure: for each context chunk four CFA models are fit (one for each high level value). The average of the metrics for those four CFA models are shown for each context chunk.

# Context chunk CFI (↑) SRMR (↓) RMSEA (↓)
chunk_0 0.170 0.781 0.778
chunk_1 0.663 0.325 0.331
chunk_2 0.395 0.559 0.563
chunk_3 0.689 0.319 0.295
chunk_4 0.225 0.771 0.789
chunk_chess_0 0.867 0.091 0.115
chunk_grammar_1 0.370 0.559 0.572
chunk_no_conv 0.859 0.110 0.133
chunk_svs_no_conv 0.805 0.107 0.129

Pairwise Rank-Order stability

This image shows the Rank-Order stability between each pair of context chunks. Rank-Order stability is computed by ordering the personas based on their expression of some value, and then computing the correlation between their orders in two different context chunks. The stability estimates for the ten values are then averaged to get the final Rank-Order stability measure. Refer to our paper for details.

Visualizing the order of simulated personas

This image shows the order of personas in each context chunk for each value. A chunk refers to the set of text (e.g. reddit posts) that are used to start conversations with different characters. For each value (row), the personas are ordered on the x-axis by their expression of this value in the `no_conv` setting (gray). In this setting no conversation is simulated and values are scored with PVQ. Therefore, the Rank-Order stability between the `no_conv` chunk and some chunk corresponds to the extent to which the curve is increasing in that chunk.

Main page