Papers
arxiv:2512.13330

FIN-bench-v2: A Unified and Robust Benchmark Suite for Evaluating Finnish Large Language Models

Published on Dec 15, 2025
· Submitted by
Joona Kytöniemi
on Dec 16, 2025
Authors:
,
,

Abstract

FIN-bench-v2 is a unified benchmark suite for evaluating Finnish large language models, incorporating diverse datasets and evaluation criteria.

We introduce FIN-bench-v2, a unified benchmark suite for evaluating large language models in Finnish. FIN-bench-v2 consolidates Finnish versions of widely used benchmarks together with an updated and expanded version of the original FIN-bench into a single, consistently formatted collection, covering multiple-choice and generative tasks across reading comprehension, commonsense reasoning, sentiment analysis, world knowledge, and alignment. All datasets are converted to HuggingFace Datasets, which include both cloze and multiple-choice prompt formulations with five variants per task, and we incorporate human annotation or review for machine-translated resources such as GoldenSwag and XED. To select robust tasks, we pretrain a set of 2.15B-parameter decoder-only models and use their learning curves to compute monotonicity, signal-to-noise, non-random performance, and model ordering consistency, retaining only tasks that satisfy all criteria. We further evaluate a set of larger instruction-tuned models to characterize performance across tasks and prompt formulations. All datasets, prompts, and evaluation configurations are publicly available via our fork of the Language Model Evaluation Harness at https://github.com/LumiOpen/lm-evaluation-harness. Supplementary resources are released in a separate repository at https://github.com/TurkuNLP/FIN-bench-v2.

Community

Paper author Paper submitter

Our paper introduces FIN-bench-v2, a unified and robust benchmark suite for evaluating large language models in Finnish, addressing the scarcity of high-quality evaluation resources for low-resource languages. This new suite modernizes the original FIN-bench, migrating it to the LM Evaluation Harness and converting all retained and new datasets into the consistent HuggingFace Datasets format for long-term maintainability. A key feature is the inclusion of both Cloze Formulation (CF) and Multiple-Choice Formulation (MCF) prompts and following the practice established in NorEval (https://aclanthology.org/2025.findings-acl.181/) and HPLT 3.0 (https://arxiv.org/abs/2511.01066) to create five separate variants to account for prompt sensitivity. We utilize the FineTasks selection process (https://huggingface.co/spaces/HuggingFaceFW/blogpost-fine-tasks) to ensure only robust, high-signal tasks are included.

📝​ Our task configurations can be found at https://github.com/LumiOpen/lm-evaluation-harness/tree/main/lm_eval/tasks/finbench_v2.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2512.13330
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2512.13330 in a model README.md to link it from this page.

Datasets citing this paper 8

Browse 8 datasets citing this paper

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.13330 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.