arxiv:2512.13330

FIN-bench-v2: A Unified and Robust Benchmark Suite for Evaluating Finnish Large Language Models

Published on Dec 15, 2025

· Submitted by

Joona Kytöniemi on Dec 16, 2025

TurkuNLP Research Group

Upvote

Authors:

Joona Kytöniemi ,

Jousia Piha ,

Akseli Reunamo ,

Abstract

FIN-bench-v2 is a unified benchmark suite for evaluating Finnish large language models, incorporating diverse datasets and evaluation criteria.

Generated by Qwen/Qwen2.5-Coder-32B-Instruct

We introduce FIN-bench-v2, a unified benchmark suite for evaluating large language models in Finnish. FIN-bench-v2 consolidates Finnish versions of widely used benchmarks together with an updated and expanded version of the original FIN-bench into a single, consistently formatted collection, covering multiple-choice and generative tasks across reading comprehension, commonsense reasoning, sentiment analysis, world knowledge, and alignment. All datasets are converted to HuggingFace Datasets, which include both cloze and multiple-choice prompt formulations with five variants per task, and we incorporate human annotation or review for machine-translated resources such as GoldenSwag and XED. To select robust tasks, we pretrain a set of 2.15B-parameter decoder-only models and use their learning curves to compute monotonicity, signal-to-noise, non-random performance, and model ordering consistency, retaining only tasks that satisfy all criteria. We further evaluate a set of larger instruction-tuned models to characterize performance across tasks and prompt formulations. All datasets, prompts, and evaluation configurations are publicly available via our fork of the Language Model Evaluation Harness at https://github.com/LumiOpen/lm-evaluation-harness. Supplementary resources are released in a separate repository at https://github.com/TurkuNLP/FIN-bench-v2.

View arXiv page View PDF Project page GitHub 2 auto Add to collection

Community

kjoona

Paper author Paper submitter Dec 16, 2025

Our paper introduces FIN-bench-v2, a unified and robust benchmark suite for evaluating large language models in Finnish, addressing the scarcity of high-quality evaluation resources for low-resource languages. This new suite modernizes the original FIN-bench, migrating it to the LM Evaluation Harness and converting all retained and new datasets into the consistent HuggingFace Datasets format for long-term maintainability. A key feature is the inclusion of both Cloze Formulation (CF) and Multiple-Choice Formulation (MCF) prompts and following the practice established in NorEval (https://aclanthology.org/2025.findings-acl.181/) and HPLT 3.0 (https://arxiv.org/abs/2511.01066) to create five separate variants to account for prompt sensitivity. We utilize the FineTasks selection process (https://huggingface.co/spaces/HuggingFaceFW/blogpost-fine-tasks) to ensure only robust, high-signal tasks are included.

📝 Our task configurations can be found at https://github.com/LumiOpen/lm-evaluation-harness/tree/main/lm_eval/tasks/finbench_v2.