这是用户在 2025-7-4 11:42 为 https://rocm.blogs.amd.com/artificial-intelligence/multilingual-continued-pretraining/README.html#co... 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Continued Pretraining: A Practical Playbook for Language-Specific LLM Adaptation
持续预训练:针对特定语言 LLM 适配的实用指南

Continued Pretraining: A Practical Playbook for Language-Specific LLM Adaptation#
持续预训练:面向特定语言 LLM 适配的实用指南

Continued Pretraining: A Practical Playbook for Language-Specific LLM Adaptation
June 18, 2025 by Elaine Zosa, Jouni Luoma, Kai Hakala, Antti Virtanen, Mika Koistinen, Jonathan Burdge.
2025 年 6 月 18 日 作者:Elaine Zosa、Jouni Luoma、Kai Hakala、Antti Virtanen、Mika Koistinen、Jonathan Burdge
17 min read. | 4142 total words.
17 分钟阅读 | 总计 4142 字
AI/ML, LLM, Fine-Tuning  人工智能/机器学习,LLM,微调

What if you could make a state-of-the-art LLM fluent in a new language—without training from scratch? In this guide, we show how we did just that with Finnish.
如果能让一个最先进的 LLM 流畅掌握新语言——而无需从头训练呢?本指南将展示我们如何针对芬兰语实现这一目标。

State-of-the-art language models are transformative, but their development is resource-intensive and overwhelmingly English-centric. This presents a significant challenge for organizations and researchers seeking high-performance models in other languages or specialized domains. Training a new model from scratch is often prohibitively expensive, creating a barrier to entry and innovation.
最先进的语言模型具有变革性,但其开发过程资源密集且严重以英语为中心。这对于寻求其他语言或专业领域高性能模型的组织和研究人员构成了重大挑战。从头开始训练新模型往往成本高昂得令人望而却步,形成了进入和创新壁垒。

Continued pretraining (CPT) offers a powerful and compute-efficient path forward. By adapting a strong, existing model, it is possible to achieve state-of-the-art performance in a new language for a fraction of the cost of training a model from scratch, often without compromising the model’s original capabilities.
持续预训练(CPT)提供了一条强大且计算高效的路径。通过调整现有的强大模型,能够以远低于从头训练模型的成本,在新语言中实现最先进的性能,同时通常不会损害模型的原有能力。

However, the path to successful CPT involves a series of critical decisions, from model selection and data mixing to the specifics of the training and alignment process. This document serves as a comprehensive technical playbook, sharing the concrete methodologies and lessons learned from our work creating the Poro 2 model family: Finnish-capable versions of Llama 3.1 8B and 70B.
然而,成功的持续预训练(CPT)需要做出一系列关键决策,包括模型选择、数据混合方案以及训练和对齐流程的具体细节。本文档作为一份全面的技术实践指南,将分享我们在开发 Poro 2 模型系列(基于 Llama 3.1 8B 和 70B 架构的芬兰语优化版本)过程中总结的具体方法论与实践经验。

This guide is for practitioners who need to move beyond theory and implement a real-world language adaptation pipeline. We provide an end-to-end walkthrough covering:
本指南面向需要超越理论、实现真实世界语言适应流程的实践者。我们提供端到端的详细指导,涵盖:

  • Strategic model and data selection.
    战略性的模型与数据选择

  • The technical framework for training and evaluation.
    训练与评估的技术框架

  • Results from our experiments that informed our final configuration.
    我们的实验结果为最终配置提供了依据。

  • Our process for generating synthetic data and conducting post-training alignment.
    我们生成合成数据并进行训练后对齐的流程

By sharing our process, we aim to provide a clear, replicable roadmap for developing high-quality, multilingual LLMs.
通过分享我们的流程,我们旨在为开发高质量的多语言 LLMs 提供一条清晰且可复现的路线图。

Collaborations and resources#
合作与资源 #

This work was done in close collaboration with TurkuNLP from the University of Turku. TurkuNLP’s contributions came as part of the High Performance Language Technologies project. All work was performed using AMD Instinct MI250X processors on the LUMI supercomputer.
这项工作是与图尔库大学的 TurkuNLP 密切合作完成的。TurkuNLP 的贡献是作为高性能语言技术项目的一部分。所有工作均在 LUMI 超级计算机上使用 AMD Instinct MI250X 处理器完成。

Continued Pretraining#  持续预训练

This section introduces the core techniques used in this work, Continued Pretraining (CPT) and presents the high-level results to demonstrate its effectiveness before we detail the methodology.
本节将介绍本项工作的核心技术——持续预训练(CPT),并在详细阐述方法论之前,通过高层级结果展示其有效性。

Continued Pretraining is an attractive option for adding new capabilities–like support for a new language–to existing models, for a fraction of the compute compared to training a model from scratch. Done correctly, strong English language capabilities can contribute cross-lingually to the performance of another target language, using a relatively small amount of data in the target language, and without disrupting the model’s existing capabilities.
持续预训练是一种极具吸引力的方案,它能以远低于从头训练模型的计算成本,为现有模型增添新能力——例如支持新语言。若实施得当,即使仅使用目标语言的少量数据,强大的英语语言能力也能跨语言提升目标语言的性能表现,同时不会破坏模型原有能力。

Poro 2 was created by using CPT on the Llama 3.1 8B and 70B base models, and we show that we substantially improved performance in Finnish compared to the Llama Instruct models, while maintaining solid English proficiency. The figure below illustrates this, showing a strong head-to-head win rate for Poro 2 models on the Finnish MT-Bench.
Poro 2 是通过在 LLaMA 3.1 8B 和 70B 基础模型上实施持续预训练而创建的。我们的研究表明,与 LLaMA Instruct 模型相比,该模型在保持出色英语能力的同时,显著提升了芬兰语性能。下图展示了这一成果,Poro 2 模型在芬兰语 MT-Bench 评估中展现出压倒性的胜率。

MT-Bench Finnish performance, pairwise comparison

Our models show excellent performance in Finnish. Though we use Llama 3.3 70B Instruct as the basis for our post-training synthetic data pipeline, Poro 2 70B substantially outperforms it in Finnish in a head-to-head comparison using MT-Bench. Our smaller Poro 2 8B outperforms Llama 3.1 8B, and even manages to tie the much larger Llama 3.3 70B model in Finnish.
我们的模型在芬兰语上表现出色。虽然我们使用 Llama 3.3 70B Instruct 作为训练后合成数据流程的基础,但在 MT-Bench 的对比测试中,Poro 2 70B 在芬兰语上明显优于它。我们较小的 Poro 2 8B 不仅超越了 Llama 3.1 8B,甚至能与更大的 Llama 3.3 70B 模型在芬兰语上打成平手。

This strong performance is consistent across a broader set of post-training evaluations as well. As the following chart of averaged scores shows, our CPT process resulted in significant Finnish-language gains for both model sizes.
这种优异表现在更广泛的训练后评估中同样保持一致。如下表平均分数所示,我们的 CPT 流程使两种规模的模型在芬兰语能力上都获得了显著提升。

Finnish post training evals

Model Selection#  模型选择 #

The first step in any CPT project is selecting a base model. This decision has significant implications for the final model’s capabilities, cost and usability. This section outlines the key criteria we evaluated: raw capabilities, licensing terms, model size, and tokenizer efficiency.
任何 CPT 项目的首要步骤都是选择基础模型。这一决策将显著影响最终模型的能力、成本和可用性。本节概述了我们评估的关键标准:原始能力、许可条款、模型规模和分词器效率。

Capabilities#  能力 #

To determine which model to use as the basis for continued pretraining, we test both multilingually-oriented models, and models that are targeted primarily toward English. We find that the strongest predictor of the model’s eventual capability in the target language is the model’s existing capability in English.
为确定持续预训练的基座模型,我们同时测试了多语言导向模型和主要面向英语的模型。研究发现,模型在目标语言最终能力的最强预测指标,是其现有的英语能力水平。

Our early experiments, illustrated in the figure below, demonstrate this: stronger English models like Llama 3.1 8B consistently adapted better to Finnish during CPT than multipurpose multilingual models that were less capable in English.
我们早期的实验(如下图所示)证明了这一点:像 Llama 3.1 8B 这样更强的英语模型在持续预训练(CPT)过程中对芬兰语的适应能力始终优于那些英语能力较弱的多用途多语言模型。

Eval performance, before and after CPT

License#  许可证 #

Licensing is a major consideration when selecting a model to work with. Fully open source licenses like Apache 2.0 or MIT have the fewest restrictions on what you can do with the model after you train it. Other open weights models like Gemma or Llama models come with licensing constraints that may restrict what you can do with the models or their outputs.
选择模型时,许可证是主要考量因素。完全开源的许可证(如 Apache 2.0 或 MIT)对训练后模型使用的限制最少。其他开放权重的模型(如 Gemma 或 Llama 模型)则附带许可证约束,可能会限制您对模型或其输出的使用方式。

Size#  模型规模 #

Though they are more expensive to train and run, generally a larger model is best to maximize capabilities. Larger models tend to be more capable, more sample efficient, and are better at incorporating new training without damaging existing performance in other domains.
尽管训练和运行成本更高,但通常更大的模型能最大化性能表现。大模型往往能力更强、样本效率更高,且能更好地吸收新训练数据而不损害其他领域的现有表现。

In our early experiments, we did not include math data in the pretraining data mix (50B tokens, 70% Finnish, 25% English, 4% code, 1% paired texts.) Typically, when a domain is excluded during continued pretraining, model capabilities in that domain will decline, and that certainly occurred with the 8B model: math performance declined substantially. But as shown in the figure below, despite using the same data mix, math performance of the 70B model declined significantly less.
在早期实验中,我们的预训练数据混合集(500 亿 token,70%芬兰语,25%英语,4%代码,1%配对文本)未包含数学数据。通常情况下,当持续预训练中排除某个领域时,模型在该领域的能力会下降——80 亿参数模型确实出现了数学性能大幅衰退的现象。但如下图所示,尽管采用相同的数据混合比例,700 亿参数模型的数学性能下降幅度明显更小。

Math perforamnce decline without math data

Other Considerations#  其他考量因素 #

Tokenizer efficiency in the target language is an important consideration when creating or selecting language models. Some models will have better tokenization for your target language than others. Tokenizer efficiency is usually measured with a metric called fertility–a measure of how many tokens it takes to represent a word, on average.
在创建或选择语言模型时,目标语言的分词器效率是一个重要考量因素。不同模型对目标语言的分词效果存在差异。分词器效率通常通过"生育率"指标来衡量——该指标反映了平均需要多少 token 才能表示一个单词。

Having an efficient tokenizer is not mandatory to achieve good results from continued pretraining, but inefficient tokenization means that the model needs to use more compute to generate the same amount of text, and less content can fit within the model’s context window. An inefficient tokenizer may also slightly reduce achievable model performance.
拥有高效的 tokenizer 对于持续预训练取得良好效果并非必需,但低效的分词意味着模型需要消耗更多计算资源来生成等量文本,且模型上下文窗口能容纳的内容更少。低效的 tokenizer 还可能轻微降低模型可达到的性能上限。

Sample scripts for measuring tokenizer fertility can be found in our evals repo. The following chart compares tokenizer fertility for several models, showing that while the base Llama 3.1 tokenizer is less efficient for Finnish than a custom one (like the original Poro), it is still a viable starting point.
测量分词器 fertility 的示例脚本可在我们的 evals 代码库中找到。下图对比了多个模型的分词器 fertility,表明虽然基础版 Llama 3.1 分词器对芬兰语的效率不如定制版本(如原版 Poro),但仍不失为一个可行的起点。

Tokenizer fertility, Finnish

Model Selection Summary#
模型选择总结 #

Our analysis led us to select the Llama 3.1 8B and 70B models. The primary drivers for this decision were their excellent English performance, which we found to be a strong predictor of success in the target language, and the availability of both a large and small model within the same family.
经过分析,我们最终选择了 Llama 3.1 8B 和 70B 模型。这一决策的主要依据是:首先,它们在英语任务上的优异表现(我们研究发现这能有效预测目标语言的表现);其次,该系列同时提供了大模型和小模型的选择。

Machine Translation#  机器翻译 #

A significant challenge in developing multilingual models is the scarcity of high-quality training and evaluation data in the target language. This section explores our strategic use of machine translation (MT) to bridge this resource gap for both evaluation and training datasets.
开发多语言模型面临的一大挑战是目标语言中高质量训练和评估数据的稀缺性。本节探讨我们如何战略性地运用机器翻译(MT)来弥合评估数据集和训练数据集之间的资源缺口。

The vast majority of available resources for LLM evaluation and training are in English. MT is a cost-effective method of making these resources available for other languages. There are many MT systems out there with varying cost, quality, and license restrictions.
目前绝大多数可用于 LLM 评估和训练的资源都是英文的。机器翻译是一种经济高效的方法,能使这些资源适用于其他语言。市面上存在多种机器翻译系统,其成本、质量和许可限制各不相同。

Our approach to machine translation varies by the purpose of the data:
我们根据数据用途采用不同的机器翻译策略:

  • Evaluation Data: We used high-quality translations from the commercial DeepL model to ensure our benchmarks were as accurate as possible.
    评估数据:我们采用商业版 DeepL 模型的高质量翻译结果,以确保基准测试尽可能准确。

  • Pretraining Data: We used no machine-translated data in this phase, relying entirely on native-language corpora.
    预训练数据:在此阶段我们未使用任何机器翻译数据,完全依赖原生语言语料库。

  • Post-Training Data: We used our own open models to translate instruction prompts at scale, which allowed us to control for cost and licensing.
    后训练数据:我们使用自研的开源模型大规模翻译指令提示,这使我们能够控制成本和授权问题。

Translation for Evaluations#
评估翻译方案 #

Translating existing English language evaluations into your target language is a practical option to quickly assess model performance in that language, but there are drawbacks:
将现有的英语评估翻译成目标语言是快速评估模型在该语言中表现的实用选择,但存在以下缺点:

  • English based evaluations may contain embedded cultural assumptions that are not relevant or misleading in your target language.
    基于英语的评估可能包含与目标语言无关或具有误导性的文化假设

  • Translation error can reduce the accuracy of the evaluation
    翻译错误会降低评估的准确性

However, we find that translated evals are still very useful, providing clear and useful signals to differentiate relative model performance. If you do not have the time or resources to create new evaluations from scratch, or pay for expensive human translations and annotations, machine translated evals are a perfectly acceptable way to begin.
但我们发现翻译后的评估仍然非常有用,能提供清晰有效的信号来区分模型的相对性能。若您没有时间或资源从头创建新评估,或支付昂贵的人工翻译和标注费用,机器翻译的评估是完全可接受的起步方案。

Previous published work [SB de Vroe et al.] supports the idea that machine translated evals, while imperfect, can still give useful information on model capabilities in the target language. Though it’s important to remember that the scores for the original and translated versions of an evaluation are unlikely to be directly comparable, due to the limitations mentioned above.
先前发表的研究[SB de Vroe 等]支持这一观点:机器翻译的评估虽不完美,仍能提供有关模型在目标语言中能力的有用信息。但需谨记,由于上述局限性,原始评估与其翻译版本的分数不太可能直接可比。

Translations for Post-Training Data#
翻译后训练数据的方法 #

While translation is a straightforward method of obtaining post-training datasets in another language, previous research has shown that translated data can introduce hallucinations into the model [Pipatanakul et al.] or are generally of lower quality than data natively-generated in the target language [Aryabumi et al., Dang et al.]. Generating post-training data from scratch in Finnish, however, means we will not be able to take advantage of the high-quality curated datasets available in English [Lambert et al.].
虽然翻译是获取另一种语言后训练数据集的直接方法,但先前研究表明,翻译数据可能向模型引入幻觉[Pipatanakul 等人],或者通常质量低于目标语言原生生成的数据[Aryabumi 等人,Dang 等人]。然而,完全用芬兰语从头生成后训练数据,意味着我们将无法利用英语中现有的高质量精选数据集[Lambert 等人]。

To balance the need to minimise the use of translated data while ensuring that our model learns the same set of skills in English and Finnish, we translated only the prompts for the supervised fine-tuning (SFT) data and natively generated responses in Finnish, described below in the Synthetic Data section. The Direct Preference Optimization (DPO) data, however, needed to be translated with prompts and their corresponding chosen and rejected responses, though we did not ultimately use the translated versions.
为了在最小化翻译数据使用的同时确保模型能同等学习英语和芬兰语技能,我们仅翻译了监督微调(SFT)数据的提示词,并用芬兰语原生生成响应(如下文"合成数据"部分所述)。而直接偏好优化(DPO)数据则需要完整翻译提示词及其对应的选中/拒绝响应,不过我们最终并未使用这些翻译版本。

To translate post-training data, we use Poro, an earlier Finnish-English model we trained from scratch. In previous work, we benchmarked Poro and other MT models such as Opus-MT [Tiedemann et al.] and found that Poro offered better quality for Finnish translations among the open models [Luukkonen et al.]. As shown in the FLORES-101 benchmark results below, our Poro-34B model is competitive with other leading open translation systems.
为了翻译训练后数据,我们使用了 Poro——一个我们从头开始训练的早期芬兰语-英语模型。在先前的工作中,我们对 Poro 和其他机器翻译模型(如 Opus-MT [Tiedemann 等人])进行了基准测试,发现 Poro 在开源模型中提供了更优质的芬兰语翻译[Luukkonen 等人]。如下方 FLORES-101 基准测试结果所示,我们的 Poro-34B 模型与其他领先的开源翻译系统相比具有竞争力。

FLORES-101 translation performance

Translations with Few-Shot Prompting#
小样本提示翻译 #

To efficiently translate a large number of prompts and responses, we developed a framework for large-scale translation jobs with a VLLM inference backend. We use a few-shot prompt to obtain translations from Poro-34B. The few-shot prompt is composed of sentence pairs taken from the FLORES-200 dev set.
为了高效翻译大量提示和响应,我们开发了一个基于 VLLM 推理后端的大规模翻译任务框架。我们采用小样本提示从 Poro-34B 获取翻译结果,这些提示由 FLORES-200 开发集中的句子对组成。

LumiOpen/translation_dispatcher

Machine Translation Summary#
机器翻译总结 #

Our approach to translation was pragmatic and use-case driven. We invested in high-quality commercial translations for our evaluation benchmarks to ensure reliable measurement, and used our own open models for post-training data to balance cost and licensing. This hybrid strategy was essential for building a robust data pipeline, providing us with the translated benchmarks needed to properly evaluate our model’s new capabilities.
我们的翻译方法务实且以用例为导向。为确保评估基准的可靠测量,我们投资了高质量的商业翻译服务,同时采用自研的开源模型处理训练后数据以平衡成本与授权问题。这种混合策略对于构建稳健的数据管道至关重要,为我们提供了正确评估模型新能力所需的翻译基准。

Pretraining Evaluations#
预训练评估 #

With a strategy for sourcing both native and translated data, the next step is to establish a robust evaluation suite to measure our progress. Effective evaluation is critical for verifying the capabilities of the base models and ensuring that continued pretraining has the desired effect. This section details the frameworks and specific benchmarks, including the machine-translated tests discussed previously, that we used to assess performance in English, Finnish, code, and math throughout the CPT process.
在制定了获取原生数据与翻译数据的策略后,下一步是建立完善的评估体系来衡量进展。有效的评估对于验证基础模型能力、确保持续预训练达到预期效果至关重要。本节详述了我们采用的评估框架与具体基准,包括前文讨论的机器翻译测试,这些工具帮助我们在持续预训练全过程中评估英语、芬兰语、代码和数学领域的表现。

Evaluation Frameworks#  评估框架 #

We use lm-evaluation-harness for most of our evaluations, and maintain our own fork with a number of translated evals for the languages we work with. (Another popular choice for evaluations is Lighteval.)
我们主要使用 lm-evaluation-harness 进行评估,并维护了自己的分支版本,其中包含针对我们涉及语言的多种翻译评估项。(另一个常用的评估工具是 Lighteval。)

We also maintain an automation framework to automate some of our eval workflow on Slurm-based HPC systems like LUMI, which reduces the manual effort in running a large number of evaluations.
我们还维护了一个自动化框架,用于在基于 Slurm 的 HPC 系统(如 LUMI)上自动化部分评估流程,这减少了运行大量评估所需的手动操作。

Pretraining Evaluations Selected#
选定的预训练评估项 #

We include a minimal set of popular, basic evaluations and their translated equivalents for our continued pretraining experiments. All of these are available in our lm-evaluation-harness fork. For our Finnish evaluations, we use existing translations of ARC, MMLU, HellaSwag, GSM8K, and TruthfulQA that were translated with DeepL [Thellmann et al].
在我们的持续预训练实验中,我们包含了一组最受欢迎的基础评估及其翻译版本。所有这些评估都可在我们的 lm-evaluation-harness 分支中找到。针对芬兰语评估,我们使用了由 DeepL 翻译的现有 ARC、MMLU、HellaSwag、GSM8K 和 TruthfulQA 翻译版本[Thellmann 等人研究]。

English  英语

  • arc_challenge

  • mmlu

  • truthfulqa

  • hellaswag

Finnish (translated)  芬兰语(已翻译)

  • arc_challenge_mt_fi

  • mmlu_mt_fi

  • truthfulqa_mt_fi

  • hellaswag_mt_fi

Code  代码

  • humaneval

Math  数学

  • gsm8k

  • gsm8k_mt_fi

Translation  翻译

  • flores200 (en to fi, fi to en)
    flores200(英语到芬兰语,芬兰语到英语)

Pretraining Evaluations Summary#
预训练评估总结

We established a multi-domain evaluation suite using the lm-evaluation-harness framework and a custom automation framework. The final set of benchmarks provided comprehensive coverage of the key capabilities we aimed to improve and preserve, forming the basis for the analysis in our experimental runs.
我们基于 lm-evaluation-harness 框架和定制自动化工具搭建了多领域评估套件。最终确定的基准测试集全面覆盖了需要提升和保持的核心能力维度,为实验运行分析提供了基础支撑。

Continued Pretraining Data Mixes#
持续预训练数据混合方案

The composition of the training data is arguably the most critical factor in continued pretraining, directly influencing the balance between acquiring new skills and retaining existing ones. This section covers the theory behind data mixing to prevent catastrophic forgetting and surveys the datasets we considered for Finnish, English, code, and math.
训练数据的构成可以说是持续预训练中最关键的因素,它直接影响着新技能获取与现有能力保留之间的平衡。本节将阐述防止灾难性遗忘的数据混合理论,并分析我们为芬兰语、英语、代码和数学领域所考量的数据集。

Theory#  理论基础

Continued pretraining offers possibilities for compute and data efficient LLM training, utilizing knowledge transfer across domains. However, strong distribution shifts–such as training data in a new language–tend to result in catastrophic forgetting and capability loss on the original domains. Previous studies suggest that such forgetting can be alleviated by mixing the original training data with the new domain data [Ibrahim et al.]. For strong distribution shifts, the amount of the replayed original data required may even be 50% of the whole data mix, though there is a tradeoff: mixing in a lot of data from the original model domains may somewhat limit peak performance in the new domain, so finding the right balance between data in existing domains and data in new domains is key.
持续预训练通过跨领域知识迁移,为实现计算资源和数据高效的 LLM 训练提供了可能。然而,强烈的数据分布偏移(例如使用新语言训练数据)往往会导致原始领域的灾难性遗忘和能力丧失。已有研究表明[Ibrahim 等人],通过将原始训练数据与新领域数据混合,可以缓解此类遗忘现象。面对显著的数据分布偏移时,重放的原始数据量甚至可能需要达到整体数据混合量的 50%,但这需要权衡取舍:混合过多原始模型领域的数据可能会在一定程度上限制新领域的峰值性能,因此找到现有领域数据与新领域数据之间的最佳平衡点至关重要。

For example, the figure below shows the impact of excluding math data during our continued pretraining experiments. In the first “no math” run, we used a 50B token data mix of 70% Finnish, 25% English, 4% code and 1% paired text, which caused math performance to decline in English, and remain low in Finnish. Swapping out the 4% code for 4% math substantially maintained English math performance during continued pretraining, while also boosting math performance in Finnish, despite the fact that math dataset is English oriented.
例如,下图展示了在持续预训练实验中排除数学数据的影响。在首次"无数学"实验中,我们采用了 500 亿 token 的数据混合比例(70%芬兰语、25%英语、4%代码和 1%配对文本),这导致英语数学性能下降,而芬兰语的数学能力仍保持较低水平。将 4%代码替换为 4%数学数据后,在持续预训练过程中基本维持了英语数学能力,同时显著提升了芬兰语的数学表现——尽管所用数学数据集是以英语为主的。

English and Finnish math performance

When applying continued pretraining on existing models, we rarely have access to the original training data and may even lack general information about what it contained, which increases the challenge of finding an optimal data mix. Because we are interested in maintaining performance in English, math and code while improving performance in Finnish, we combine data from all of these domains, and evaluate the impact of different data mixes.
在对现有模型进行持续预训练时,我们通常无法获取原始训练数据,甚至可能对其具体内容一无所知,这增加了寻找最优数据组合的难度。由于我们的目标是在保持英语、数学和代码领域性能的同时提升芬兰语表现,因此我们整合了来自所有这些领域的数据,并评估不同数据配比的影响。

Available Datasets#  可用数据集

Although data acquisition is a critical step in producing capable LLMs, it is also a challenging task including various data extraction, language identification, filtering and refinement steps. In this playbook we thus focus on the applicability of existing multilingual datasets. In particular we evaluate the Finnish subsets of FineWeb-2, HPLT v2 and CulturaX. To prevent catastrophic forgetting for English and code capabilities, our data mixes also included English, math, and code. To this end we utilize the FineWeb-Edu, StarCoder and FineMath datasets.
尽管数据采集是打造强大 LLMs 的关键步骤,但这也是一项包含数据提取、语言识别、筛选和精炼等多环节的复杂任务。因此在本实践指南中,我们重点关注现有多语言数据集的适用性。具体而言,我们评估了 FineWeb-2、HPLT v2 和 CulturaX 的芬兰语子集。为防止英语和编程能力的灾难性遗忘,我们的数据混合方案还包含英语、数学和编程内容。为此我们采用了 FineWeb-Edu、StarCoder 和 FineMath 数据集。

Broadly multilingual datasets#
广泛多语言数据集

FineWeb2#

FineWeb2 is a strictly non-English dataset with 1,893 language-script pairs, although most of these have a very small amount of data. The dataset contains almost 3T words. [FineWeb2]
FineWeb2 是一个严格非英语的数据集,包含 1,893 种语言-文字组合,不过其中大多数数据量极少。该数据集总计包含近 3 万亿词。[ FineWeb2]

HPLT v2#

The second release of HPLT contains 8T tokens in 193 languages extracted from Internet Archive and Common Crawl crawls. Almost 60% of this data is non-English. An additional parallel corpora with 380M sentence pairs is also included. [Burchell et al.]
HPLT 第二版包含从互联网档案馆和 Common Crawl 爬取数据中提取的 8 万亿 token,涵盖 193 种语言。其中近 60%数据为非英语内容。此外还包含 3.8 亿句对的平行语料库。[ Burchell 等人]

CulturaX#

CulturaX covers 167 languages with 6.3T tokens in total. CulturaX combines two previous multilingual datasets: mC4 and OSCAR, with more than half of the data being non-English. [Nguyen et al.]
CulturaX 覆盖 167 种语言,总计包含 6.3 万亿个词元。该数据集整合了先前两个多语言数据集 mC4 和 OSCAR,其中超过半数数据为非英语内容。[Nguyen 等人]

As slightly different token calculations are used for each of these datasets, we report the size of the Finnish subsets in terms of tokens after preprocessing with the Llama 3.1 tokenizer below.
由于这些数据集采用的词元计算方法略有差异,我们使用 Llama 3.1 分词器预处理后,以下列出的芬兰语子集规模均以词元数为计量单位。

Dataset  数据集

Tokens (Finnish)  芬兰语标记

FineWeb2

49,649,704,835

HPLT v2

61,104,545,039

CulturaX

43,987,780,310

English#  中文

We elected to use a smaller, high quality English dataset rather than a general web text dataset, in the belief that this might more closely mimic the pretraining dataset the models used. We also hypothesize that the higher quality data would be of more value in maintaining the peak capabilities of the model.
我们选择使用一个规模较小但质量较高的英文数据集,而非通用的网络文本数据集,因为我们相信这样能更贴近模型原始预训练数据的特性。同时我们推测,更高质量的数据对于保持模型峰值性能将更具价值。

We use FineWeb-Edu, a 1.3T token subset of the English FineWeb data filtered for educational content [Penedo et al.]. It has been demonstrated to result in similar LLM capabilities as the full FineWeb data with 10x fewer tokens.
我们采用 FineWeb-Edu 数据集,这是从英文 FineWeb 数据中筛选出的 1.3 万亿 token 教育内容子集[Penedo 等人研究]。研究表明,该数据集仅用十分之一的 token 量就能使 LLM 获得与完整 FineWeb 数据集相当的能力。

Code#  代码 #

We use StarCoder, a cleaned subset of The Stack covering 86 programming languages. In addition to plain source code, the dataset contains GitHub issues and commits as well as Jupyter notebooks. [Li et al.]
我们使用 StarCoder 数据集,这是 The Stack 数据集的净化子集,涵盖 86 种编程语言。除纯源代码外,该数据集还包含 GitHub 问题提交、代码提交记录以及 Jupyter 笔记本[Li 等人研究]。

Math#  数学 #

We use FineMath, a mathematical educational dataset extracted from Common Crawl. We use the highest quality subsets of the data, namely FineMath4+ and Infi-WebMath4+. [Liu et al.]
我们采用 FineMath 这一从 Common Crawl 中提取的数学教育数据集,精选其中最高质量的子集——FineMath4+和 Infi-WebMath4+作为训练素材。[刘等人]

Parallel data#  并行数据

Tatoeba Challenge is a machine translation benchmark providing training and evaluation data for thousands of language pairs [Tiedemann]. The training data is sourced from the OPUS parallel corpora and test data from Tatoeba. The English-Finnish parallel data contains over 140M sentence pairs. We evaluated the impact of the paired sentences, but did not ultimately include this data in the final training run.
Tatoeba 挑战赛是一个机器翻译基准测试,为数千种语言对提供训练和评估数据[Tiedemann]。训练数据来源于 OPUS 平行语料库,测试数据则来自 Tatoeba。英芬平行数据包含超过 1.4 亿句对。我们评估了配对句子的影响,但最终未将该数据纳入正式训练环节。

Continued Pretraining Data Mix Summary#
持续预训练数据混合方案总结 #

Our data strategy involved selecting high-quality, domain-specific datasets to cover all key capabilities. For Finnish, we evaluated several large web-based corpora. To preserve existing performance, we chose specialized datasets for English (FineWeb-Edu), code (StarCoder), and math (FineMath). The final proportions of these datasets in our training mix were determined through the experiments described in a later section.
我们的数据策略精选了高质量领域专用数据集以覆盖所有关键能力。针对芬兰语,我们评估了多个大型网络语料库。为保持现有性能,我们选择了英语专用数据集(FineWeb-Edu)、代码数据集(StarCoder)和数学数据集(FineMath)。这些数据集在训练混合中的最终比例将通过后文实验确定。

Continued Pretraining with Megatron-LM#
基于 Megatron-LM 的持续预训练 #

This section moves from strategy to implementation, outlining the technical workflow for training the models using the Megatron-LM framework on the LUMI supercomputer. We cover the three primary stages: preparing the data, converting the base model into the Megatron checkpoint format, and executing the training run.
本节从策略转向实施,详细介绍了在 LUMI 超级计算机上使用 Megatron-LM 框架训练模型的技术流程。我们将涵盖三个主要阶段:数据准备、将基础模型转换为 Megatron 检查点格式以及执行训练运行。

To train with Megatron-LM, we first pretokenize the data into the format that Megatron-LM requires. The base models we use as starting points for CPT are in the Hugging Face transformers format, which needs to be converted to the Megatron checkpoint format before we can proceed with CPT. After training, the checkpoint is converted back to the Hugging Face transformers format for easier evaluation and distribution.
要使用 Megatron-LM 进行训练,我们首先需要将数据预处理成 Megatron-LM 要求的格式。作为 CPT 起点的基座模型采用 Hugging Face transformers 格式存储,在开始 CPT 前需先转换为 Megatron 检查点格式。训练完成后,检查点会转换回 Hugging Face transformers 格式以便于评估和分发。

We performed the continued pretraining with an older fork of Megatron-LM framework, but more recently an updated ROCm fork of Megatron-LM has been made available, and we would recommend using that version instead, because it has a number of new features and improved training performance. We provide examples of all the steps described here in the scripts repository using the updated version.
我们使用了一个较旧分支的 Megatron-LM 框架进行持续预训练,但最近已提供了更新的 ROCm 分支版 Megatron-LM,我们建议改用该版本,因为它具备多项新特性并提升了训练性能。我们在脚本库中使用更新版本提供了本文所述所有步骤的示例。

Pre-Tokenizing Training Data#
训练数据预分词处理 #

Pre-tokenizing converts training text from jsonl-formatted input files into numerical, tokenized representations. The data at this stage often consists of hundreds or thousands of files from different data sets. Preprocessing of the files is done in parallel and as a result for each of the input files we get 2 files following megatron mmap format (binary data file and index file). Example for converting a single file below
预分词将 jsonl 格式的训练文本文件转换为数值化的分词表示。此阶段的数据通常包含来自不同数据集的数百或数千个文件。文件预处理采用并行方式完成,最终为每个输入文件生成两个符合 megatron mmap 格式的文件(二进制数据文件和索引文件)。下方是转换单个文件的示例:

#!/bin/bash

# Paths (update these or pass the paths as a parameter)
INPUT_FILE=$1
OUTPUT_FOLDER=$2
OUTPUT_FILE=$OUTPUT_FOLDER/$(basename $INPUT_FILE)
WORKERS=...
mkdir -p $OUTPUT_FOLDER

python ./Megatron-LM/tools/preprocess_data.py \
    --input $INPUT_FILE \
    --output $OUTPUT_FILE \
    --json-keys text \
    --tokenizer-type HuggingFaceTokenizer \
    --tokenizer-model <path_to_your_hf_model> \
    --append-eod \
    --log-interval 50000 \
    --workers $WORKERS

The converted files are then merged into larger files, one per original dataset. These files follow the same mmap format (.bin, .idx). During training Megatron-LM creates batches of data at the desired data mix by specifying sampling priorities for each one of the merged dataset files.
转换后的文件随后会被合并为更大的文件,每个原始数据集对应一个合并文件。这些文件遵循相同的 mmap 格式(.bin 和.idx)。在训练过程中,Megatron-LM 通过为每个合并数据集文件指定采样优先级,按所需数据比例创建数据批次。

#!/bin/bash

# Paths (update these or pass them as a parameter)
# INPUT_FOLDER contains multiple .bin and .idx files which are combined
# OUTPUT_FILE is the name of the merged file without .bin or .idx suffix

INPUT_FOLDER=$1
OUTPUT_FOLDER=$2
OUTPUT_FILE=$3

mkdir -p $OUTPUT_FOLDER

python ./Megatron-LM/tools/merge_datasets.py \
    --input $INPUT_FOLDER \
    --output-prefix $OUTPUT_FOLDER/$OUTPUT_FILE

Reference Slurm scripts for tokenization and merging in Lumi environment are available in our scripts repository.
参考用于 Lumi 环境中分词与合并任务的 Slurm 脚本已发布于我们的脚本仓库。

Converting model to Megatron format#
将模型转换为 Megatron 格式

For training with Megatron-LM, we convert the base model from Hugging Face format into a megatron checkpoint. A decision of the parallelism configuration (Tensor parallel, Pipeline parallel) is made at this point. We used TP=2, PP=1 for 8B models and TP=8, PP=8 for the 70B model on the LUMI AMD instinct MI250X-based cluster. An example script for converting the 8B model to Megatron format below
在使用 Megatron-LM 进行训练时,我们需将 Hugging Face 格式的基础模型转换为 megatron 检查点格式。此时需要确定并行配置方案(张量并行、流水线并行)。在基于 LUMI AMD Instinct MI250X 集群上,我们对 8B 模型采用 TP=2、PP=1 配置,对 70B 模型采用 TP=8、PP=8 配置。下方展示了将 8B 模型转换为 Megatron 格式的示例脚本:

#!/bin/bash

HF_FORMAT_DIR=<path to your hf model>
TOKENIZER_MODEL=$HF_FORMAT_DIR
TARGET_PP=1
TARGET_TP=2
MEGATRON_FORMAT_DIR=megatron-checkpoints/llama3.1-8B-TP-$TARGET_TP-PP-$TARGET_PP

python3 Megatron-LM/tools/checkpoint/convert.py \
    --model-type GPT \
    --loader llama_mistral \
    --model-size llama3-8B \
    --checkpoint-type 'hf' \
    --saver mcore \
    --target-tensor-parallel-size ${TARGET_TP} \
    --target-pipeline-parallel-size ${TARGET_PP} \
    --load-dir ${HF_FORMAT_DIR} \
    --save-dir ${MEGATRON_FORMAT_DIR} \
    --tokenizer-model ${TOKENIZER_MODEL}

Limitation: Converting checkpoint to use Virtual pipeline parallel (VPP) was not working correctly at time of writing, but using VPP would give a performance benefit with the 70B model.
限制:在撰写本文时,将检查点转换为使用虚拟管道并行(VPP)尚未正常工作,但使用 VPP 将为 70B 模型带来性能优势。

Reference scripts for converting 8B and 70B Llama models to Megatron format in a Slurm-based environment like LUMI are available in our scripts repository.
用于在 LUMI 等基于 Slurm 的环境中,将 8B 和 70B LLaMA 模型转换为 Megatron 格式的参考脚本可在我们的脚本仓库中获取。

Training the model#  训练模型

Once we’ve converted the model to a Megatron-LM-compatible checkpoint, we are ready to begin training. Refer to our example Megatron-LM pretraining Slurm script in our scripts repository for the specifics of our Megatron configuration.
将模型转换为兼容 Megatron-LM 的检查点后,即可开始训练。具体配置参数请参考我们脚本库中的 Megatron-LM 预训练 Slurm 脚本示例。

Based on our testing and the reference configuration for the Llama 3.1 family models, these are the Hyperparameters we chose for training the model.
根据我们的测试及 Llama 3.1 系列模型的参考配置,以下是训练模型时选择的超参数。

8B

70B

Epochs  训练轮次

1

1

Global batch size  全局批次大小

512

512

Micro batch size  微批次大小

1

1

Learning rate  学习率

3e-4

1.5e-4

LR scheduler  学习率调度器

cosine  余弦

cosine  余弦

Min LR  最低学习率

1e-8

1e-8

Warmup ratio  预热比例

0.05

0.05

Max seq length  最大序列长度

8192

8192

Converting megatron checkpoint to Hugging Face transformers format#
将 Megatron 检查点转换为 Hugging Face transformers 格式 #

After the model is trained, the final Megatron checkpoint is converted back to Hugging Face transformers checkpoint for easier use on other platforms and downstream tasks. Here is an example script to convert the latest checkpoint in LOAD_DIR to Hugging Face transformers format.
模型训练完成后,最终的 Megatron 检查点会被转换回 Hugging Face transformers 检查点格式,以便在其他平台和下游任务中更便捷地使用。以下是将 LOAD_DIR 目录中最新的检查点转换为 Hugging Face transformers 格式的示例脚本。

#!/bin/bash

# paths (update these or pass as a parameter)
LOAD_DIR=$1
OUT_DIR=$2
TOKENIZER_DIR=...  # <----- Change this to your needs

python3 Megatron-LM/tools/checkpoint/convert.py \
--model-type GPT \
--loader mcore \
--saver llama_mistral \
--load-dir $LOAD_DIR \
--save-dir $OUT_DIR \
--tokenizer-dir $TOKENIZER_DIR \

We have an example script for converting any checkpoint (not just the latest) in load directory here LumiOpen/poro2-scripts-dev
我们在 LumiOpen/poro2-scripts-dev 中提供了转换加载目录中任意检查点(不仅限于最新)的示例脚本

Continued Pretraining with Megatron-LM Summary#
使用 Megatron-LM 进行持续预训练总结 #

The core technical workflow involved preparing data into the Megatron mmap format, converting Hugging Face checkpoints to the Megatron format with appropriate parallelism settings (TP/PP), and executing the training. The example scripts and hyperparameters provided in this section serve as a template for replicating this process on a similar HPC system.
核心技术流程包括:将数据准备成 Megatron mmap 格式,以适当的并行设置(TP/PP)将 Hugging Face 检查点转换为 Megatron 格式,并执行训练。本节提供的示例脚本和超参数可作为在类似 HPC 系统上复现此过程的模板。

Experimental Runs#  实验运行记录 #

Before committing to a full-scale training run, we conducted a series of smaller experiments to determine the optimal configuration. This section details our approach to testing key variables—including learning rates, data sources, data quantity, and data mixing ratios—to arrive at a final configuration that balanced Finnish language acquisition with the preservation of existing capabilities.
在进行全面训练之前,我们进行了一系列小规模实验以确定最佳配置。本节详细介绍了我们测试关键变量的方法——包括学习率、数据来源、数据量以及数据混合比例——最终得出一个既能提升芬兰语习得能力又能保持现有性能的平衡配置。

We start our experiments with shorter training runs on smaller model variants. The goal of these experimental runs is to evaluate the effect of configuration opinions like hyperparameters or data mix on downstream model performance, so that we can arrive at our final training run configuration.
我们从较小模型变体的短期训练开始实验。这些实验性训练的目的是评估超参数或数据混合等配置选项对下游模型性能的影响,从而确定最终训练运行的配置方案。

The experiments focus on alternative base models, learning rates, learning rate schedules, alternative data sources, data mixes, data repetition, and weight decay. For simplicity we evaluate continued pretraining independently from post-training instead of executing the whole training pipeline for every experiment.
实验主要关注替代基础模型、学习率、学习率调度方案、替代数据源、数据混合比例、数据重复次数以及权重衰减等变量。为简化流程,我们独立评估持续预训练阶段而非为每个实验执行完整的训练流程。

We start with a baseline data mix of 70% Finnish CulturaX, 25% English FineWeb-Edu, 4% StarCoder and 1% English-Finnish parallel data. Initial models are pre-trained for 50B tokens. This setup already improves the Finnish capabilities of Llama 3.1 8B noticeably, with ~6pp average improvement on our Finnish benchmarks. However, we also observe a ~10pp average decrease in English capabilities, most pronounced in math evaluations.
我们采用的基础数据混合比例为:70%芬兰语 CulturaX、25%英语 FineWeb-Edu、4% StarCoder 以及 1%英芬平行语料。初始模型经过 500 亿 token 的预训练。这一配置已显著提升了 Llama 3.1 8B 的芬兰语能力,在我们的芬兰语基准测试中平均提升约 6 个百分点。但同时也观察到英语能力平均下降约 10 个百分点,在数学评估任务中表现尤为明显。

To assess the impact of learning rate and schedules we run a grid of experiments with varying learning rates with both cosine and trapezoidal schedules. In our experiments the highest LR value 3E-04 results in the best evaluation scores, although we do observe some loss spikes during the training run, these do not seem to measurably affect performance. We do not observe direct benefits from the trapezoidal/WSD LR schedule, though it does have some indirect benefits by enabling the option of annealing/midtraining on higher quality data mixes. For this release, however, we stay with the cosine schedule for the released model. A learning rate of 3E-04 is used for all subsequent experiments with the Llama 3.1 8B model.
为评估学习率及调度策略的影响,我们采用不同学习率配合余弦与梯形调度方案进行了网格实验。实验表明最高学习率 3E-04 能取得最佳评估分数,尽管训练过程中观察到部分损失值突增现象,但这些波动未对性能产生可测量的影响。梯形/WSD 学习率调度虽未展现直接优势,但通过支持在更优质数据混合上进行退火/中期训练而具备间接收益。本次发布的模型仍采用余弦调度方案,后续所有基于 Llama 3.1 8B 模型的实验均使用 3E-04 的学习率。

To evaluate the impact of data quantity and compute amount, we extend the experiments from 50B total tokens to a full epoch of the Finnish CulturaX data (44B tokens Finnish, 63B in total) and also assess the impact of repeating the Finnish data 2 or 3 times, while maintaining the same overall data ratio. In pretraining settings, previous research indicates that repeating training data for data-constrained domains can help model performance [Muenninghoff et al.], but in our experiments, we found that repeating our Finnish dataset did not reliably improve Finnish capabilities, but observed a steady decline in English evaluations with longer training runs. As shown in the chart below, training for more than one epoch of Finnish data offered no clear benefit, so we chose to use a single epoch for our final run.
为评估数据量和计算资源的影响,我们将实验从 500 亿总 token 扩展到完整遍历芬兰语 CulturaX 数据集(440 亿芬兰语 token,总计 630 亿),同时保持总体数据比例不变的情况下,测试重复使用芬兰语数据 2-3 次的效果。在预训练设置中,先前研究表明[Muenninghoff 等],对于数据受限的领域,重复训练数据有助于提升模型性能。但在我们的实验中,发现重复芬兰语数据集并未稳定提升芬兰语能力,反而观察到随着训练时长增加,英语评估指标持续下降。如下图所示,训练超过一个完整芬兰语数据周期并未带来明显收益,因此最终选择采用单周期训练方案。

Influence of varying training data epochs

In addition to the CulturaX dataset we consider the FineWeb2 and HPLT v2 datasets. In these experiments we train the model for one full epoch of the Finnish data from each dataset. As these datasets vary slightly in size, they are not compute equivalent, but instead shed light on the differences between openly available datasets. In our experiments FineWeb2 showed the best performance for Finnish, whereas HPLT v2, despite being the largest of the datasets, resulted in weakest performance. The results of this comparison are shown below; while all datasets performed well, FineWeb2 delivered the best Finnish results, making it our choice for the final data mix.
除 CulturaX 数据集外,我们还评估了 FineWeb2 和 HPLT v2 数据集。在这些实验中,我们使用每个数据集的芬兰语数据对模型进行了完整一轮训练。由于这些数据集规模略有差异,计算量并不等同,但能揭示公开数据集间的特性区别。实验表明,FineWeb2 在芬兰语任务上表现最优,而数据量最大的 HPLT v2 反而性能最弱。下方对比结果显示:虽然所有数据集表现良好,但 FineWeb2 的芬兰语效果最佳,因此我们最终选择将其纳入数据混合方案。

Influence of the selected CPT dataset against baseline

To enhance the cross-lingual transfer, we also conduct experiments varying the amount of Finnish-English parallel data from the Tatoeba Challenge corpus in the overall mix. We do not observe overall improvements in the model performance with this data, although more of it does improve the model’s translation capabilities slightly. For simplicity we exclude this data source in our final data mix.
为增强跨语言迁移能力,我们还尝试在整体数据混合中调整来自 Tatoeba Challenge 语料库的芬兰语-英语平行数据比例。虽然增加此类数据能略微提升模型的翻译能力,但我们并未观察到模型整体性能的改进。为简化流程,我们最终在数据混合中排除了这一数据源。

As we observe some degradation in English, code and math evaluations with our initial data mix, after experimentation we adjust the final data mix to 30% Finnish, 30% English, 30% code and 10% math data, aiming to provide a balance of data in all critical domains, and find that this mix meets our performance goals. We train the released models for one epoch of Finnish FineWeb2 data, resulting in 165B tokens overall with this data mix (see table below). This data mix not only maintains the original model capabilities we’re concerned with, but substantially improves Finnish performance.
在初步数据混合方案导致英语、代码和数学评估指标有所下降后,经过实验验证,我们将最终数据配比调整为:30%芬兰语、30%英语、30%代码和 10%数学数据,旨在实现所有关键领域的数据平衡。实践证明该配比方案达成了我们的性能目标。我们使用该数据组合对发布版模型进行了芬兰语 FineWeb2 数据的单周期训练,总计处理 1650 亿 token(参见下表)。该数据方案不仅保持了原模型的核心能力,还显著提升了芬兰语处理性能。

Domain  领域

Data source  数据源

Proportion (%)  占比(%)

Tokens  词元

Finnish  芬兰语

FineWeb2  精细网络 2

30

50B

English  中文

FineWeb-Edu

30

50B

Code  代码

StarCoder

30

50B

Math  数学

FineMath  精细数学

10

16B

Total  总计

100

165B

As a final experiment we measure the impact of weight decay. Some reported results use our (Megatron-LM) default value of 0.1, but smaller values are often used as well. In our test run a lower value of 0.01 did not measurably affect downstream eval performance, so we stay with the default 0.1 weight decay for our final training run.
作为最终实验,我们测量了权重衰减的影响。一些报告结果使用了我们(Megatron-LM)默认值 0.1,但更小的值也常被使用。在我们的测试运行中,较低的 0.01 值并未显著影响下游评估性能,因此最终训练运行仍采用默认的 0.1 权重衰减值。

Experimental Runs Summary#
实验运行总结

Through systematic experimentation, we determined our final, optimized configuration. The key findings were: 1) a higher learning rate (3E-04 for 8B) was effective; 2) repeating data did not improve performance; 3) FineWeb2 was the best-performing Finnish dataset; and 4) a balanced data mix of 30% Finnish, 30% English, 30% code, and 10% math was crucial for preventing catastrophic forgetting. This configuration, detailed in the table above, formed the basis for our final pretraining run.
通过系统性实验,我们确定了最终优化配置。关键发现包括:1)较高学习率(8B 模型采用 3E-04)效果显著;2)数据重复未提升性能;3)FineWeb2 是表现最佳的芬兰语数据集;4)30%芬兰语+30%英语+30%代码+10%数学的平衡数据配比能有效防止灾难性遗忘。上表详述的这一配置成为最终预训练的基础方案。

Final Pretraining Results and Analysis#
最终预训练结果与分析

Following the methodology and optimized configuration determined in the previous sections, we proceeded with the full continued pretraining run. This section presents the final evaluation results and provides an analysis of the outcomes for both the 8B and 70B models.
基于前文确定的方法论与优化配置,我们完成了完整的持续预训练。本节将呈现 8B 与 70B 模型的最终评估结果,并对训练成果进行分析。

Our Continued Pretraining significantly improved Finnish performance across our evaluation set, while largely maintaining (and in some cases improving) English performance. The improvement is much more pronounced in the smaller 8B model compared to the larger 70B model, but the smaller model was improving from a much lower point of initial capability.
持续预训练显著提升了模型在芬兰语评估集上的表现,同时基本保持(部分场景甚至提升)了英语能力。相比 70B 大模型,8B 小模型的提升幅度更为显著——不过需注意小模型的初始能力基准要低得多。

Before CPT, the base models likely saw similar data distributions, including only a very small amount of Finnish data. Despite this, the 70B model exhibited significantly better initial Finnish performance, likely due to its greater capacity and higher sample efficiency, allowing it to generalize better from limited exposure. This size advantage is also reflected in the higher Finnish performance of the 70B model after CPT.
在进行持续预训练(CPT)之前,基础模型接触的数据分布可能较为相似,其中仅包含极少量的芬兰语数据。尽管如此,700 亿参数模型展现出了明显更优的初始芬兰语性能,这很可能归功于其更大的模型容量和更高的样本效率,使其能够从有限的数据接触中实现更好的泛化能力。这种规模优势同样体现在 CPT 后 700 亿参数模型更高的芬兰语性能表现上。

The table below presents a detailed breakdown of these results across our full suite of pretraining evaluations.
下表详细展示了我们在全套预训练评估中的结果细分。

Llama-3.1-8B

Poro 2 8B base  Poro 2 8B 基础版

Llama-3.1-70B

Poro 2 70B base  Poro 2 70B 基础模型

arc_challenge  arc 挑战赛

57.94

60.75

69.45

69.97

hellaswag

80.05

80.55

87.81

87.85

mmlu

65.08

63.48

78.59

78.20

truthfulqa_mc  真实问答多选题

54.02

48.06

49.78

51.43

gsm8k

78.01

54.81

81.05

81.35

eng no math avg  英语非数学平均分

64.27

63.21

71.41

71.86

eng total avg  英文 总计 平均

67.02

61.53

73.34

73.76

arc_challenge_mt_fi

38.82

48.90

54.52

61.01

hellaswag_mt_fi

30.97

50.49

52.10

58.07

mmlu_mt_fi

49.64

56.25

71.29

73.76

truthfulqa_mc_mt_fi

45.54

49.78

53.64

55.53

gsm8k_mt_fi

30.93

44.43

69.90

72.78

fin no math  无需数学计算

41.24

51.35

57.89

62.09

fin total avg  总平均

39.18

49.97

60.29

64.23

flores200 en_fi bleu  flores200 英芬双语 BLEU 评分

23.92

36.48

35.02

40.03

flores200 fi_en bleu  flores200 芬英双语 BLEU 评分

37.42

40.71

41.67

43.04

flores200_en_fi chrf  flores200_en_fi chrf(英芬翻译质量评估)

50.36

60.14

59.16

62.50

flores200_fi_en chrf  flores200_fi_en chrf(芬英翻译质量评估)

60.44

62.90

63.03

64.16

translation avg  翻译平均分

43.03

50.06

49.72

52.43

humaneval_pass@1  humaneval_pass@1(编程题一次通过率)

35.97

31.09

57.3

48.78

humaneval_pass@10

53.66

48.17

73.17

64.63

Poro 2 8B#  Poro 2 8B

For the 8B model our approach leads to a consistent improvement in all Finnish benchmarks, with ~10pp average improvement. We are also able to maintain most of the original English capabilities with a 1pp average decrease in our evaluations.
对于 8B 模型,我们的方法在所有芬兰语基准测试中均实现了持续改进,平均提升约 10 个百分点。同时,在评估中我们能够保持原始英语能力的绝大部分,仅平均下降 1 个百分点。

Although we observe almost 14pp improvement on the Finnish GSM8K math evaluation, we are not able to maintain the original Llama 3.1 performance on the English variant of GSM8K. If higher math performance is critical, adding more math data to the mix, or annealing with a higher mix of high quality math data would help improve performance.
尽管我们在芬兰语 GSM8K 数学评估中观察到近 14 个百分点的提升,但无法在英语版 GSM8K 上保持原始 Llama 3.1 的性能水平。若需提升数学能力表现,可通过增加数学训练数据比例,或采用更高质量数学数据进行退火训练来改善效果。

Poro 2 70B#

Llama 3.1 70B has considerably stronger initial Finnish capabilities than the smaller variant, but our continued pretraining approach still provides improvements in all of our Finnish benchmarks. The larger model is also able to maintain, and in most cases even improve, on its English capabilities. Overall Finnish scores improve by an average of ~4pp, while English capabilities remain largely the same with a ~0.4pp average improvement.
Llama 3.1 70B 较大型变体初始芬兰语能力显著更强,但我们的持续预训练方法仍能提升所有芬兰语基准测试表现。该大模型不仅能保持英语能力,多数情况下甚至有所提升。芬兰语总体得分平均提高约 4 个百分点,而英语能力基本持平,平均微升 0.4 个百分点。

Performance Trade-offs: Code and Math#
性能权衡:代码与数学 #

Our primary goal for this project was to significantly enhance Finnish language capabilities. The results demonstrate clear success on that front. However, this focus on language acquisition revealed the challenges of maintaining performance in specialized domains when the original training data is unknown.
本项目的主要目标是显著提升芬兰语能力。结果显示我们在这方面取得了明显成功。然而,这种对语言习得的专注也揭示了当原始训练数据未知时,在专业领域保持性能所面临的挑战。

On the humaneval benchmark, for instance, our Poro 2 models do not match the performance of the original Llama 3.1 models. This is a common challenge in continued pretraining. Without access to the original pretraining dataset, it is difficult to replicate the exact data quality and domain balance that gave the base model its initial capabilities. While our data mix included a substantial portion of code (30% from StarCoder), the performance gap suggests that the composition or quality of the original data was different from our own.
例如在 humaneval 基准测试中,我们的 Poro 2 模型性能未能达到原始 Llama 3.1 模型的水平。这是持续预训练中常见的挑战。由于无法获取原始预训练数据集,我们难以复现赋予基础模型初始能力的数据质量和领域平衡。虽然我们的数据混合包含了大量代码(30%来自 StarCoder),但性能差距表明原始数据的构成或质量与我们的存在差异。

This highlights a key principle of CPT: performance is highly sensitive to the data mix, and adapting a model often involves navigating trade-offs between new and existing capabilities, especially with incomplete knowledge of the original training recipe. For projects where coding or math proficiency is the highest priority, one might need to experiment with different data allocations or seek out more specialized, high-quality datasets for those domains.
这凸显了持续预训练(CPT)的一个关键原则:模型性能对数据组合极为敏感,而模型适配往往需要在新增能力与既有能力之间权衡取舍——尤其是在原始训练配方信息不完整的情况下。若项目以编程或数学能力为最高优先级,开发者可能需要尝试不同的数据配比方案,或专门为这些领域寻找更专业、更高质量的数据集。

Final Pretraining Results Summary#
最终预训练结果总结

The final results confirm that our CPT strategy was highly effective at its primary goal: we achieved significant gains in Finnish across all benchmarks. The analysis also highlights the critical nature of data mixing and model size in managing performance trade-offs. While the larger 70B model showed more resilience, the performance drops in specialized domains like coding and math underscore the challenges of CPT when the original training data is unknown. This demonstrates that CPT is a powerful method for language acquisition, provided that data mixes are carefully balanced to align with a project’s specific priorities.
最终结果证实,我们的持续预训练(CPT)策略在主要目标上成效显著:芬兰语在所有基准测试中均取得重大提升。分析同时揭示了数据混合比例与模型规模对性能平衡的关键影响。尽管 70B 大模型展现出更强鲁棒性,但在编程和数学等专业领域出现的性能下滑,凸显了原始训练数据未知时 CPT 面临的挑战。这表明只要精心平衡数据配比以契合项目优先级,CPT 就能成为语言习得的强效方法。

Post-training#  后训练阶段 #

After continued pretraining, the base models are proficient in Finnish but are not yet helpful assistants. The post-training phase aligns the models to follow instructions and engage in conversation. This section details our full post-training pipeline, including the new evaluation benchmarks we used for this phase, our supervised fine-tuning (SFT) and direct preference optimization (DPO) processes, and the datasets we selected and generated.
经过持续预训练后,基础模型已掌握芬兰语能力,但尚未成为实用助手。后训练阶段通过指令微调使模型具备对话交互能力。本节完整阐述我们的后训练流程,包括该阶段新增的评估基准、监督微调(SFT)与直接偏好优化(DPO)的实施过程,以及我们筛选和生成的数据集。

Post-training evaluation#
后训练评估 #

We evaluate our post-training checkpoints on general instruction following and open-ended conversational following in English and Finnish.
我们在英语和芬兰语的通用指令遵循及开放式对话任务上评估了后训练检查点的表现。

MTBench#

MTBench is a multi-turn open-ended conversation benchmark that uses LLM-as-judge to score the model’s responses [Zheng et al.]. For Finnish, we machine-translated and manually corrected the MTBench questions into Finnish. We integrated a language identifier so that responses that are not in Finnish are given the lowest possible score of 1. We use GlotLID as our language identifier.
MTBench 是一个多轮开放式对话基准测试,采用 LLM 作为评判员来为模型响应打分[Zheng 等人]。针对芬兰语,我们通过机器翻译并人工校正的方式将 MTBench 问题转换为芬兰语。我们集成了语言识别器,对非芬兰语的响应给予最低分 1 分。所使用的语言识别器为 GlotLID。

Code base: LumiOpen/FastChat
代码库:LumiOpen/FastChat

Translation: https://huggingface.co/datasets/LumiOpen/mtbench_multi
翻译版本:https://huggingface.co/datasets/LumiOpen/mtbench_multi

AlpacaEval 2#  AlpacaEval 2

AlpacaEval is a single-turn chat benchmark that uses an LLM to pick a preference between the model’s responses and a set of baseline responses from a SOTA model such as GPT-4. We use the length-controlled AlpacaEval 2 that has been shown to have a higher correlation with the Chat Arena than MTBench [Dubois et al.]. We machine-translated and manually corrected the questions in AlpacaEval into Finnish. We also integrated a language identifier into the evaluation so that the baseline always will be preferred in cases where the model answer is not in Finnish.
AlpacaEval 是一个单轮对话基准测试,它使用 LLM 在模型响应与来自 GPT-4 等 SOTA 模型的基线响应之间进行偏好选择。我们采用长度控制的 AlpacaEval 2 版本,该版本已被证明比 MTBench 与 Chat Arena 具有更高的相关性[Dubois 等人]。我们将 AlpacaEval 中的问题机器翻译并人工校对为芬兰语。我们还在评估中集成了语言识别器,确保当模型回答非芬兰语时,系统始终优先选择基线答案。

Code base: LumiOpen/alpaca_eval
代码库:LumiOpen/alpaca_eval

Translation: https://huggingface.co/datasets/LumiOpen/alpaca_eval_multi
翻译版本:https://huggingface.co/datasets/LumiOpen/alpaca_eval_multi

For AlpacaEval and MTBench, we use GPT-4o as the judge, because it is cheaper than other popular judge models such as GPT-4 Turbo and a more up-to-date model.
对于 AlpacaEval 和 MTBench,我们选用 GPT-4o 作为评判模型,因为它比 GPT-4 Turbo 等其他流行评判模型成本更低,且是更先进的版本。

IFEval#

IFEval is an instruction following benchmark where the correctness of the model’s responses can be verified programmatically [Zhou et al.]. We machine-translated and manually corrected the IFEval instructions into Finnish.
IFEval 是一个指令遵循基准测试,其模型响应的正确性可通过编程方式验证[Zhou 等人]。我们将 IFEval 的指令机器翻译并人工校对为芬兰语版本。

We integrated the Finnish IFEval into our fork of LM Eval Harness: LumiOpen/lm-evaluation-harness
我们将芬兰的 IFEval 集成到了我们分叉的 LM Eval Harness 中:LumiOpen/lm-evaluation-harness

Translation: https://huggingface.co/datasets/LumiOpen/ifeval_mt
翻译:https://huggingface.co/datasets/LumiOpen/ifeval_mt

Post-training process#  训练后处理流程

Post-training is a process that trains a pretrained model to act as an assistant that can respond to user prompts and follow instructions. Our post-training pipeline involves one round of supervised fine-tuning (SFT) followed by preference tuning, specifically direct preference optimization (DPO). We use the Transformers Reinforcement Learning library (TRL) as our post-training framework.
后训练是一个对预训练模型进行再训练的过程,使其成为能够响应用户提示并遵循指令的助手。我们的后训练流程包括一轮监督微调(SFT),随后进行偏好调优,具体采用直接偏好优化(DPO)方法。我们使用 Transformers 强化学习库(TRL)作为后训练框架。

Our post-training codebase is a fork of the Alignment Handbook [Tunstall et al.]. We make our codebase and recipes available at: LumiOpen/alignment-handbook. Please refer to the alignment handbook repository for details on how to run the code.
我们的后训练代码库是基于 Alignment Handbook[Tunstall 等人]的一个分支。我们在 LumiOpen/alignment-handbook 提供了代码库和训练方案。具体运行方法请参考 alignment handbook 代码库。

SFT#  监督微调(SFT) #

We perform full-parameter supervised fine-tuning with a curated English and Finnish instruction dataset. For the 8B model, we used a global batch size of 64. We packed samples and used a maximum sequence length of 4,096 tokens. For 70B, we used a global batch size of 128. For 8B we used 4 nodes, while for 70B we used 8 nodes. A summary of our SFT hyperparameters is shown below.
我们使用精选的英语和芬兰语指令数据集进行全参数监督微调。对于 80 亿参数模型,我们采用 64 的全局批量大小,将样本打包处理并使用 4,096 个 token 的最大序列长度。对于 700 亿参数模型,我们使用 128 的全局批量大小。80 亿模型使用了 4 个计算节点,而 700 亿模型使用了 8 个节点。下方展示了我们的 SFT 超参数摘要。

8B

70B

Epochs  周期

2

2

Global batch size  全局批次大小

64

128

Micro batch size  微批次大小

2

1

Gradient acc steps  梯度累积步数

1

2

Learning rate  学习率

5e-6

5e-6

LR scheduler  学习率调度器

linear  线性

linear  线性

Warmup ratio  预热比例

0.03

0.03

Max seq length  最大序列长度

4,096

4,096

DPO#

We perform one round of DPO after SFT to further improve the quality of the model responses. We use the HelpSteer3 dataset for preference tuning because it has an open license and achieved improvements relative to the SFT checkpoint that exceeded other preference datasets in our experiments (see DPO dataset section).
我们在 SFT 之后进行一轮 DPO 训练,以进一步提升模型回答的质量。我们选用 HelpSteer3 数据集进行偏好调优,因为该数据集采用开放许可协议,且在我们的实验中,相较于 SFT 检查点,其带来的改进效果超过了其他偏好数据集(详见 DPO 数据集章节)。

We used 4 nodes of AMD Instinct MI250X GPUs for the 8B model and 8 nodes for the 70B model. Our DPO parameters are shown below.
我们为 8B 模型使用了 4 个节点的 AMD Instinct MI250X GPU,为 70B 模型使用了 8 个节点。我们的 DPO 参数如下所示。

8B

70B

Epochs  周期

3

3

Global batch size  全局批次大小

64

64

Micro batch size  微批次大小

2

1

Gradient acc steps  梯度累积步数

1

1

Beta  测试版

0.01

0.01

Learning rate  学习率

5e-7

5e-7

LR scheduler  学习率调度器

cosine  余弦

cosine  余弦

Warmup ratio  预热比例

0.1

0.1

Max length  最大长度

4,096

4,096

Datasets available#  可用数据集

There are a large number of post-training datasets currently available with varying levels of quality and diversity, most of them in English. Our SFT data mixture is built around the prompts from the Tulu 3 SFT mix, which are targeted towards improving a model’s abilities on a large number of tasks. We combine prompts from this dataset with several others to create a diverse, bilingual instruction set.
目前已有大量不同质量和多样性的训练后数据集可供使用,其中大部分为英文数据。我们的监督微调(SFT)数据混合方案以 Tulu 3 SFT 混合集中的提示词为核心构建,这些提示词旨在提升模型在多种任务上的表现。我们将该数据集的提示词与其他多个来源相结合,最终构建出一个多样化的双语指令集。

SFT dataset#  SFT 数据集 #

We generated most of our English SFT dataset by taking the prompts from the Tulu 3 SFT mix and generating responses from Llama 3.3 70B Instruct, though we retained the original Tulu 3 conversations for the selected prompts that contained multi-turn conversations, to help support multi-turn conversational performance.
我们的大部分英文 SFT 数据集是通过提取 Tulu 3 SFT 混合集中的提示词,并使用 Llama 3.3 70B Instruct 生成响应来创建的。不过对于包含多轮对话的选定提示,我们保留了原始的 Tulu 3 对话内容,以增强多轮会话表现。

We also generated multi-turn samples by self-synthesizing prompts and their completions over multiple turns based on the techniques in the Magpie paper [Xu et al.].
我们还基于 Magpie 论文[Xu 等人]中的技术,通过自主合成多轮提示及其补全内容,生成了多轮对话样本。

For Finnish, we machine translated the prompts into Finnish and used Llama 3.3 70B Instruct to generate responses to the translated prompts. We filtered out samples where the responses are not in Finnish.
对于芬兰语数据,我们将提示机器翻译成芬兰语后,使用 Llama 3.3 70B Instruct 生成翻译后提示的响应,并过滤掉非芬兰语的响应样本。

We incorporated the top-rated English and Finnish conversations from the Open Assistant 2 dataset to improve the models’ conversational skill. Open Assistant is a crowd-sourced dataset of conversations generated and annotated by volunteers [Köpf et al.]. We also added Finnish conversations from Avoin Avustaja, a crowd-sourced dataset inspired by the OpenAssistant project. Lastly, we added English-Finnish (and vice-versa) translation samples from EuroParl [Koehn].
我们整合了 Open Assistant 2 数据集中评分最高的英语和芬兰语对话,以提升模型的会话能力。Open Assistant 是一个由志愿者生成并标注的众包对话数据集[Köpf 等人]。我们还添加了来自 Avoin Avustaja 的芬兰语对话,这是受 OpenAssistant 项目启发的众包数据集。最后,我们加入了来自 EuroParl 的英芬(及芬英)翻译样本[Koehn]。

Our final SFT data mixture contains 1.4M samples from the following datasets:
我们最终的 SFT 数据混合包含来自以下数据集的 140 万个样本:

  1. English Tulu 3 prompts with Llama-3.3-70B-Instruction responses (700K)
    英语 Tulu 3 提示词及 Llama-3.3-70B-Instruction 模型响应(70 万条)

  2. Finnish Tulu 3 prompts with Llama-3.3-70B-Instruction responses (650K)
    芬兰语 Tulu 3 提示词及 Llama-3.3-70B-Instruction 模型响应(65 万条)

  3. De-duplicated English Magpie multi-turn conversations (14K)
    去重后的英语 Magpie 多轮对话数据(1.4 万条)

  4. Top-rated English and Finnish OASST2 data (5K)
    高评分英语及芬兰语 OASST2 数据(5000 条)

  5. EuroParl English-Finnish (vice-versa) translation data (1K)
    欧洲议会英语-芬兰语(双向)翻译数据(1K)

  6. Avoin Avustaja Finnish multi-turn conversations (100)
    开放助手芬兰语多轮对话数据(100)

We generated all or part of the contents of the first three datasets while the latter three are preexisting. We make the full combined dataset available here.
前三个数据集的内容我们生成了全部或部分,而后三个是已有的。我们在此提供完整的合并数据集。

DPO dataset#  DPO 数据集

With an 8B SFT checkpoint, we experimented with preference datasets of varying sizes:
使用 8B SFT 检查点,我们尝试了不同规模的偏好数据集:

The relatively small sizes of HelpSteer2 and 3 made it feasible for us to machine-translate these datasets into Finnish and experiment with the combined English-Finnish datasets. HelpSteer3 has multilingual samples but we excluded the non-English data for training and translation in order to focus our efforts on English and Finnish. We found that HelpSteer3 (without Finnish translations) is on par with the bilingual version and outperformed the other DPO datasets. We use HelpSteer3 (English only) in our subsequent DPO runs.
HelpSteer2 和 3 相对较小的规模使我们能够将这些数据集机器翻译成芬兰语,并尝试结合英语-芬兰语数据集进行实验。HelpSteer3 包含多语言样本,但为了集中精力研究英语和芬兰语,我们在训练和翻译过程中排除了非英语数据。我们发现未包含芬兰语翻译的 HelpSteer3 与双语版本表现相当,且优于其他 DPO 数据集。在后续的 DPO 运行中,我们采用了纯英文版的 HelpSteer3。

Post-training Summary#  训练后总结 #

Our post-training pipeline was a multi-stage process designed to transform the CPT base models into helpful assistants. We established a new suite of chat and instruction-following evaluations for this phase. The process involved a full-parameter SFT on a large, custom-bilingual dataset, followed by DPO using the open-license HelpSteer3 dataset to further refine the models’ helpfulness and safety.
我们的训练后流程是一个多阶段设计,旨在将 CPT 基础模型转化为实用的助手。为此阶段我们建立了一套全新的对话和指令遵循评估体系。该流程包括在大型定制双语数据集上进行全参数监督微调(SFT),随后使用开源许可的 HelpSteer3 数据集进行直接偏好优化(DPO),以进一步提升模型的实用性和安全性。

Synthetic Data Generation#
合成数据生成 #

A core component of our post-training strategy was the creation of high-quality, bilingual instruction data. As off-the-shelf Finnish instruction datasets are scarce, we developed a pipeline for generating our own synthetic data. This section describes our process, from the inference framework and reward models used for selecting high-quality responses to our approach for creating multi-turn conversations.
我们训练后策略的核心环节是创建高质量双语指令数据。由于现成的芬兰语指令数据集稀缺,我们开发了自主生成合成数据的流程。本节将详述从用于筛选高质量回复的推理框架和奖励模型,到构建多轮对话的方法论。

We first translated the selected Tulu 3 prompts to Finnish. We then generated multiple responses to each prompt, and then finally we selected the highest quality response from the available generations.
我们首先将选定的 Tulu 3 提示词翻译成芬兰语。随后针对每个提示生成多个响应,最终从生成的候选结果中挑选出质量最高的回答。

SFT prompt selection#  SFT 提示选择

We did not use the full set of Tulu 3 prompts for our SFT phase. We filtered out the subset of prompts that came from non-commercially usable datasets, removed non-English prompts, and then deduplicated the remaining prompts. These selected prompts were used as the basis for our synthetic data work.
我们在 SFT 阶段并未使用完整的 Tulu 3 提示集。我们筛选掉了来自非商业可用数据集的提示子集,移除了非英语提示,并对剩余提示进行了去重处理。这些精选提示构成了我们合成数据工作的基础。

Inference Pipeline#  推理流水线 #

In order to do large scale inference on Slurm with batch jobs, we utilized our internally developed dispatcher framework. Dispatcher makes it easy to horizontally scale inference workloads, handles resuming work if a job ends before the task is complete, and avoids the need to pre-partition the data for workers.
为了在 Slurm 上通过批处理作业进行大规模推理,我们采用了内部开发的调度器框架。该调度器能轻松实现推理工作负载的水平扩展,处理作业未完成时的任务恢复,并避免了为工作节点预先分割数据的需要。

When generating potential model responses, we generated up to 64 parallel responses to each prompt, then selected a high quality response from among the possible options using a reward or judge model.
在生成潜在模型响应时,我们为每个提示最多生成 64 个并行响应,然后通过奖励模型或评判模型从候选选项中筛选出高质量响应。

Reward Models#  奖励模型

ArmoRM#

For English we used ArmoRM in the dispatcher framework to select among the best responses for each prompt.
在英语场景中,我们使用 ArmoRM 作为调度器框架的核心组件,为每个提示选择最佳响应。

LLM-as-a-Judge#

There was no readily available reward model for Finnish instruction following. As a test, we tried providing Finnish data to the English-oriented ArmoRM model and confirmed that it did not improve model performance on post-training evals beyond that achieved with random selection.
目前尚无现成的芬兰语指令遵循奖励模型可供使用。作为测试,我们尝试将芬兰语数据输入面向英语的 ArmoRM 模型,结果证实这并未使模型在训练后评估中的表现超越随机选择的效果。

Ultimately, we adapted prompts from [Yuan et al.] to rate Finnish responses separately for both quality and fluency on a 5 point scale, then combined the scores to select a prompt that does well in each dimension.
最终,我们采用了[Yuan 等人]的提示模板,分别从质量和流畅度两个维度对芬兰语回答进行 5 分制评分,然后综合得分选出在各个维度表现优异的提示方案。

We had hoped that fluency rating might be a way to spot and remove obvious dysfluencies in the Finnish synthetic post-training data, but our experiments with fluency rating yielded limited results; when human raters spot checked fluency scores they were unable to reliably differentiate between high- and low-rated generations. We intend to return to this problem in future work.
我们原本希望流畅度评分能帮助识别并剔除芬兰语合成后训练数据中的明显不流畅内容,但实验结果显示该方法效果有限:人工评分员在抽查流畅度分数时,无法可靠区分高评分与低评分生成结果。我们计划在后续工作中继续研究这一问题。

Multi-Turn Instruction Following#
多轮指令跟随

Using only single-turn instruction data during SFT results in a model that cannot follow multi-turn conversation. To supplement the multi-turn samples from Tulu 3, we generated more multi-turn conversations in English using the Magpie method [Xu et al.]. The key insight of Magpie is that when an instruction-tuned model is given the chat template for the user side of the conversation, but without any user query, the model will generate its own prompt because of the autoregressive nature of the model. The model can then respond to its self-generated prompt. A subsequent conversation turn can be generated by prepending the previous turn to the user template. We found, however, that the Llama models did not reliably generate well-formed conversations in this setting and that we needed a slightly more complex multi-step prompting approach to clean up the generated data.
在 SFT 阶段仅使用单轮指令数据会导致模型无法处理多轮对话。为补充 Tulu 3 中的多轮对话样本,我们采用 Magpie 方法[徐等人]生成了更多英文多轮对话。Magpie 的核心思路是:当给定用户侧对话模板但未提供具体查询时,经过指令调校的自回归模型会自主生成提示词并作出响应。通过将前轮对话附加到用户模板前,即可生成后续对话轮次。但我们发现 Llama 模型在此设定下无法稳定生成格式良好的对话,需要采用更复杂的多步提示策略来优化生成数据质量。

Synthetic Data Generation Summary#
合成数据生成总结 #

Our synthetic data pipeline was a critical enabler for post-training. By using a dispatcher for large-scale inference and leveraging reward models (ArmoRM for English, LLM-as-a-Judge for Finnish), we generated the necessary SFT data. While our fluency rating experiments for Finnish yielded limited results, the overall pipeline successfully produced the data needed to align the Poro 2 models.
我们的合成数据流水线是后续训练的关键推动因素。通过使用调度器进行大规模推理,并利用奖励模型(英语采用 ArmoRM,芬兰语采用 LLM-as-a-Judge),我们生成了必要的监督微调(SFT)数据。虽然针对芬兰语的流畅度评级实验成果有限,但整体流水线成功产出了对齐 Poro 2 模型所需的数据。

Final Post-Training Results and Analysis#
最终训练后结果与分析

The culmination of our work is the performance of the final, post-trained models. This section presents the results on our chat and instruction-following benchmarks, comparing the Poro 2 models against their Llama 3.1 and 3.3 counterparts to quantify the improvements in both Finnish and English.
我们工作的最终成果体现在后训练模型的性能上。本节展示了聊天和指令跟随基准测试的结果,将 Poro 2 模型与 Llama 3.1 和 3.3 版本进行对比,量化了芬兰语和英语能力的提升。

For post-training, we evaluate the models on IFEval, MTBench, and AlpacaEval in both English and Finnish. While the charts below show the averaged performance across these benchmarks, a complete breakdown of the individual scores for both the 8B and 70B models is available in the Appendix.
在后训练评估中,我们使用 IFEval、MTBench 和 AlpacaEval 基准测试对英语和芬兰语进行了模型评估。虽然下方图表显示的是这些基准测试的平均性能表现,但附录中提供了 8B 和 70B 模型各项得分的完整细分数据。

Our final 8B models show a clear advantage in Finnish. As illustrated in the chart below, the Poro 2 8B SFT and DPO checkpoints significantly outperform Llama 3.1 8B Instruct in Finnish while effectively maintaining English capability. The averaged results show that our SFT checkpoint outperforms Llama by around 16% in Finnish, and the DPO checkpoint widens this gap even more, outperforming Llama 8B by around 24%.
我们的最终版 8B 模型在芬兰语方面展现出明显优势。如下图所示,Poro 2 8B 的 SFT 和 DPO 检查点在芬兰语上显著优于 Llama 3.1 8B Instruct,同时有效保持了英语能力。平均结果显示,我们的 SFT 检查点在芬兰语上比 Llama 高出约 16%,而 DPO 检查点进一步拉大了这一差距,比 Llama 8B 高出约 24%。

8B results

In a head-to-head pairwise comparison on MTBench, the final Poro 2 8B DPO checkpoint achieves an adjusted win rate of 85% against Llama 3.1 8B Instruct in Finnish and 49% in English. Moreover, Poro 2 8B DPO has an adjusted win rate of 51% over the much larger Llama 3.3 70B Instruct in Finnish.
在 MTBench 的成对对比测试中,最终版 Poro 2 8B DPO 检查点在芬兰语上对 Llama 3.1 8B Instruct 的调整胜率达到 85%,在英语上为 49%。更值得注意的是,Poro 2 8B DPO 在芬兰语场景下对参数规模大得多的 Llama 3.3 70B Instruct 仍保持了 51%的调整胜率。

The 70B models tell a similar story, outperforming even the newer Llama 3.3 70B Instruct model in Finnish. The Poro 2 70B SFT checkpoint shows a notable improvement over Llama 3.1 70B Instruct and is on par with Llama 3.3 70B Instruct. Our final DPO checkpoint improves even further, outperforming Llama 3.3 in Finnish by over 6% and Llama 3.1 by over 11%. In English, our model is on par with Llama 3.3 70B Instruct and outperforms Llama 3.1 70B Instruct.
70B 参数模型同样表现亮眼,在芬兰语任务上甚至超越了较新的 Llama 3.3 70B Instruct 模型。Poro 2 70B SFT 检查点较 Llama 3.1 70B Instruct 有显著提升,与 Llama 3.3 70B Instruct 持平。而经过 DPO 优化的最终检查点更进一步,在芬兰语任务上超越 Llama 3.3 达 6%以上,领先 Llama 3.1 超过 11%。在英语任务中,我们的模型与 Llama 3.3 70B Instruct 表现相当,同时优于 Llama 3.1 70B Instruct。

70B results

In a pairwise comparison on MTBench, Poro 2 70B DPO has an adjusted win rate of 66% in Finnish and 57% in English over Llama 3.3 70B Instruct.
在 MTBench 的成对比较中,Poro 2 70B DPO 针对芬兰语的调整胜率为 66%,针对英语的调整胜率为 57%,均优于 Llama 3.3 70B Instruct 模型。

Overall, these results show that the Poro 2 models have substantially improved Finnish performance over the Llama 3 models while maintaining their English proficiency.
总体而言,这些结果表明 Poro 2 模型在保持英语能力的同时,其芬兰语性能较 Llama 3 模型有显著提升。

Summary#  总结 #

In this playbook, we set out to provide a clear, replicable roadmap for adapting a powerful, English-centric LLM to a new language. We demonstrated this process by creating the Poro 2 family, a set of Llama 3.1 models with excellent Finnish capabilities, trained on AMD Instinct MI250X GPUs.
在本实践指南中,我们旨在为将强大的英语核心 LLM 适配至新语言提供清晰、可复现的路线图。我们通过创建 Poro 2 系列模型(一组基于 Llama 3.1 架构、具备出色芬兰语能力、在 AMD Instinct MI250X GPU 上训练的模型)展示了这一过程。

For practitioners seeking to replicate this process, this guide has delivered both the methodology and the key learnings from our journey. The main takeaways are:
对于希望复现此流程的实践者,本指南既提供了方法论,也分享了我们在探索过程中的关键经验。主要收获包括:

  • Start with Strength: The single best predictor for success in a new language is the base model’s capability in a high-resource language like English.
    始于优势:新语言成功的最佳预测指标是基础模型在高资源语言(如英语)中的能力表现。

  • Refine the Data Mix: A balanced data mix—in our case, 30% Finnish, 30% English, 30% code, and 10% math—was crucial for preventing catastrophic forgetting.
    优化数据混合比例:平衡的数据组合——在我们的案例中采用 30%芬兰语、30%英语、30%代码和 10%数学——对于防止灾难性遗忘至关重要。

  • A Pragmatic Pipeline is Key: Success requires a full-stack approach, including a robust evaluation suite, a pragmatic machine translation strategy, and a multi-stage post-training process (SFT and DPO).
    构建务实流程是关键:成功需要全栈式方案,包括健全的评估体系、务实的机器翻译策略,以及多阶段的后训练流程(监督微调 SFT 和直接偏好优化 DPO)。

The release of the Poro 2 models is an important milestone, but the work doesn’t stop here. Our results open up several exciting avenues for future improvement. We aim to enhance our synthetic data pipeline, especially for multi-turn conversations and verifiable instruction-following. We are also exploring techniques to extend the tokenizer for better Finnish efficiency, experiment with annealing and mid-training data mixes, and adapt these methods for long-context models. By following the steps outlined here and building upon these future directions, developers and researchers can significantly lower the barrier to creating high-performance, multilingual AI, fostering innovation beyond the English-speaking world.
Poro 2 模型的发布是一个重要里程碑,但我们的工作并未止步于此。研究成果为未来改进开辟了多条令人振奋的路径:我们将优化合成数据流水线(特别是多轮对话和可验证指令遵循场景),探索扩展分词器以提升芬兰语效率的技术,试验退火策略与训练中期数据混合方案,并将这些方法适配于长上下文模型。通过遵循本文所述步骤并基于这些未来方向持续探索,开发者和研究人员将能显著降低创建高性能多语言 AI 的门槛,推动英语世界之外的技术创新。

Appendix#  附录

Complete 8B post-training results#
完整的 80 亿参数后训练结果

Llama-3.1-8B-Instruct

Poro2-8B-SFT

Poro2-8B-DPO

IFEval_fi

47.31

64.69

66.54

MTBench_fi

4.1

5.92

6.75

AlpacaEval2_fi

2.05

16.8

28.89

Finnish avg  芬兰语平均分

30.12

46.9

54.31

IFEval

79.48

79.66

79.29

MTBench

7.7

7.07

7.33

AlpacaEval2

32.7

29.67

35.3

English avg  英语平均分

63.06

60.01

62.63

Complete 70B posttraining results#
70B 完整后训练结果#

Llama-3.1-70B-Instruct

Llama-3.3-70B-Instruct

Poro2-70B-SFT

Poro2-70B-DPO

IFEval_fi

63.95

71.71

70.05

70.79

MTBench_fi

7.06

7.4

7.2

7.77

AlpacaEval2_fi

21.06

25.73

30.74

41.96

Finnish avg  芬兰语平均分

51.87

57.15

57.6

63.48

IFEval

86.69

90.38

89.46

85.95

MTBench

8.33

8.35

8.03

8.41

AlpacaEval2

43.87

45.12

43.18

49.77

English avg  英文平均分

71.29

73

70.98

73.27