Open-source incentive program, Skywork-Reward-V2, blazes a trail with innovative approaches for compensation in the open-source community.
In a groundbreaking development, a new collaborative data pipeline has significantly improved the quality and performance of open-source reward models. This pipeline, named "human-machine collaboration, two-stage iteration," was used in the creation of Skywork-SynPref-40M, the largest preference hybrid dataset to date, containing 40 million preference sample pairs.
The pipeline begins with Stage 1: Human-Guided Small-Scale High-Quality Preference Construction. This stage involves creating an unverified initial preference pool, with Large Language Models (LLMs) generating auxiliary preference-related attributes like task type, objectivity, and controversy. Human annotators then apply a strict verification protocol, assisted by external tools and advanced LLMs, to review partial data. This results in a small but high-quality "gold standard" dataset, serving as a reliable foundation for subsequent data generation and model evaluation.
Stage 2: Large-Scale Data Expansion and Iterative Optimization follows. Using the gold standard preference labels as guidance, LLMs generate a large volume of high-quality "silver standard" data, achieving substantial data scale expansion. Multiple rounds of iteration follow, with reward models being trained and their weaknesses identified based on gold standard data performance. Similar samples are retrieved, and multi-model consensus mechanisms automatically annotate these, further expanding and enriching the silver standard dataset. This iterative human-machine closed-loop continuously enhances the reward model’s ability to understand and discriminate nuanced human preferences.
The impact of this pipeline is profound. It addresses the fragility of reward models caused by limitations in existing datasets, such as limited coverage, mechanical labeling, and insufficient quality control. By combining strict human verification with scalable machine augmentation and iterative refinement, the pipeline ensures large-scale data that is both diverse and of high quality. Consequently, reward models trained on Skywork-SynPref-40M show improved robustness and effectiveness in modeling complex, nuanced human preferences, especially in open-ended and subjective tasks.
The Skywork-Reward-V2 series, consisting of eight reward models based on Qwen3 and LLaMA3 series models, with parameters ranging from 600 million to 8 billion, has benefited greatly from this pipeline. The smallest model, Skywork-Reward-V2-Qwen3-0.6B, achieves overall performance nearly matching the previous generation's strongest model on average. Remarkably, training an 8B-scale model using only 1.8% of the high-quality data from Skywork-SynPref-40M already exceeded the performance of current 70B-level SOTA reward models.
Skywork-Reward-V2 series has achieved top rankings across seven major mainstream reward model evaluation benchmarks. The technical report for Skywork-Reward-V2 can be found at arxiv.org/abs/2507.01352.
Download links for Skywork-Reward-V2 are available on HuggingFace and GitHub.
Data-and-cloud-computing was utilized to store and process the large-scale preference data generated by the pipeline, enhancing the efficiency of the two-stage process. The novel pipeline, integrating technology such as artificial-intelligence and human annotation, significantly improved the performance and quality of reward models, yielding models that demonstrated robustness and effectiveness in modeling complex human preferences.