With the amount of new subnets being added it can be hard to get up to date information across all subnets, so data may be slightly out of date from time to time

Subnet 38

Distributed Training

Emissions
Value
Recycled
Value
Recycled (24h)
Value
Registration Cost
Value
Active Validators
Value
Active Miners
Value
Active Dual Miners/Validators
Value

ABOUT

What exactly does it do?

This subnet uses a distributed approach to train Large Language Models on web-based datasets. Their proposed solution is a subnet that incentivizes compute, bandwidth, and latency. Compute resources drive the training of each miner’s local model, while bandwidth and latency facilitate the averaging of local model weights using a process called butterfly all-reduce. Once this process is completed, every miner receives a unified global averaged gradient to update their model weights.

This subnet uses a distributed approach to train Large Language Models on web-based datasets. Their proposed solution is a subnet that incentivizes compute, bandwidth, and latency. Compute resources drive the training of each miner’s local model, while bandwidth and latency facilitate the averaging of local model weights using a process called butterfly all-reduce. Once this process is completed, every miner receives a unified global averaged gradient to update their model weights.

PURPOSE

What exactly is the 'product/build'?

Training Process:
Miners train the collective model on specific dataset segments. The training is iterative, with both local and global tracking of epochs and steps. Miners perform local training on their assigned data and participate in gradient averaging using the butterfly all-reduce method.

Dataset:

The subnet utilizes the “HuggingFaceFW/fineweb” dataset with the “sample-350BT” configuration.
Data is streamed in real-time from Hugging Face servers for efficient large-scale data handling.
Text is tokenized with the GPT-2 tokenizer (“distilgpt2”).

Model Submission:

After each gradient averaging step, miners push the updated model to the Hugging Face Hub.
The model is tagged with the current epoch number.
In case of upload failure, the system retries within a set limit.

Validation:

Validators perform two main queries: “Train” and “AllReduce.”
For “Train” queries, validators check miners’ loss, gradients, and dataset indices.
For “AllReduce” queries, they initiate gradient averaging and verify miner participation.

Incentive Mechanism:

Bandwidth Score: Measures miners’ efficiency in sharing model states.
Gradient Score: Compares miner-reported gradients to validator-calculated gradients.
Steps Score: Rewards miners based on the volume of data trained in each step.

Training Process:
Miners train the collective model on specific dataset segments. The training is iterative, with both local and global tracking of epochs and steps. Miners perform local training on their assigned data and participate in gradient averaging using the butterfly all-reduce method.

Dataset:

The subnet utilizes the “HuggingFaceFW/fineweb” dataset with the “sample-350BT” configuration.
Data is streamed in real-time from Hugging Face servers for efficient large-scale data handling.
Text is tokenized with the GPT-2 tokenizer (“distilgpt2”).

Model Submission:

After each gradient averaging step, miners push the updated model to the Hugging Face Hub.
The model is tagged with the current epoch number.
In case of upload failure, the system retries within a set limit.

Validation:

Validators perform two main queries: “Train” and “AllReduce.”
For “Train” queries, validators check miners’ loss, gradients, and dataset indices.
For “AllReduce” queries, they initiate gradient averaging and verify miner participation.

Incentive Mechanism:

Bandwidth Score: Measures miners’ efficiency in sharing model states.
Gradient Score: Compares miner-reported gradients to validator-calculated gradients.
Steps Score: Rewards miners based on the volume of data trained in each step.

WHO

Team Info

Karim Foda

Mikkel Loose

Karim Foda

Mikkel Loose

FUTURE

Roadmap

  • Test stability of the network when training [dsitributed/gpt2-250m](https://huggingface.co/distributed/gpt2-250m)
  • Reproduce the [Lamb](https://arxiv.org/pdf/1904.00962v5) paper to prove that convergence can be achieved using batch sizes of 32k/64k when training on the [HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) dataset
  • Replace the WandB Progress Tracker with an Chain based Progress Tracker
  • Enhance the all reduce validation mechanism to reward miners based on how much they’ve contributed to the all-reduce process
  • Repeat step 1 for gpt2-500m
  • Repeat step 1 for gpt2-1b
  • Implement 1.5 bit quantisation techniques to pave the way for train 7b+ models
  • Investigate wether the use of Optimiser Parameter Offloading & Delayed Parameter Updates can enable larger model size training
  • Test stability of the network when training [dsitributed/gpt2-250m](https://huggingface.co/distributed/gpt2-250m)
  • Reproduce the [Lamb](https://arxiv.org/pdf/1904.00962v5) paper to prove that convergence can be achieved using batch sizes of 32k/64k when training on the [HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) dataset
  • Replace the WandB Progress Tracker with an Chain based Progress Tracker
  • Enhance the all reduce validation mechanism to reward miners based on how much they’ve contributed to the all-reduce process
  • Repeat step 1 for gpt2-500m
  • Repeat step 1 for gpt2-1b
  • Implement 1.5 bit quantisation techniques to pave the way for train 7b+ models
  • Investigate wether the use of Optimiser Parameter Offloading & Delayed Parameter Updates can enable larger model size training

NEWS

Announcements

MORE INFO

Useful Links