Subnet 38

Distributed Training

ABOUT

What exactly does it do?

This subnet uses a distributed approach to train Large Language Models on web-based datasets. Their proposed solution is a subnet that incentivizes compute, bandwidth, and latency. Compute resources drive the training of each miner’s local model, while bandwidth and latency facilitate the averaging of local model weights using a process called butterfly all-reduce. Once this process is completed, every miner receives a unified global averaged gradient to update their model weights.

What exactly does it do?

PURPOSE

What exactly is the 'product/build'?

Training Process:
Miners train the collective model on specific dataset segments. The training is iterative, with both local and global tracking of epochs and steps. Miners perform local training on their assigned data and participate in gradient averaging using the butterfly all-reduce method.

Dataset:

The subnet utilizes the “HuggingFaceFW/fineweb” dataset with the “sample-350BT” configuration.
Data is streamed in real-time from Hugging Face servers for efficient large-scale data handling.
Text is tokenized with the GPT-2 tokenizer (“distilgpt2”).

Model Submission:

After each gradient averaging step, miners push the updated model to the Hugging Face Hub.
The model is tagged with the current epoch number.
In case of upload failure, the system retries within a set limit.

Validation:

Validators perform two main queries: “Train” and “AllReduce.”
For “Train” queries, validators check miners’ loss, gradients, and dataset indices.
For “AllReduce” queries, they initiate gradient averaging and verify miner participation.

Incentive Mechanism:

Bandwidth Score: Measures miners’ efficiency in sharing model states.
Gradient Score: Compares miner-reported gradients to validator-calculated gradients.
Steps Score: Rewards miners based on the volume of data trained in each step.

What exactly is the 'product/build'?

Dataset:

Model Submission:

Validation:

Incentive Mechanism:

WHO

Team Info

Karim Foda

Mikkel Loose

Team Info

Karim Foda

Mikkel Loose

FUTURE

Roadmap

Test stability of the network when training [dsitributed/gpt2-250m](https://huggingface.co/distributed/gpt2-250m)
Reproduce the [Lamb](https://arxiv.org/pdf/1904.00962v5) paper to prove that convergence can be achieved using batch sizes of 32k/64k when training on the [HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) dataset
Replace the WandB Progress Tracker with an Chain based Progress Tracker
Enhance the all reduce validation mechanism to reward miners based on how much they’ve contributed to the all-reduce process
Repeat step 1 for gpt2-500m
Repeat step 1 for gpt2-1b
Implement 1.5 bit quantisation techniques to pave the way for train 7b+ models
Investigate wether the use of Optimiser Parameter Offloading & Delayed Parameter Updates can enable larger model size training

Roadmap

Test stability of the network when training [dsitributed/gpt2-250m](https://huggingface.co/distributed/gpt2-250m)
Reproduce the [Lamb](https://arxiv.org/pdf/1904.00962v5) paper to prove that convergence can be achieved using batch sizes of 32k/64k when training on the [HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) dataset
Replace the WandB Progress Tracker with an Chain based Progress Tracker
Enhance the all reduce validation mechanism to reward miners based on how much they’ve contributed to the all-reduce process
Repeat step 1 for gpt2-500m
Repeat step 1 for gpt2-1b
Implement 1.5 bit quantisation techniques to pave the way for train 7b+ models
Investigate wether the use of Optimiser Parameter Offloading & Delayed Parameter Updates can enable larger model size training

NEWS

Announcements

DSTRBTD Follow 36 779

Trust-less Decentralised Distributed Training

DSTRBTD @dstrbtd_ai ·

23h 2008173225705148734

Starting the year incredibly grateful to our Open Source contributors.

Over the holidays, while working on a PR to migrate our Mechanism 0 DataLoader to R2, @jorritvangils spotted a critical bug in our miner code. He quickly merged a fix (PR #87: ) to

Fix dataloader and blocklist block mismatch by JorritvanGils · Pull Request #87 · dstrbtd/Distrib...

Currently, self.current_block is updated continuously, causing a mismatch between the value passed to DatasetLoader.next_pag...

github.com

Reply on Twitter 2008173225705148734 Retweet on Twitter 2008173225705148734 1 Like on Twitter 2008173225705148734 13 X 2008173225705148734

DSTRBTD @dstrbtd_ai ·

31 Dec 2006330438889967863

Last Friday, we launched Mechanism 1 on Subnet 38's main-net! 🚀

Mechanism 1 is a winner-takes-all mechanism that aims to incentivise miners to develop SOTA distributed training strategies (see the "Aggregation" row in the heat-map in the attached post).

These optimised

DSTRBTD @dstrbtd_ai

Decentralized pre-training has accelerated rapidly over the past year, with multiple teams running public experiments each taking a different approach to the same problem.

Here is a high-level comparison across sharding strategy, permissions, model scale, aggregation, and

Reply on Twitter 2006330438889967863 Retweet on Twitter 2006330438889967863 1 Like on Twitter 2006330438889967863 10 X 2006330438889967863

DSTRBTD @dstrbtd_ai ·

24 Dec 2003790199327744050

Reply on Twitter 2003790199327744050 Retweet on Twitter 2003790199327744050 3 Like on Twitter 2003790199327744050 9 X 2003790199327744050

DSTRBTD @dstrbtd_ai ·

24 Dec 2003790205250052171

It's worth noting that there are also excellent teams like Nous Research, Grail and Gensyn working on decentralized post-training.

This thread focuses specifically on decentralized pre-training, where the size and type of information being shared are quite different. Both

Reply on Twitter 2003790205250052171 Retweet on Twitter 2003790205250052171 0 Like on Twitter 2003790205250052171 4 X 2003790205250052171

DSTRBTD @dstrbtd_ai ·

24 Dec 2003790208282828871

If we’ve missed out any other public decentralized pre-training efforts, we’d love for people to share them with us.

Especially interested in protocols exploring novel aggregation techniques, compression algorithms or incentive mechanisms.

Reply on Twitter 2003790208282828871 Retweet on Twitter 2003790208282828871 0 Like on Twitter 2003790208282828871 4 X 2003790208282828871

DSTRBTD @dstrbtd_ai ·

20 Dec 2002286334321348748

A question we often get from members of our community is: "in layman's terms what is DSTRBTD's long term vision?"

Put simply, its building community owned artificial intelligence.

Right now, the world’s most powerful AI is owned and controlled by a small number of large

Reply on Twitter 2002286334321348748 Retweet on Twitter 2002286334321348748 2 Like on Twitter 2002286334321348748 10 X 2002286334321348748

DSTRBTD @dstrbtd_ai ·

17 Dec 2001201850456916406

DSTRBTD’s Run 4 is our most stable attempt to date at training a 4B parameter model in a fully permission-less, trust-less and decentralised setting: https://dash.dstrbtd.ai/performance.

Over the past week, we’ve seen an average of 10 participants per AllReduce (the process of sharing

Reply on Twitter 2001201850456916406 Retweet on Twitter 2001201850456916406 5 Like on Twitter 2001201850456916406 16 X 2001201850456916406

DSTRBTD @dstrbtd_ai ·

11 Dec 1999243840918896774

DSTRBTD's Mechanism 1 is now producing reproducible benchmarks for distributed training optimizers.

Each optimizer is evaluated in a sandbox environment that trains NanoGPT variants for 10k steps. We record:

• Final Loss
• Communication Volume
• Throughput

These metrics are

Reply on Twitter 1999243840918896774 Retweet on Twitter 1999243840918896774 3 Like on Twitter 1999243840918896774 13 X 1999243840918896774

Subnet 38

Distributed Training

ABOUT

What exactly does it do?

PURPOSE

What exactly is the 'product/build'?

WHO

Team Info

FUTURE

Roadmap

NEWS

Announcements

View other subnets