Foresight Ventures: Taking a Rational View of Decentralized Computing Power Networks

Original author: Ian Xu

TL;DR

Currently, there are mainly two big directions for the combination of AI and Crypto: distributed computing power and ZKML. For more information on ZKML, please refer to one of my previous articles. This article will focus on decentralized distributed computing power networks.
With the development trend of AI large models, computing power will be the next battlefield of the next decade and the most important thing in future human society. It will not only be limited to commercial competition, but also become a strategic resource for major power games. Investment in high-performance computing infrastructure and computing power reserves will increase exponentially in the future.
The need for decentralized distributed computing power networks in AI large model training is the greatest, but it also faces the greatest challenges and technical bottlenecks, including complex data synchronization and network optimization issues. In addition, data privacy and security are also important limiting factors. Although some existing technologies can provide preliminary solutions, these technologies are still not applicable in large-scale distributed training tasks due to the huge computational and communication costs.
The decentralized distributed computing power network has a better chance of landing in model inference, and the predicted incremental space is also large enough. However, it also faces challenges such as communication latency, data privacy, and model security. Compared with model training, the computational complexity and data interaction in inference are lower, which is more suitable for distributed environments.
Through the cases of the two start-up companies Together and Gensyn.ai, the overall research direction and specific ideas of the decentralized distributed computing power network are explained from the perspectives of technical optimization and incentive layer design.

I. Distributed Computing Power—Large Model Training

When we discuss the application of distributed computing power in training, we generally focus on the training of large language models. The main reason is that the computing power required for the training of small models is not large. It is not cost-effective to do distributed work to deal with data privacy and a bunch of engineering problems. It is better to solve it directly in a centralized way. Large language models have huge computing power requirements, and are now in the initial stage of outbreak. From 2012 to 2018, the computing requirements of AI doubled every four months. Now, it is the concentration point of computing power needs, and it can be predicted that it will still be a huge incremental demand in the next 5-8 years.

While there are huge opportunities, it is also necessary to see the problems clearly. Everyone knows that the scene is big, but what are the specific challenges? The core of judging excellent projects in this competition is who can target these problems instead of blindly entering the market.

(NVIDIA NeMo Megatron Framework）

1. Overall Training Process

Taking training a large model with 175 billion parameters as an example. Due to the huge size of the model, it needs to be trained in parallel on many GPU devices. Assume that there is a centralized machine room with 100 GPUs, each device has 32GB of memory.

Data Preparation : First, a huge dataset is required, which contains various data such as Internet information, news, books, etc. Before training, these data need to be preprocessed, including text cleaning, tokenization, vocabulary construction, etc.
Data Segmentation : The processed data will be divided into multiple batches for parallel processing on multiple GPUs. Assuming that the batch size selected is 512, that is, each batch contains 512 text sequences. Then, we divide the entire dataset into multiple batches to form a batch queue.
Inter-device Data Transmission : At the beginning of each training step, the CPU takes out a batch of data from the batch queue and then sends the data of this batch to the GPU through the PCIe bus. Assuming that the average length of each text sequence is 1024 tokens, the data size of each batch is about 512 * 1024 * 4B = 2MB (assuming each token is represented by a 4-byte single-precision floating-point number). This data transmission process usually takes only a few milliseconds.
Parallel Training : After each GPU device receives the data, it starts to perform forward propagation and backward propagation calculations, calculating the gradient of each parameter. Because the size of the model is very large, the memory of a single GPU cannot store all parameters, so we use model parallelism to distribute model parameters to multiple GPUs.
Gradient Aggregation and Parameter Update : After the backward propagation calculation is completed, each GPU has obtained the gradient of a part of the parameters. Then, these gradients need to be aggregated between all GPU devices to calculate the global gradient. This requires data transmission through the network. Assuming a network speed of 25Gbps, it takes about 224 seconds to transmit 700GB of data (assuming each parameter uses single-precision floating-point numbers, 175 billion parameters are about 700GB). Then, each GPU updates its stored parameters based on the global gradient.
Synchronization : After the parameter update, all GPU devices need to be synchronized to ensure that they use consistent model parameters for the next training step. This also requires data transmission through the network.
Repeat Training Steps: Repeat the above steps until all batches of training are completed or the predetermined number of training rounds (epoch) is reached.

This process involves a large amount of data transmission and synchronization, which can become a bottleneck for training efficiency. Therefore, optimizing network bandwidth and latency, as well as using efficient parallel and synchronization strategies, are essential for large-scale model training.

2. Bottlenecks in communication overhead:

It should be noted that communication overhead is also a reason why current distributed computing networks cannot handle large language model training.

The various nodes need to exchange information frequently to work together, which creates communication overhead. For large language models, this problem is particularly severe due to the huge number of model parameters. Communication overhead is divided into several aspects:

Data transmission : During training, nodes need to exchange model parameters and gradient information frequently. This requires a large amount of data to be transmitted in the network, consuming a lot of network bandwidth. If the network conditions are poor or the distance between computing nodes is large, the latency of data transmission will be high, further increasing communication overhead.
Synchronization issues : During training, nodes need to work together to ensure correct training. This requires frequent synchronization operations between nodes, such as updating model parameters and calculating global gradients. These synchronization operations require a large amount of data to be transmitted in the network and require all nodes to complete the operation, which leads to a lot of communication overhead and waiting time.
Gradient accumulation and update : During training, each node needs to calculate its own gradients and send them to other nodes for accumulation and update. This requires a large amount of gradient data to be transmitted in the network and requires all nodes to complete the calculation and transmission of gradients, which is also the reason for a large amount of communication overhead.
Data consistency : It is necessary to ensure that the model parameters of each node remain consistent. This requires frequent data verification and synchronization operations between nodes, which leads to a lot of communication overhead.

Although there are some methods to reduce communication overhead, such as compressing parameters and gradients, and efficient parallel strategies, these methods may introduce additional computational burden or have a negative impact on the training effect of the model. Moreover, these methods cannot completely solve the problem of communication overhead, especially when the network conditions are poor or the distance between computing nodes is large.

Example:

Decentralized Distributed Computing Network

The GPT-3 model has 175 billion parameters. If we use single-precision floating-point numbers (4 bytes per parameter) to represent these parameters, storing them would require about 700 GB of memory. In distributed training, these parameters need to be frequently transmitted and updated between various computing nodes.

Suppose there are 100 computing nodes, and each node needs to update all parameters in each step. Then each step needs to transmit about 70 TB (700 GB × 100) of data. If we assume that each step takes 1 second (a very optimistic assumption), then 70 TB of data needs to be transmitted every second. This kind of bandwidth requirement already exceeds most networks and is a feasibility problem.

In reality, due to communication delays and network congestion, data transmission time may far exceed 1 second. This means that computing nodes may spend a lot of time waiting for data transmission instead of actual computation. This will greatly reduce the efficiency of training, and this efficiency reduction cannot be solved by just waiting. It makes a difference between feasible and infeasible, and makes the entire training process infeasible.

Centralized Data Centers

Even in centralized data center environments, training of large models still requires heavy communication optimization.

In a centralized data center environment, high-performance computing devices form clusters and share computing tasks through high-speed networks. However, even in this high-speed network environment, communication overhead is still a bottleneck when training models with a large number of parameters, because model parameters and gradients need to be frequently transmitted and updated between computing devices.

As mentioned earlier, suppose there are 100 computing nodes, and each server has a network bandwidth of 25 Gbps. If each server needs to update all parameters in each training step, each training step requires transmitting about 700 GB of data, which takes about 224 seconds. With the advantage of centralized data centers, developers can optimize network topology within the data center and use technology such as model parallelism to significantly reduce this time.

By comparison, if the same training is performed in a distributed environment, assuming there are still 100 computing nodes distributed around the world, the average network bandwidth per node is only 1 Gbps. In this case, transmitting the same 700 GB of data requires about 5600 seconds, much longer than in a centralized data center. Moreover, due to network latency and congestion, the actual time required may be even longer.

However, compared to the situation in a distributed computing network, optimizing communication overhead in a centralized data center environment is relatively easy. This is because in a centralized data center environment, computing devices are typically connected to the same high-speed network, which has relatively good bandwidth and latency. In contrast, in a distributed computing network, computing nodes may be distributed around the world, and network conditions may be relatively poor, making communication overhead a more serious problem.

OpenAI used a model parallelism framework called Megatron to solve the communication overhead problem during the training of GPT-3. Megatron parallelizes the model’s parameters across multiple GPUs by partitioning the parameters, with each device responsible for storing and updating a portion of the parameters, thus reducing the amount of parameters each device needs to process and reducing communication overhead. At the same time, high-speed interconnect networks were used during training, and the network topology was optimized to reduce communication path lengths.

(Data used to train LLM models)

3. Why can’t distributed computing networks do this optimization?

It is possible, but the effects of these optimizations are limited compared to centralized data centers.

Network topology optimization: In a centralized data center, network hardware and layout can be directly controlled, so network topology can be designed and optimized as needed. However, in a distributed environment, computing nodes are distributed in different geographical locations, and even one in China and one in the United States cannot be directly controlled for network connections. Although data transmission paths can be optimized through software, it is less effective than optimizing hardware networks. At the same time, due to differences in geographic location, network latency and bandwidth also vary greatly, further limiting the effectiveness of network topology optimization.
Model parallelism: Model parallelism is a technique that partitions a model’s parameters across multiple computing nodes and improves training speed through parallel processing. However, this method often requires frequent data transfer between nodes, so it has high requirements for network bandwidth and latency. In a centralized data center, model parallelism can be very effective due to high network bandwidth and low latency. However, in a distributed environment, model parallelism is greatly limited due to poor network conditions.

4. Challenges of Data Security and Privacy

Almost all links involving data processing and transmission may affect data security and privacy, including:

Data allocation: Training data needs to be allocated to various nodes participating in the computation. During this process, data may be maliciously used/leaked on distributed nodes.
Model training: During training, each node uses its allocated data for computation and outputs updated model parameters or gradients. During this process, if the node’s computation process is stolen or results are maliciously parsed, data may also be leaked.
Parameter and gradient aggregation: The output of each node needs to be aggregated to update the global model, and communication during the aggregation process may also leak information about the training data.

What are the solutions to data privacy issues?

Secure multi-party computation (SMC): SMC has been successfully applied in some specific and smaller-scale computing tasks. However, due to its large computational and communication overhead, it is not widely used in large-scale distributed training tasks.
Differential privacy (DP): It is applied in some data collection and analysis tasks, such as Chrome’s user statistics. However, DP will affect the accuracy of the model in large-scale deep learning tasks. At the same time, designing appropriate noise generation and addition mechanisms is also a challenge.
Federated learning (FL): It is applied in some model training tasks of edge devices, such as vocabulary prediction of Android keyboard. However, FL faces problems such as large communication overhead and complex coordination in larger-scale distributed training tasks.
Fully homomorphic encryption (FHE): It has been successfully applied in some tasks with small computational complexity. However, due to its large computational overhead in large-scale distributed training tasks, it is not widely used at present.

Summary

Each method mentioned above has its own applicable scenarios and limitations, and no method can completely solve data privacy issues in large-scale model training on distributed computing networks.

Can ZK, which is highly anticipated, solve the data privacy issues in large-scale model training?

In theory, ZKP can be used to ensure data privacy in distributed computing, allowing a node to prove that it has performed calculations according to the rules, but without revealing the actual input and output data.

However, in practice, using ZKP in the scenario of training large models with a large-scale distributed computing network faces the following bottlenecks:

Computational and communication overhead : Constructing and verifying zero-knowledge proofs requires a lot of computing resources. In addition, the communication overhead of ZKP is also significant, as it requires the transmission of the proof itself. In the case of training large models, these overheads may become particularly significant. For example, if a proof needs to be generated for every small batch of calculations, this will significantly increase the overall time and cost of training.
Complexity of ZK protocols : Designing and implementing a ZKP protocol that is suitable for large model training can be very complex. This protocol needs to be able to handle large amounts of data and complex calculations, and needs to be able to handle possible error reporting.
Hardware and software compatibility : Using ZKP requires specific hardware and software support, which may not be available on all distributed computing devices.

In summary

To use ZKP for large-scale distributed computing network training of large models, it also requires years of research and development, and more academic resources need to be devoted to this direction.

II. Distributed Computing—Model Inference

Another major scenario of distributed computing is model inference. According to our judgment of the development path of large models, the demand for model training will gradually slow down after passing a high point, while the demand for model inference will correspondingly increase exponentially with the maturity of large models and AIGC.

Inference tasks usually have lower computational complexity and less data interaction compared to training tasks, making them more suitable for distributed environments.

(Power LLM inference with NVIDIA Triton)

1. Challenges

Communication delay :

Communication between nodes is essential in distributed environments. In a decentralized distributed computing network, nodes may be spread across the globe, making network latency an issue, especially for real-time inference tasks.

Model Deployment and Updating:

Models need to be deployed to each node. If a model is updated, each node needs to update its model, which requires a lot of network bandwidth and time.

Data Privacy:

Although inference tasks usually only require input data and a model and do not need to return large amounts of intermediate data and parameters, input data may still contain sensitive information, such as users’ personal information.

Model Security:

In a decentralized network, models need to be deployed to untrusted nodes, which can lead to model leakages and intellectual property and abuse issues. This can also lead to security and privacy issues, if a model is used to process sensitive data, nodes can infer sensitive information by analyzing the behavior of the model.

Quality Control:

Each node in a decentralized distributed computing network may have different computing capabilities and resources, which can make it difficult to ensure the performance and quality of inference tasks.

2. Feasibility

Computational Complexity:

During the training phase, the model needs to iterate repeatedly, computing forward and backward propagation for each layer, including activation functions, loss functions, gradients, and weight updates. Therefore, the computational complexity of model training is high.

During the inference phase, only one forward propagation is needed to compute the predicted result. For example, in GPT-3, the input text needs to be converted into a vector, and then forward propagation is performed through each layer of the model (usually Transformer layers), and finally, the output probability distribution is obtained, and the next word is generated based on this distribution. In GANs, the model needs to generate an image based on the input noise vector. These operations only involve forward propagation of the model and do not require the calculation of gradients or the updating of parameters, so the computational complexity is low.

Data Interactivity:

During the inference phase, the model usually processes a single input, rather than a large batch of data as in training. The result of each inference also depends only on the current input, not on other inputs or outputs, so there is no need for a lot of data interaction, and the communication pressure is smaller.

Taking the generative image model as an example, assuming we use GANs to generate images, we only need to input a noise vector to the model, and then the model will generate a corresponding image. In this process, each input will only generate one output, and there is no dependency between the outputs, so there is no need for data interaction.

Taking GPT-3 as an example, generating the next word only requires the current text input and the model’s state, without the need for interaction with other inputs or outputs, so the requirement for data interaction is also weak.

In summary

Whether it is a large language model or a generative image model, the computational complexity and data interaction of inference tasks are relatively low, and it is more suitable to be carried out in a decentralized distributed computing network. This is also a direction where most projects are focusing on now.

3. Projects

The technical threshold and breadth of decentralized distributed computing networks are very high, and hardware resources are also needed to support them, so we have not seen too many attempts. Taking Together and Gensyn.ai as examples:

1. Together

(RedBlockingjama from Together)

Together is a company that focuses on open source large models and is committed to a decentralized AI computing solution, hoping that anyone anywhere can access and use AI. Together has just completed a $20 million seed round of financing led by Lux Capital.

Together was co-founded by Chris, Percy, and Ce. The original intention was that large-scale model training requires a large number of high-end GPU clusters and expensive expenditures, and these resources and model training capabilities are also concentrated in a few large companies.

From my perspective, a more reasonable entrepreneurial plan for distributed computing power is:

Step 1. Open source model

To achieve model inference in a decentralized distributed computing network, a prerequisite is that nodes must be able to obtain the model at low cost, which means that the model used in the decentralized computing network needs to be open source (if the model needs to be used under the corresponding license, it will increase the complexity and cost of implementation). For example, chatgpt as a non-open source model is not suitable for execution on a decentralized computing network.

Therefore, it can be inferred that an invisible barrier for a company that provides a decentralized computing network is to have strong development and maintenance capabilities for large models. Developing and open sourcing a powerful base model can to some extent get rid of the dependence on third-party model open source, solve the most basic problem of decentralized computing networks. At the same time, it is also more conducive to proving that the computing network can effectively train and infer large models.

Together also does this. The recently released RedBlockingjama based on LLaMA is a joint project by Together, Ontocord.ai, ETH DS3Lab, Stanford CRFM, and Hazy Research, with the goal of developing a series of completely open-source large language models.

Step 2. Distributed computing power is implemented in model inference

As mentioned in the previous two sections, compared to model training, the computational complexity and data interaction of model inference are lower and more suitable for decentralized distributed environments.

Based on the open-source model, Together’s development team has made a series of updates to the RedBlockingjama-INCITE-3B model, such as using LoRA for low-cost fine-tuning, making the model run more smoothly on CPUs (especially MacBook Pros using M2 Pro processors). At the same time, although the size of this model is small, its capacity exceeds other models of the same size and has been used in legal and social scenarios.

Step 3. Distributed computing power is implemented in model training

(Schematic diagram of the computing network of “Overcoming Communication Bottlenecks for Decentralized Training”)

In the medium and long term, although there are great challenges and technical bottlenecks, meeting the computing power needs for large-scale AI model training must be the most attractive. At the beginning of its establishment, Together began to lay out the work of how to overcome communication bottlenecks in decentralized training. They also released a related paper at NeurIPS 2022: Overcoming Communication Bottlenecks for Decentralized Training. We can mainly summarize the following directions:

Scheduling optimization

When training in a decentralized environment, it is important to assign tasks that require heavy communication to devices with faster connections because the connections between each node have different delays and bandwidths. Together optimizes scheduling strategies better by establishing a model to describe the cost of specific scheduling strategies, minimizing communication costs, and maximizing training throughput. The Together team also found that even if the network is 100 times slower, the end-to-end training throughput is only slowed down by 1.7 to 2.3 times. Therefore, it is promising to catch up with the gap between distributed networks and centralized clusters through scheduling optimization.

Communication compression optimization

Together proposed communication compression for forward activation and backward gradients, and introduced AQ-SGD algorithm, which provides strict guarantees on the convergence of stochastic gradient descent. AQ-SGD can fine-tune large base models on slow networks (e.g. 500 Mbps), and is only 31% slower than end-to-end training performance without compression on centralized computing networks (e.g. 10 Gbps). In addition, AQ-SGD can be combined with state-of-the-art gradient compression techniques (such as QuantizedAdam) to achieve a 10% end-to-end speedup.

Project Summary

The Together team has a very comprehensive configuration, with members having strong academic backgrounds, including industry experts in large model development, cloud computing, and hardware optimization. And Together has indeed demonstrated a long-term and patient approach in path planning, from developing open source large models, to testing idle computing power (such as mac) for model inference in distributed computing networks, and then to the layout of distributed computing power in large model training. There is a feeling of thick accumulation and thin hair 🙂

However, at present, I have not seen too many research results of Together in the incentive layer, which I think is equally important as technical research, and is a key factor in ensuring the development of decentralized computing power networks.

2. Gensyn.ai

(Gensyn.ai)

From the technical path of Together, we can roughly understand the landing process of decentralized computing power networks in model training and inference, as well as the corresponding research and development focus.

Another important focus that cannot be ignored is the design of the incentive layer/consensus algorithm of the computing power network. For example, an excellent network needs to:

Ensure that the income is attractive enough;
Ensure that each miner receives the appropriate income, including anti-cheating and more labor for more income;
Ensure that tasks are reasonably scheduled and allocated among different nodes, without a large number of idle nodes or some nodes being overly crowded;
The incentive algorithm is concise and efficient, and does not cause too much system burden and delay.

…

See how Gensyn.ai does it:

Becoming a Node

First, the solver in the computing network competes for the right to handle user-submitted tasks through a bidding process, and depending on the size of the task and the risk of cheating, the solver needs to pledge a certain amount of money.

Verification

While updating Blockingrameters, the solver generates multiple checkpoints (to ensure transparency and traceability of the work), and periodically generates cryptographic encryption reasoning proofs (proofs of work progress) about the task;

When a solver completes work and produces some computational results, the protocol selects a verifier, who also pledges a certain amount of money (to ensure that the verifier performs the verification honestly), and decides which part of the computational results needs to be verified based on the aforementioned proofs.

If there is a disagreement between the solver and the verifier

The exact location of the computational results with disagreements is located through a data structure based on a Merkle tree. The entire verification process is recorded on the blockchain, and cheaters will have their pledged amounts deducted.

Project Summary

The design of the incentive and verification algorithms means that Gensyn.ai does not need to replay all the results of the entire computation task during the verification process, but only needs to replicate and verify some results based on the provided proofs, which greatly improves the efficiency of verification. At the same time, nodes only need to store part of the computed results, which also reduces storage space and computational resource consumption. In addition, potential cheating nodes cannot predict which parts will be selected for verification, so this also reduces the risk of cheating;

This way of verifying the differences and discovering cheaters can also quickly find the place where errors occurred during the computation process without the need to compare the entire computation result (starting from the root node of the Merkle tree and traversing downwards step by step), which is very effective in dealing with large-scale computation tasks.

In summary, the goal of Gensyn.ai’s incentive/verification layer design is: simple and efficient. However, at present, it is only at the theoretical level, and specific implementation may still face the following challenges:

In terms of the economic model, how to set appropriate parameters that can effectively prevent fraud and not pose too high a barrier to participants.
In terms of technical implementation, how to develop an effective periodic cryptographic reasoning proof is also a complex problem that requires advanced knowledge of cryptography.
In terms of task allocation, it is necessary to have reasonable scheduling algorithms to support how computing networks select and allocate tasks to different solvers. It is obviously questionable to only allocate tasks based on the bidding mechanism in terms of efficiency and feasibility. For example, nodes with strong computing power may be able to process larger-scale tasks but may not participate in bidding (which involves the incentive problem of node availability), and nodes with low computing power may offer the highest bid but may not be suitable for handling some complex large-scale computation tasks.

IV. A Little Thinking on the Future

The question of who needs a decentralized computing power network has never been verified. It is obviously most sensible to use idle computing power for large-scale model training that requires a huge amount of computing resources, and it is also the area with the greatest imagination. But in fact, bottlenecks such as communication and privacy have to make us rethink:

Is there really hope for decentralized training of large models?

If we jump out of this consensus, the “most reasonable landing scenario,” isn’t applying decentralized computing power to the training of small AI models also a huge area? From a technical point of view, the current limiting factors have been solved due to the scale and architecture of the model. At the same time, looking at the market, we have always felt that the training of large models will be huge from now to the future, but is the market for small AI models not attractive?

I don’t think so. Compared with large models, small AI models are easier to deploy and manage, and they are more efficient in terms of processing speed and memory usage. In many application scenarios, users or companies do not need the more general reasoning ability of large language models, but only focus on a very detailed prediction goal. Therefore, in most scenarios, small AI models are still a more feasible choice and should not be prematurely overlooked in the tide of FOMO for large models.

Like what you're reading? Subscribe to our top stories.

We will continue to update Gambling Chain; if you have any questions or suggestions, please contact us!

Computing Power NetworkdataEthereumForesight Ventures

Gambling Chain

Foresight Ventures: Taking a Rational View of Decentralized Computing Power Networks

Like what you're reading? Subscribe to our top stories.

Was this article helpful?

Long push: A must-read for investors, understand the sectors and projects that are about to boom again in just 3 minutes.

Learn About NostrSwap in Three Minutes: Nostr’s First Dex in the Ecosystem to Achieve Zero-Gas BRC-20 Trading

Products used

GC Wallet