Kernel Ventures: Data Availability and Historical Data Layer Design

Intermediate1/11/2024, 3:45:08 PM
This article explores and interprets DA performance indicators, DA-related technologies, and DA layer storage solutions.
  1. In the early stage of blockchain, maintaining data consistency is considered extremely important to ensure security and decentralization. However, with the development of the blockchain ecosystem, the storage pressure is also increasing, leading to a trend of centralization in node operation. Such being the case, the storage cost problem brought by TPS growth in Layer1 needs to be solved urgently.
  2. Faced with this problem, developers should propose a solution that takes security, storage cost, data reading speed, and DA layer versatility fully into account.
  3. In the process of solving this problem, many new technologies and ideas have emerged, including Sharding, DAS, Verkle Tree, DA intermediate components, and so on. They try to optimize the storage scheme of the DA layer by reducing data redundancy and improving data validation efficiency.
  4. DA solutions are broadly categorized into two types from the perspective of data storage location, namely, main-chain DAs and third-party DAs. Main-chain DAs are designed from the perspectives of regular data cleansing and sliced data storage to reduce the storage pressure on nodes, while third-party DAs are designed to serve the storage needs and have reasonable solutions for large amounts of data. As a result, we mainly trade-off between single-chain compatibility and multi-chain compatibility in third-party DAs, and propose three kinds of solutions: main-chain-specific DAs, modularized DAs, and storage public-chain DAs.
  5. Payment-type public chains have very high requirements for historical data security and, thus are suitable to use in the main chain as the DA layer. However, for public chains that have been running for a long time and have a large number of miners running the network, it is more suitable to adopt a third-party DA that does not involve the consensus layer change with relatively high security. For comprehensive public chains, it is more suitable to use the main chain’s dedicated DA storage with larger data capacity, lower cost, and security. However, considering the demand for cross-chain, modular DA is also a good option.
  6. Overall, blockchain is moving towards reducing data redundancy as well as multi-chain division of labor.

1. Background

As a distributed ledger, blockchain needs to store historical data on all nodes to ensure the security and sufficient decentralization of data storage. Since the correctness of each state change is related to the previous state (transaction source), to ensure the correctness of transactions, a blockchain should in principle store all historical records from the first transaction to the current transaction. Taking Ethereum as an example, even if the average block size is estimated to be 20 kb, the current total size of Ethereum blocks has reached 370 GB. In addition to the block itself, a full node also needs to record status and transaction receipts. Counting this part, the total storage capacity of a single node has exceeded 1 TB, which concentrates the operation of the node to a few people.

Ethereum’s latest block height, image source: Etherscan

2. DA performance indicators

2.1 Safety

Compared with database or linked list storage structures, the non-comparability of blockchain comes from the ability to verify newly generated data through historical data. Therefore, ensuring the security of historical data is the first issue to be considered in DA layer storage. When judging the data security of blockchain systems, we often analyze it from the amount of data redundancy and the verification method of data availability.

  1. Amount of redundancy: Regarding the redundancy of data in the blockchain system, it can mainly play the following roles: First, if the number of redundancies in the network is greater, when the verifier needs to view the account status in a certain historical block to verify When a transaction is being verified, it can get the most samples for reference and select the data recorded by most nodes. In traditional databases, since data is only stored in the form of key-value pairs on a certain node, changes to historical data can only be done on a single node, and the cost of attack is extremely low. In theory, the greater the number of redundancies, the less likely the data will be. The higher the degree of credibility. At the same time, the more nodes are stored, the less likely the data will be lost. This can also be compared to the centralized server that stores Web2 games. Once all the backend servers are shut down, the server will be completely shut down. However, the more the better, because each piece of redundancy will bring additional storage space. Excessive data redundancy will bring excessive storage pressure to the system. A good DA layer should choose a suitable one. The redundant approach balances security and storage efficiency.
  2. Data availability verification: The number of redundancies ensures that there are enough records of data in the network, but the accuracy and completeness of the data to be used must be verified. The commonly used verification method in the current blockchain is the cryptographic commitment algorithm, which retains a small cryptographic commitment for the entire network to record. This commitment is obtained by mixing transaction data. When you want to test the authenticity of a certain piece of historical data, you need to restore the cryptographic commitment through the data and check whether the cryptographic commitment obtained by this restoration is consistent with the records of the entire network. If it is consistent, the verification is passed. Commonly used cryptography verification algorithms include Verkle Root and Verkle Root. The high-security data availability verification algorithm requires only a small amount of verification data and can quickly verify historical data.

2.2 Storage cost

On the premise of ensuring basic security, the next core goal that the DA layer needs to achieve is to reduce costs and increase efficiency. The first is to reduce storage costs, regardless of hardware performance differences, that is, to reduce the memory usage caused by storing unit-size data. At this stage, the main ways to reduce storage costs in blockchain are to adopt sharding technology and use reward-based storage to ensure that data is effectively stored and reduce the number of data backups. However, it is not difficult to see from the above improvement methods that there is a game relationship between storage cost and data security. Reducing storage occupancy often means a decrease in security. Therefore, an excellent DA layer needs to achieve a balance between storage cost and data security. In addition, if the DA layer is a separate public chain, it needs to reduce the cost by minimizing the intermediate process of data exchange. In each transfer process, index data needs to be left for subsequent query calls. Therefore, The longer the call process, the more index data will be left and the storage cost will increase. Finally, the cost of data storage is directly linked to the durability of the data. Generally speaking, the higher the storage cost of data, the more difficult it is for the public chain to store data persistently.

2.3 Data reading speed

After achieving cost reduction, the next step is to increase efficiency, which is the ability to quickly call data out of the DA layer when it needs to be used. This process involves two steps. The first is to search for nodes that store data. This process is mainly for public chains that have not achieved data consistency across the entire network. If the public chain achieves data synchronization for nodes across the entire network, this can be ignored. The time consumption of a process. Secondly, in the current mainstream blockchain systems, including Bitcoin, Ethereum, and Filecoin, the node storage method is the Leveldb database. In Leveldb, data is stored in three ways. First, the data written immediately will be stored in Memtable-type files. When the Memtable storage is full, the file type will be changed from Memtable to Immutable Memtable. Both types of files are stored in memory, but Immutable Memtable files can no longer be changed, only data can be read from them. The hot storage used in the IPFS network stores data in this part. When it is called, it can be quickly read from the memory. However, the mobile memory of an ordinary node is often GB level, and it is easy to write slowly, When a node crashes or other abnormal situation occurs, the data in the memory will be permanently lost. If you want the data to be stored persistently, you need to store it in the form of an SST file on a solid-state drive (SSD). However, when reading the data, you need to read the data into the memory first, which greatly reduces the data indexing speed. Finally, for systems that use shared storage, data restoration requires sending data requests to multiple nodes and restoring them. This process will also reduce the data reading speed.

Leveldb data storage method, picture source: Leveldb-handbook

2.4 DA Generalization

With the development of DeFi and various problems with CEX, users’ requirements for cross-chain transactions of decentralized assets are also growing. Regardless of the cross-chain mechanism of hash locking, notary public, or relay chain, the simultaneous determination of historical data on both chains cannot be avoided. The key to this problem lies in the separation of data on the two chains, and direct communication cannot be achieved in different decentralized systems. Therefore, a solution is proposed at this stage by changing the DA layer storage method, which not only stores the historical data of multiple public chains on the same trusted public chain but only needs to call the data on this public chain during verification. Can. This requires the DA layer to be able to establish secure communication methods with different types of public chains, which means that the DA layer has good versatility.

3. Techniques Concerning DA

3.1 Sharding

  1. In a traditional distributed system, a file is not stored in a complete form on a certain node. Instead, the original data is divided into multiple Blocks and one Block is stored in each node. Blocks are often not stored on only one node but will leave appropriate backups on other nodes. In existing mainstream distributed systems, this number of backups is usually set to 2. This Sharding mechanism can reduce the storage pressure of a single node, expand the total capacity of the system to the sum of the storage capacity of each node, and at the same time ensure the security of storage through appropriate data redundancy. The Sharding scheme adopted in the blockchain is generally similar, but the specific details will be different. First of all, because each node in the blockchain is untrustworthy by default, the process of implementing Sharding requires a large enough amount of data backup for subsequent judgment of data authenticity, so the number of backups for this node needs to be much more than 2. Ideally, in a blockchain system using this storage scheme, if the total number of verification nodes is T and the number of shards is N, then the number of backups should be T/N. The second is the storage process of the Block. There are fewer nodes in traditional distributed systems, so one node often adapts to multiple data blocks. First, the data is mapped to the hash ring through the consistent hash algorithm, and then each node Stores data blocks numbered in a certain range, and can accept that a node does not allocate storage tasks during certain storage. On the blockchain, whether each node is assigned a Block is no longer a random event but an inevitable event. Each node will randomly select a Block for storage. This process combines the original data with the block and the node’s information. The result of hashing the data is completed by taking the modulus of the number of shards. Assuming that each piece of data is divided into N Blocks, the actual storage size of each node is only 1/N of the original one. By setting N appropriately, a balance between growing TPS and node storage pressure can be achieved.

Data storage method after Sharding, image source: Kernel Ventures

3.2 DAS(Data Availability Sampling)

DAS technology is based on further optimization of Sharding storage methods. During the Sharding process, due to the simple random storage of nodes, a certain Block may be lost. Secondly, for fragmented data, it is also very important to confirm the authenticity and integrity of the data during the restoration process. In DAS, these two problems are solved through Eraser code and KZG polynomial commitment.

  1. Eraser code: Considering the huge number of verification nodes in Ethereum, the probability that a certain Block is not stored by any node is almost 0, but theoretically there is still the possibility of such an extreme situation happening. To mitigate this possible threat of storage loss, under this scheme, the original data is often not directly divided into Blocks for storage. Instead, the original data is first mapped to the coefficients of an n-order polynomial, and then 2n is taken from the polynomial. points, and let the node randomly select one from them for storage. For this n-order polynomial, only n+1 points are needed to restore it. Therefore, only half of the Blocks need to be selected by the nodes, and we can restore the original data. Through Eraser code, the security of data storage and the network’s data recovery capability are improved.
  2. A very important aspect of data storage is the verification of data authenticity. In networks that do not use Eraser code, various methods can be used for verification, but if the Eraser code above is introduced to improve data security, then it is more appropriate to use the KZG polynomial commitment, which can verify the contents of a single block directly in the form of a polynomial, thus eliminating the need to reduce the polynomial to binary data. KZG polynomial commitment can directly verify the content of a single block in the form of polynomials, thus eliminating the need to reduce the polynomials to binary data, and the overall form of verification is similar to that of Merkle Tree, but it does not require specific Path node data and only requires the KZG Root and block data to verify the authenticity of the block.

3.3 Data Validation Method in DA

Data validation ensures that the data called from a node are accurate and complete. To minimize the amount of data and computational cost required in the validation process, the DA layer now uses a tree structure as the mainstream validation method. The simplest form is to use Merkle Tree for verification, which uses the form of complete binary tree records, only need to keep a Merkle Root and the hash value of the subtree on the other side of the path of the node can be verified, the time complexity of the verification is O(logN) level (the logN is default log2(N)). Although the validation process has been greatly simplified, the amount of data for the validation process in general still grows with the increase of data. To solve the problem of increasing validation volume, another validation method, Verkle Tree, is proposed at this stage, in which each node in the Verkle Tree not only stores the value but also attaches a Vector Commitment, which can quickly validate the authenticity of the data by using the value of the original node and the commitment proof, without the need to call the values of other sister nodes, which makes the computation of each validation easier and faster. This makes the number of computations for each verification only related to the depth of the Verkle Tree, which is a fixed constant, thus greatly accelerating the verification speed. However, the calculation of Vector Commitment requires the participation of all sister nodes in the same layer, which greatly increases the cost of writing and changing data. However, for data such as historical data, which is permanently stored and cannot be tampered with, also, can only be read but not written, the Verkle Tree is extremely suitable. In addition, Merkle Tree and Verkle Tree itself have a K-ary form of variants, the specific implementation of the mechanism is similar, just change the number of subtrees under each node, the specific performance comparison can be seen in the following table.

Time performance comparison of data verification methods, picture source: Verkle Trees

3.4 Generic DA Middleware

The continuous expansion of the blockchain ecosystem has brought about a continuous increase in the number of public chains. Due to the advantages and irreplaceability of each public chain in their respective fields, it is almost impossible for Layer 1 public chains to unify in a short time. However, with the development of DeFi and various problems with CEX, users’ requirements for decentralized cross-chain trading assets are also growing. Therefore, DA layer multi-chain data storage that can eliminate security issues in cross-chain data interactions has received more and more attention. However, to accept historical data from different public chains, the DA layer needs to provide a decentralized protocol for standardized storage and verification of data streams. For example, kvye, a storage middleware based on Arweave, actively grabs data from the chain and all Data on the chain is stored in Arweave in a standard form to minimize differences in the data transmission process. Relatively speaking, Layer2, which specifically provides DA layer data storage for a certain public chain, interacts with data through internal shared nodes. Although it reduces the cost of interaction and improves security, it has relatively large limitations and can only provide data to Specific public chains that provide services.

4. Storage Methods of DA

4.1 Main chain DA

4.1.1 DankSharding-like

This type of storage solution has no definite name yet, and the most prominent representative is DankSharding on Ethereum, so this article uses the class DankSharding to refer to this type of solution. This type of solution mainly uses the two DA storage technologies mentioned above, Sharding and DAS. First, the data is divided into appropriate shares through Sharding, and then each node extracts a data block in the form of DAS for storage. If there are enough nodes in the entire network, we can choose a larger number of shards N, so that the storage pressure of each node is only 1/N of the original, thereby achieving N times expansion of the overall storage space. At the same time, to prevent the extreme situation that a certain Block is not stored in any block, DankSharding encodes the data using an Eraser Code, and only half of the data can be completely restored. The last step is the data verification process, which uses the Verkle tree structure and polynomial commitment to achieve fast verification.

4.1.2 Temporary storage

For the DA of the main chain, one of the simplest data processing methods is to store historical data in the short term. In essence, the blockchain plays the role of a public ledger, allowing changes to the ledger content to be witnessed by the entire network, without the need for permanent storage. Taking Solana as an example, although its historical data is synchronized to Arweave, the main network node only retains the transaction data of the past two days. On the public chain based on account records, the historical data at each moment retains the final status of the account on the blockchain, which is enough to provide a verification basis for changes at the next moment. For projects that have special needs for data before this period, they can store it themselves on other decentralized public chains or by a trusted third party. In other words, those who have additional data needs need to pay for historical data storage.

4.2 Third-party DA

4.2.1 Main chain-specific DA: EthStorage

  1. Main chain-specific DA: The most important thing about the DA layer is the security of data transmission. The most secure at this point is the main chain’s DA. However, main chain storage is subject to storage space limitations and competition for resources. Therefore, when the amount of network data grows rapidly, third-party DA will be a better choice if long-term storage of data is to be achieved. If the third-party DA has higher compatibility with the main network, it can realize the sharing of nodes, and it will also have higher security during the data interaction process. Therefore, under the premise of considering security, the main chain-specific DA will have huge advantages. Taking Ethereum as an example, a basic requirement for main chain-specific DA is to be compatible with EVM and ensure interoperability with Ethereum data and contracts. Representative projects include Topia, EthStorage, etc. Among them, EthStorage is currently the most well-developed in terms of compatibility, because in addition to compatibility at the EVM level, it has also specially set up relevant interfaces to connect with Ethereum development tools such as Remix and Hardhat to achieve compatibility at the Ethereum development tool level.
  2. EthStorage: EthStorage is a public chain independent of Ethereum, but the nodes running on it are superior to Ethereum nodes. That is, the nodes running EthStorage can also run Ethereum at the same time. Through the operation codes on Ethereum, you can directly access EthStorage. EthStorage performs operations. In the storage model of EthStorage, only a small amount of metadata is retained on the Ethereum mainnet for indexing, essentially creating a decentralized database for Ethereum. In the current solution, EthStorage implements the interaction between the Ethereum main network and EthStorage by deploying an EthStorage Contract on the Ethereum main network. If Ethereum wants to store data, it needs to call the put() function in the contract. The input parameters are two-byte variables key and data, where data represents the data to be stored, and the key is its location in the Ethereum network. The identification can be regarded as similar to the existence of CID in IPFS. After the (key, data) data pair is successfully stored in the EthStorage network, EthStorage will generate a kvldx and return it to the Ethereum main network, and correspond to the key on Ethereum. This value corresponds to the storage address of the data on EthStorage, so it is originally possible The problem of needing to store large amounts of data now becomes storing a single (key, kvldx) pair, thus greatly reducing the storage cost of the Ethereum mainnet. If you need to call previously stored data, you need to use the get() function in EthStorage and enter the key parameter. You can quickly search the data on EthStorage through kvldx stored in Ethereum.

EthStorage contract, image source: Kernel Ventures

  1. In terms of how nodes specifically store data, EthStorage draws on the Arweave model. First, a large number of (k, v) pairs from ETH are sharded. Each Sharding contains a fixed number of (k, v) data pairs. There is also a limit on the specific size of each (k, v) pair. In this way, the fairness of the subsequent workload for miners in the storage reward process is ensured. For the issuance of rewards, it is necessary to first verify whether the node stores data. During this process, EthStorage will divide a Sharding (TB level size) into many chunks, and retain a Verkle root on the Ethereum main network for verification. Then the miner needs to first provide a nonce to generate the addresses of several chunks through a random algorithm with the hash of the previous block on EthStorage. The miner needs to provide the data of these chunks to prove that it indeed stores the entire Sharding. But this nonce cannot be selected arbitrarily, otherwise, the node will select a suitable nonce that only corresponds to its stored chunk and pass the verification. Therefore, this nonce must be such that the difficulty value of the generated chunk can meet the network requirements after mixing and hashing, and Only the first node to submit the nonce and random access proof can obtain the reward.

4.2.2 Modularization DA: Celestia

  1. Blockchain module: At this stage, the transactions required to be performed by the Layer1 public chain are mainly divided into the following four parts: (1) Design the underlying logic of the network, select verification nodes in a certain way, write blocks and allocate rewards to network maintainers ; (2) Package and process transactions and publish related transactions; (3) Verify transactions to be uploaded to the chain and determine the final status; (4) Store and maintain historical data on the blockchain. According to the different functions completed, we can divide the blockchain into four modules, namely the consensus layer, execution layer, settlement layer, and data availability layer (DA layer).
  2. Modular blockchain design: For a long time, these four modules have been integrated into a public chain. Such a blockchain is called a single blockchain. This form is more stable and easier to maintain, but it also puts huge pressure on a single public chain. During actual operation, these four modules constrain each other and compete for the limited computing and storage resources of the public chain. For example, increasing the processing speed of the processing layer will bring greater storage pressure to the data availability layer; to ensure the security of the execution layer, a more complex verification mechanism is required but slows down the transaction processing speed. Therefore, the development of public chains often faces trade-offs between these four modules. To break through the bottleneck of public chain performance improvement, developers have proposed a modular blockchain solution. The core idea of ​​modular blockchain is to separate one or more of the four modules mentioned above and implement them on a separate public chain. In this way, the public chain can only focus on improving transaction speed or storage capacity, breaking through the previous limitations on the overall performance of the blockchain due to shortcomings.
  3. Modular DA: The complex method of separating the DA layer from the blockchain business and handing it over to a public chain is considered a feasible solution to the growing historical data of Layer 1. Exploration in this area is still in the early stages at this stage, and the most representative project at present is Celestia. In terms of the specific storage method, Celestia draws on the storage method of Danksharding, which also divides the data into multiple blocks, and each node extracts a part for storage and uses KZG polynomial commitment to verify the integrity of the data. At the same time, Celestia uses an advanced two-dimensional RS erasure code the original data is rewritten in the form of a k matrix, and only 25% of the original data can be recovered. However, data sharding storage essentially just multiplies the storage pressure of the entire network node by a coefficient on the total data volume. The storage pressure of the node and the data volume still maintain a linear growth. As Layer 1 continues to improve its transaction speed, the storage pressure of nodes may still reach an unacceptable critical level one day. To solve this problem, the IPLD component is introduced in Celestia for processing. for kThe data in the k matrix is ​​not stored directly on Celestia, but is stored in the LL-IPFS network, and only the CID code of the data on IPFS is retained in the node. When a user requests a piece of historical data, the node will send the corresponding CID to the IPLD component, and the original data will be called on IPFS through this CID. If the data exists on IPFS, it will be returned via the IPLD component and node; if it does not exist, the data cannot be returned.

Celestia data reading method, image source: Celestia Core

  1. Celestia: Taking Celestia as an example, we can get a glimpse of the application of modular blockchain in solving the storage problem of Ethereum. The Rollup node will send the packaged and verified transaction data to Celestia and store the data on Celestia. During this process, Celestia only stores the data without excessive awareness. Finally, the Rollup node will be rolled according to the size of the storage space. Corresponding tia tokens will be paid to Celestia as storage fees. The storage in Celstia utilizes DAS and erasure codes similar to those in EIP4844, but the polynomial erasure codes in EIP4844 are upgraded and two-dimensional RS erasure codes are used to upgrade the storage security again. Only 25% of the fractures can restore the entire transaction data. It is essentially just a POS public chain with low storage costs. If it is to be used to solve the historical data storage problem of Ethereum, many other specific modules are needed to cooperate with Celestia. For example, in terms of rollup, a roll-up mode highly recommended on the Celestia official website is Sovereign Rollup. Different from the common Rollup on Layer 2, transactions are only calculated and verified, that is, the execution layer operations are completed. Sovereign Rollup includes the entire execution and settlement process, which minimizes the processing of transactions on Celestia. When Celestia’s overall security is weaker than Ethereum’s, this measure can maximize the security of the overall transaction process. In terms of ensuring the security of data called by Celestia, the main network of Ethereum, the most mainstream solution at the moment is the quantum gravity bridge smart contract. For the data stored on Celestia, it will generate a Verkle Root (proof of data availability) and keep it on the quantum gravity bridge contract of the Ethereum main network. Every time Ethereum calls the historical data on Celestia, its hash result will be compared with Verkle Root is used for comparison, and if it matches, it means that it is indeed real historical data.

4.2.3 Storage Chain DA

In terms of main chain DA technical principles, many technologies similar to Sharding are borrowed from the storage public chain. Among third-party DAs, some directly use the storage public chain to complete some storage tasks. For example, the specific transaction data in Celestia is placed on the LL-IPFS network. In the third-party DA solution, in addition to building a separate public chain to solve the storage problem of Layer1, a more direct way is to directly connect the storage public chain with Layer1 to store the huge historical data on Layer1. For high-performance blockchains, the volume of historical data is even larger. When running at full speed, the data volume of the high-performance public chain Solana is close to 4 PG, which is completely beyond the storage range of ordinary nodes. The solution Solana chose is to store historical data on the decentralized storage network Arweave, and only retain 2 days of data on the main network nodes for verification. To ensure the security of the stored process, Solana and Arweave Chain have specially designed a storage bridge protocol, Solar Bridge. The data verified by the Solana node will be synchronized to Arweave and the corresponding tag will be returned. Only through this tag, the Solana node can view the historical data of the Solana blockchain at any time. On Arweave, there is no need for all network nodes to maintain data consistency and use this as a threshold to participate in network operations. Instead, reward storage is adopted. First of all, Arweave does not use a traditional chain structure to build blocks but is more similar to a graph structure. In Arweave, a new block will not only point to the previous block, but also randomly point to a generated block Recall Block. The specific location of the Recall Block is determined by the hash result of its previous block and its block height. The location of the Recall Block is unknown until the previous block is mined. However, in the process of generating a new block, the node needs to have Recall Block data to use the POW mechanism to calculate the hash of the specified difficulty. Only the first miner to calculate the hash that meets the difficulty can get the reward, which encourages miners to store as much as possible. historical data. At the same time, the fewer people who store a certain historical block, the nodes will have fewer competitors when generating nonces that meet the difficulty, encouraging miners to store fewer blocks in the network. Finally, to ensure that nodes permanently store data in Arweave, it introduces WildFire’s node scoring mechanism. Nodes will tend to communicate with nodes that can provide more historical data faster, while nodes with lower ratings are often unable to obtain the latest block and transaction data as soon as possible and thus cannot take advantage of the POW competition…

Arweave block construction method, image source: Arweave Yellow-Paper

5. Synthesized Comparison

Next, we will compare the advantages and disadvantages of the five storage solutions based on the four dimensions of DA performance indicators.

  1. Security: The biggest source of data security problems is the loss caused during the data transmission process and malicious tampering from dishonest nodes. In the cross-chain process, due to the independence and state of the two public chains, data transmission security is one of the hardest hit areas. In addition, Layer 1, which currently requires a dedicated DA layer, often has a strong consensus group, and its security will be much higher than that of ordinary storage public chains. Therefore, the main chain DA solution has higher security. After ensuring the security of data transmission, the next step is to ensure the security of the calling data. If only the short-term historical data used to verify transactions is considered, the same data is backed up by the entire network in the temporary storage network. In a DankSharding-like solution, the average number of data backups is only 1/N of the number of nodes in the entire network. , more data redundancy can make data less likely to be lost, and can also provide more reference samples during verification. Therefore, temporary storage will relatively have higher data security. In the third-party DA solution, the main chain-specific DA uses public nodes with the main chain, and data can be directly transmitted through these relay nodes during the cross-chain process, so it will have relatively higher security than other DA solutions.
  2. Storage costs: The biggest factor affecting storage costs is the amount of data redundancy. In the short-term storage solution of the main chain DA, it is stored in the form of data synchronization of the entire network nodes. Any newly stored data needs to be backed up in the entire network node, which has the highest storage cost. The high storage cost in turn determines that this method is only suitable for temporary storage in high TPS networks. The second is the storage method of Sharding, including Sharding in the main chain and Sharding in third-party DA. Since the main chain often has more nodes, a corresponding Block will also have more backups, so the main chain Sharding solution will have higher costs. The lowest storage cost is the storage public chain DA which adopts the reward storage method. Under this scheme, the amount of data redundancy often fluctuates around a fixed constant. At the same time, a dynamic adjustment mechanism is also introduced in the storage public chain DA to attract nodes to store less backed-up data by increasing rewards to ensure data security.
  3. Data reading speed: The storage speed of data is mainly affected by the storage location of the data in the storage space, the data index path, and the distribution of the data in the nodes. Among them, the storage location of the data on the node has a greater impact on the speed, because storing the data in memory or SSD may cause the reading speed to differ by dozens of times. The storage of public chain DA mostly uses SSD storage, because the load on this chain not only includes the data of the DA layer but also includes personal data with high memory usage such as videos and pictures uploaded by users. If the network does not use SSD as storage space, it will be difficult to carry huge storage pressure and meet long-term storage needs. Secondly, for third-party DA and main-chain DA that use memory state to store data, the third-party DA first needs to search the corresponding index data in the main chain, and then transfer the index data across the chain to the third-party DA and return it through the storage bridge data. In contrast, the main chain DA can directly query data from nodes and therefore has faster data retrieval speed. Finally, within the main chain DA, the Sharding method requires calling Block from multiple nodes and restoring the original data. Therefore, compared with short-term storage without fragmented storage, the speed will be slower.
  4. DA layer universality: The DA universality of the main chain is close to zero because it is impossible to transfer data on a public chain with insufficient storage space to another public chain with insufficient storage space. In third-party DA, the versatility of a solution and its compatibility with a specific main chain are contradictory indicators. For example, in the main chain-specific DA solution designed for a certain main chain, a lot of improvements have been made at the node type and network consensus level to adapt to the public chain. Therefore, these improvements will play a role when communicating with other public chains. a huge hindrance. Within third-party DA, storage public chain DA performs better in terms of versatility compared with modular DA. The storage public chain DA has a larger developer community and more expansion facilities, which can adapt to the conditions of different public chains. At the same time, the storage public chain DA acquires data more actively through packet capture, rather than passively receiving information transmitted from other public chains. Therefore, it can encode data in its way, achieve standardized storage of data streams, facilitate the management of data information from different main chains, and improve storage efficiency.

Storage solution performance comparison, image source: Kernel Ventures

6. Summary

The current blockchain is undergoing a transformation from Crypto to the more inclusive Web3. This process brings not only a richness of projects on the blockchain. To accommodate the simultaneous operation of so many projects on Layer1 while ensuring the experience of Gamefi and Socialfi projects, Layer1 represented by Ethereum has adopted methods such as Rollup and Blobs to improve TPS. Among the new blockchains, the number of high-performance blockchains is also growing. But higher TPS not only means higher performance, but also greater storage pressure on the network. For massive historical data, various DA methods based on the main chain and third parties are currently proposed to adapt to the increase in on-chain storage pressure. Each improvement method has advantages and disadvantages and has different applicability in different situations.

Blockchains that focus on payment have extremely high requirements for the security of historical data and do not pursue particularly high TPS. If this type of public chain is still in the preparation stage, a DankSharding-like storage method can be adopted, which can achieve a huge increase in storage capacity while ensuring security. However, if it is a public chain like Bitcoin that has already taken shape and has a large number of nodes, there are huge risks in rash improvements at the consensus layer. Therefore, the main chain dedicated DA with higher security in off-chain storage can be used to balance security and storage issues… However, it is worth noting that the functions of blockchain are not static but constantly changing. For example, the early functions of Ethereum were mainly limited to payments and simple automated processing of assets and transactions using smart contracts. However, as the blockchain landscape continues to expand, various Socialfi and Defi projects have gradually been added to Ethereum. Make Ethereum develop in a more comprehensive direction. Recently, with the explosion of the inscription ecology on Bitcoin, the transaction fees of the Bitcoin network have surged nearly 20 times since August. This reflects that the transaction speed of the Bitcoin network at this stage cannot meet the transaction demand, and traders can only Raise fees to make transactions processed as quickly as possible. Now, the Bitcoin community needs to make a trade-off, whether to accept high fees and slow transaction speeds or reduce network security to increase transaction speeds but defeat the original intention of the payment system. If the Bitcoin community chooses the latter, then in the face of increasing data pressure, the corresponding storage solution will also need to be adjusted.

Bitcoin mainnet transaction fees fluctuate, image source: OKLINK

Public chains with comprehensive functions have a higher pursuit of TPS, and the growth of historical data is even greater. It is difficult to adapt to the rapid growth of TPS in the long run by adopting a DankSharding-like solution. Therefore, a more appropriate way is to migrate the data to a third-party DA for storage. Among them, the main chain-specific DA has the highest compatibility and may have more advantages if only the storage issues of a single public chain are considered. But today, when Layer 1 public chains are flourishing, cross-chain asset transfer and data interaction have become a common pursuit of the blockchain community. If the long-term development of the entire blockchain ecosystem is taken into account, storing historical data of different public chains on the same public chain can eliminate many security issues in the data exchange and verification process. Therefore, the difference between modular DA and storage public chain DA way might be a better choice. Under the premise of close versatility, modular DA focuses on providing blockchain DA layer services, introducing more refined index data management historical data, which can reasonably classify different public chain data, and store public chain data. Has more advantages than. However, the above solution does not take into account the cost of adjusting the consensus layer on the existing public chain. This process is extremely risky. Once problems occur, it may lead to systemic vulnerabilities and cause the public chain to lose community consensus. Therefore, if it is a transitional solution during the blockchain expansion process, the simplest temporary storage of the main chain may be more suitable. Finally, the above discussion is based on performance during actual operation. However, if the goal of a certain public chain is to develop its ecology and attract more project parties and participants, it may also prefer projects that are supported and funded by its foundation… For example, when the overall performance is equivalent to or even slightly lower than that of public chain storage solutions, the Ethereum community will also tend to Layer 2 projects supported by the Ethereum Foundation such as EthStorage to continue to develop the Ethereum ecosystem.

All in all, the functions of today’s blockchain are becoming more and more complex, which also brings greater storage space requirements. When there are enough Layer1 verification nodes, historical data does not need to be backed up by all nodes in the entire network. Only when the number of backups reaches a certain value can relative security be guaranteed.. at the same time, The division of labor in public chains has also become more and more detailed., Layer 1 is responsible for consensus and execution, Rollup is responsible for calculation and verification, and a separate blockchain is used for data storage. Each part can focus on a certain function without being limited by the performance of other parts. However, how much specific amount of storage or what proportion of nodes should be allowed to store historical data can achieve a balance between security and efficiency, and how to ensure secure interoperability between different blockchains, this is an issue that requires blockchain developers to think about and continuously improve. Investors, yet pay attention to the main chain-specific DA project on Ethereum, because Ethereum already has enough supporters at this stage and does not need to rely on other communities to expand its influence. What is more needed is to improve and develop your community and attract more projects to the Ethereum ecosystem. However, for public chains in the catch-up position, such as Solana and Aptos, the single chain itself does not have such a complete ecology, so it may be more inclined to join forces with other communities to build a huge cross-chain ecology to expand influence. Thus the emerging Layer1, general third-party DA deserves more attention.


Kernel Ventures is a crypto venture capital fund driven by the research and development community with over 70 early-stage investments focused on infrastructure, middleware, dApps, especially ZK, Rollup, DEX, modular blockchains, and onboarding Vertical areas for billions of crypto users in the future, such as account abstraction, data availability, scalability, etc. For the past seven years, we have been committed to supporting the growth of core development communities and university blockchain associations around the world.

Disclaimer:

  1. This article is reprinted from [mirror]. All copyrights belong to the original author [Kernel Ventures Jerry Luo]. If there are objections to this reprint, please contact the Gate Learn team, and they will handle it promptly.
  2. Liability Disclaimer: The views and opinions expressed in this article are solely those of the author and do not constitute any investment advice.
  3. Translations of the article into other languages are done by the Gate Learn team. Unless mentioned, copying, distributing, or plagiarizing the translated articles is prohibited.

Kernel Ventures: Data Availability and Historical Data Layer Design

Intermediate1/11/2024, 3:45:08 PM
This article explores and interprets DA performance indicators, DA-related technologies, and DA layer storage solutions.
  1. In the early stage of blockchain, maintaining data consistency is considered extremely important to ensure security and decentralization. However, with the development of the blockchain ecosystem, the storage pressure is also increasing, leading to a trend of centralization in node operation. Such being the case, the storage cost problem brought by TPS growth in Layer1 needs to be solved urgently.
  2. Faced with this problem, developers should propose a solution that takes security, storage cost, data reading speed, and DA layer versatility fully into account.
  3. In the process of solving this problem, many new technologies and ideas have emerged, including Sharding, DAS, Verkle Tree, DA intermediate components, and so on. They try to optimize the storage scheme of the DA layer by reducing data redundancy and improving data validation efficiency.
  4. DA solutions are broadly categorized into two types from the perspective of data storage location, namely, main-chain DAs and third-party DAs. Main-chain DAs are designed from the perspectives of regular data cleansing and sliced data storage to reduce the storage pressure on nodes, while third-party DAs are designed to serve the storage needs and have reasonable solutions for large amounts of data. As a result, we mainly trade-off between single-chain compatibility and multi-chain compatibility in third-party DAs, and propose three kinds of solutions: main-chain-specific DAs, modularized DAs, and storage public-chain DAs.
  5. Payment-type public chains have very high requirements for historical data security and, thus are suitable to use in the main chain as the DA layer. However, for public chains that have been running for a long time and have a large number of miners running the network, it is more suitable to adopt a third-party DA that does not involve the consensus layer change with relatively high security. For comprehensive public chains, it is more suitable to use the main chain’s dedicated DA storage with larger data capacity, lower cost, and security. However, considering the demand for cross-chain, modular DA is also a good option.
  6. Overall, blockchain is moving towards reducing data redundancy as well as multi-chain division of labor.

1. Background

As a distributed ledger, blockchain needs to store historical data on all nodes to ensure the security and sufficient decentralization of data storage. Since the correctness of each state change is related to the previous state (transaction source), to ensure the correctness of transactions, a blockchain should in principle store all historical records from the first transaction to the current transaction. Taking Ethereum as an example, even if the average block size is estimated to be 20 kb, the current total size of Ethereum blocks has reached 370 GB. In addition to the block itself, a full node also needs to record status and transaction receipts. Counting this part, the total storage capacity of a single node has exceeded 1 TB, which concentrates the operation of the node to a few people.

Ethereum’s latest block height, image source: Etherscan

2. DA performance indicators

2.1 Safety

Compared with database or linked list storage structures, the non-comparability of blockchain comes from the ability to verify newly generated data through historical data. Therefore, ensuring the security of historical data is the first issue to be considered in DA layer storage. When judging the data security of blockchain systems, we often analyze it from the amount of data redundancy and the verification method of data availability.

  1. Amount of redundancy: Regarding the redundancy of data in the blockchain system, it can mainly play the following roles: First, if the number of redundancies in the network is greater, when the verifier needs to view the account status in a certain historical block to verify When a transaction is being verified, it can get the most samples for reference and select the data recorded by most nodes. In traditional databases, since data is only stored in the form of key-value pairs on a certain node, changes to historical data can only be done on a single node, and the cost of attack is extremely low. In theory, the greater the number of redundancies, the less likely the data will be. The higher the degree of credibility. At the same time, the more nodes are stored, the less likely the data will be lost. This can also be compared to the centralized server that stores Web2 games. Once all the backend servers are shut down, the server will be completely shut down. However, the more the better, because each piece of redundancy will bring additional storage space. Excessive data redundancy will bring excessive storage pressure to the system. A good DA layer should choose a suitable one. The redundant approach balances security and storage efficiency.
  2. Data availability verification: The number of redundancies ensures that there are enough records of data in the network, but the accuracy and completeness of the data to be used must be verified. The commonly used verification method in the current blockchain is the cryptographic commitment algorithm, which retains a small cryptographic commitment for the entire network to record. This commitment is obtained by mixing transaction data. When you want to test the authenticity of a certain piece of historical data, you need to restore the cryptographic commitment through the data and check whether the cryptographic commitment obtained by this restoration is consistent with the records of the entire network. If it is consistent, the verification is passed. Commonly used cryptography verification algorithms include Verkle Root and Verkle Root. The high-security data availability verification algorithm requires only a small amount of verification data and can quickly verify historical data.

2.2 Storage cost

On the premise of ensuring basic security, the next core goal that the DA layer needs to achieve is to reduce costs and increase efficiency. The first is to reduce storage costs, regardless of hardware performance differences, that is, to reduce the memory usage caused by storing unit-size data. At this stage, the main ways to reduce storage costs in blockchain are to adopt sharding technology and use reward-based storage to ensure that data is effectively stored and reduce the number of data backups. However, it is not difficult to see from the above improvement methods that there is a game relationship between storage cost and data security. Reducing storage occupancy often means a decrease in security. Therefore, an excellent DA layer needs to achieve a balance between storage cost and data security. In addition, if the DA layer is a separate public chain, it needs to reduce the cost by minimizing the intermediate process of data exchange. In each transfer process, index data needs to be left for subsequent query calls. Therefore, The longer the call process, the more index data will be left and the storage cost will increase. Finally, the cost of data storage is directly linked to the durability of the data. Generally speaking, the higher the storage cost of data, the more difficult it is for the public chain to store data persistently.

2.3 Data reading speed

After achieving cost reduction, the next step is to increase efficiency, which is the ability to quickly call data out of the DA layer when it needs to be used. This process involves two steps. The first is to search for nodes that store data. This process is mainly for public chains that have not achieved data consistency across the entire network. If the public chain achieves data synchronization for nodes across the entire network, this can be ignored. The time consumption of a process. Secondly, in the current mainstream blockchain systems, including Bitcoin, Ethereum, and Filecoin, the node storage method is the Leveldb database. In Leveldb, data is stored in three ways. First, the data written immediately will be stored in Memtable-type files. When the Memtable storage is full, the file type will be changed from Memtable to Immutable Memtable. Both types of files are stored in memory, but Immutable Memtable files can no longer be changed, only data can be read from them. The hot storage used in the IPFS network stores data in this part. When it is called, it can be quickly read from the memory. However, the mobile memory of an ordinary node is often GB level, and it is easy to write slowly, When a node crashes or other abnormal situation occurs, the data in the memory will be permanently lost. If you want the data to be stored persistently, you need to store it in the form of an SST file on a solid-state drive (SSD). However, when reading the data, you need to read the data into the memory first, which greatly reduces the data indexing speed. Finally, for systems that use shared storage, data restoration requires sending data requests to multiple nodes and restoring them. This process will also reduce the data reading speed.

Leveldb data storage method, picture source: Leveldb-handbook

2.4 DA Generalization

With the development of DeFi and various problems with CEX, users’ requirements for cross-chain transactions of decentralized assets are also growing. Regardless of the cross-chain mechanism of hash locking, notary public, or relay chain, the simultaneous determination of historical data on both chains cannot be avoided. The key to this problem lies in the separation of data on the two chains, and direct communication cannot be achieved in different decentralized systems. Therefore, a solution is proposed at this stage by changing the DA layer storage method, which not only stores the historical data of multiple public chains on the same trusted public chain but only needs to call the data on this public chain during verification. Can. This requires the DA layer to be able to establish secure communication methods with different types of public chains, which means that the DA layer has good versatility.

3. Techniques Concerning DA

3.1 Sharding

  1. In a traditional distributed system, a file is not stored in a complete form on a certain node. Instead, the original data is divided into multiple Blocks and one Block is stored in each node. Blocks are often not stored on only one node but will leave appropriate backups on other nodes. In existing mainstream distributed systems, this number of backups is usually set to 2. This Sharding mechanism can reduce the storage pressure of a single node, expand the total capacity of the system to the sum of the storage capacity of each node, and at the same time ensure the security of storage through appropriate data redundancy. The Sharding scheme adopted in the blockchain is generally similar, but the specific details will be different. First of all, because each node in the blockchain is untrustworthy by default, the process of implementing Sharding requires a large enough amount of data backup for subsequent judgment of data authenticity, so the number of backups for this node needs to be much more than 2. Ideally, in a blockchain system using this storage scheme, if the total number of verification nodes is T and the number of shards is N, then the number of backups should be T/N. The second is the storage process of the Block. There are fewer nodes in traditional distributed systems, so one node often adapts to multiple data blocks. First, the data is mapped to the hash ring through the consistent hash algorithm, and then each node Stores data blocks numbered in a certain range, and can accept that a node does not allocate storage tasks during certain storage. On the blockchain, whether each node is assigned a Block is no longer a random event but an inevitable event. Each node will randomly select a Block for storage. This process combines the original data with the block and the node’s information. The result of hashing the data is completed by taking the modulus of the number of shards. Assuming that each piece of data is divided into N Blocks, the actual storage size of each node is only 1/N of the original one. By setting N appropriately, a balance between growing TPS and node storage pressure can be achieved.

Data storage method after Sharding, image source: Kernel Ventures

3.2 DAS(Data Availability Sampling)

DAS technology is based on further optimization of Sharding storage methods. During the Sharding process, due to the simple random storage of nodes, a certain Block may be lost. Secondly, for fragmented data, it is also very important to confirm the authenticity and integrity of the data during the restoration process. In DAS, these two problems are solved through Eraser code and KZG polynomial commitment.

  1. Eraser code: Considering the huge number of verification nodes in Ethereum, the probability that a certain Block is not stored by any node is almost 0, but theoretically there is still the possibility of such an extreme situation happening. To mitigate this possible threat of storage loss, under this scheme, the original data is often not directly divided into Blocks for storage. Instead, the original data is first mapped to the coefficients of an n-order polynomial, and then 2n is taken from the polynomial. points, and let the node randomly select one from them for storage. For this n-order polynomial, only n+1 points are needed to restore it. Therefore, only half of the Blocks need to be selected by the nodes, and we can restore the original data. Through Eraser code, the security of data storage and the network’s data recovery capability are improved.
  2. A very important aspect of data storage is the verification of data authenticity. In networks that do not use Eraser code, various methods can be used for verification, but if the Eraser code above is introduced to improve data security, then it is more appropriate to use the KZG polynomial commitment, which can verify the contents of a single block directly in the form of a polynomial, thus eliminating the need to reduce the polynomial to binary data. KZG polynomial commitment can directly verify the content of a single block in the form of polynomials, thus eliminating the need to reduce the polynomials to binary data, and the overall form of verification is similar to that of Merkle Tree, but it does not require specific Path node data and only requires the KZG Root and block data to verify the authenticity of the block.

3.3 Data Validation Method in DA

Data validation ensures that the data called from a node are accurate and complete. To minimize the amount of data and computational cost required in the validation process, the DA layer now uses a tree structure as the mainstream validation method. The simplest form is to use Merkle Tree for verification, which uses the form of complete binary tree records, only need to keep a Merkle Root and the hash value of the subtree on the other side of the path of the node can be verified, the time complexity of the verification is O(logN) level (the logN is default log2(N)). Although the validation process has been greatly simplified, the amount of data for the validation process in general still grows with the increase of data. To solve the problem of increasing validation volume, another validation method, Verkle Tree, is proposed at this stage, in which each node in the Verkle Tree not only stores the value but also attaches a Vector Commitment, which can quickly validate the authenticity of the data by using the value of the original node and the commitment proof, without the need to call the values of other sister nodes, which makes the computation of each validation easier and faster. This makes the number of computations for each verification only related to the depth of the Verkle Tree, which is a fixed constant, thus greatly accelerating the verification speed. However, the calculation of Vector Commitment requires the participation of all sister nodes in the same layer, which greatly increases the cost of writing and changing data. However, for data such as historical data, which is permanently stored and cannot be tampered with, also, can only be read but not written, the Verkle Tree is extremely suitable. In addition, Merkle Tree and Verkle Tree itself have a K-ary form of variants, the specific implementation of the mechanism is similar, just change the number of subtrees under each node, the specific performance comparison can be seen in the following table.

Time performance comparison of data verification methods, picture source: Verkle Trees

3.4 Generic DA Middleware

The continuous expansion of the blockchain ecosystem has brought about a continuous increase in the number of public chains. Due to the advantages and irreplaceability of each public chain in their respective fields, it is almost impossible for Layer 1 public chains to unify in a short time. However, with the development of DeFi and various problems with CEX, users’ requirements for decentralized cross-chain trading assets are also growing. Therefore, DA layer multi-chain data storage that can eliminate security issues in cross-chain data interactions has received more and more attention. However, to accept historical data from different public chains, the DA layer needs to provide a decentralized protocol for standardized storage and verification of data streams. For example, kvye, a storage middleware based on Arweave, actively grabs data from the chain and all Data on the chain is stored in Arweave in a standard form to minimize differences in the data transmission process. Relatively speaking, Layer2, which specifically provides DA layer data storage for a certain public chain, interacts with data through internal shared nodes. Although it reduces the cost of interaction and improves security, it has relatively large limitations and can only provide data to Specific public chains that provide services.

4. Storage Methods of DA

4.1 Main chain DA

4.1.1 DankSharding-like

This type of storage solution has no definite name yet, and the most prominent representative is DankSharding on Ethereum, so this article uses the class DankSharding to refer to this type of solution. This type of solution mainly uses the two DA storage technologies mentioned above, Sharding and DAS. First, the data is divided into appropriate shares through Sharding, and then each node extracts a data block in the form of DAS for storage. If there are enough nodes in the entire network, we can choose a larger number of shards N, so that the storage pressure of each node is only 1/N of the original, thereby achieving N times expansion of the overall storage space. At the same time, to prevent the extreme situation that a certain Block is not stored in any block, DankSharding encodes the data using an Eraser Code, and only half of the data can be completely restored. The last step is the data verification process, which uses the Verkle tree structure and polynomial commitment to achieve fast verification.

4.1.2 Temporary storage

For the DA of the main chain, one of the simplest data processing methods is to store historical data in the short term. In essence, the blockchain plays the role of a public ledger, allowing changes to the ledger content to be witnessed by the entire network, without the need for permanent storage. Taking Solana as an example, although its historical data is synchronized to Arweave, the main network node only retains the transaction data of the past two days. On the public chain based on account records, the historical data at each moment retains the final status of the account on the blockchain, which is enough to provide a verification basis for changes at the next moment. For projects that have special needs for data before this period, they can store it themselves on other decentralized public chains or by a trusted third party. In other words, those who have additional data needs need to pay for historical data storage.

4.2 Third-party DA

4.2.1 Main chain-specific DA: EthStorage

  1. Main chain-specific DA: The most important thing about the DA layer is the security of data transmission. The most secure at this point is the main chain’s DA. However, main chain storage is subject to storage space limitations and competition for resources. Therefore, when the amount of network data grows rapidly, third-party DA will be a better choice if long-term storage of data is to be achieved. If the third-party DA has higher compatibility with the main network, it can realize the sharing of nodes, and it will also have higher security during the data interaction process. Therefore, under the premise of considering security, the main chain-specific DA will have huge advantages. Taking Ethereum as an example, a basic requirement for main chain-specific DA is to be compatible with EVM and ensure interoperability with Ethereum data and contracts. Representative projects include Topia, EthStorage, etc. Among them, EthStorage is currently the most well-developed in terms of compatibility, because in addition to compatibility at the EVM level, it has also specially set up relevant interfaces to connect with Ethereum development tools such as Remix and Hardhat to achieve compatibility at the Ethereum development tool level.
  2. EthStorage: EthStorage is a public chain independent of Ethereum, but the nodes running on it are superior to Ethereum nodes. That is, the nodes running EthStorage can also run Ethereum at the same time. Through the operation codes on Ethereum, you can directly access EthStorage. EthStorage performs operations. In the storage model of EthStorage, only a small amount of metadata is retained on the Ethereum mainnet for indexing, essentially creating a decentralized database for Ethereum. In the current solution, EthStorage implements the interaction between the Ethereum main network and EthStorage by deploying an EthStorage Contract on the Ethereum main network. If Ethereum wants to store data, it needs to call the put() function in the contract. The input parameters are two-byte variables key and data, where data represents the data to be stored, and the key is its location in the Ethereum network. The identification can be regarded as similar to the existence of CID in IPFS. After the (key, data) data pair is successfully stored in the EthStorage network, EthStorage will generate a kvldx and return it to the Ethereum main network, and correspond to the key on Ethereum. This value corresponds to the storage address of the data on EthStorage, so it is originally possible The problem of needing to store large amounts of data now becomes storing a single (key, kvldx) pair, thus greatly reducing the storage cost of the Ethereum mainnet. If you need to call previously stored data, you need to use the get() function in EthStorage and enter the key parameter. You can quickly search the data on EthStorage through kvldx stored in Ethereum.

EthStorage contract, image source: Kernel Ventures

  1. In terms of how nodes specifically store data, EthStorage draws on the Arweave model. First, a large number of (k, v) pairs from ETH are sharded. Each Sharding contains a fixed number of (k, v) data pairs. There is also a limit on the specific size of each (k, v) pair. In this way, the fairness of the subsequent workload for miners in the storage reward process is ensured. For the issuance of rewards, it is necessary to first verify whether the node stores data. During this process, EthStorage will divide a Sharding (TB level size) into many chunks, and retain a Verkle root on the Ethereum main network for verification. Then the miner needs to first provide a nonce to generate the addresses of several chunks through a random algorithm with the hash of the previous block on EthStorage. The miner needs to provide the data of these chunks to prove that it indeed stores the entire Sharding. But this nonce cannot be selected arbitrarily, otherwise, the node will select a suitable nonce that only corresponds to its stored chunk and pass the verification. Therefore, this nonce must be such that the difficulty value of the generated chunk can meet the network requirements after mixing and hashing, and Only the first node to submit the nonce and random access proof can obtain the reward.

4.2.2 Modularization DA: Celestia

  1. Blockchain module: At this stage, the transactions required to be performed by the Layer1 public chain are mainly divided into the following four parts: (1) Design the underlying logic of the network, select verification nodes in a certain way, write blocks and allocate rewards to network maintainers ; (2) Package and process transactions and publish related transactions; (3) Verify transactions to be uploaded to the chain and determine the final status; (4) Store and maintain historical data on the blockchain. According to the different functions completed, we can divide the blockchain into four modules, namely the consensus layer, execution layer, settlement layer, and data availability layer (DA layer).
  2. Modular blockchain design: For a long time, these four modules have been integrated into a public chain. Such a blockchain is called a single blockchain. This form is more stable and easier to maintain, but it also puts huge pressure on a single public chain. During actual operation, these four modules constrain each other and compete for the limited computing and storage resources of the public chain. For example, increasing the processing speed of the processing layer will bring greater storage pressure to the data availability layer; to ensure the security of the execution layer, a more complex verification mechanism is required but slows down the transaction processing speed. Therefore, the development of public chains often faces trade-offs between these four modules. To break through the bottleneck of public chain performance improvement, developers have proposed a modular blockchain solution. The core idea of ​​modular blockchain is to separate one or more of the four modules mentioned above and implement them on a separate public chain. In this way, the public chain can only focus on improving transaction speed or storage capacity, breaking through the previous limitations on the overall performance of the blockchain due to shortcomings.
  3. Modular DA: The complex method of separating the DA layer from the blockchain business and handing it over to a public chain is considered a feasible solution to the growing historical data of Layer 1. Exploration in this area is still in the early stages at this stage, and the most representative project at present is Celestia. In terms of the specific storage method, Celestia draws on the storage method of Danksharding, which also divides the data into multiple blocks, and each node extracts a part for storage and uses KZG polynomial commitment to verify the integrity of the data. At the same time, Celestia uses an advanced two-dimensional RS erasure code the original data is rewritten in the form of a k matrix, and only 25% of the original data can be recovered. However, data sharding storage essentially just multiplies the storage pressure of the entire network node by a coefficient on the total data volume. The storage pressure of the node and the data volume still maintain a linear growth. As Layer 1 continues to improve its transaction speed, the storage pressure of nodes may still reach an unacceptable critical level one day. To solve this problem, the IPLD component is introduced in Celestia for processing. for kThe data in the k matrix is ​​not stored directly on Celestia, but is stored in the LL-IPFS network, and only the CID code of the data on IPFS is retained in the node. When a user requests a piece of historical data, the node will send the corresponding CID to the IPLD component, and the original data will be called on IPFS through this CID. If the data exists on IPFS, it will be returned via the IPLD component and node; if it does not exist, the data cannot be returned.

Celestia data reading method, image source: Celestia Core

  1. Celestia: Taking Celestia as an example, we can get a glimpse of the application of modular blockchain in solving the storage problem of Ethereum. The Rollup node will send the packaged and verified transaction data to Celestia and store the data on Celestia. During this process, Celestia only stores the data without excessive awareness. Finally, the Rollup node will be rolled according to the size of the storage space. Corresponding tia tokens will be paid to Celestia as storage fees. The storage in Celstia utilizes DAS and erasure codes similar to those in EIP4844, but the polynomial erasure codes in EIP4844 are upgraded and two-dimensional RS erasure codes are used to upgrade the storage security again. Only 25% of the fractures can restore the entire transaction data. It is essentially just a POS public chain with low storage costs. If it is to be used to solve the historical data storage problem of Ethereum, many other specific modules are needed to cooperate with Celestia. For example, in terms of rollup, a roll-up mode highly recommended on the Celestia official website is Sovereign Rollup. Different from the common Rollup on Layer 2, transactions are only calculated and verified, that is, the execution layer operations are completed. Sovereign Rollup includes the entire execution and settlement process, which minimizes the processing of transactions on Celestia. When Celestia’s overall security is weaker than Ethereum’s, this measure can maximize the security of the overall transaction process. In terms of ensuring the security of data called by Celestia, the main network of Ethereum, the most mainstream solution at the moment is the quantum gravity bridge smart contract. For the data stored on Celestia, it will generate a Verkle Root (proof of data availability) and keep it on the quantum gravity bridge contract of the Ethereum main network. Every time Ethereum calls the historical data on Celestia, its hash result will be compared with Verkle Root is used for comparison, and if it matches, it means that it is indeed real historical data.

4.2.3 Storage Chain DA

In terms of main chain DA technical principles, many technologies similar to Sharding are borrowed from the storage public chain. Among third-party DAs, some directly use the storage public chain to complete some storage tasks. For example, the specific transaction data in Celestia is placed on the LL-IPFS network. In the third-party DA solution, in addition to building a separate public chain to solve the storage problem of Layer1, a more direct way is to directly connect the storage public chain with Layer1 to store the huge historical data on Layer1. For high-performance blockchains, the volume of historical data is even larger. When running at full speed, the data volume of the high-performance public chain Solana is close to 4 PG, which is completely beyond the storage range of ordinary nodes. The solution Solana chose is to store historical data on the decentralized storage network Arweave, and only retain 2 days of data on the main network nodes for verification. To ensure the security of the stored process, Solana and Arweave Chain have specially designed a storage bridge protocol, Solar Bridge. The data verified by the Solana node will be synchronized to Arweave and the corresponding tag will be returned. Only through this tag, the Solana node can view the historical data of the Solana blockchain at any time. On Arweave, there is no need for all network nodes to maintain data consistency and use this as a threshold to participate in network operations. Instead, reward storage is adopted. First of all, Arweave does not use a traditional chain structure to build blocks but is more similar to a graph structure. In Arweave, a new block will not only point to the previous block, but also randomly point to a generated block Recall Block. The specific location of the Recall Block is determined by the hash result of its previous block and its block height. The location of the Recall Block is unknown until the previous block is mined. However, in the process of generating a new block, the node needs to have Recall Block data to use the POW mechanism to calculate the hash of the specified difficulty. Only the first miner to calculate the hash that meets the difficulty can get the reward, which encourages miners to store as much as possible. historical data. At the same time, the fewer people who store a certain historical block, the nodes will have fewer competitors when generating nonces that meet the difficulty, encouraging miners to store fewer blocks in the network. Finally, to ensure that nodes permanently store data in Arweave, it introduces WildFire’s node scoring mechanism. Nodes will tend to communicate with nodes that can provide more historical data faster, while nodes with lower ratings are often unable to obtain the latest block and transaction data as soon as possible and thus cannot take advantage of the POW competition…

Arweave block construction method, image source: Arweave Yellow-Paper

5. Synthesized Comparison

Next, we will compare the advantages and disadvantages of the five storage solutions based on the four dimensions of DA performance indicators.

  1. Security: The biggest source of data security problems is the loss caused during the data transmission process and malicious tampering from dishonest nodes. In the cross-chain process, due to the independence and state of the two public chains, data transmission security is one of the hardest hit areas. In addition, Layer 1, which currently requires a dedicated DA layer, often has a strong consensus group, and its security will be much higher than that of ordinary storage public chains. Therefore, the main chain DA solution has higher security. After ensuring the security of data transmission, the next step is to ensure the security of the calling data. If only the short-term historical data used to verify transactions is considered, the same data is backed up by the entire network in the temporary storage network. In a DankSharding-like solution, the average number of data backups is only 1/N of the number of nodes in the entire network. , more data redundancy can make data less likely to be lost, and can also provide more reference samples during verification. Therefore, temporary storage will relatively have higher data security. In the third-party DA solution, the main chain-specific DA uses public nodes with the main chain, and data can be directly transmitted through these relay nodes during the cross-chain process, so it will have relatively higher security than other DA solutions.
  2. Storage costs: The biggest factor affecting storage costs is the amount of data redundancy. In the short-term storage solution of the main chain DA, it is stored in the form of data synchronization of the entire network nodes. Any newly stored data needs to be backed up in the entire network node, which has the highest storage cost. The high storage cost in turn determines that this method is only suitable for temporary storage in high TPS networks. The second is the storage method of Sharding, including Sharding in the main chain and Sharding in third-party DA. Since the main chain often has more nodes, a corresponding Block will also have more backups, so the main chain Sharding solution will have higher costs. The lowest storage cost is the storage public chain DA which adopts the reward storage method. Under this scheme, the amount of data redundancy often fluctuates around a fixed constant. At the same time, a dynamic adjustment mechanism is also introduced in the storage public chain DA to attract nodes to store less backed-up data by increasing rewards to ensure data security.
  3. Data reading speed: The storage speed of data is mainly affected by the storage location of the data in the storage space, the data index path, and the distribution of the data in the nodes. Among them, the storage location of the data on the node has a greater impact on the speed, because storing the data in memory or SSD may cause the reading speed to differ by dozens of times. The storage of public chain DA mostly uses SSD storage, because the load on this chain not only includes the data of the DA layer but also includes personal data with high memory usage such as videos and pictures uploaded by users. If the network does not use SSD as storage space, it will be difficult to carry huge storage pressure and meet long-term storage needs. Secondly, for third-party DA and main-chain DA that use memory state to store data, the third-party DA first needs to search the corresponding index data in the main chain, and then transfer the index data across the chain to the third-party DA and return it through the storage bridge data. In contrast, the main chain DA can directly query data from nodes and therefore has faster data retrieval speed. Finally, within the main chain DA, the Sharding method requires calling Block from multiple nodes and restoring the original data. Therefore, compared with short-term storage without fragmented storage, the speed will be slower.
  4. DA layer universality: The DA universality of the main chain is close to zero because it is impossible to transfer data on a public chain with insufficient storage space to another public chain with insufficient storage space. In third-party DA, the versatility of a solution and its compatibility with a specific main chain are contradictory indicators. For example, in the main chain-specific DA solution designed for a certain main chain, a lot of improvements have been made at the node type and network consensus level to adapt to the public chain. Therefore, these improvements will play a role when communicating with other public chains. a huge hindrance. Within third-party DA, storage public chain DA performs better in terms of versatility compared with modular DA. The storage public chain DA has a larger developer community and more expansion facilities, which can adapt to the conditions of different public chains. At the same time, the storage public chain DA acquires data more actively through packet capture, rather than passively receiving information transmitted from other public chains. Therefore, it can encode data in its way, achieve standardized storage of data streams, facilitate the management of data information from different main chains, and improve storage efficiency.

Storage solution performance comparison, image source: Kernel Ventures

6. Summary

The current blockchain is undergoing a transformation from Crypto to the more inclusive Web3. This process brings not only a richness of projects on the blockchain. To accommodate the simultaneous operation of so many projects on Layer1 while ensuring the experience of Gamefi and Socialfi projects, Layer1 represented by Ethereum has adopted methods such as Rollup and Blobs to improve TPS. Among the new blockchains, the number of high-performance blockchains is also growing. But higher TPS not only means higher performance, but also greater storage pressure on the network. For massive historical data, various DA methods based on the main chain and third parties are currently proposed to adapt to the increase in on-chain storage pressure. Each improvement method has advantages and disadvantages and has different applicability in different situations.

Blockchains that focus on payment have extremely high requirements for the security of historical data and do not pursue particularly high TPS. If this type of public chain is still in the preparation stage, a DankSharding-like storage method can be adopted, which can achieve a huge increase in storage capacity while ensuring security. However, if it is a public chain like Bitcoin that has already taken shape and has a large number of nodes, there are huge risks in rash improvements at the consensus layer. Therefore, the main chain dedicated DA with higher security in off-chain storage can be used to balance security and storage issues… However, it is worth noting that the functions of blockchain are not static but constantly changing. For example, the early functions of Ethereum were mainly limited to payments and simple automated processing of assets and transactions using smart contracts. However, as the blockchain landscape continues to expand, various Socialfi and Defi projects have gradually been added to Ethereum. Make Ethereum develop in a more comprehensive direction. Recently, with the explosion of the inscription ecology on Bitcoin, the transaction fees of the Bitcoin network have surged nearly 20 times since August. This reflects that the transaction speed of the Bitcoin network at this stage cannot meet the transaction demand, and traders can only Raise fees to make transactions processed as quickly as possible. Now, the Bitcoin community needs to make a trade-off, whether to accept high fees and slow transaction speeds or reduce network security to increase transaction speeds but defeat the original intention of the payment system. If the Bitcoin community chooses the latter, then in the face of increasing data pressure, the corresponding storage solution will also need to be adjusted.

Bitcoin mainnet transaction fees fluctuate, image source: OKLINK

Public chains with comprehensive functions have a higher pursuit of TPS, and the growth of historical data is even greater. It is difficult to adapt to the rapid growth of TPS in the long run by adopting a DankSharding-like solution. Therefore, a more appropriate way is to migrate the data to a third-party DA for storage. Among them, the main chain-specific DA has the highest compatibility and may have more advantages if only the storage issues of a single public chain are considered. But today, when Layer 1 public chains are flourishing, cross-chain asset transfer and data interaction have become a common pursuit of the blockchain community. If the long-term development of the entire blockchain ecosystem is taken into account, storing historical data of different public chains on the same public chain can eliminate many security issues in the data exchange and verification process. Therefore, the difference between modular DA and storage public chain DA way might be a better choice. Under the premise of close versatility, modular DA focuses on providing blockchain DA layer services, introducing more refined index data management historical data, which can reasonably classify different public chain data, and store public chain data. Has more advantages than. However, the above solution does not take into account the cost of adjusting the consensus layer on the existing public chain. This process is extremely risky. Once problems occur, it may lead to systemic vulnerabilities and cause the public chain to lose community consensus. Therefore, if it is a transitional solution during the blockchain expansion process, the simplest temporary storage of the main chain may be more suitable. Finally, the above discussion is based on performance during actual operation. However, if the goal of a certain public chain is to develop its ecology and attract more project parties and participants, it may also prefer projects that are supported and funded by its foundation… For example, when the overall performance is equivalent to or even slightly lower than that of public chain storage solutions, the Ethereum community will also tend to Layer 2 projects supported by the Ethereum Foundation such as EthStorage to continue to develop the Ethereum ecosystem.

All in all, the functions of today’s blockchain are becoming more and more complex, which also brings greater storage space requirements. When there are enough Layer1 verification nodes, historical data does not need to be backed up by all nodes in the entire network. Only when the number of backups reaches a certain value can relative security be guaranteed.. at the same time, The division of labor in public chains has also become more and more detailed., Layer 1 is responsible for consensus and execution, Rollup is responsible for calculation and verification, and a separate blockchain is used for data storage. Each part can focus on a certain function without being limited by the performance of other parts. However, how much specific amount of storage or what proportion of nodes should be allowed to store historical data can achieve a balance between security and efficiency, and how to ensure secure interoperability between different blockchains, this is an issue that requires blockchain developers to think about and continuously improve. Investors, yet pay attention to the main chain-specific DA project on Ethereum, because Ethereum already has enough supporters at this stage and does not need to rely on other communities to expand its influence. What is more needed is to improve and develop your community and attract more projects to the Ethereum ecosystem. However, for public chains in the catch-up position, such as Solana and Aptos, the single chain itself does not have such a complete ecology, so it may be more inclined to join forces with other communities to build a huge cross-chain ecology to expand influence. Thus the emerging Layer1, general third-party DA deserves more attention.


Kernel Ventures is a crypto venture capital fund driven by the research and development community with over 70 early-stage investments focused on infrastructure, middleware, dApps, especially ZK, Rollup, DEX, modular blockchains, and onboarding Vertical areas for billions of crypto users in the future, such as account abstraction, data availability, scalability, etc. For the past seven years, we have been committed to supporting the growth of core development communities and university blockchain associations around the world.

Disclaimer:

  1. This article is reprinted from [mirror]. All copyrights belong to the original author [Kernel Ventures Jerry Luo]. If there are objections to this reprint, please contact the Gate Learn team, and they will handle it promptly.
  2. Liability Disclaimer: The views and opinions expressed in this article are solely those of the author and do not constitute any investment advice.
  3. Translations of the article into other languages are done by the Gate Learn team. Unless mentioned, copying, distributing, or plagiarizing the translated articles is prohibited.
Start Now
Sign up and get a
$100
Voucher!