Reading, indexing, analyzing, and summarizing the Web3 data indexing track

This article explores the development of data accessibility in blockchain, compares the characteristics of The Graph, Chainbase, and Space and Time data service protocols in terms of architecture and AI technology applications, points out that blockchain data services are developing towards intelligence and security, and will continue to play an important role as industry infrastructure in the future.

1. Introduction

From the first wave of dApps in 2017, such as Etheroll, ETHLend, and CryptoKitties, to the blossoming of various financial, gaming, and social dApps based on different blockchains, have you ever considered the origins of the various types of data adopted by these dApps in their interactions when we talk about on-chain applications of Decentralization?

In 2024, the focus is on AI and Web3. In the world of artificial intelligence, data is like the lifeblood of its growth and evolution. Just as plants rely on sunlight and water to thrive, AI systems also rely on massive amounts of data to continuously "learn" and "think." Without data, the algorithms of AI, no matter how exquisite, are just a castle in the air, unable to exert their due intelligence and efficiency.

From the perspective of data accessibility in the blockchain, this article deeply analyzes the evolution of blockchain data indexing in the industry development process, and compares the old data indexing protocol The Graph with the emerging blockchain data services protocol Chainbase and Space and Time. It particularly explores the similarities and differences in data services and product architecture features of these two new protocols that combine AI technology.

2. The complexity and simplicity of data indexing: from Block chain nodes to the entire chain database

2.1 Data source: Block chain node

From the beginning of understanding what 'Blockchain' is, we often see a sentence like this: Blockchain is the ledger of decentralization. Blockchain Node is the foundation of the entire Blockchain network, responsible for recording, storing, and disseminating all transaction data on the chain. Each Node has a complete copy of the Blockchain data, ensuring the maintenance of the decentralization feature of the network. However, for ordinary users, building and maintaining a Blockchain Node is not easy. This not only requires professional technical capabilities but also comes with high hardware and bandwidth costs. At the same time, the query capability of ordinary Nodes is limited and cannot query data in the format required by developers. Therefore, although theoretically everyone can run their own Node, in practice, users usually prefer to rely on third-party services.

To address this issue, RPC (remote procedure call) Node providers have emerged. These providers are responsible for the cost and management of Nodes, and they provide data through RPC endpoints. This allows users to access blockchain data without having to build their own Nodes. Public RPC endpoints are free but have rate limits, which may negatively impact the user experience of dApps. Private RPC endpoints offer better performance by reducing congestion, but even simple data retrieval requires a significant amount of back-and-forth communication. This makes their requests heavy and inefficient for complex data queries. In addition, private RPC endpoints are often difficult to scale and lack compatibility across different networks. However, the standardized API interface of Node providers lowers the threshold for users to access on-chain data, laying the foundation for subsequent data parsing and applications.

2.2 Data Parsing: From Prototype Data to Usable Data

The data obtained from the Block chain node is often the original data that has undergone encryption and encoding processing. Although these data retain the integrity and security of the Block chain, their complexity also increases the difficulty of data parsing. For ordinary users or developers, directly processing these prototype data requires a lot of technical knowledge and computing resources.

The process of data parsing is particularly important in this context. By parsing complex raw data into a more understandable and manageable format, users can have a more intuitive understanding and utilization of this data. The success of data parsing directly determines the efficiency and effectiveness of Block chain data applications, and it is a key step in the entire data indexing process.

Evolution of 2.3 data indexers

With the increase of Block chain data volume, the demand for data indexers is also increasing. Indexers play a crucial role in organizing on-chain data and sending it to databases for querying. The indexer works by indexing Block chain data and making it available at all times through query languages similar to SQL (GraphQL, etc. API). By providing a unified interface for querying data, indexers allow developers to quickly and accurately retrieve the desired information using standardized query languages, greatly simplifying the process.

Different types of indexers optimize data retrieval in various ways:

· Full Node Indexer: These indexers run full Block chain nodes and extract data directly from them, ensuring data integrity and accuracy, but requiring a large amount of storage and processing power.

· Lightweight indexer: These indexers rely on complete Nodes to retrieve specific data as needed, reducing storage requirements but potentially increasing query time.

· Dedicated Indexers: These indexers are specifically designed for certain types of data or specific blockchains, optimizing retrieval for specific use cases, such as Non-fungible Token data or Decentralized Finance transactions.

· Aggregator: These indexers extract data from multiple blockchains and sources, including off-chain information, and provide a unified query interface, which is particularly useful for multi-chain dApps.

Currently, the archive Node of Ethereum in the Geth client occupies about 13.5 TB of storage space in archive mode, while the storage demand in the Erigon client is about 3 TB. As the blockchain continues to rise, the data storage of archive Nodes will also increase. Faced with such a huge amount of data, mainstream indexers not only support multi-chain indexing, but also customize data parsing frameworks for the data needs of different applications. For example, The Graph's 'subgraph' framework is a typical case.

The emergence of indexers greatly improves the indexing and querying efficiency of data. Compared with traditional RPC endpoints, indexers can efficiently index a large amount of data and support high-speed queries. These indexers allow users to perform complex queries, easily filter data, and analyze it after extraction. In addition, some indexers also support aggregation of data sources from multiple blockchains, avoiding the need to deploy multiple APIs in multi-chain dApps. By running distributed on multiple nodes, indexers not only provide stronger security and performance, but also reduce the risk of interruption and downtime that centralized RPC providers may bring.

In contrast, an indexer allows users to directly obtain the required information without dealing with the underlying complex data by using a predefined query language. This mechanism significantly improves the efficiency and reliability of data retrieval, which is an important innovation in Block chain data access.

2.4 Full-Chain Database: Aligning with Flow Priority

Using an indexed Node to query data often means that the API becomes the only gateway to digest on-chain data. However, as a project enters the expansion phase, a more flexible data source is often needed, which standardized APIs cannot provide. With the increasing complexity of application requirements, it becomes increasingly difficult for primary data indexers and their standardized index formats to meet a growing variety of query needs, such as search, cross-chain interaction access, or off-chain data mapping.

In modern data pipeline architectures, the 'stream-first' approach has become a solution to the limitations of traditional batch processing, enabling real-time data ingestion, processing, and analysis. This paradigm shift allows organizations to immediately respond to incoming data, almost instantly gain insights, and make decisions. Similarly, the development of blockchain data service providers is also moving towards the construction of blockchain data streams, with traditional indexer service providers successively launching products that obtain real-time blockchain data in a streaming manner, such as Substreams from The Graph, Mirror from Goldsky, as well as real-time data lakes generated according to blockchain data streams, such as Chainbase and SubSquid.

These services aim to address the need for real-time parsing of Block chain transactions and provide more comprehensive query capabilities. Just as the 'stream-first' architecture improves data processing and consumption in traditional data pipelines through drop latency and enhanced response capabilities, these Block chain data stream service providers also hope to support the development of more applications and assist on-chain data analysis through more advanced and mature data sources.

By redefining the challenges of on-chain data from the perspective of modern data pipelines, we can look at the full potential of on-chain data management, storage, and delivery from a new perspective. When we begin to view subgraphs and ETH block ETL indexers as data streams in data pipelines rather than final outputs, we can envision a world where high-performance datasets can be tailored to any business case.

3. AI + Database? In-depth comparison of The Graph, Chainbase, Space and Time

3.1 The Graph

The Graph network uses a Decentralization Node network to provide multi-chain data indexing and query services, facilitating developers to easily index blockchain data and build Decentralization applications. Its main product modes are the data query execution market and the data indexing cache market, both of which are essentially to serve the user's product query needs. The data query execution market specifically refers to consumers paying appropriate indexed Node for the required data, while the data indexing cache market is a market where indexed Nodes allocate resources based on the historical indexing heat of the subgraph, the query fees collected, and the on-chain curators' demand for subgraph output.

Subgraphs are the basic data structures in The Graph network. They define how to extract and transform data from the blockchain into queryable formats (such as GraphQL schemas). Anyone can create subgraphs, and multiple applications can reuse these subgraphs, which improves data reusability and efficiency of use.

The Graph network consists of four key roles: indexers, curators, delegators, and developers, who collectively provide data support for web3 applications. Here are their respective responsibilities:

· Indexer: The Indexer is a Node operator in The Graph network. Indexers stake GRT (The Graph's native token) to participate in the network and provide indexing and query processing services.

· Delegator: Delegator is the user who stakes GRT Tokens to the Index Node to support its operation. Delegators earn a portion of the rewards through the Index Node they delegate to.

· Curator: The curator is responsible for signaling which subgraphs should be indexed by the network. The curator helps ensure that valuable subgraphs are processed first.

· Developers: Unlike the previous three as suppliers, developers are the demand side and the main users of The Graph. They create and submit subgraphs to The Graph network, waiting for the network to satisfy the demand for data.

Currently, The Graph has transitioned to a fully decentralized subgraph hosting service, with circulating economic incentives among different participants to ensure the system operates smoothly:

· Index Node reward: Index Nodes earn revenue by consuming the query fees from consumers and a portion of the GRT Token Block reward.

· Delegator rewards: Delegators receive partial rewards through the indexed Nodes they support.

· Curator Reward: If curators signal valuable subgraphs, they can receive a portion of the query fees as a reward.

In fact, The Graph's products are also rapidly developing in the AI wave. As one of the core development teams in The Graph ecosystem, Semiotic Labs has been dedicated to optimizing indexing pricing and user query experience using AI technology. Currently, the AutoAgora, Allocation Optimizer, and AgentC tools developed by Semiotic Labs have improved the performance of the ecosystem in various aspects.

· AutoAgora introduces a dynamic pricing mechanism, adjusting prices in real time based on query volume and resource usage to optimize pricing strategies and ensure the competitiveness and maximization of indexer revenue.

· Allocation Optimizer solves the complex problem of subgraph resource allocation, helping the indexer achieve the optimal configuration of resources to improve revenue and performance.

· AgentC is an experimental tool that allows users to access The Graph's blockchain data through natural language, thereby enhancing the user experience.

The application of these tools enables The Graph to further enhance the intelligence and user-friendliness of the system with the assistance of AI.

3.2 Chainbase

Chainbase is a full-chain data network that integrates all Block chain data into one platform, making it easier for developers to build and maintain applications. Its unique features include:

· Real-time Data Lake: Chainbase provides a real-time data lake specifically designed for the blockchain data stream, allowing data to be accessed instantly upon generation.

· Dual-chain architecture: Chainbase has built an execution layer based on Eigenlayer AVS, forming a parallel dual-chain architecture with the ConsensusAlgorithm of CometBFT. This design enhances the programmability and composability of Cross-Chain Interaction data, supporting high throughput, low latency, and finality, and enhancing network security through a dual stake model.

· Innovative data format standard: Chainbase has introduced a new data format standard called "manuscripts", optimizing the structured and utilization of data in the encryption industry.

· encryption World Model: With its vast Block chain data resources, Chainbase combines AI model technology to create an AI model that can effectively understand, predict block chain transactions and interact with them. The basic version model Theia has been launched for public use.

These features make Chainbase stand out in the Block chain index protocol, especially focusing on the accessibility of real-time data, innovative data formats, and creating smarter models to enhance insights through the combination of on-chain and off-chain data.

Chainbase's AI model Theia is a key highlight that distinguishes itself from other data service protocols. Theia is based on NVIDIA's developed DORA model, combined with on-chain and off-chain data as well as temporal and spatial activities, learns and analyzes encryption patterns, and responds through causal inference, thus deeply exploring the potential value and regularity of on-chain data, providing users with more intelligent data services.

The data service empowered by AI makes Chainbase no longer just a Block chain data service platform, but a more competitive intelligent data service provider. With powerful data resources and proactive analysis by AI, Chainbase can provide broader data insights and optimize users' data processing.

3.3 Space and Time

Space and Time (SxT) aims to build a verifiable computation layer, expanding Zero-Knowledge Proof on the Decentralization data warehouse to provide trustworthy data processing for smart contracts, large language models, and enterprises. Currently, Space and Time has raised $20 million in its latest Series A funding round, led by Framework Ventures, Lightspeed Faction, Arrington Capital, and Hivemind Capital.

In the field of data indexing and validation, Space and Time has introduced a new technological approach called Proof of SQL. This is an innovative Zero-Knowledge Proof (ZKP) technology developed by Space and Time to ensure that SQL queries executed on the Decentralization data warehouse are tamper-proof and verifiable. When running a query, Proof of SQL generates an encryption proof to verify the integrity and accuracy of the query results. This proof is attached to the query results, allowing any validators (such as Smart Contracts) to independently confirm that the data has not been tampered with during the processing. Traditional blockchain networks typically rely on Consensus Mechanisms to verify the authenticity of data, while Space and Time's Proof of SQL implements a more efficient way of data validation. Specifically, in Space and Time's system, one Node is responsible for data retrieval, while other Nodes verify the authenticity of the data using zk-SNARKs technology. This approach changes the resource consumption of multiple Nodes repeatedly indexing the same data under Consensus Mechanisms to achieve Consensus and improves the overall performance of the system. As this technology matures, it serves as a cornerstone for the development of a series of traditional industries focusing on data reliability and constructing products based on on-chain data from Blockchain.

Meanwhile, SxT has been closely collaborating with Microsoft AI Joint Innovation Lab to accelerate the development of generative AI tools, making it easier for users to process blockchain data through natural language processing. Currently, in Space and Time Studio, users can experience inputting natural language queries, and the AI will automatically convert them into SQL and execute the query on behalf of the users, presenting the final results they need.

3.4 Differential Comparison

4. Conclusion and Prospects

In conclusion, Block chain data indexing technology has evolved from the initial Node data source, through the development of data parsing and indexers, to the ultimate evolution of AI-empowered full-chain data services, undergoing a gradual improvement process. The continuous evolution of these technologies has not only improved the efficiency and accuracy of data access, but also brought users an unprecedented intelligent experience.

Looking ahead, with the continuous development of AI technology and new technologies such as Zero-Knowledge Proof, blockchain data services will become more intelligent and secure. We have reason to believe that blockchain data services will continue to play an important role as infrastructure in the future, providing strong support for industry progress and innovation.

Statement:

This article is reproduced from [[Trustless Labs](https://x.com/TrustlessLabs/status/1833815530647834843)], the copyright vesting original author is [Trustless Labs]. If you have any objections to the reprint, please contact the Gate Learn team, and the team will process it as soon as possible according to the relevant procedures.
Disclaimer: The views and opinions expressed in this article are solely those of the author and do not constitute any investment advice.
The other language versions of the article are translated by the Gate Learn team. Copying, disseminating, or plagiarizing translated articles without mentioning Gate.io is not allowed.