Decentralized Data Layer: The New Infrastructure for the AI Era #247

Intermediate

11/26/2024, 4:31:40 AM

We previously discussed how AI and Web3 can complement each other across vertical industries like computational networks, intermediary platforms, and consumer applications. When focusing on data resources as a vertical field, emerging Web projects offer new possibilities for the acquisition, sharing, and utilization of data.

TL/DR

Traditional data providers struggle to meet the demand for high-quality, real-time, and verifiable data in AI and other data-driven industries, especially in terms of transparency, user control, and privacy protection.
Web3 solutions are reshaping the data ecosystem. Technologies like MPC (Multi-Party Computation), zero-knowledge proofs, and TLS Notary ensure data authenticity and privacy during flow among multiple sources, while distributed storage and edge computing offer higher flexibility and efficiency in real-time data processing.
Decentralized data networks as emerging infrastructure have given rise to several representative projects such as OpenLayer (a modular real-data layer), Grass (leveraging user idle bandwidth and decentralized crawler node networks), and Vana (a user-data sovereignty Layer 1 network), which open new prospects for fields like AI training and applications through different technological paths.
By leveraging crowdsourced capacity, trustless abstraction layers, and token-based incentive mechanisms, decentralized data infrastructure can provide more private, secure, efficient, and cost-effective solutions compared to Web2 giants. It also empowers users with control over their data and related resources, building a more open, secure, and interconnected digital ecosystem.

1. The Surge in Data Demand

Data has become the key driver of innovation and decision-making across industries. UBS predicts that global data volume will grow tenfold from 2020 to 2030, reaching 660 ZB. By 2025, each individual globally is expected to generate 463 EB (Exabytes, 1 EB = 1 billion GB) of data daily. The Data-as-a-Service (DaaS) market is rapidly expanding. According to Grand View Research, the global DaaS market was valued at $14.36 billion in 2023 and is expected to grow at a compound annual growth rate (CAGR) of 28.1%, reaching $76.8 billion by 2030.

AI model training relies heavily on large datasets to identify patterns and adjust parameters. After training, datasets are also needed to test the models’ performance and generalization capabilities. Additionally, AI agents, as emerging intelligent application forms, require real-time and reliable data sources to ensure accurate decision-making and task execution.

(Source: Leewayhertz)

Demand for business analytics is also becoming more diverse and widespread, serving as a core tool driving enterprise innovation. For instance, social media platforms and market research firms need reliable user behaviour data to formulate strategies and analyze trends, integrating diverse data from multiple social platforms to build a more comprehensive picture.

For the Web3 ecosystem, reliable and authentic data is also needed on-chain to support new financial products. As more innovative assets are tokenized, flexible and reliable data interfaces are required to support product development and risk management, allowing smart contracts to execute based on verifiable real-time data.

In addition, use cases in scientific research, IoT, and other fields highlight the skyrocketing demand for diverse, authentic, and real-time data. Traditional systems may struggle to cope with the rapidly growing data volume and ever-changing demands.

2. Limitations and Challenges of Traditional Data Ecosystems

A typical data ecosystem includes data collection, storage, processing, analysis, and application. Centralized models are characterized by centralized data collection and storage, managed by a core IT team with strict access control. For example, Google’s data ecosystem spans various data sources like search engines, Gmail, and the Android operating system. These platforms collect user data, store it in globally distributed data centers, and process it using algorithms to support the development and optimization of various products and services.

In financial markets, LSEG (formerly Refinitiv) gathers real-time and historical data from global exchanges, banks, and major financial institutions, while utilizing its proprietary Reuters News network to collect market-related news. They process this information using proprietary algorithms and models to generate analysis and risk assessment products as value-added services.

(Source: kdnuggets.com)

While traditional data architecture is effective in professional services, the limitations of centralized models are becoming increasingly evident, particularly in covering emerging data sources, transparency, and user privacy protection. Below are some key issues:

Insufficient Data Coverage: Traditional data providers struggle to capture and analyze emerging data sources like social media sentiment and IoT device data quickly. Centralized systems find it challenging to efficiently acquire and integrate “long-tail” data from numerous small-scale or non-mainstream sources.

For example, the 2021 GameStop event revealed the limitations of traditional financial data providers in analyzing social media sentiment. Investor sentiment on platforms like Reddit swiftly influenced market trends, but data terminals like Bloomberg and Reuters failed to capture these dynamics in time, leading to delayed market forecasts.

Limited Data Accessibility: Monopoly limits access. Many traditional providers open parts of their data through APIs/cloud services, but high access fees and complex authorization processes increase the difficulty of data integration. On-chain developers struggle to access reliable off-chain data quickly, with high-quality data monopolized by a few giants at a high cost.
Issues of Data Transparency and Credibility: Many centralized data providers lack transparency in their data collection and processing methods. Effective mechanisms to verify the authenticity and completeness of large-scale data are also lacking. Verifying real-time data at scale remains complex, and the centralized nature increases the risk of data tampering or manipulation.
Privacy Protection and Data Ownership: Large tech companies have extensively commercialized user data. Users, as the creators of personal data, rarely gain due value from it. They often cannot understand how their data is collected, processed, or used, nor can they decide the scope and manner of its usage. Over-collection and misuse also result in severe privacy risks. For example, Facebook’s Cambridge Analytica scandal exposed significant flaws in transparency and privacy protection in traditional data ecosystems.
Data Silos: Real-time data from different sources and formats is challenging to integrate quickly, hindering comprehensive analysis. Much of this data remains locked within organizations, limiting cross-industry and cross-organizational sharing and innovation. This “data silo” effect obstructs cross-domain data integration and analysis. For example, in the consumer industry, brands need to integrate data from e-commerce platforms, physical stores, social media, and market research, but these data sets may be isolated due to platform inconsistencies or segregation. Similarly, ride-sharing companies like Uber and Lyft collect large amounts of real-time data on traffic, passenger demand, and geographic locations, but competitive dynamics prevent these datasets from being shared or integrated.

Beyond these issues, traditional data providers face challenges related to cost efficiency and flexibility. Although they are actively addressing these problems, emerging Web3 technologies provide new perspectives and possibilities to tackle them.

3. The Web3 Data Ecosystem

Since the launch of decentralized storage solutions like IPFS (InterPlanetary File System) in 2014, a series of emerging projects have aimed to address the limitations of traditional data ecosystems. Decentralized data solutions have evolved into a multi-layered, interconnected ecosystem covering all stages of the data lifecycle, including data generation, storage, exchange, processing and analysis, verification and security, as well as privacy and ownership.

Data Storage: The rapid development of Filecoin and Arweave demonstrates that decentralized storage (DCS) is becoming a paradigm shift in the storage field. DCS reduces single points of failure through distributed architecture while attracting participants with competitive cost-efficiency. With the emergence of large-scale applications, DCS storage capacity has grown exponentially (e.g., Filecoin’s total network storage capacity reached 22 exabytes by 2024).
Processing and Analysis: Decentralized data computing platforms like Fluence improve the real-time performance and efficiency of data processing through edge computing, especially for real-time application scenarios such as IoT and AI inference. Web3 projects utilize technologies like federated learning, differential privacy, trusted execution environments, and fully homomorphic encryption to provide flexible privacy protection at the computing layer.
Data Marketplaces/Exchange Platforms: To facilitate the valuation and circulation of data, Ocean Protocol employs tokenization and DEX mechanisms to create efficient and open data exchange channels. For example, it has collaborated with Daimler (Mercedes-Benz’s parent company) to develop data exchange markets for supply chain management. Streamr, on the other hand, has developed a permissionless subscription-based data stream network tailored for IoT and real-time analytics scenarios, showing exceptional potential in transportation and logistics projects (e.g., collaborating with Finland’s smart city project).

As data exchange and utilization increase, ensuring authenticity, credibility, and privacy has become critical. This drives the Web3 ecosystem to innovate in data verification and privacy protection, leading to groundbreaking solutions.

3.1 Innovations in Data Verification and Privacy Protection

Many Web3 technologies and native projects focus on addressing issues of data authenticity and privacy protection. Beyond the widespread adoption of technologies like Zero-Knowledge Proofs (ZK) and Multi-Party Computation (MPC), TLS Notary has emerged as a noteworthy new verification method.

Introduction to TLS Notary

The Transport Layer Security (TLS) protocol is a widely used encryption protocol for network communications. Its primary purpose is to ensure the security, integrity, and confidentiality of data transmission between a client and a server. TLS is a common encryption standard in modern network communications, applied in scenarios such as HTTPS, email, and instant messaging.

(TLS Encryption Principles, Source: TechTarget)

When TLS Notary was first introduced a decade ago, its goal was to verify the authenticity of TLS sessions by introducing a third-party “notary” outside of the client (prover) and server.

Using key-splitting technology, the master key of a TLS session is divided into two parts, held separately by the client and the notary. This design allows the notary to participate as a trusted third party in the verification process without accessing the actual communication content. This mechanism aims to detect man-in-the-middle attacks, prevent fraudulent certificates, and ensure that communication data is not tampered with during transmission. It also enables trusted third parties to confirm the legitimacy of communications while protecting privacy.

Thus, TLS Notary offers secure data verification and effectively balances verification needs with privacy protection.

In 2022, the TLS Notary project was restructured by the Ethereum Foundation’s Privacy and Scaling Exploration (PSE) research lab. The new version of the TLS Notary protocol was rewritten from scratch in the Rust programming language and integrated with more advanced cryptographic protocols like MPC. These updates enable users to prove the authenticity of data received from a server to a third party without revealing the data’s content. While retaining its core verification capabilities, the new TLS Notary significantly enhances privacy protection, making it more suitable for current and future data privacy requirements.

3.2 Variants and Extensions of TLS Notary

In recent years, TLS Notary technology has continued to evolve, resulting in various derivatives that further enhance its privacy and verification capabilities:

zkTLS: A privacy-enhanced version of TLS Notary that integrates ZKP technology, allowing users to generate cryptographic proofs of webpage data without exposing any sensitive information. It is particularly suited for communication scenarios requiring high privacy protection.
3P-TLS (Three-Party TLS): This protocol introduces three parties—client, server, and auditor—enabling the auditor to verify the security of communications without disclosing content. This protocol is useful in scenarios that demand both transparency and privacy, such as compliance audits or financial transaction reviews.

Web3 projects leverage these cryptographic technologies to enhance data verification and privacy protection, tackling issues like data monopolies, silos, and trusted transmission. Users can securely verify ownership of social media accounts, shopping records for financial loans, banking credit history, professional background, and academic credentials without compromising their privacy. Examples include:

Reclaim Protocol: Uses zkTLS to generate zero-knowledge proofs of HTTPS traffic, allowing users to securely import activity, reputation, and identity data from external websites without exposing sensitive information.
zkPass: Combines 3P-TLS technology to enable users to verify private real-world data securely, with applications in KYC and credit services. It is also compatible with the HTTPS network.
Opacity Network: Built on zkTLS, it allows users to securely prove their activities on platforms like Uber, Spotify, and Netflix without directly accessing these platforms’ APIs, enabling cross-platform activity verification.

(Projects Working on TLS Oracles, Source: Bastian Wetzel)

Data verification in Web3 is an essential link in the data ecosystem, with vast application prospects. The flourishing of this ecosystem is steering the digital economy toward a more open, dynamic, and user-centric model. However, the development of authenticity verification technologies is only the beginning of constructing next-generation data infrastructure.

4. Decentralized Data Networks

Some projects have combined the aforementioned data verification technologies with further exploration of upstream data ecosystems, such as data traceability, distributed data collection, and trusted transmission. Below, we highlight three representative projects—OpenLayer, Grass, and Vana—that showcase unique potential in building next-generation data infrastructure.

4.1 OpenLayer

OpenLayer, one of the projects from the a16z Crypto 2024 Spring Startup Accelerator, is the first modular authentic data layer. It aims to provide an innovative modular solution for coordinating data collection, verification, and transformation, addressing the needs of both Web2 and Web3 companies. OpenLayer has garnered support from renowned funds and angel investors, including Geometry Ventures and LongHash Ventures.

Traditional data layers face multiple challenges: lack of reliable verification mechanisms, reliance on centralized architectures that limit accessibility, lack of interoperability and flow between different systems, and the absence of fair data value distribution mechanisms.

A more specific issue is the increasing scarcity of training data for AI. On the public internet, many websites now deploy anti-scraping measures to prevent large-scale data scraping by AI companies. In private proprietary data, the situation is even more complex. Valuable data is often stored in a privacy-protected manner due to its sensitive nature, lacking effective incentive mechanisms. Users cannot safely monetize their private data and are thus reluctant to share sensitive information.

To address these problems, OpenLayer combines data verification technologies to build a Modular Authentic Data Layer. Through decentralization and economic incentives, it coordinates the processes of data collection, verification, and transformation, providing a safer, more efficient, and flexible data infrastructure for Web2 and Web3 companies.

4.1.1 Core Components of OpenLayer’s Modular Design

OpenLayer provides a modular platform that simplifies data collection, trustworthy verification, and transformation processes.

a) OpenNodes

OpenNodes are the core components responsible for decentralized data collection in the OpenLayer ecosystem. Through mobile apps, browser extensions, and other channels, users can collect data. Different operators/nodes can optimize their rewards by performing tasks most suited to their hardware specifications.

OpenNodes support three main types of data:

Publicly available internet data (e.g., financial, weather, sports, and social media data)
User private data (e.g., Netflix viewing history, Amazon order records)
Self-reported data from trusted sources (e.g., data verified by owners or specific trusted hardware).

Developers can easily add new data types, specify data sources, and define requirements, and retrieval methods. Users can provide anonymized data in exchange for rewards. This design allows the system to expand continuously to meet new data demands. The diverse data sources make OpenLayer suitable for various application scenarios and lower the threshold for data provision.

b) OpenValidators

OpenValidators handle the verification of collected data, enabling data consumers to confirm the accuracy of user-provided data against its source. Verification methods use cryptographic proofs, and results can be retrospectively validated. Multiple providers can offer verification services for the same type of proof, allowing developers to select the best-suited provider for their needs.

In initial use cases, particularly for public or private data from internet APIs, OpenLayer employs TLS Notary as a verification solution. It exports data from any web application and verifies its authenticity without compromising privacy.

Beyond TLS Notary, thanks to its modular design, the verification system can easily integrate other methods to accommodate diverse data and verification needs, including:

Attested TLS Connections: Utilizing Trusted Execution Environments (TEEs) to establish certified TLS connections, ensuring data integrity and authenticity during transmission.
Secure Enclaves: Using hardware-level secure isolation environments (e.g., Intel SGX) to process and verify sensitive data, offering higher-level data protection.
ZK Proof Generators: Integrating Zero-Knowledge Proofs to verify data attributes or computation results without exposing the underlying data.

c) OpenConnect

OpenConnect is the module responsible for data transformation and usability within the OpenLayer ecosystem. It processes data from various sources, ensuring interoperability across different systems to meet diverse application requirements. For example:

Converting data into an on-chain Oracle format for direct use by smart contracts.
Preprocessing unstructured raw data into structured data for AI training.

Providing privacy-preserving data anonymization for user private accounts while enhancing security during data sharing to reduce leaks and misuse.

To meet the real-time data demands of AI and blockchain applications, OpenConnect supports efficient real-time data transformation.

Currently, through integration with EigenLayer, OpenLayer AVS (Active Validation Service) operators monitor data request tasks, collect data, verify it, and report results back to the system. Operators stake or re-stake assets on EigenLayer to provide economic guarantees for their actions. Malicious behaviour results in asset slashing. As one of the earliest AVS projects on the EigenLayer mainnet, OpenLayer has attracted over 50 operators and $4 billion in restaked assets.

4.2 Grass

Grass, the flagship project developed by Wynd Network, is designed to create a decentralized network crawler and AI training data platform. By the end of 2023, Grass completed a $3.5 million seed funding round led by Polychain Capital and Tribe Capital. In September 2024, it secured Series A funding, with $5 million led by HackVC and additional participation from Polychain, Delphi, Lattice, and Brevan Howard.

As AI training increasingly relies on diverse and expansive data sources, Grass addresses this need by creating a distributed web crawler node network. This network leverages decentralized physical infrastructure and idle user bandwidth to collect and provide verifiable datasets for AI training. Nodes route web requests through user internet connections, accessing public websites and compiling structured datasets. Initial data cleaning and formatting are performed using edge computing technology, ensuring high-quality outputs.

Grass utilizes the Solana Layer 2 Data Rollup architecture to enhance processing efficiency. Validators receive, verify, and batch-process web transactions from nodes, generating Zero-Knowledge (ZK) proofs to confirm data authenticity. Verified data is stored on the Grass Data Ledger (L2), with corresponding proofs linked to the Solana L1 blockchain.

4.2.1 Key Components of Grass

a) Grass Nodes:

Users install the Grass app or browser extension, allowing their idle bandwidth to power decentralized web crawling. Nodes route web requests, access public websites, and compile structured datasets. Using edge computing, they perform initial data cleaning and formatting. Users earn GRASS tokens as rewards based on their bandwidth contribution and the volume of data provided.

b) Routers：

Acting as intermediaries, routers connect Grass nodes to validators. They manage the node network, and relay bandwidth, and are incentivized based on the total verified bandwidth they facilitate.

c) Validators：

Validators receive and verify web transactions relayed by routers. They generate ZK proofs to confirm the validity of the data, leveraging unique key sets to establish secure TLS connections and encryption suites. While Grass currently uses centralized validators, plans are in place to transition to a decentralized validator committee.

d) ZK Processors:

These processors validate node session data proofs and batch all web request proofs for submission to Solana Layer 1.

e) Grass Data Ledger (Grass L2):

The Grass Data Ledger stores comprehensive datasets and links them to their corresponding L1 proofs on Solana, ensuring transparency and traceability.

f) Edge Embedding Models:

These models transform unstructured web data into structured datasets suitable for AI training.

Source： Grass

Comparison: Grass vs. OpenLayer

Grass and OpenLayer share a commitment to leveraging distributed networks to provide companies with access to open internet data and authenticated private data. Both utilize incentive mechanisms to promote data sharing and the production of high-quality datasets, but their technical architectures and business models differ.

Technical Architecture:

Grass uses a Solana Layer 2 Data Rollup architecture with centralized validation, relying on a single validator. OpenLayer, as an early adopter of EigenLayer’s AVS (Active Validation Service), employs a decentralized validation mechanism using economic incentives and slashing penalties. Its modular design emphasizes scalability and flexibility in data verification services.

Product Focus:

Both projects allow users to monetize data through nodes, but their business use cases diverge:

Grass features a data marketplace model using L2 to store structured, high-quality datasets verifiably. These datasets are tailored for AI companies as training resources.
OpenLayer focuses on real-time data stream verification (VaaS) rather than dedicated data storage. It serves dynamic scenarios such as oracles for RWA/DeFi/prediction markets, real-time social data, and AI applications requiring instant data inputs.

Grass primarily targets AI companies and data scientists needing large-scale, structured datasets, as well as research institutions and enterprises requiring web-based data. OpenLayer caters to Web3 developers needing off-chain data sources, AI companies requiring real-time, verifiable streams, and businesses pursuing innovative strategies like verifying competitor product usage.

Future Competition and Synergies

While both projects currently occupy distinct niches, their functionalities may converge as the industry evolves:

Grass could expand to offer real-time structured data.
OpenLayer might develop a dedicated data ledger for dataset management.

Both projects could also integrate data labelling as a critical step for training datasets. Grass, with its vast network of over 2.2 million active nodes, could quickly deploy Reinforcement Learning with Human Feedback (RLHF) services to optimize AI models. OpenLayer, with its expertise in real-time data verification and processing, could maintain an edge in data credibility and quality, particularly for private datasets.

Despite the potential overlap, their unique strengths and technological approaches may allow them to dominate different niches within the decentralized data ecosystem.

（Source：IOSG, David）

4.3 Vana: A User-Centric Data Pool Network

Vana is a user-centric data pool network designed to provide high-quality data for AI and related applications. Compared to OpenLayer and Grass, Vana takes a distinct technological and business approach. In September 2024, Vana secured $5 million in funding led by Coinbase Ventures, following an $18 million Series A round in which Paradigm served as the lead investor, with participation from Polychain and Casey Caruso.

Originally launched in 2018 as an MIT research project, Vana is a Layer 1 blockchain dedicated to private user data. Its innovations in data ownership and value distribution allow users to profit from AI models trained on their data. Vana achieves this through trustless, private, and attributable Data Liquidity Pools (DLPs) and an innovative Proof of Contribution mechanism that facilitates the flow and monetization of private data.

4.3.1. Data Liquidity Pools (DLPs)

Vana introduces a unique concept of Data Liquidity Pools (DLPs), which are at the core of the Vana network. Each DLP is an independent peer-to-peer network aggregating specific types of data assets. Users can upload their private data—such as shopping records, browsing habits, and social media activity—into designated DLPs and decide whether to authorize specific third-party usage.

Data within these pools undergoes de-identification to protect user privacy while remaining usable for commercial applications, such as AI model training and market research. Users contributing data to a DLP are rewarded with corresponding DLP tokens. These tokens represent the user’s contribution to the pool, grant governance rights, and entitle the user to a share of future profits.

Unlike the traditional one-time sale of data, Vana allows data to participate continuously in the economic cycle, enabling users to receive ongoing rewards with transparent, visualized usage tracking.

4.3.2. Proof of Contribution Mechanism

The Proof of Contribution (PoC) mechanism is a cornerstone of Vana’s approach to ensuring data quality. Each DLP can define a unique PoC function tailored to its characteristics, verifying the authenticity and completeness of submitted data and evaluating its contribution to improving AI model performance. This mechanism quantifies user contributions, recording them for reward allocation. Similar to the “Proof of Work” concept in cryptocurrency, PoC rewards users based on data quality, quantity, and usage frequency. Smart contracts automate this process, ensuring contributors are compensated fairly and transparently.

Vana’s Technical Architecture

Data Liquidity Layer：

This core layer enables the contribution, verification, and recording of data into DLPs, transforming data into transferable digital assets on-chain. DLP creators deploy smart contracts to set purposes, verification methods, and contribution parameters. Data contributors submit data for validation, and the PoC module evaluates data quality and assigns governance rights and rewards.

Data Portability Layer：

Serving as Vana’s application layer, this platform facilitates collaboration between data contributors and developers. It provides infrastructure for building distributed AI training models and AI DApps using the liquidity in DLPs.

Connectome:

A decentralized ledger that underpins the Vana ecosystem, Connectome acts as a real-time data flow map. It records all real-time data transactions using Proof of Stake consensus, ensuring the efficient transfer of DLP tokens and enabling cross-DLP data access. Fully compatible with EVM, it allows interoperability with other networks, protocols, and DeFi applications.

（Source： Vana）

Vana provides a fresh approach by focusing on the liquidity and empowerment of user data. This decentralized data exchange model not only supports AI training and data marketplaces but also enables seamless cross-platform data sharing and ownership in the Web3 ecosystem. Ultimately, it fosters an open internet where users can own and manage their data and the intelligent products created from it.

5. The Value Proposition of Decentralized Data Networks

In 2006, data scientist Clive Humby famously remarked, “Data is the new oil.” Over the past two decades, we have witnessed the rapid evolution of technologies that “refine” this resource, such as big data analytics and machine learning, which have unlocked unprecedented value from data. According to IDC, by 2025, the global data sphere will expand to 163 ZB, with the majority coming from individuals. As IoT, wearable devices, AI, and personalized services become more widespread, much of the data required for commercial use will originate from individuals.

Challenges of Traditional Solutions and Web3 Innovations

Web3 data solutions overcome the limitations of traditional infrastructure by leveraging distributed node networks. These networks enable broader, more efficient data collection while improving the real-time accessibility and verifiability of specific datasets. Web3 technologies ensure data authenticity and integrity while protecting user privacy, fostering a fairer data utilization model. This decentralized architecture democratizes data access and empowers users to share in the economic benefits of the data economy.

Both OpenLayer and Grass rely on user-node models to enhance specific data collection processes, while Vana monetizes private user data. These approaches not only improve efficiency but also enable ordinary users to participate in the value created by the data economy, creating a win-win scenario for users and developers.

Through tokenomics, Web3 data solutions redesign incentive models, establishing a fairer value distribution mechanism. These systems attract significant user participation, hardware resources, and capital investment, optimizing the operation of the entire data network.

Web3 solutions offer modularity and scalability, allowing for technological iteration and ecosystem expansion. For example: OpenLayer’s modular design provides flexibility for future advancements; Grass’ distributed architecture optimizes AI model training by providing diverse and high-quality datasets.

From data generation, storage, and verification to exchange and analysis, Web3-driven solutions address the shortcomings of traditional infrastructures. By enabling users to monetize their data, these solutions fundamentally transform the data economy.

As technologies evolve and application scenarios expand, decentralized data layers are poised to become a cornerstone of next-generation infrastructure. They will support a wide range of data-driven industries while empowering users to take control of their data and its economic potential.

Disclaimer:

This article is reprinted from [IOSG Ventures]. All copyrights belong to the original author [IOSG Ventures]. If there are objections to this reprint, please contact the Gate Learn team, and they will handle it promptly.
Liability Disclaimer: The views and opinions expressed in this article are solely those of the author and do not constitute investment advice.
The Gate Learn team translated the article into other languages. Copying, distributing, or plagiarizing the translated articles is prohibited unless mentioned.

แชร์

Inhalt

TL/DR

1. The Surge in Data Demand

2. Limitations and Challenges of Traditional Data Ecosystems

3. The Web3 Data Ecosystem

4. Decentralized Data Networks

5. The Value Proposition of Decentralized Data Networks

Decentralized Data Layer: The New Infrastructure for the AI Era #247

Intermediate11/26/2024, 4:31:40 AM

Macro Trends Technology AI

TL/DR

1. The Surge in Data Demand

2. Limitations and Challenges of Traditional Data Ecosystems

3. The Web3 Data Ecosystem

4. Decentralized Data Networks

5. The Value Proposition of Decentralized Data Networks

TL/DR

Traditional data providers struggle to meet the demand for high-quality, real-time, and verifiable data in AI and other data-driven industries, especially in terms of transparency, user control, and privacy protection.
Web3 solutions are reshaping the data ecosystem. Technologies like MPC (Multi-Party Computation), zero-knowledge proofs, and TLS Notary ensure data authenticity and privacy during flow among multiple sources, while distributed storage and edge computing offer higher flexibility and efficiency in real-time data processing.
Decentralized data networks as emerging infrastructure have given rise to several representative projects such as OpenLayer (a modular real-data layer), Grass (leveraging user idle bandwidth and decentralized crawler node networks), and Vana (a user-data sovereignty Layer 1 network), which open new prospects for fields like AI training and applications through different technological paths.
By leveraging crowdsourced capacity, trustless abstraction layers, and token-based incentive mechanisms, decentralized data infrastructure can provide more private, secure, efficient, and cost-effective solutions compared to Web2 giants. It also empowers users with control over their data and related resources, building a more open, secure, and interconnected digital ecosystem.

1. The Surge in Data Demand

(Source: Leewayhertz)

2. Limitations and Challenges of Traditional Data Ecosystems

(Source: kdnuggets.com)

Insufficient Data Coverage: Traditional data providers struggle to capture and analyze emerging data sources like social media sentiment and IoT device data quickly. Centralized systems find it challenging to efficiently acquire and integrate “long-tail” data from numerous small-scale or non-mainstream sources.

Limited Data Accessibility: Monopoly limits access. Many traditional providers open parts of their data through APIs/cloud services, but high access fees and complex authorization processes increase the difficulty of data integration. On-chain developers struggle to access reliable off-chain data quickly, with high-quality data monopolized by a few giants at a high cost.
Issues of Data Transparency and Credibility: Many centralized data providers lack transparency in their data collection and processing methods. Effective mechanisms to verify the authenticity and completeness of large-scale data are also lacking. Verifying real-time data at scale remains complex, and the centralized nature increases the risk of data tampering or manipulation.
Privacy Protection and Data Ownership: Large tech companies have extensively commercialized user data. Users, as the creators of personal data, rarely gain due value from it. They often cannot understand how their data is collected, processed, or used, nor can they decide the scope and manner of its usage. Over-collection and misuse also result in severe privacy risks. For example, Facebook’s Cambridge Analytica scandal exposed significant flaws in transparency and privacy protection in traditional data ecosystems.
Data Silos: Real-time data from different sources and formats is challenging to integrate quickly, hindering comprehensive analysis. Much of this data remains locked within organizations, limiting cross-industry and cross-organizational sharing and innovation. This “data silo” effect obstructs cross-domain data integration and analysis. For example, in the consumer industry, brands need to integrate data from e-commerce platforms, physical stores, social media, and market research, but these data sets may be isolated due to platform inconsistencies or segregation. Similarly, ride-sharing companies like Uber and Lyft collect large amounts of real-time data on traffic, passenger demand, and geographic locations, but competitive dynamics prevent these datasets from being shared or integrated.

3. The Web3 Data Ecosystem

Data Storage: The rapid development of Filecoin and Arweave demonstrates that decentralized storage (DCS) is becoming a paradigm shift in the storage field. DCS reduces single points of failure through distributed architecture while attracting participants with competitive cost-efficiency. With the emergence of large-scale applications, DCS storage capacity has grown exponentially (e.g., Filecoin’s total network storage capacity reached 22 exabytes by 2024).
Processing and Analysis: Decentralized data computing platforms like Fluence improve the real-time performance and efficiency of data processing through edge computing, especially for real-time application scenarios such as IoT and AI inference. Web3 projects utilize technologies like federated learning, differential privacy, trusted execution environments, and fully homomorphic encryption to provide flexible privacy protection at the computing layer.
Data Marketplaces/Exchange Platforms: To facilitate the valuation and circulation of data, Ocean Protocol employs tokenization and DEX mechanisms to create efficient and open data exchange channels. For example, it has collaborated with Daimler (Mercedes-Benz’s parent company) to develop data exchange markets for supply chain management. Streamr, on the other hand, has developed a permissionless subscription-based data stream network tailored for IoT and real-time analytics scenarios, showing exceptional potential in transportation and logistics projects (e.g., collaborating with Finland’s smart city project).

3.1 Innovations in Data Verification and Privacy Protection

Introduction to TLS Notary

(TLS Encryption Principles, Source: TechTarget)

When TLS Notary was first introduced a decade ago, its goal was to verify the authenticity of TLS sessions by introducing a third-party “notary” outside of the client (prover) and server.

Thus, TLS Notary offers secure data verification and effectively balances verification needs with privacy protection.

3.2 Variants and Extensions of TLS Notary

In recent years, TLS Notary technology has continued to evolve, resulting in various derivatives that further enhance its privacy and verification capabilities:

zkTLS: A privacy-enhanced version of TLS Notary that integrates ZKP technology, allowing users to generate cryptographic proofs of webpage data without exposing any sensitive information. It is particularly suited for communication scenarios requiring high privacy protection.
3P-TLS (Three-Party TLS): This protocol introduces three parties—client, server, and auditor—enabling the auditor to verify the security of communications without disclosing content. This protocol is useful in scenarios that demand both transparency and privacy, such as compliance audits or financial transaction reviews.

Reclaim Protocol: Uses zkTLS to generate zero-knowledge proofs of HTTPS traffic, allowing users to securely import activity, reputation, and identity data from external websites without exposing sensitive information.
zkPass: Combines 3P-TLS technology to enable users to verify private real-world data securely, with applications in KYC and credit services. It is also compatible with the HTTPS network.
Opacity Network: Built on zkTLS, it allows users to securely prove their activities on platforms like Uber, Spotify, and Netflix without directly accessing these platforms’ APIs, enabling cross-platform activity verification.

(Projects Working on TLS Oracles, Source: Bastian Wetzel)