Breaking Down AI Data Barriers: Why Data DAOs Are Crucial Now

IntermediateJul 14, 2024
This article examines the current limitations of AI data sources and suggests that Data DAOs can provide new, high-quality datasets to advance AI models. Data DAOs can enhance AI training with real-world data, personal health data, and human feedback, but they also face challenges like incentive distortion, data verification, and benefit evaluation.
Breaking Down AI Data Barriers: Why Data DAOs Are Crucial Now

Recent notable data authorization deals, such as those between OpenAI and News Corp and Reddit, emphasize the need for high-quality data in AI. Leading AI models have already been trained on a significant portion of the internet. For example, Common Crawl has indexed about 10% of web pages for large language model training, which includes over 100 trillion tokens.

To further improve AI models, it’s essential to expand and enhance the data available for training. We have been discussing ways to aggregate data, especially through decentralized methods. We are particularly interested in how decentralized approaches can help create new datasets and offer economic incentives to contributors and creators.

In recent years, one of the hot topics in the crypto world has been the concept of Data DAOs, which are groups of people who create, organize, and manage data. While this topic has been discussed by Multicoin and others, the rapid advancement of AI raises a new question: “Why is now the right time for Data DAOs?”

In this article, we will share our insights on Data DAOs to address the question: How can Data DAOs accelerate AI development?

1.The Current State of Data in AI

Today, AI models are primarily trained on public data, either through partnerships with companies like News Corp and Reddit or by scraping data from the open internet. For instance, Meta’s Llama 3 was trained using 15 trillion tokens from public sources. While these methods are effective for quickly gathering large amounts of data, they have limitations regarding what types of data are collected and how this data is obtained.

First, regarding what data should be collected: AI development is hindered by bottlenecks in data quality and quantity. Leopold Aschenbrenner discussed the “data wall” that limits further algorithm improvements: “Soon, the simple approach of pre-training larger language models on more scraped data may face significant bottlenecks.”

One way to overcome the data wall is to make new datasets available. For example, model companies cannot scrape login-protected data without violating most websites’ terms of service, and they cannot access data that hasn’t been collected. Currently, there is a vast amount of private data that AI training cannot access, such as data from Google Drive, Slack, personal health records, and other private information.

Second, regarding how data is collected: In the current model, data collection companies capture most of the value. Reddit’s S-1 filing highlights data licensing as a major anticipated revenue source: “We expect our growing data advantage and intellectual property to remain key elements in future LLM training.” However, the end users who generate the actual content do not receive any economic benefits from these licensing agreements or the AI models themselves. This misalignment could discourage participation—there are already movements to sue generative AI companies or opt out of training datasets. Additionally, concentrating revenue in the hands of model companies or platforms without sharing it with end users has significant socio-economic implications.

2. The Impact of Data DAOs

The data issues mentioned earlier share a common theme: they benefit from substantial contributions from diverse and representative user samples. While any single data point might have negligible impact on model performance, collectively, a large group of users can generate new datasets that are highly valuable for AI training. This is where Data DAOs (Decentralized Autonomous Organizations) come into play. With Data DAOs, data contributors can earn economic rewards for providing data and can control how their data is used and monetized.

In what areas can Data DAOs make a significant impact in the current data landscape? Here are a few ideas—this is not an exhaustive list, and Data DAOs certainly have other opportunities:

(1) Real-World Data
In the field of decentralized physical infrastructure (DEPIN), networks like Hivemapper aim to collect the latest global map data by incentivizing dashcam owners to share their data and encouraging users to provide data through their applications (e.g., information about road closures or repairs). DEPIN can be seen as a real-world Data DAO, where datasets are generated from hardware devices and/or user networks. This data has commercial value for many companies, and contributors are rewarded with tokens.

(2) Personal Health Data
Biohacking is a social movement where individuals and communities adopt a DIY approach to studying biology, often experimenting on themselves. For example, someone might use different nootropic drugs to boost brain performance, try various treatments or environmental changes to improve sleep, or even inject themselves with experimental substances.

Data DAOs can support these biohacking efforts by organizing participants around shared experiments and systematically collecting results. The income generated by these personal health DAOs, such as from research labs or pharmaceutical companies, can be returned to participants who contributed their personal health data.

(3) Reinforcement Learning with Human Feedback
Reinforcement Learning with Human Feedback (RLHF) involves using human input to fine-tune AI models and improve their performance. Typically, feedback comes from experts in specific fields who can effectively evaluate the model’s output. For instance, a research lab might seek assistance from a mathematics PhD to enhance their AI’s mathematical capabilities. Token rewards can attract and incentivize experts to participate, offering speculative value and global access through crypto payment systems. Companies like Sapien, Fraction, and Sahara are actively working in this area.

(4) Private Data
As public data available for AI training becomes scarcer, the focus may shift to proprietary datasets, including private user data. Behind login walls lies a wealth of high-quality data that remains inaccessible, such as private messages and documents. This data can be highly effective for training personalized AI and contains valuable information not found on the public internet.

Accessing and using this data presents significant legal and ethical challenges. Data DAOs can offer a solution by allowing willing participants to upload and monetize their data while managing its usage. For example, a Reddit Data DAO could enable users to upload their exported Reddit data, including comments, posts, and voting history, which could be sold or leased to AI companies in a privacy-protective manner. Token incentives allow users to earn not only from a one-time transaction but also from the ongoing value generated by AI models trained with their data.

3. Open Issues and Challenges

While Data DAOs offer significant potential benefits, there are several important considerations and challenges to address.

(1) Distortion of Incentives
A key lesson from the history of using token incentives in crypto is that external rewards can alter user behavior. This has direct implications for using token incentives to gather data: incentives might distort the participant pool and the types of data they contribute.

Introducing token incentives also opens up the possibility of participants exploiting the system, such as by submitting low-quality or fabricated data to maximize their income. This is critical because the success of Data DAOs depends on the quality of the data. If contributions deviate from the desired goal, the value of the dataset can be compromised.

(2) Measuring and Rewarding Data

The central idea of Data DAOs is to reward contributors for their data submissions with tokens, which will generate revenue for the DAO in the long run. However, due to the subjective nature of data value, determining the appropriate reward for different data contributions is highly challenging. For example, in the biohacking scenario: Is some users’ data more valuable than others? If so, what factors determine this? For map data: Is information from certain areas more valuable than from others? How should these differences be quantified? (Research into measuring data value in AI by assessing the incremental contribution of data to model performance is ongoing but can be computationally intensive.)

Furthermore, it is essential to establish robust mechanisms to verify the authenticity and accuracy of the data. Without these measures, the system could be vulnerable to fraudulent data submissions (e.g., creating fake accounts) or Sybil attacks. DEPIN networks address this issue by integrating verification at the hardware device level, but other types of Data DAOs relying on user contributions might be more susceptible to manipulation.

(3) Incremental Value of New Data
Most open networks have already been leveraged for training purposes, so Data DAO operators must consider whether the datasets collected in a decentralized manner truly add incremental value to the existing data on open networks, and whether researchers can access this data from the platform or through other means. This idea underscores the importance of gathering entirely new data that surpasses what is currently available, leading to the next consideration: the scale of impact and revenue opportunities.

(4) Assessing Revenue Opportunities
Fundamentally, Data DAOs are building a two-sided marketplace that connects data buyers with data contributors. Therefore, the success of a Data DAO depends on its ability to attract a stable and diverse customer base willing to pay for data.

Data DAOs need to identify and confirm the demand for their data and ensure that the revenue opportunities are significant enough (whether in total or per contributor) to motivate the necessary quantity and quality of data. For instance, the concept of creating a user data DAO to aggregate personal preferences and browsing data for advertising purposes has been discussed for years, but the potential returns for users may be minimal. (For context, Meta’s global ARPU was $13.12 at the end of 2023.) With AI companies planning to invest trillions of dollars in training, the potential earnings from data might be enough to incentivize large-scale contributions, raising an intriguing question for Data DAOs: “Why now?”

4. Breaking Through the Data Wall

Data DAOs offer a promising solution for creating new, high-quality datasets and breaking through the data wall that challenges artificial intelligence. While the exact methods for achieving this are still to be determined, we are excited to see how this field evolves.

Disclaimer:

  1. This article is reprinted from [Jinse Finance], and the copyright belongs to the original author [Li Jin]. If you have any objections to this reprint, please contact the Gate Learn team at [email protected]. The team will promptly address any concerns according to the relevant procedures.
  2. Disclaimer: The views and opinions expressed in this article are those of the author alone and do not constitute any investment advice.
  3. Other language versions of this article have been translated by the Gate Learn team. Without mentioning Gate.io, the translated articles may not be copied, distributed, or plagiarized.

Breaking Down AI Data Barriers: Why Data DAOs Are Crucial Now

IntermediateJul 14, 2024
This article examines the current limitations of AI data sources and suggests that Data DAOs can provide new, high-quality datasets to advance AI models. Data DAOs can enhance AI training with real-world data, personal health data, and human feedback, but they also face challenges like incentive distortion, data verification, and benefit evaluation.
Breaking Down AI Data Barriers: Why Data DAOs Are Crucial Now

Recent notable data authorization deals, such as those between OpenAI and News Corp and Reddit, emphasize the need for high-quality data in AI. Leading AI models have already been trained on a significant portion of the internet. For example, Common Crawl has indexed about 10% of web pages for large language model training, which includes over 100 trillion tokens.

To further improve AI models, it’s essential to expand and enhance the data available for training. We have been discussing ways to aggregate data, especially through decentralized methods. We are particularly interested in how decentralized approaches can help create new datasets and offer economic incentives to contributors and creators.

In recent years, one of the hot topics in the crypto world has been the concept of Data DAOs, which are groups of people who create, organize, and manage data. While this topic has been discussed by Multicoin and others, the rapid advancement of AI raises a new question: “Why is now the right time for Data DAOs?”

In this article, we will share our insights on Data DAOs to address the question: How can Data DAOs accelerate AI development?

1.The Current State of Data in AI

Today, AI models are primarily trained on public data, either through partnerships with companies like News Corp and Reddit or by scraping data from the open internet. For instance, Meta’s Llama 3 was trained using 15 trillion tokens from public sources. While these methods are effective for quickly gathering large amounts of data, they have limitations regarding what types of data are collected and how this data is obtained.

First, regarding what data should be collected: AI development is hindered by bottlenecks in data quality and quantity. Leopold Aschenbrenner discussed the “data wall” that limits further algorithm improvements: “Soon, the simple approach of pre-training larger language models on more scraped data may face significant bottlenecks.”

One way to overcome the data wall is to make new datasets available. For example, model companies cannot scrape login-protected data without violating most websites’ terms of service, and they cannot access data that hasn’t been collected. Currently, there is a vast amount of private data that AI training cannot access, such as data from Google Drive, Slack, personal health records, and other private information.

Second, regarding how data is collected: In the current model, data collection companies capture most of the value. Reddit’s S-1 filing highlights data licensing as a major anticipated revenue source: “We expect our growing data advantage and intellectual property to remain key elements in future LLM training.” However, the end users who generate the actual content do not receive any economic benefits from these licensing agreements or the AI models themselves. This misalignment could discourage participation—there are already movements to sue generative AI companies or opt out of training datasets. Additionally, concentrating revenue in the hands of model companies or platforms without sharing it with end users has significant socio-economic implications.

2. The Impact of Data DAOs

The data issues mentioned earlier share a common theme: they benefit from substantial contributions from diverse and representative user samples. While any single data point might have negligible impact on model performance, collectively, a large group of users can generate new datasets that are highly valuable for AI training. This is where Data DAOs (Decentralized Autonomous Organizations) come into play. With Data DAOs, data contributors can earn economic rewards for providing data and can control how their data is used and monetized.

In what areas can Data DAOs make a significant impact in the current data landscape? Here are a few ideas—this is not an exhaustive list, and Data DAOs certainly have other opportunities:

(1) Real-World Data
In the field of decentralized physical infrastructure (DEPIN), networks like Hivemapper aim to collect the latest global map data by incentivizing dashcam owners to share their data and encouraging users to provide data through their applications (e.g., information about road closures or repairs). DEPIN can be seen as a real-world Data DAO, where datasets are generated from hardware devices and/or user networks. This data has commercial value for many companies, and contributors are rewarded with tokens.

(2) Personal Health Data
Biohacking is a social movement where individuals and communities adopt a DIY approach to studying biology, often experimenting on themselves. For example, someone might use different nootropic drugs to boost brain performance, try various treatments or environmental changes to improve sleep, or even inject themselves with experimental substances.

Data DAOs can support these biohacking efforts by organizing participants around shared experiments and systematically collecting results. The income generated by these personal health DAOs, such as from research labs or pharmaceutical companies, can be returned to participants who contributed their personal health data.

(3) Reinforcement Learning with Human Feedback
Reinforcement Learning with Human Feedback (RLHF) involves using human input to fine-tune AI models and improve their performance. Typically, feedback comes from experts in specific fields who can effectively evaluate the model’s output. For instance, a research lab might seek assistance from a mathematics PhD to enhance their AI’s mathematical capabilities. Token rewards can attract and incentivize experts to participate, offering speculative value and global access through crypto payment systems. Companies like Sapien, Fraction, and Sahara are actively working in this area.

(4) Private Data
As public data available for AI training becomes scarcer, the focus may shift to proprietary datasets, including private user data. Behind login walls lies a wealth of high-quality data that remains inaccessible, such as private messages and documents. This data can be highly effective for training personalized AI and contains valuable information not found on the public internet.

Accessing and using this data presents significant legal and ethical challenges. Data DAOs can offer a solution by allowing willing participants to upload and monetize their data while managing its usage. For example, a Reddit Data DAO could enable users to upload their exported Reddit data, including comments, posts, and voting history, which could be sold or leased to AI companies in a privacy-protective manner. Token incentives allow users to earn not only from a one-time transaction but also from the ongoing value generated by AI models trained with their data.

3. Open Issues and Challenges

While Data DAOs offer significant potential benefits, there are several important considerations and challenges to address.

(1) Distortion of Incentives
A key lesson from the history of using token incentives in crypto is that external rewards can alter user behavior. This has direct implications for using token incentives to gather data: incentives might distort the participant pool and the types of data they contribute.

Introducing token incentives also opens up the possibility of participants exploiting the system, such as by submitting low-quality or fabricated data to maximize their income. This is critical because the success of Data DAOs depends on the quality of the data. If contributions deviate from the desired goal, the value of the dataset can be compromised.

(2) Measuring and Rewarding Data

The central idea of Data DAOs is to reward contributors for their data submissions with tokens, which will generate revenue for the DAO in the long run. However, due to the subjective nature of data value, determining the appropriate reward for different data contributions is highly challenging. For example, in the biohacking scenario: Is some users’ data more valuable than others? If so, what factors determine this? For map data: Is information from certain areas more valuable than from others? How should these differences be quantified? (Research into measuring data value in AI by assessing the incremental contribution of data to model performance is ongoing but can be computationally intensive.)

Furthermore, it is essential to establish robust mechanisms to verify the authenticity and accuracy of the data. Without these measures, the system could be vulnerable to fraudulent data submissions (e.g., creating fake accounts) or Sybil attacks. DEPIN networks address this issue by integrating verification at the hardware device level, but other types of Data DAOs relying on user contributions might be more susceptible to manipulation.

(3) Incremental Value of New Data
Most open networks have already been leveraged for training purposes, so Data DAO operators must consider whether the datasets collected in a decentralized manner truly add incremental value to the existing data on open networks, and whether researchers can access this data from the platform or through other means. This idea underscores the importance of gathering entirely new data that surpasses what is currently available, leading to the next consideration: the scale of impact and revenue opportunities.

(4) Assessing Revenue Opportunities
Fundamentally, Data DAOs are building a two-sided marketplace that connects data buyers with data contributors. Therefore, the success of a Data DAO depends on its ability to attract a stable and diverse customer base willing to pay for data.

Data DAOs need to identify and confirm the demand for their data and ensure that the revenue opportunities are significant enough (whether in total or per contributor) to motivate the necessary quantity and quality of data. For instance, the concept of creating a user data DAO to aggregate personal preferences and browsing data for advertising purposes has been discussed for years, but the potential returns for users may be minimal. (For context, Meta’s global ARPU was $13.12 at the end of 2023.) With AI companies planning to invest trillions of dollars in training, the potential earnings from data might be enough to incentivize large-scale contributions, raising an intriguing question for Data DAOs: “Why now?”

4. Breaking Through the Data Wall

Data DAOs offer a promising solution for creating new, high-quality datasets and breaking through the data wall that challenges artificial intelligence. While the exact methods for achieving this are still to be determined, we are excited to see how this field evolves.

Disclaimer:

  1. This article is reprinted from [Jinse Finance], and the copyright belongs to the original author [Li Jin]. If you have any objections to this reprint, please contact the Gate Learn team at [email protected]. The team will promptly address any concerns according to the relevant procedures.
  2. Disclaimer: The views and opinions expressed in this article are those of the author alone and do not constitute any investment advice.
  3. Other language versions of this article have been translated by the Gate Learn team. Without mentioning Gate.io, the translated articles may not be copied, distributed, or plagiarized.
Start Now
Sign up and get a
$100
Voucher!