News

DALL·E 2 pre-training mitigations

Published

11 months ago

October 8, 2023

We observed that our internal predecessors to DALL·E 2 would sometimes reproduce training images verbatim. This behavior was undesirable, since we would like DALL·E 2 to create original, unique images by default and not just “stitch together” pieces of existing images. Additionally, reproducing training images verbatim can raise legal questions around copyright infringement, ownership, and privacy (if people’s photos were present in training data).

To better understand the issue of image regurgitation, we collected a dataset of prompts that frequently resulted in duplicated images. To do this, we used a trained model to sample images for 50,000 prompts from our training dataset, and sorted the samples by perceptual similarity to the corresponding training image. Finally, we inspected the top matches by hand, finding only a few hundred true duplicate pairs out of the 50k total prompts. Even though the regurgitation rate appeared to be less than 1%, we felt it was necessary to push the rate down to 0 for the reasons stated above.

When we studied our dataset of regurgitated images, we noticed two patterns. First, the images were almost all simple vector graphics, which were likely easy to memorize due to their low information content. Second, and more importantly, the images all had many near-duplicates in the training dataset. For example, there might be a vector graphic which looks like a clock showing the time 1 o’clock—but then we would discover a training sample containing the same clock showing 2 o’clock, and then 3 o’clock, etc. Once we realized this, we used a distributed nearest neighbor search to verify that, indeed, all of the regurgitated images had perceptually similar duplicates in the dataset. Other works have observed a similar phenomenon in large language models, finding that data duplication is strongly linked to memorization.

The above finding suggested that, if we deduplicated our dataset, we might solve the regurgitation problem. To achieve this, we planned to use a neural network to identify groups of images that looked similar, and then remove all but one image from each group.^{[^footnote-2]}

However, this would require checking, for each image, whether it is a duplicate of every other image in the dataset. Since our whole dataset contains hundreds of millions of images, we would naively need to check hundreds of quadrillions of image pairs to find all the duplicates. While this is technically within reach, especially on a large compute cluster, we found a much more efficient alternative that works almost as well at a small fraction of the cost.Consider what happens if we cluster our dataset before performing deduplication. Since nearby samples often fall into the same cluster, most of the duplicate pairs would not cross cluster decision boundaries. We could then deduplicate samples within each cluster without checking for duplicates outside of the cluster, while only missing a small fraction of all duplicate pairs. This is much faster than the naive approach, since we no longer have to check every single pair of images.^{[^footnote-3]}

When we tested this approach empirically on a small subset of our data, it found 85% of all duplicate pairs when usingK=1024 clusters.To improve the success rate of the above algorithm, we leveraged one key observation: when you cluster different random subsets of a dataset, the resulting cluster decision boundaries are often quite different. Therefore, if a duplicate pair crosses a cluster boundary for one clustering of the data, the same pair might fall inside a single cluster in a different clustering. The more clusterings you try, the more likely you are to discover a given duplicate pair. In practice, we settled on using five clusterings, which means that we search for duplicates of each image in the union of five different clusters. In practice, this found 97% of all duplicate pairs on a subset of our data.

Surprisingly, almost a quarter of our dataset was removed by deduplication. When we looked at the near-duplicate pairs that were found, many of them included meaningful changes. Recall the clock example from above: the dataset might include many images of the same clock at different times of day. While these images are likely to make the model memorize this particular clock’s appearance, they might also help the model learn to distinguish between times of day on a clock. Given how much data was removed, we were worried that removing images like this might have hurt the model’s performance.

To test the effect of deduplication on our models, we trained two models with identical hyperparameters: one on the full dataset, and one on the deduplicated version of the dataset. To compare the models, we used the same human evaluations we used to evaluate our original GLIDE model. Surprisingly, we found that human evaluators slightly preferred the model trained on deduplicated data, suggesting that the large amount of redundant images in the dataset was actually hurting performance.

Once we had a model trained on deduplicated data, we reran the regurgitation search we had previously done over 50k prompts from the training dataset. We found that the new model never regurgitated a training image when given the exact prompt for the image from the training dataset. To take this test another step further, we also performed a nearest neighbor search over the entire training dataset for each of the 50k generated images. This way, we thought we might catch the model regurgitating a different image than the one associated with a given prompt. Even with this more thorough check, we never found a case of image regurgitation.

Source link

News

We’re bringing the Financial Times’ world-class journalism to ChatGPT

Published

5 months ago

April 29, 2024

Josh Kuku

Editor’s note: This news was originally shared by the Financial Times and can be read here.

The Financial Times today announced a strategic partnership and licensing agreement with OpenAI, a leader in artificial intelligence research and deployment, to enhance ChatGPT with attributed content, help improve its models’ usefulness by incorporating FT journalism, and collaborate on developing new AI products and features for FT readers.

Through the partnership, ChatGPT users will be able to see select attributed summaries, quotes and rich links to FT journalism in response to relevant queries.

In addition, the FT became a customer of ChatGPT Enterprise earlier this year, purchasing access for all FT employees to ensure its teams are well-versed in the technology and can benefit from the creativity and productivity gains made possible by OpenAI’s tools.

“This is an important agreement in a number of respects,” said FT Group CEO John Ridding. “It recognises the value of our award-winning journalism and will give us early insights into how content is surfaced through AI. We have long been a leader in news media innovation, pioneering the subscription model and engagement technologies, and this partnership will help to keep us at the forefront of developments in how people access and use information.”

“The FT is committed to human journalism, as produced by our unrivalled newsroom, and this agreement will broaden the reach of that work, while deepening our understanding of reader demands and interests,” Ridding added. “Apart from the benefits to the FT, there are broader implications for the industry. It’s right, of course, that AI platforms pay publishers for the use of their material. OpenAI understands the importance of transparency, attribution, and compensation – all essential for us. At the same time, it’s clearly in the interests of users that these products contain reliable sources.”

Brad Lightcap, COO of OpenAI, expressed enthusiasm about the evolving relationship with the Financial Times, stating: “Our partnership and ongoing dialogue with the FT is about finding creative and productive ways for AI to empower news organisations and journalists, and enrich the ChatGPT experience with real-time, world-class journalism for millions of people around the world.”

“We’re keen to explore the practical outcomes regarding news sources and AI through this partnership,” said Ridding. “We value the opportunity to be inside the development loop as people discover content in new ways. As with any transformative technology, there is potential for significant advancements and major challenges, but what’s never possible is turning back time. It’s important for us to represent quality journalism as these products take shape – with the appropriate safeguards in place to protect the FT’s content and brand.

We have always embraced new technologies and disruption, and we’ll continue to operate with both curiosity and vigilance as we navigate this next wave of change.”

Source link

News

Introducing more enterprise-grade features for API customers

Published

5 months ago

April 23, 2024

Josh Kuku

To help organizations scale their AI usage without over-extending their budgets, we’ve added two new ways to reduce costs on consistent and asynchronous workloads:

Discounted usage on committed throughput: Customers with a sustained level of tokens per minute (TPM) usage on GPT-4 or GPT-4 Turbo can request access to provisioned throughput to get discounts ranging from 10–50% based on the size of the commitment.
Reduced costs on asynchronous workloads: Customers can use our new Batch API to run non-urgent workloads asynchronously. Batch API requests are priced at 50% off shared prices, offer much higher rate limits, and return results within 24 hours. This is ideal for use cases like model evaluation, offline classification, summarization, and synthetic data generation.

We plan to keep adding new features focused on enterprise-grade security, administrative controls, and cost management. For more information on these launches, visit our API documentation or get in touch with our team to discuss custom solutions for your enterprise.

Source link

News

adopting safety by design principles

Published

5 months ago

April 23, 2024

Josh Kuku

OpenAI, alongside industry leaders including Amazon, Anthropic, Civitai, Google, Meta, Metaphysic, Microsoft, Mistral AI, and Stability AI, has committed to implementing robust child safety measures in the development, deployment, and maintenance of generative AI technologies as articulated in the Safety by Design principles. This initiative, led by Thorn, a nonprofit dedicated to defending children from sexual abuse, and All Tech Is Human, an organization dedicated to tackling tech and society’s complex problems, aims to mitigate the risks generative AI poses to children. By adopting comprehensive Safety by Design principles, OpenAI and our peers are ensuring that child safety is prioritized at every stage in the development of AI. To date, we have made significant effort to minimize the potential for our models to generate content that harms children, set age restrictions for ChatGPT, and actively engage with the National Center for Missing and Exploited Children (NCMEC), Tech Coalition, and other government and industry stakeholders on child protection issues and enhancements to reporting mechanisms.

As part of this Safety by Design effort, we commit to:

Develop: Develop, build, and train generative AI models
that proactively address child safety risks.
- Responsibly source our training datasets, detect and remove child sexual
  abuse material (CSAM) and child sexual exploitation material (CSEM) from
  training data, and report any confirmed CSAM to the relevant
  authorities.
- Incorporate feedback loops and iterative stress-testing strategies in
  our development process.
- Deploy solutions to address adversarial misuse.
Deploy: Release and distribute generative AI models after
they have been trained and evaluated for child safety, providing protections
throughout the process.
- Combat and respond to abusive content and conduct, and incorporate
  prevention efforts.
- Encourage developer ownership in safety by design.
Maintain: Maintain model and platform safety by continuing
to actively understand and respond to child safety risks.
- Committed to removing new AIG-CSAM generated by bad actors from our
  platform.
- Invest in research and future technology solutions.
- Fight CSAM, AIG-CSAM and CSEM on our platforms.

This commitment marks an important step in preventing the misuse of AI technologies to create or spread child sexual abuse material (AIG-CSAM) and other forms of sexual harm against children. As part of the working group, we have also agreed to release progress updates every year.

Source link