Connect with us

News

DALL·E 2 pre-training mitigations

Published

on

Dall E 2 Pre Training


We observed that our internal predecessors to DALL·E 2 would sometimes reproduce training images verbatim. This behavior was undesirable, since we would like DALL·E 2 to create original, unique images by default and not just “stitch together” pieces of existing images. Additionally, reproducing training images verbatim can raise legal questions around copyright infringement, ownership, and privacy (if people’s photos were present in training data).

To better understand the issue of image regurgitation, we collected a dataset of prompts that frequently resulted in duplicated images. To do this, we used a trained model to sample images for 50,000 prompts from our training dataset, and sorted the samples by perceptual similarity to the corresponding training image. Finally, we inspected the top matches by hand, finding only a few hundred true duplicate pairs out of the 50k total prompts. Even though the regurgitation rate appeared to be less than 1%, we felt it was necessary to push the rate down to 0 for the reasons stated above.

When we studied our dataset of regurgitated images, we noticed two patterns. First, the images were almost all simple vector graphics, which were likely easy to memorize due to their low information content. Second, and more importantly, the images all had many near-duplicates in the training dataset. For example, there might be a vector graphic which looks like a clock showing the time 1 o’clock—but then we would discover a training sample containing the same clock showing 2 o’clock, and then 3 o’clock, etc. Once we realized this, we used a distributed nearest neighbor search to verify that, indeed, all of the regurgitated images had perceptually similar duplicates in the dataset. Other works have observed a similar phenomenon in large language models, finding that data duplication is strongly linked to memorization.

The above finding suggested that, if we deduplicated our dataset, we might solve the regurgitation problem. To achieve this, we planned to use a neural network to identify groups of images that looked similar, and then remove all but one image from each group.[^footnote-2]

However, this would require checking, for each image, whether it is a duplicate of every other image in the dataset. Since our whole dataset contains hundreds of millions of images, we would naively need to check hundreds of quadrillions of image pairs to find all the duplicates. While this is technically within reach, especially on a large compute cluster, we found a much more efficient alternative that works almost as well at a small fraction of the cost.Consider what happens if we cluster our dataset before performing deduplication. Since nearby samples often fall into the same cluster, most of the duplicate pairs would not cross cluster decision boundaries. We could then deduplicate samples within each cluster without checking for duplicates outside of the cluster, while only missing a small fraction of all duplicate pairs. This is much faster than the naive approach, since we no longer have to check every single pair of images.[^footnote-3]

When we tested this approach empirically on a small subset of our data, it found 85% of all duplicate pairs when usingK=1024 clusters.To improve the success rate of the above algorithm, we leveraged one key observation: when you cluster different random subsets of a dataset, the resulting cluster decision boundaries are often quite different. Therefore, if a duplicate pair crosses a cluster boundary for one clustering of the data, the same pair might fall inside a single cluster in a different clustering. The more clusterings you try, the more likely you are to discover a given duplicate pair. In practice, we settled on using five clusterings, which means that we search for duplicates of each image in the union of five different clusters. In practice, this found 97% of all duplicate pairs on a subset of our data.

Surprisingly, almost a quarter of our dataset was removed by deduplication. When we looked at the near-duplicate pairs that were found, many of them included meaningful changes. Recall the clock example from above: the dataset might include many images of the same clock at different times of day. While these images are likely to make the model memorize this particular clock’s appearance, they might also help the model learn to distinguish between times of day on a clock. Given how much data was removed, we were worried that removing images like this might have hurt the model’s performance.

To test the effect of deduplication on our models, we trained two models with identical hyperparameters: one on the full dataset, and one on the deduplicated version of the dataset. To compare the models, we used the same human evaluations we used to evaluate our original GLIDE model. Surprisingly, we found that human evaluators slightly preferred the model trained on deduplicated data, suggesting that the large amount of redundant images in the dataset was actually hurting performance.

Once we had a model trained on deduplicated data, we reran the regurgitation search we had previously done over 50k prompts from the training dataset. We found that the new model never regurgitated a training image when given the exact prompt for the image from the training dataset. To take this test another step further, we also performed a nearest neighbor search over the entire training dataset for each of the 50k generated images. This way, we thought we might catch the model regurgitating a different image than the one associated with a given prompt. Even with this more thorough check, we never found a case of image regurgitation.



Source link

News

We’re bringing the Financial Times’ world-class journalism to ChatGPT

Published

on


Editor’s note: This news was originally shared by the Financial Times and can be read here.  

The Financial Times today announced a strategic partnership and licensing agreement with OpenAI, a leader in artificial intelligence research and deployment, to enhance ChatGPT with attributed content, help improve its models’ usefulness by incorporating FT journalism, and collaborate on developing new AI products and features for FT readers. 

Through the partnership, ChatGPT users will be able to see select attributed summaries, quotes and rich links to FT journalism in response to relevant queries. 

In addition, the FT became a customer of ChatGPT Enterprise earlier this year, purchasing access for all FT employees to ensure its teams are well-versed in the technology and can benefit from the creativity and productivity gains made possible by OpenAI’s tools. 

“This is an important agreement in a number of respects,” said FT Group CEO John Ridding. “It recognises the value of our award-winning journalism and will give us early insights into how content is surfaced through AI. We have long been a leader in news media innovation, pioneering the subscription model and engagement technologies, and this partnership will help to keep us at the forefront of developments in how people access and use information.” 

“The FT is committed to human journalism, as produced by our unrivalled newsroom, and this agreement will broaden the reach of that work, while deepening our understanding of reader demands and interests,” Ridding added. “Apart from the benefits to the FT, there are broader implications for the industry. It’s right, of course, that AI platforms pay publishers for the use of their material. OpenAI understands the importance of transparency, attribution, and compensation – all essential for us. At the same time, it’s clearly in the interests of users that these products contain reliable sources.” 

Brad Lightcap, COO of OpenAI, expressed enthusiasm about the evolving relationship with the Financial Times, stating: “Our partnership and ongoing dialogue with the FT is about finding creative and productive ways for AI to empower news organisations and journalists, and enrich the ChatGPT experience with real-time, world-class journalism for millions of people around the world.” 

“We’re keen to explore the practical outcomes regarding news sources and AI through this partnership,” said Ridding. “We value the opportunity to be inside the development loop as people discover content in new ways. As with any transformative technology, there is potential for significant advancements and major challenges, but what’s never possible is turning back time. It’s important for us to represent quality journalism as these products take shape – with the appropriate safeguards in place to protect the FT’s content and brand. 

We have always embraced new technologies and disruption, and we’ll continue to operate with both curiosity and vigilance as we navigate this next wave of change.”



Source link

Continue Reading

News

Introducing more enterprise-grade features for API customers

Published

on


To help organizations scale their AI usage without over-extending their budgets, we’ve added two new ways to reduce costs on consistent and asynchronous workloads:

  • Discounted usage on committed throughput: Customers with a sustained level of tokens per minute (TPM) usage on GPT-4 or GPT-4 Turbo can request access to provisioned throughput to get discounts ranging from 10–50% based on the size of the commitment.
  • Reduced costs on asynchronous workloads: Customers can use our new Batch API to run non-urgent workloads asynchronously. Batch API requests are priced at 50% off shared prices, offer much higher rate limits, and return results within 24 hours. This is ideal for use cases like model evaluation, offline classification, summarization, and synthetic data generation.


We plan to keep adding new features focused on enterprise-grade security, administrative controls, and cost management. For more information on these launches, visit our API documentation or get in touch with our team to discuss custom solutions for your enterprise.



Source link

Continue Reading

News

adopting safety by design principles

Published

on


OpenAI, alongside industry leaders including Amazon, Anthropic, Civitai, Google, Meta, Metaphysic, Microsoft, Mistral AI, and Stability AI, has committed to implementing robust child safety measures in the development, deployment, and maintenance of generative AI technologies as articulated in the Safety by Design principles. This initiative, led by Thorn, a nonprofit dedicated to defending children from sexual abuse, and All Tech Is Human, an organization dedicated to tackling tech and society’s complex problems, aims to mitigate the risks generative AI poses to children. By adopting comprehensive Safety by Design principles, OpenAI and our peers are ensuring that child safety is prioritized at every stage in the development of AI. To date, we have made significant effort to minimize the potential for our models to generate content that harms children, set age restrictions for ChatGPT, and actively engage with the National Center for Missing and Exploited Children (NCMEC), Tech Coalition, and other government and industry stakeholders on child protection issues and enhancements to reporting mechanisms. 

As part of this Safety by Design effort, we commit to:

  1. Develop: Develop, build, and train generative AI models
    that proactively address child safety risks.

    • Responsibly source our training datasets, detect and remove child sexual
      abuse material (CSAM) and child sexual exploitation material (CSEM) from
      training data, and report any confirmed CSAM to the relevant
      authorities.
    • Incorporate feedback loops and iterative stress-testing strategies in
      our development process.
    • Deploy solutions to address adversarial misuse.
  2. Deploy: Release and distribute generative AI models after
    they have been trained and evaluated for child safety, providing protections
    throughout the process.

    • Combat and respond to abusive content and conduct, and incorporate
      prevention efforts.
    • Encourage developer ownership in safety by design.
  3. Maintain: Maintain model and platform safety by continuing
    to actively understand and respond to child safety risks.

    • Committed to removing new AIG-CSAM generated by bad actors from our
      platform. 
    • Invest in research and future technology solutions.
    • Fight CSAM, AIG-CSAM and CSEM on our platforms.

This commitment marks an important step in preventing the misuse of AI technologies to create or spread child sexual abuse material (AIG-CSAM) and other forms of sexual harm against children. As part of the working group, we have also agreed to release progress updates every year.



Source link

Continue Reading
Advertisement
SEO & Digital Marketing6 hours ago

Elevating Your Digital Brand Through Strategic Visual Storytelling

SEO & Digital Marketing8 hours ago

Adapting to Google’s New Reality – TopRank® Marketing

SEO & Digital Marketing9 hours ago

AI vs Marketing Agencies: Threats & Gains

SEO & Digital Marketing12 hours ago

47 Must-Know Real Estate Digital Marketing Statistics for 2024

SEO & Digital Marketing4 days ago

Find New Web Design Clients in 2024 with These 8 Proven Strategies

SEO & Digital Marketing5 days ago

Try These 7 B2C Influencer Marketing Tactics for B2B Success – TopRank® Marketing

SEO & Digital Marketing6 days ago

11 Creative Examples & 4 Research-Based Insights

SEO & Digital Marketing6 days ago

3 Ways to Increase Video Conversion Rate

SEO & Digital Marketing1 week ago

Content Localization Tips From the Experts – TopRank® Marketing

SEO & Digital Marketing1 week ago

Best Marketing Memes for Industry Insiders & Virality

SEO & Digital Marketing2 weeks ago

Here’s How 11 Expert Marketers Define B2B Influencer Marketing – TopRank® Marketing

SEO & Digital Marketing2 weeks ago

115 Digital Marketing Agency Names & Ideas (with an Agency Name Generator!)

SEO & Digital Marketing2 weeks ago

What’s the Catch with Google AI Overview?

SEO & Digital Marketing2 weeks ago

The Complete Guide to Food Influencer Marketing Strategies for 2024 (8 Steps to Success!)

SEO & Digital Marketing2 weeks ago

9 Inspiring Back to School Marketing Ideas in 2024 [5 Campaigns Included!]

SEO & Digital Marketing2 weeks ago

How Brands and Agencies Connect

SEO & Digital Marketing3 weeks ago

How Strong SEO Strategies Will Boost Your Lead conversion – TopRank® Marketing

SEO & Digital Marketing3 weeks ago

Leveraging AI Video Editor Tools for Success

SEO & Digital Marketing3 weeks ago

Why Celebrity Sound-Alikes Are Taking Over?

SEO & Digital Marketing3 weeks ago

How These 7 Successful SEO Case Studies Transformed Online Visibility

Marketing Strategy
SEO & Digital Marketing8 months ago

8 Inspiring Nike Marketing Campaigns Fueled by Powerful Digital Strategies

Dall E 2 Pre Training
News11 months ago

DALL·E 2 pre-training mitigations

SEO & Digital Marketing9 months ago

25 B2B Influencer Marketing Experts To Follow In 2024

key
AI Trends11 months ago

Unlocking the Potential: Revolutionary AI Trends Expected in 2023

Dall E 3 System
News12 months ago

DALL·E 3 system card

Shutterstock
SEO & Digital Marketing11 months ago

Grow Your Base: B2B Market Entry Strategies

Sea Snail
AI Case Studies11 months ago

AI Case Study: How Chatbots Revolutionize Customer Support

Melancholia
AI Basics12 months ago

Delving into the Science: How Basic AI Neural Networks Learn and Adapt

Graffiti
AI Trends11 months ago

AI Breakthroughs to Transform Industries: Unveiling the 2023 Trends

SEO & Digital Marketing12 months ago

Digital Marketing Agencies for Sale in 2023

Newtons Cradle
AI Tutorials and Guides11 months ago

Illustrator Mastery: Discover the Secrets Behind Professional Digital Art

Cafe
AI Basics12 months ago

The Building Blocks of AI: Understanding the Basics of Neural Networks

AI Generated
AI Trends11 months ago

AI’s Paradigm Shift: Anticipated Trends in 2023 and Beyond

Laptop
AI Basics11 months ago

Getting Started with Basic Artificial Intelligence: A Primer on Neural Networks

Novice
AI Tutorials and Guides12 months ago

From Novice to Pro: Beginner’s Guide to Illustrator Tutorials

Cake
AI Tutorials and Guides11 months ago

Illustrator Step-by-Step: Essential Tutorial Guides for Every Skill Level

Niagara Falls
AI Case Studies11 months ago

AI-powered Power Plants: Case Studies on Optimizing Energy Generation

Chateau
AI Tools Review11 months ago

Enhancing Efficiency with AI: A Review of the Most Promising Tools

key
AI Basics12 months ago

Unlocking Potential: Harnessing the Power of Basic Artificial Intelligence Neural Networks

Woman
AI Tools Review11 months ago

The AI Revolution: Unveiling the Top Tools Transforming Industries

Trending