Data Mining for AI Training

Artificial intelligence is based on statistical models that are being developed and adjusted through training. Large amounts of high-quality data are required for training and validation - especially for generative AI systems. Freely accessible content from the internet is often used for AI training. So called “scrapers” or “crawlers” are available for the mass collection of training data. They search the internet for content and copy the source code of the relevant websites and any metadata. This content is then analyzed, structured and prepared for AI training in a further step. 


This means that the content you intend to publish on your service will potentially be used for AI training.


Please take the time to decide for your service, whether it is in Bayer’s interest that an AI model is potentially trained this way or not. In case you decide that an AI model should not be trained with the content you intend to publish, here is what you can do to protect the content:

  • Include the text block available from the Terms of Use template into the Terms of Use of your service

  • It is advisable to also publish a truly machine-readable form of the disclaimer on the website by informing web scrapers that they are not wanted on your site. There are several technical solutions available, here are some you can consider implementing into your service:

    • The Robots Exclusion Standard, which has been established since 1999. However, it is important to ensure that you do not block any (actually desired) scrapers from search engines (e.g. the Googlebot) from your own website. Because if you overshoot the mark with robots.txt (which happens quickly), this can lead to a dramatic drop in search engine rankings;

    • The new “TDM Reservation Protocol”. The implementation is very simple, standardised and granular in the HTML source code of the website and does not have the potentially negative effects on search engine scrapers as the solution via robots.txt;

    • The rights protocol developed by the Coalition for Content Provenance and Authenticity (C2PA). The project is backed by Adobe, Arm, Intel, Microsoft and Truepic, among others. The aim is to prevent the dissemination of misleading information. The protocol provides for the attachment of metadata, the "manifest", to media files, which are signed using a cryptographic key. Changes to the data can thus be tracked. In particular, it is possible to specify whether data mining or the training of AI systems using this data should be permitted or not;

    • The IPTC's RightsML standard. RightsML offers a data model in a machine-readable language, which is based on the ODLR standard of the W3C and which has been adapted to the requirements of the media industry

  • The most effective course of action to prevent crawlers and scrapers from collecting the content from your service (and ignoring your above expressed wish to keep away from your service), is to have the content behind a paywall or at least behind a logon or a clickthrough entry to your site, where users must acknowledge the terms of service which prohibit materials on the site from being used