Value-Laden Areas for Standardisation in the AI Act
A note from Michael Veale, UCL (m.veale@ucl.ac.uk)
As I have previously noted, the Commission's proposed AI Act gives a significant role to standards bodies, likely CEN and CENELEC, to implement its essential requirements. I have written about that before here. I have often been asked recently to give examples of some of the value-laden choices that a standardisation body would have to make. Below I select some of the articles from the Commission proposal on the AI Act to analyse.
Article 9: Risk Management System
Article 9
Risk management system
1. A risk management system shall be established, implemented, documented and maintained in relation to high-risk AI systems.
2. The risk management system shall consist of a continuous iterative process run throughout the entire lifecycle of a high-risk AI system, requiring regular systematic updating. It shall comprise the following steps:
(a) identification and analysis of the known and foreseeable risks associated with each high-risk AI system;
(b) estimation and evaluation of the risks that may emerge when the high-risk AI system is used in accordance with its intended purpose and under conditions of reasonably foreseeable misuse;
(c) evaluation of other possibly arising risks based on the analysis of data gathered from the post-market monitoring system referred to in Article 61;
(d) adoption of suitable risk management measures in accordance with the provisions of the following paragraphs.
3. The risk management measures referred to in paragraph 2, point (d) shall give due consideration to the effects and possible interactions resulting from the combined application of the requirements set out in this Chapter 2. They shall take into account the generally acknowledged state of the art, including as reflected in relevant harmonised standards or common specifications.
4. The risk management measures referred to in paragraph 2, point (d) shall be such that any residual risk associated with each hazard as well as the overall residual risk of the high-risk AI systems is judged acceptable, provided that the high-risk AI system is used in accordance with its intended purpose or under conditions of reasonably foreseeable misuse. Those residual risks shall be communicated to the user.
In identifying the most appropriate risk management measures, the following shall be ensured:
(a) elimination or reduction of risks as far as possible through adequate design and development;
(b) where appropriate, implementation of adequate mitigation and control measures in relation to risks that cannot be eliminated;
(c) provision of adequate information pursuant to Article 13, in particular as regards the risks referred to in paragraph 2, point (b) of this Article, and, where appropriate, training to users.
In eliminating or reducing risks related to the use of the high-risk AI system, due consideration shall be given to the technical knowledge, experience, education, training to be expected by the user and the environment in which the system is intended to be used.
5. High-risk AI systems shall be tested for the purposes of identifying the most appropriate risk management measures. Testing shall ensure that high-risk AI systems perform consistently for their intended purpose and they are in compliance with the requirements set out in this Chapter.
6. Testing procedures shall be suitable to achieve the intended purpose of the AI system and do not need to go beyond what is necessary to achieve that purpose.
7. The testing of the high-risk AI systems shall be performed, as appropriate, at any point in time throughout the development process, and, in any event, prior to the placing on the market or the putting into service. Testing shall be made against preliminarily defined metrics and probabilistic thresholds that are appropriate to the intended purpose of the high-risk AI system.
8. When implementing the risk management system described in paragraphs 1 to 7, specific consideration shall be given to whether the high-risk AI system is likely to be accessed by or have an impact on children.
9. For credit institutions regulated by Directive 2013/36/EU, the aspects described in paragraphs 1 to 8 shall be part of the risk management procedures established by those institutions pursuant to Article 74 of that Directive.
Firstly, it should be noted that the rights and freedoms to which an AI system can pose a risk to in EU law alone are wide. Page 11 of the Commission proposal states:
With a set of requirements for trustworthy AI and proportionate obligations on all value chain participants, the proposal will enhance and promote the protection of the rights protected by the Charter: the right to human dignity (Article 1), respect for private life and protection of personal data (Articles 7 and 8), non-discrimination (Article 21) and equality between women and men (Article 23). It aims to prevent a chilling effect on the rights to freedom of expression (Article 11) and freedom of assembly (Article 12), to ensure protection of the right to an effective remedy and to a fair trial, the rights of defence and the presumption of innocence (Articles 47 and 48), as well as the general principle of good administration. Furthermore, as applicable in certain domains, the proposal will positively affect the rights of a number of special groups, such as the workers’ rights to fair and just working conditions (Article 31), a high level of consumer protection (Article 28), the rights of the child (Article 24) and the integration of persons with disabilities (Article 26). The right to a high level of environmental protection and the improvement of the quality of the environment (Article 37) is also relevant, including in relation to the health and safety of people.
These are a staggeringly wide array of issues to identiy risks to. It is therefore clear that identifying risks is a value-laden task. Would such a standard create a typology for the risks that might exist in AI systems? If so, which risks would be included, and which excluded? The proposal is silent on what entities the risks should be considered as endangering. Individuals? Groups or communities? If the latter, then which type of groups, and how should these in practice be identified? Would this standard choose some of these areas; would they choose all of them; or would it be entirely silent and let the provider decide which to even check or consider?
A standard would presumably be more specific on the appropriate methods for identifying risks. Some of these are going to be more inclusive than others. Do proposed methods involve talking to stakeholders, for example? What type of skillset is needed for the individuals assessing risks? How will such risks be assessed with regards to the entire Union, for example when systems are designed in certain national contexts, where languages, institutions and culture may differ?
What type of risks would be included? For example, research has highlighted a difference between allocative harms, which affect outcomes relating to individuals such as a decision on a loan or a job; with representational harms, where individuals are stereotyped or misrepresented within systems, such as stereotyping within search or information retrieval systems, or the inability of text analysis systems to parse or generate certain dialects or vernaculars, without needing to find a final effect on them. What about a risk that the system does not function over time, or the risk of how AI systems interact with organisations they are placed within? AI systems have complex interactions with their users, affected individuals and communities, and wider institutions that surround them. For example, such systems can promote a "target culture" within an organisation, where people generating input data, or responding to output data, change their own work in order to respond to this system.[2] Would this be a type of risk that would be considered, or would the language of a standard cast that out of scope of the system being analysed.
This article requires that high-risk AI systems be tested for identifying measures to mitigate risks*. What kind of tests would be proposed? Given these are engineering standards bodies, they are likely to fall back on tests relating to checking systems for bias or similar issues. These are discussed further below. But efforts to mitigate risks go beyond technical tests. The question remains on how much a standard would consider that.
Article 10: Data Governance
Article 10
Data and data governance
1. High-risk AI systems which make use of techniques involving the training of models with data shall be developed on the basis of training, validation and testing data sets that meet the quality criteria referred to in paragraphs 2 to 5.
2. Training, validation and testing data sets shall be subject to appropriate data governance and management practices. Those practices shall concern in particular,
(a) the relevant design choices;
(b) data collection;
(c) relevant data preparation processing operations, such as annotation, labelling, cleaning, enrichment and aggregation;
(d) the formulation of relevant assumptions, notably with respect to the information that the data are supposed to measure and represent;
(e) a prior assessment of the availability, quantity and suitability of the data sets that are needed;
(f) examination in view of possible biases;
(g) the identification of any possible data gaps or shortcomings, and how those gaps and shortcomings can be addressed.
3. Training, validation and testing data sets shall be relevant, representative, free of errors and complete. They shall have the appropriate statistical properties, including, where applicable, as regards the persons or groups of persons on which the high-risk AI system is intended to be used. These characteristics of the data sets may be met at the level of individual data sets or a combination thereof.
4. Training, validation and testing data sets shall take into account, to the extent required by the intended purpose, the characteristics or elements that are particular to the specific geographical, behavioural or functional setting within which the high-risk AI system is intended to be used.
5. To the extent that it is strictly necessary for the purposes of ensuring bias monitoring, detection and correction in relation to the high-risk AI systems, the providers of such systems may process special categories of personal data referred to in Article 9(1) of Regulation (EU) 2016/679, Article 10 of Directive (EU) 2016/680 and Article 10(1) of Regulation (EU) 2018/1725, subject to appropriate safeguards for the fundamental rights and freedoms of natural persons, including technical limitations on the re-use and use of state-of-the-art security and privacy-preserving measures, such as pseudonymisation, or encryption where anonymisation may significantly affect the purpose pursued.
6. Appropriate data governance and management practices shall apply for the development of high-risk AI systems other than those which make use of techniques involving the training of models in order to ensure that those high-risk AI systems comply with paragraph 2.
This section requires standards on datasets being "relevant, representative, free of errors and complete". It is first worth noting, regardless of any further amendments decided upon, that "free of errors" has been widely misinterpreted in earlier commentary --- the proposal always clarified in recital 44 that it should be free of errors "in view of the intended purpose of the system".
Firstly, what is "appropriate" data collection and similar processes? Will a standard elaborate on this; will it simply reduce it down to compliant with other laws? AI systems require specific types of data collection. Often they are experimental in nature, requiring dynamic data collection over time, through methods such as A/B testing which turn social domains into experiments. Would that be discussed in the standard, given these are high risk areas? Would labelling in a high risk area require any expertise? How many of these criteria would th organisation be required to know and validate themselves if they were using a second-hand dataset? Would they be prohibited from using a dataset if they could not evidence that data collection was "appropriate", or would the standard decide --- in a way the law has not mentioned --- to allow providers to have a lower level of assurance?
"(T)he formulation of relevant assumptions" is an inherently political activity. It requires a political vision of how certain factors in the world link together. What makes a criminal? What makes a good essay? What makes an uncreditworthy person? Some say that machine learning reduces the needs for assumptions, but this is not true. The type of data that goes into a machine learning system is inherently mapping onto some phenomenon of interest. When we use "grades", we assume they mean "achievement". They do not, at least not always. When we use "past crimes", we measn "past observed crime". When we build a system that inherits financial, social, economic, textual data — or frankly whatever data you like — we make assumptions. What level of evidence or satisfaction is needed for these assumptions? At what point, when they are shown to be wrong, do they fail to be "relevant". In whose eyes is "relevance" analysed from — the provider? The user? The affected individuals? From scientific research in the area?
The "examination in view of possible biases" is the most obviously value-laden section. Bias and discrimination can be measured in many ways, including methods that are more shallow, methods that are focussed on legal discrimination definitions, methods that looks for intersectional bias at the crossover of two or more characteristics, and more. Ideas of desert, and worthiness, and merit, all feed into this question. It will be a hugely value-laden question to answer, but it is important not to be distracted and think it is the core of the value-laden aspects of standardisation in the AI Act — it is just the most visible.
The provider must look at "the characteristics or elements that are particular to the specific geographical, behavioural or functional setting within which the high-risk AI system is intended to be used". Given they are the provider, not the user, they do not have a clear idea of intended context. How far should they go? If new issues are flagged to them, does their system no longer meet the standard until these concerns are resolved? How should they gather information on geographical, behavioural or functional settings? Do they need to investigate these contexts in qualitative and quantitative ways? Do they need to run user studied to understand them? What kind of geographic expertise is acceptable?
In general, this section says nothing about the rigour or quality of evidence needed to analyse the datasets or systems based on them.
Article 12: Record-keeping
Article 12
Record-keeping
1. High-risk AI systems shall be designed and developed with capabilities enabling the automatic recording of events (‘logs’) while the high-risk AI systems is operating. Those logging capabilities shall conform to recognised standards or common specifications.
2. The logging capabilities shall ensure a level of traceability of the AI system’s functioning throughout its lifecycle that is appropriate to the intended purpose of the system.
3. In particular, logging capabilities shall enable the monitoring of the operation of the high-risk AI system with respect to the occurrence of situations that may result in the AI system presenting a risk within the meaning of Article 65(1) or lead to a substantial modification, and facilitate the post-market monitoring referred to in Article 61.
4. For high-risk AI systems referred to in paragraph 1, point (a) of Annex III, the logging capabilities shall provide, at a minimum:
(a) recording of the period of each use of the system (start date and time and end date and time of each use);
(b) the reference database against which input data has been checked by the system;
(c) the input data for which the search has led to a match;
(d) the identification of the natural persons involved in the verification of the results, as referred to in Article 14 (5).
This is a standard within a standard, indicating that logging shall conform to 'recognised standards or common specifications'. A challenge here is the level of data collection. Given the huge array of risks that have been highlighted above, these call for large-scale data collection. How can the right to good administration be monitored, for example, for any risks posed to it, without collection of data that enables the analysis of whether the adminstrative processes are indeed 'good'? A concern here is that this section could provide a legal basis for huge data collection in order to analyse and improve AI systems. Furthermore, the standards actually chosen and recognised here for logging are crucial for setting the lower floor.
Article 14: Human Oversight
Article 14
Human oversight
1. High-risk AI systems shall be designed and developed in such a way, including with appropriate human-machine interface tools, that they can be effectively overseen by natural persons during the period in which the AI system is in use.
2. Human oversight shall aim at preventing or minimising the risks to health, safety or fundamental rights that may emerge when a high-risk AI system is used in accordance with its intended purpose or under conditions of reasonably foreseeable misuse, in particular when such risks persist notwithstanding the application of other requirements set out in this Chapter.
3. Human oversight shall be ensured through either one or all of the following measures:
(a) identified and built, when technically feasible, into the high-risk AI system by the provider before it is placed on the market or put into service;
(b) identified by the provider before placing the high-risk AI system on the market or putting it into service and that are appropriate to be implemented by the user.
4. The measures referred to in paragraph 3 shall enable the individuals to whom human oversight is assigned to do the following, as appropriate to the circumstances:
(a) fully understand the capacities and limitations of the high-risk AI system and be able to duly monitor its operation, so that signs of anomalies, dysfunctions and unexpected performance can be detected and addressed as soon as possible;
(b) remain aware of the possible tendency of automatically relying or over-relying on the output produced by a high-risk AI system (‘automation bias’), in particular for high-risk AI systems used to provide information or recommendations for decisions to be taken by natural persons;
(c) be able to correctly interpret the high-risk AI system’s output, taking into account in particular the characteristics of the system and the interpretation tools and methods available;
(d) be able to decide, in any particular situation, not to use the high-risk AI system or otherwise disregard, override or reverse the output of the high-risk AI system;
(e) be able to intervene on the operation of the high-risk AI system or interrupt the system through a “stop” button or a similar procedure.
5. For high-risk AI systems referred to in point 1(a) of Annex III, the measures referred to in paragraph 3 shall be such as to ensure that, in addition, no action or decision is taken by the user on the basis of the identification resulting from the system unless this has been verified and confirmed by at least two natural persons.
Explanation systems are value-laden in important ways which are not elaborated in these standards. For example, when two or more explanations are equally plausible, which one to choose? This is a serious area of research and consideration.
In general, when AI systems allow the users to input new data sources, building explanations can be extremely tricky. Such data may not have human interpretable meaning; and if it does, the provider does not have a key to all data in advance, knowing what each variable means. Where users can bring their own data, what are the obligations of providers?
Furthermore, a major value-laden task here is identifying the acceptable level of uncertainty in this proposal for interpretability. If a system is "unsure" about its output, how "unsure" is "unsure"? Is a confidence level of 80% enough to warrant a red warning? In what situations? If standards bodies make statements on calibration, then they are making a very value-laden decision. If they leave it up to providers, then they are not really placing a standard on the market at all.
Article 15: Accuracy, Robustness and Cybersecurity
Article 15
Accuracy, robustness and cybersecurity
1. High-risk AI systems shall be designed and developed in such a way that they achieve, in the light of their intended purpose, an appropriate level of accuracy, robustness and cybersecurity, and perform consistently in those respects throughout their lifecycle.
2. The levels of accuracy and the relevant accuracy metrics of high-risk AI systems shall be declared in the accompanying instructions of use.
3. High-risk AI systems shall be resilient as regards errors, faults or inconsistencies that may occur within the system or the environment in which the system operates, in particular due to their interaction with natural persons or other systems.
The robustness of high-risk AI systems may be achieved through technical redundancy solutions, which may include backup or fail-safe plans.
High-risk AI systems that continue to learn after being placed on the market or put into service shall be developed in such a way to ensure that possibly biased outputs due to outputs used as an input for future operations (‘feedback loops’) are duly addressed with appropriate mitigation measures.
4. High-risk AI systems shall be resilient as regards attempts by unauthorised third parties to alter their use or performance by exploiting the system vulnerabilities.
The technical solutions aimed at ensuring the cybersecurity of high-risk AI systems shall be appropriate to the relevant circumstances and the risks.
The technical solutions to address AI specific vulnerabilities shall include, where appropriate, measures to prevent and control for attacks trying to manipulate the training dataset (‘data poisoning’), inputs designed to cause the model to make a mistake (‘adversarial examples’), or model flaws.
The term "accuracy" is extremely value-laden. Accuracy itself refers to a specific measure, and so it is unlikely to survive with that term in the final piece, with the more generic phrase usually described as "performance". However, even with that generic term, one has to decide exactly what kind of performance metrics are permissible. Some metrics look at performance on subgroups, for example, and penalise systems that perform basly on certain groups of people. What an "appropriate" level is seems difficult to define, given that the performance has to be "declared" — is there an impermissibly low performance rate?
Relatedly, both performance and robustness differ a lot in real life, as paragraph 3 illustrates. A standard could state that systems have to be trialled, or piloted, or subject to a model which simulates real life difficulties; or it can indicate that "in vitro" rather than "in vivo" testing is sufficient.
In general, this part is going to be challenging to standardise because there are not solid, reliable, state-of-the-art methods which are mature enough to deploy in many of these areas. As a result, the standard will risk being weak, and this will keep weakness in this area going forward.
General concluding thoughts
This was a rapid look at just some of the value-laden areas in this chapter of the AI Act proposal. A standard might just copy the chapter, effectively, and provide little more specificity. If the Commission decides to make one standard for all AI systems, this seems relatively inevitable. It also seems ridiculous: the same standard for a system helping judges, as one monitoring the security of a telecoms network, or engaging in lie detecting? But the more specific you get, the more value-lade choices will be made. Even leaving it up to providers to decide is a value-laden choice, given the standard reverses the burden of proof and offers a presumption of conformity. Standards bodies are extremely unrepresentative of broader, non-technical interests, and not even so balanced when it comes to technical ones. This proposal needs significant rethinking, particularly on the issues regarding delegation of core policy issues.