Ruthenium complexes comprising nitrogen donor ligands: synthesis and investigation of their cytotoxicity potential

Artificial Intelligence Review (2025) 58:238 https://doi.org/10.1007/s10462-025-11242-6 Rethinking and recomputing the value of machine learning models Burcu Sayin1 · Jie Yang2 · Xinyue Chen2 · Andrea Passerini1 · Fabio Casati1,3 Accepted: 17 April 2025 / Published online: 8 May 2025 © The Author(s) 2025 Abstract In this paper, we argue that the prevailing approach to training and evaluating machine learning models often fails to consider their real-world application within organizational or societal contexts, where they are intended to create beneficial value for people. We propose a shift in perspective, redefining model assessment and selection to emphasize integration into workflows that combine machine predictions with human expertise, particularly in scenarios requiring human intervention for low-confidence predictions. Traditional metrics like accuracy and f-score fail to capture the beneficial value of models in such hybrid settings. To address this, we introduce a simple yet theoretically sound “value” metric that incorporates task-specific costs for correct predictions, errors, and rejections, offering a practical framework for real-world evaluation. Through extensive experiments, we show that existing metrics fail to capture real-world needs, often leading to suboptimal choices in terms of value when used to rank classifiers. Furthermore, we emphasize the critical role of calibration in determining model value, showing that simple, well-calibrated models can often outperform more complex models that are challenging to calibrate. Keywords Machine learning · Hybrid intelligence · Selective classification · Costsensitive learning Burcu Sayin burcu.sayin@unitn.it Jie Yang j.yang-3@tudelft.nl Xinyue Chen xinyuechen223@gmail.com Andrea Passerini andrea.passerini@unitn.it Fabio Casati fabio.casati@servicenow.com 1 Department of Information Engineering and Computer Science, University of Trento, Via Sommarive 9, Povo, 38123 Trento, Italy 2 Department of Software Technology, TU Delft, Mekelweg 5, 2628 XE Delft, The Netherlands 3 Servicenow, Zurich, Switzerland 13 238 Page 2 of 23 B. Sayin et al. 1 Introduction Recently, a few position papers (Casati et al. 2021; Sayin et al. 2021a, 2021b; Gunel 2022; Sayin et al. 2022, 2023a, 2023b) have challenged the underlying assumptions of quality in Machine Learning (ML), particularly the overemphasis on accuracy-based metrics and various measures of calibration errors (i.e. the difference between a model’s predicted probabilities and the actual likelihood of its predictions being correct). At the heart of this stance, there are two observations: (i) ML models are almost always applied in hybrid human– machine settings, where the model can abstain or its prediction be rejected for insufficient confidence (i.e. the model’s estimation of the correctness of its prediction) as in Fig. 1, and (ii) the beneficial value of correct inferences, as well as the detrimental value of incorrect inferences and rejections, is determined by the use case, not by the model. In our experience (see Sect. 4.3), we have found that the majority of AI deployments in the enterprise consist of selective models or selective classifiers (Geifman and El-Yaniv 2017), which is more of a rule than an exception. An example where this commonly occurs is in customer support requests, where the goal is to identify the customer’s intent to trigger an automated request processing workflow if possible. Failing to comprehend the customer’s intent and resorting to human agents is not ideal. However, it’s even more problematic to misinterpret the customer’s intent and guide them down the wrong path toward a resolution. This is why intent classifications are filtered based on prediction confidence. How “good” or “useful” a model is therefore depends on the beneficial value it brings when inserted in ML solution workflows (Fig. 1). This beneficial value depends on how often the workflow rejects the predictions, on the correctness patterns of the predictions that are not rejected, and on the detrimental value of errors vs benefits of correct predictions. While in this paper we only marginally discuss the plethora of Large Language Models (LLMs), the problem is exactly the same if not worse: With generative AI the question of whether to show an answer or to withhold it is crucial and there are many things that can be wrong in an answer, the most common being hallucination. We add that the fact that some APIs do not reveal likelihood/confidence level makes model evaluation more difficult. Fig. 1 A typical implementation of ML models into an ML solution workflow involves using a rejection function that filters predictions based on a confidence threshold. This approach generally assumes that the classifier is trained independently of the rejection logic. However, this is not a necessity-the classifier can be designed to be aware of the associated costs, which may make it less “general” but more tailored to specific needs 13 Rethinking and recomputing the value of machine learning models Page 3 of 23 238 To some extent, all this is trivial. There is no inherent difficulty in developing use casebased value functions, selecting the best model from a set of well-performing models based on the value function, or evaluating a model’s performance across multiple value functions. Moreover, one could contend that accuracy metrics are a sufficient substitute for evaluating model improvements in data science, or for selecting models to deploy in an AI platform designed to meet specific use cases. Thus, the practical approach would be to choose the model with the best accuracy or F1 score and enable users to filter out predictions with a confidence level lower than a set threshold. Accuracy and similar metrics are easy to comprehend and do not require us to determine parameters such as the “cost of errors”, which can be difficult to estimate, especially when considering the use case. In this paper, we show that this reasoning is wrong. If we accept that classifiers are mostly applied as selective models, then the method we use to measure, compare, and even train models must change. The implications of models being almost always applied as selective classifiers are often neglected in the literature, and this is also reflected in model leaderboards. We also show that the simplicity of not having to choose a cost parameter is an illusion: when we use accuracy to compare models, i) we do implicitly choose a cost parameter, often without realizing it, and ii) this implicitly selected cost is probably one of the worst choices possible: that of setting the relative cost of errors to zero. Despite being counter-intuitive, we show that accuracy is a quality metric that may be selected when the consequences of model errors are not critical. When a model is likely to be used across multiple use cases, relying solely on accuracy-based metrics can have significant implications. Overall, we show that: ● Universal metrics used for model evaluation are poor indicators of model value, potentially leading to incorrect decisions such as choosing models with negative value; ● Metrics designed to account for cost-sensitive errors are also inappropriate as they fail to consider the reject option; ● Lack of calibration substantially affects model value, and poorly calibrated complex models can be outperformed by simple, decades-old models that are easier to calibrate; ● Operating in an out-of-distribution setting further reduces the reliability of standard performance metrics. It is worth underlining that the notion of value we introduce in this paper is not a radically different metric, but rather a combination of existing metrics, such as accuracy, detrimental value of errors and rejection rate, into a single measure accounting for the “value” of the predictor for a user. Importantly, the metric is normalized in such a way that a value of zero indicates a classifier that is completely useless, a negative value a classifier that is harmful (with respect to always ignoring it and resorting to the default path) and any value larger than zero indicates the gain that is obtained by using the classifier. The remainder of this paper is structured as follows: in Sect. 2, we review related works on our concept of model value. Then, in Sect. 3, we formalize this notion and introduce the rejection threshold maximizing value, along with its extension to the cost-sensitive setting where different errors have different costs. Section 4 presents our experimental analysis comparing our value metric with standard performance measures, while Sect. 5 offers our conclusions. 13 238 Page 4 of 23 B. Sayin et al. 2 Related work Selective classification. Mimicking the typical use of ML models in many practical applications, a number of approaches rely on the combination of an ML model making an initial prediction and a human annotator taking over when the model’s confidence is not high enough (Callaghan et al. 2018). Selective classifiers are specifically conceived for this use, by including a rejection mechanism to decide when to abstain from making a prediction. The literature on selective classifiers is extensive, encompassing a broad range of learning algorithms, including nearest-neighbor classifiers (Hellman 1970), SVM (Fumera and Roli 2002), and neural networks (Cordella et al. 1995; De Stefano et al. 2000; Geifman and ElYaniv 2017) (see Hendrickx et al. Hendrickx et al. 2021 for a recent survey). The effectiveness of this solution is, however, heavily dependent on the reliability of machine confidence, which has shown to be very poor, especially for deep learning (Balda et al. 2020; Guo et al. 2017). Classifier confidence. To effectively use a classifier (Jiang et al. 2018), it is important to understand its properties and have confidence in its individual predictions. The literature proposes various confidence-based methods, including measuring the entropy of the softmax predictions (Teerapittayanon et al. 2017), calculating trust scores based on the distance of samples to a calibration set (Jiang et al. 2018), determining a confidence threshold (via Shannon entropy Shannon 1948, Gini coefficient Bendel et al. 1989, or norm-based methods Ng 2004) that maximizes coverage for a given accuracy (Bukowski et al. 2021), and using semantics-preserving data transformation to estimate confidence (Bahat and Shakhnarovich 2020). Post-hoc recalibration is a popular strategy for improving classifier confidence, with techniques ranging from temperature scaling (Guo et al. 2017) to Dirichlet calibration (Kull et al. 2019 (see a recent survey by Filho et al. de Menezes e Silva Filho et al. 2021). However, as we will show in our experimental evaluation (see Sect. 4.2.3), it’s essential to complement these solutions with a proper value metric to assess the classifier’s beneficial value in real-world applications. Cost-sensitive learning addresses the challenge of training classifiers by considering the varying costs associated with different types of errors, particularly in scenarios with significant class imbalance (Elkan 2001; Ling and Sheng 2010; Thai-Nghe et al. 2010; Tu and Lin 2020; Charoenphakdee et al. 2021). Existing work includes (Tu and Lin 2020): (i) data-level approaches Ting (1998); Zadrozny et al. (2003) where the class distribution of training data is balanced via sampling methods, and (ii) algorithm-level approaches, that use a thresholding scheme Chai et al. (2004); Domingos (1999); Elkan (2001); Ling and Sheng (2010); Sayin et al. (2021); Sheng and Ling (2006); Suri (2022) to improve the prediction performance on the minority class (e.g. in binary classification, the threshold is set such that the prediction is 1 only if the expected cost associated with this prediction is lower than or equal to that of predicting 0). Although this line of work is closely related to our setting as it also considers the impact of errors on the downstream pipeline, it assumes that the classifier provides a prediction for every instance without any rejection mechanism. This assumption can significantly impact the evaluation of the resulting classifier’s quality, as our experimental evaluation will demonstrate (see Sect. 4.2.2). Finally, Charoenphakdee et al. (2021) introduces a novel approach to classification with rejection option by training an ensemble of cost-sensitive classifiers. In contrast, our goal is not to develop a novel cost-sensitive classifier. Instead, we aim to introduce a metric designed to evaluate such classifiers. 13 Rethinking and recomputing the value of machine learning models Page 5 of 23 238 Hybrid Human-AI systems aim at solving classification problems with humans and machines (Dellermann et al. 2019a, 2019b; Raghu et al. 2019; Wilder et al. 2021), but effectively combining human and machine intelligence has many challenges. For example, trust in humans requires a deep understanding of how to design crowdsourcing tasks and model their complexity (Gadiraju et al. 2017; Qarout et al. 2018; Wu and Quinn 2017; Yang et al. 2016), test and filter crowd workers (Bragg et al. 2016), aggregate results into a decision (Han et al. 2020; Kamar et al. 2012; Krivosheev et al. 2018; Li 2013; Liu et al. 2013; Whitehill et al. 2009; Zhou et al. 2012), improve the engagement (Han et al. 2019, 2021); Qiu et al. 2020), or leverage crowds to learn features of ML models (Cheng and Bernstein 2015; Rodriguez et al. 2014). Furthermore, the effective aggregation of human and machine decisions (Nagar and Malone 2011, 2012; Nuñez 2022) depends on many factors, such as training, explaining, sustaining, interacting, and amplifying. The value metric is defined in the context of hybrid human-AI systems, where humans intervene whenever the AI defers a decision due to low confidence in its prediction. This metric accounts for the value of deferral, along with the impact of both correct and incorrect machine predictions, in assessing the overall value of the system. We believe that defining appropriate measures of the beneficial value of the joint human–machine system is a major prerequisite to keep research in the field on the right course. 3 Measuring model “value” In this section, we formally define the notion of model “value”, and show how thresholdbased selective classifiers, by far the most popular class of classifiers in practical ML workflows, can be adjusted to maximize value. 3.1 The setting Selective classifiers are ML models that generate output only when they are sufficiently confident in their prediction accuracy; otherwise, they abstain from making a decision, guided by a predefined rejection function. Selective classifiers can be implemented as follows: (a) We take a model f that outputs a prediction y and a confidence cy (or a vector c of confidence for a set of possible answers). Then, we filter the predictions to take only those above a certain confidence threshold (Fig. 2a). (b) The model f outputs predictions and confidence, but we apply a selector model s that decides whether to accept the prediction or not, based on features of the input x (Fig. 2b). (c) A hybrid of the two above cases is where the selector is a recalibrator r that can either take as input the prediction and confidence measure (feature-agnostic calibrator) or also the input features of x and adjust the confidence vector (feature-aware calibrator), typically applying threshold-based selection on the resulting confidence (Fig. 2c). (d) The model f is already trained to only output predictions that are “good enough” and includes an “I don’t know” class (Fig. 2d). The first case is the most common, at least in our experience (see Sect. 4.3). The second case is an extension and generalization of the first case in two ways: it can take features 13 238 Page 6 of 23 B. Sayin et al. Fig. 2 Common approaches to selectivity in classification: a filtering predictions based on a confidence threshold, b employing an input-based selector model to decide on prediction acceptance, c using a confidence recalibrator followed by threshold-based filtering, or d incorporating built-in abstention with an ‘I don’t know’ class as input (s can be trained as opposed to “just” being a formula), and it can filter based on any formula. It however requires some form of “training” or machine teaching, which is highly non-trivial. The recalibrator also typically requires some form of training. However, a feature-agnostic calibrator can be easily set up by post-hoc calibration strategies de Menezes e (Silva Filho et al. 2021), the most common being temperature scaling (Guo et al. 2017). Finally, the last case is what is being addressed by the recent literature on learning to reject (Hendrickx et al. 2021), which is currently confined to the academic world, but could greatly benefit from incorporating the notion of value that we introduce here. In this paper, we focus on classifiers that can integrate threshold-based filtering mechanisms, enabling the use of the “value” metric with any model capable of providing confidence scores alongside its predictions. In formalizing “value”, we will progressively make a few assumptions that (i) allow to simplify the presentation of the problem without altering the essence of the concepts, (ii) are reasonable in many if not most use cases, and (iii) make the definition of the value function easier to understand and interpret for the users who eventually have to deploy ML into their companies. We scope the conversation on classification problems as it makes it easy to ground the examples and terminology, and because it is easier to define a notion of accuracy. This is important: people understand accuracy because it is simple, and that has beneficial value even if accuracy is “inaccurate” as a metric, and most users will not be able to express complex value functions. Note however that our results also apply to other performance measures, like F1-score, as we will show in our experimental evaluation. 13 Rethinking and recomputing the value of machine learning models Page 7 of 23 238 3.2 Definition of value We have a classifier g that operates on test examples x ∈ D and returns either a predicted class y ∈ Y or a special label yr , denoting “rejection” of the prediction. Then, we can compute the average value per the prediction of applying a model g over D as follows: V (g, D) = ρVr + (1 − ρ)(αVc + (1 − α)Vw )(1) where ρ is the proportion of items in D that are rejected by g (classified as yr ). The term α denotes the accuracy of predictions that exceed the threshold. Vr refers to the value associated with rejecting an item, independently of the correctness of its prediction, and thus resorting to a default path, typically involving a human expert. Vc is the value of correctly classifying an item, which is only granted for non-rejected items. Finally, Vw is the value of an incorrect classification, which again is only granted (or rather, paid) for non-rejected items. Although these values can be expressed in monetary terms, such as dollars, we focus on their relative values to facilitate comparison between different models and learning strategies. We define the baseline scenario as one in ML is not utilized, or equivalently, where all predictions are rejected. We set this baseline value to 0 (Vr = 0), which simplifies the process of evaluating a model by determining (i) whether it improves upon the baseline, and (ii) whether adopting AI is beneficial for the specific problem at hand. V (g, D) = (1 − ρ)(αVc + (1 − α)Vw )(2) We also express Vw in terms of Vc , as in Vw = −kVc , where k is a constant telling us how bad is an error with respect to getting the correct prediction: V (g, D) = Vc (1 − ρ)(α − k(1 − α))(3) In the value formula, Vc acts as a scaling factor. When evaluating an AI-powered solution workflow, the specific magnitude of this factor is less critical. Instead, we consider the value relative to a unit of Vc dollars, effectively normalizing Vc to focus primarily on value. Thus, we can discuss value in terms of “value per dollar unit of rejection cost (detrimental value)” denoted as V ′ = V /Vc . To simplify further without deviating from the equations, we set Vc = 1. Therefore, we obtain: V (g, D) = (1 − ρ)(α − k(1 − α))(4) Eq. 4 embodies the same concepts as Equation 1, streamlining our presentation. 3.3 Filtering by threshold We now focus on the most common situation observed in practice; the model selectivity is applied by thresholding confidence values and rejecting predictions that have confidence cy less than a threshold τ (case (a) in Fig. 2). We are given a model m that processes items x ∈ D and returns a vector of confidences (one per class). This is the output of a softmax; 13 238 Page 8 of 23 B. Sayin et al. for each x, we consider the pair y, cy corresponding to the top-level prediction of m(x) and the confidence associated with the prediction. Given a threshold τ , we define a function s: s(y, cy , τ ) = { y yr if cy ≥ τ, otherwise. where yr is the special class label denoting “rejection” of the prediction. Our classifier g is therefore now expressed in terms of m and τ . This means that we can express the value as a function of m, D, τ . In a given use case, when we are given m and have knowledge of k, we select the threshold τ ∈ [0, 1] that optimizes V (g, D) (We assume τ is unique or we randomly pick one if not). Thus, we can express the value of our classification logic as a function of (m, D, k): V (m, D, k) = (1 − ρτ )(ατ − k(1 − ατ ))(5) Notice that τ can be set empirically on some tuning dataset D (it depends on m, D, k ), and ρτ and ατ reflect the proportions ρ and α given τ . However, if we are aware of the properties of confidence vectors, we can set τ regardless of D. For example, if we assume perfect calibration (where the expected accuracy for a prediction of confidence c is c) de Menezes e (Silva Filho et al. 2021), then we know that the threshold is at the point where the value of accepting a prediction is greater than zero, and ατ = τ . This means that to have V (m, D, k) > 0 we need τ − k + kτ > 0, which means τ > k/(k + 1)(6) This conforms to intuition: if k is large, it never makes sense to predict, better go with the default. If k=0 (no cost for errors), we might always predict since there is no penalty for applying inaccurate predictions. Perhaps paradoxically, this case where inaccurate predictions are harmless is when accuracy is the metric we want to use. If k=1 (errors are the mirror image of correct predictions), then our threshold is 0.5. Figure 3a shows how a simple threshold-based selector can be adapted to maximize model value. In most real-world settings, especially for complex models, the available classifier will not be perfectly calibrated. In these cases, the threshold can be chosen by either recalibrating the model first using existing recalibration approaches de Menezes e Silva Filho et al. (2021) and then applying Eq. (6), or directly maximizing Eq. (5) over a separate validation set before testing the classifier. We will evaluate both strategies in our experimental evaluation (see Sect. 4.2.3). In deriving the threshold, we initially assumed that all errors incur equal costs. However, we will next demonstrate how this derivation can be readily adapted to cost-sensitive settings. 3.4 Cost-sensitive value and thresholds In this section, we extend the discussion on the value and optimal threshold to the setting in which different errors have different costs (and possibly, different correct predictions have different beneficial values). We focus on the binary classification setting for simplicity, but the reasoning can be easily generalized to multiclass classification. In cost-sensitive learn- 13 Rethinking and recomputing the value of machine learning models Page 9 of 23 238 Fig. 3 Adapting selective classifiers to maximize value: a threshold-based selector, b cost-sensitive threshold-based selector; c recalibrator + threshold-based selector. Changes with respect to standard counterparts are highlighted in red ing, the standard approach is that of giving a specific cost to each type of error and correct prediction (in which case the "cost" is the benefit Ling and Sheng (2010). We adapt this strategy to the value case, by providing a specific value for each possible type of error and correct prediction. The cumulative value of a selective classifier g on a dataset D can be written as (setting Vr = 0 as in the cost-insensitive case): V (g, D) = (1 − ρ)(Ntp Vtp + Ntn Vtn + Nf p Vf p + Nf n Vf n ) where Ntp , Ntn , Nf p , Nf n are the numbers of true positives, true negatives, false positives, and false negatives in D, and Vtp , Vtn , Vf p , Vf n are the values associated to the corresponding predictions. Let Vc be the base cost for a correct prediction. This is typically associated with a correctly predicted negative instance, i.e., Vtn = Vc . We can define the other values as multiples of this base cost as follows: Vtp = ktp Vc , Vf p = −kf p Vc , Vf n = −kf n Vc for some user-defined and application-specific constants ktp , kf p , kf n . The cumulative value simplifies as: V (g, D) = (1 − ρ)(Ntp ktp Vc + Ntn Vc − Nf p kf p Vc − Nf n kf n Vc ) = (1 − ρ)Vc (ktp Ntp + Ntn − kf p Nf p − kf n Nf n ) Setting Vc = 1 (unit of value) as in the cost-insensitive case, we get: V (g, D) = (1 − ρ)(ktp Ntp + Ntn − kf p Nf p − kf n Nf n ) Let’s now focus on the standard setting of a classifier rejecting by threshold. Note that we need to set class-specific thresholds τp and τn for positive and negative predictions respec- 13 238 Page 10 of 23 B. Sayin et al. tively to account for the different costs. Consider an instance x predicted as positive by the classifier. Its expected value (according to the predictions in D) is given by: V (g, x) = (1 − ρ)(ktp Ntp /Np − kf p (Nf p /Np ) = (1 − ρ)(ktp Ntp /Np − kf p (1 − Ntp /Np )) = (1 − ρ)(Ntp /Np (ktp + kf p ) − kf p ) where we normalized Ntp and Nf p by Np , the number of positive instances in D, to turn them into probabilities, and we removed the terms containing Ntn and Nf n as their corresponding probabilities are zero if the instance is predicted as positive. If the classifier is perfectly calibrated, we know that Ntp /Np = τp . A positive value for the instance is thus achieved by setting τp as: τp > kf p (7) ktp + kf p Similarly, if x is predicted as negative by the classifier, it is expected value is given by: V (g, x) = (1 − ρ)(Ntn /Nn − kf n Nf n /Nn ) = (1 − ρ)(Ntn /Nn − kf n (1 − Ntn /Nn )) = (1 − ρ)(Ntn /Nn (1 + kf n ) − kf n ) where Nn is the number of negative instances in the training set. If the classifier is perfectly calibrated, we know that Ntn /Nn = τn . A positive value for the instance is thus achieved by setting τn as: τn > kf n (8) 1 + kf n Figure 3b shows how to adjust a threshold-based selector to maximize value in a costsensitive setting. We assumed a binary classification setting for simplicity, but the derivation can be easily extended to account for class-specific thresholds in multiclass classification. 4 Experiments We now explore how adopting a value-oriented perspective influences model evaluation and application. Specifically, we aim to address the following questions: Q1 Is model accuracy (or F1-score) a sensible indicator of the value of a model? Q2 Is cost-sensitive error a sensible indicator of the value of a model in cost-sensitive settings? Q3 How does calibration affect the value of a model? Q4 How does predicting in an out-of-distribution setting affect the value of a model? Our experimental evaluation is focused on NLP classification tasks, for which we analyze the behavior of simple as well as state-of-the-art models over various datasets, models, and text encoders. This choice stems from the broad diffusion of NLP models in companies, and 13 Page 11 of 23 238 Rethinking and recomputing the value of machine learning models from our experience (see Sect. 4.3) in industrial use cases that were all NLP-based. However, the concept of value can be applied to any ML model deployed in a practical application, and we believe that the main results of our experimental evaluation hold for many other domains. We refer the reader to our GitHub repo1 for the companion code. 4.1 Experimental Setup Datasets and Tasks Table 1 presents a summary of the characteristics of the datasets we employed and their corresponding classification tasks. Additional information is provided in the following. ● Hate-speech detection on Twitter. We replicated the original tests from Arango et al. (2019) where we analyzed two widely used models (Agrawal and Awekar 2018; Badjatiya et al. 2017) and tested them on the Waseem et al. Waseem and Hovy (2016) dataset. However, we could only recover 9668 of the tweets as of October 2021 (the dataset size is 14949 in the original paper). ● Clickbait detection. The Clickbait Challenge on the Webis Clickbait Corpus 20172 was classifying Twitter posts as a clickbait or not. Both training and test sets are publicly available3, while each team was free to choose a subset of the training set for validation (we followed the “blobfish” team). ● Multi-Domain Sentiment Analysis - and Dataset (MDS). Sentiment analysis based on a dataset for domain adaptation.4 The data includes four categories of Amazon products (DVD, Books, Electronics, and Kitchen). The task is to learn sentiment from one of these domains and test it on the others. Models and text encoders. For each task in our experiments, we use different models (see Table 2 and the accompanying code repository for details). Since we do not train models and use the validation set only to determine the optimal threshold, we do not perform standard cross-validation. The optimal threshold is selected by evaluating the model 1 https://github.com/burcusayin/value-of-ml-models/ Table 1 Statistics of the datasets used in the experiments Task Classifying tweets as “hate”, and “non-hate” (binary) Classify Twitter posts to detect clickbait (binary) Sentiment analysis on Amazon product reviews (3-class; positive, negative, and neutral) 2 https://webis.de/data/webis-clickbait-17.html. 3 https://zenodo.org/record/5530410#.YWcFtC8RrRV. 4 http://nlpprogress.com/english/domain_adaptation.html. Dataset Hate Speech Train/Val/Test size 7734/967/967 Clickbait 17600/4395/18979 MDS Electronics MDS DVD MDS Books MDS Kitchen 2000/200/3386 2000/200/4265 2000/200/5481 2000/200/5745 13 238 Page 12 of 23 Table 2 Models used in the experiments B. Sayin et al. Dataset Hate-speech detection Clickbait detection MDS Models Badjatiya et al. (2017), Agrawal and Awekar (2018) fullnetconc, weNet, lingNet, fullNet mttri (Ruder and Plank 2018) Google’s T5-base SieBERT LogR, MLP1, MLP4 GPT-3 Model details Leader-board models Leader-board models Leader-board Fine-tuned for sentiment analysis Fine-tuned RoBERTa-large From scikit-learn library Fine-tuned for sentiment analysis on the validation set across a range of candidate thresholds and choosing the one that maximizes performance. ● For the hate-speech dataset, we test the following leaderboard models: (i) Badjatiya et al. (2017) which uses an RNN to construct word embeddings and then classify them with Gradient-Boosted Decision Tree. In the original paper, test accuracy is measured as the average of the ten folds in cross-validation; however, in our reproduction, we separated validation and test set before cross-validation, and they are used for evaluation only after training. (ii) one model from Agrawal and Awekar (2018) which is composed of an embedding layer followed by a Bidirectional LSTM and a fully connected layer with softmax activation. ● For the clickbait detection dataset, we test 4 models from one leaderboard team on clickbait challenge: fullnetconc, weNet, lingNet, and fullNet which are published on Github.5 This team modified the task into binary classification - they categorized items with a score under 0.5 into “non-clickbaiting”, and vice versa. ● For the MDS dataset, we referred to the leaderboard for the sentiment analysis task of Domain adaptation6 and tested the best-performing leader-board model, Multi-task tri-training (mttri) by Ruder and Plank (2018), that leverages multi-task learning strategies to improve the performance of tri-training. As the source code of other competing approaches was not publicly available, we compared mttri with three baseline models from the scikit-learn library7: (i) a simple Logistic Regression model (LogR); (ii) a basic MLP with a single hidden layer (MLP1); (iii) an MLP with four hidden layers (MLP4). All models where tested with a simple TF-IDF encoding. 5 https://github.com/clickbait-challenge/blobfish. 6 nlpprogress.com/english/domain_adaptation.html. 7 https://scikit-learn.org/ 13 Rethinking and recomputing the value of machine learning models Page 13 of 23 238 4.2 Results 4.2.1 Q1: Accuracy and F1-score are poor indicators of model value We first investigate whether standard performance metrics, like accuracy and F1-score, are sensible indicators of the value of the model, and how this depends on the magnitude of the cost factor k. Following the simplification in Sect. 3.2, we set Vr = 0 and Vc = 1, and use the threshold in Eq. 6 to decide whether to accept or reject each prediction given a certain k. Table 3 report results in terms of accuracy, F1-score, and value for different values of k ∈ [0, 10]. As expected, the value of a model decreases substantially with the increase of the cost factor, with many models achieving negative value for larger values of k. Note that a model is useful only if its value exceeds 0; otherwise, it is deemed unnecessary, and the system can proceed without it. We want to stress that the cost factors we considered are fairly small and definitely realistic. For instance, setting k = 4 means that “being wrong is 4 times as bad” with respect to the advantage of being right. Many scenarios have values of k way more extreme (e.g., in medical decision support systems Sutton et al. 2020). Notice that accuracy corresponds to the case where we do not reject any predictions, which corresponds to setting k = 0, a rather unrealistic scenario. Another major finding is that accuracy is a quite poor proxy of value even in relative terms. Boldface numbers indicate the best performing model in terms of the different metTable 3 Accuracy and F1-score results compared with value computed for increasing values of the cost factor k Task Model Accuracy F1 Value k = 0 k = 1 k = 2 k = 4 k = 8 k = 10 Hate Speech Badj et al. 0.822 0.626 0.822 0.644 0.51 0.362 0.272 0.217 Agr et al 0.732 0.621 0.732 0.464 0.22 − 0.213 − 1.081 −1.499 Clickbait Fullnetconc 0.857 0.684 0.857 0.715 0.564 0.286 0.041 0.013 weNet 0.852 0.672 0.852 0.703 0.561 0.306 0.04 0.011 LingNet 0.82 0.565 0.82 0.64 0.442 0.079 0.0 0.0 FullNet 0.856 0.663 0.856 0.713 0.588 0.367 0.061 0.015 MDS Electronics LogReg 0.762 0.736 0.762 0.524 0.339 0.162 0.053 0.033 MLP1 0.749 0.711 0.749 0.497 0.327 0.18 0.081 0.062 MLP4 0.735 0.713 0.735 0.47 0.24 − 0.143 − 0.78 − 1.06 mttri 0.808 0.786 0.808 0.616 0.441 0.148 − 0.354 − 0.58 MDS DVD LogReg 0.74 0.739 0.74 0.48 0.283 0.122 0.038 0.027 MLP1 0.728 0.732 0.728 0.457 0.274 0.133 0.054 0.038 MLP4 0.72 0.724 0.72 0.439 0.202 − 0.158 − 0.737 −0.981 mttri 0.753 0.725 0.753 0.506 0.28 − 0.123 − 0.84 −1.166 MDS Books LogReg 0.704 0.678 0.704 0.408 0.228 0.102 0.022 0.015 MLP1 0.691 0.662 0.691 0.382 0.134 0.013 − 0.017 −0.013 MLP4 0.696 0.681 0.696 0.393 0.154 − 0.171 − 0.666 − 0.86 mttri 0.742 0.712 0.742 0.484 0.254 − 0.16 − 0.869 −1.215 MDS Kitchen LogReg 0.782 0.771 0.782 0.565 0.374 0.176 0.06 0.034 MLP1 0.765 0.752 0.765 0.53 0.337 0.164 0.07 0.044 MLP4 0.761 0.758 0.761 0.521 0.312 0.003 − 0.478 −0.685 mttri 0.821 0.832 0.821 0.642 0.489 0.235 − 0.192 −0.384 For each dataset and metric, the best performance is highlighted in bold 13 238 Page 14 of 23 B. Sayin et al. rics. It is clear that the best performing model is largely dependent on the cost factor, and that accuracy quickly becomes totally unreliable as a metric to identify the most appropriate model to employ. Replacing accuracy with F1-score does not change much. While we do observe substantially lower values for the unbalanced datasets (Hate Speech and Clickbait), the best performing model is unchanged almost everywhere. 4.2.2 Q2: Cost-sensitive error is a poor indicator of model value in cost-sensitive settings The previous evaluation assumed equal cost for the different types of error. This is however rarely the case in practical applications, where false negative errors (e.g., undiagnosed diseases) can be far more costly than false positive ones (i.e., false alarms). Section 3.4 shows how to adapt value to this cost-sensitive setting, and how to determine cost-sensitive thresholds that are specific for each predicted class. In the following we evaluate the value of models in this cost-sensitive setting. We replace accuracy and F1, which are clearly inappropriate in this setting, with cost-sensitive error Elkan (2001) a popular performance measure in the cost-sensitive learning literature. Cost-sensitive error is obtained by computing the weighted sum of errors, with the weights given by the corresponding cost, i.e. (Nf n kf n + Nf p kf p )/|D|, where we divide by |D| to remove the dependency on the size of the dataset. For simplicity, and consistently with common practice in the literature, we set kf p = 1 and vary kf n ∈ [1, 10]. Results are shown in Table 4. While cost-sensitive error identifies different best performing models for different values of the cost, in only one case (MDS Kitchen) it consistently agrees with value across the spectrum of costs. What is worse, for large values of kf n it often detects as best performing models that actually achieve negative value, making it a poor overall indicator of model value. The problem is not how it treats the costs of different errors, but in the fact that it does not assume a selective classifier and a corresponding cost-sensitive rejection threshold, which is the main practical contribution of our definition of value. This also implies that cost sensitive learning (He and Ma 2013), that aims at training classifiers to minimize (a certain notion of) cost-sensitive error, should be coupled with learning to reject mechanisms Hendrickx et al. (2021) in order to be fully effective in optimizing the value of the learned models. 4.2.3 Q3: Lack of calibration substantially affects model value The threshold in Eq. 6 assumes that models are perfectly calibrated, which is often far from being true for trained models, and deep learning models in particular (Guo et al. 2017). In order to evaluate the role of calibration in determining value of a model, we apply temperature scaling (Guo et al. 2017), a simple yet effective recalibration technique, to each model before applying the threshold (the resulting selector is shown in Fig. 3c). Table 5 reports the results in exactly the same setting as Table 3, but using recalibrated models. Notice that accuracy and F1-score are unchanged, as temperature scaling affects the confidence in the prediction but not how classes are being ranked. In terms of value, however, we observe an overall improvement, quite substantial for larger values of k. Note that the effectiveness of calibration in improving the model’s “value” depends on the accuracy of the calibrated model. The degenerate behaviour of models with negative values is almost completely eliminated, with “useless" models receiving a value of zero, as expected. These results suggest 13 Rethinking and recomputing the value of machine learning models Page 15 of 23 238 Table 4 Comparison between cost-sensitive error and value for different values of k = kf n (with kf p = 1) Task Model Cost-sensitive error Value k = 1 k = 2 k = 4 k = 8 k = 10k = 1 k = 2 k = 4 k = 8 k = 10 Hate Speech Badj et al. 0.178 0.297 0.535 1.01 1.248 0.644 0.545 0.389 0.315 0.278 Agr et al. 0.268 0.322 0.429 0.644 0.752 0.464 0.405 0.32 0.157 0.098 Clickbait fullnetconc 0.143 0.221 0.377 0.689 0.845 0.715 0.608 0.368 0.131 0.103 WeNet 0.148 0.228 0.388 0.707 0.867 0.703 0.604 0.381 0.124 0.094 LingNet 0.18 0.295 0.524 0.983 1.213 0.64 0.467 0.125 0.052 0.052 FullNet 0.144 0.234 0.416 0.779 0.961 0.713 0.631 0.446 0.15 0.103 MDS LogReg 0.238 0.406 0.742 1.413 1.749 0.524 0.442 0.355 0.293 0.282 Electronics MLP1 0.259 0.436 0.791 1.5 1.854 0.497 0.413 0.338 0.284 0.274 MLP4 0.254 0.418 0.745 1.4 1.727 0.47 0.33 0.09 − − 0.492 0.313 mttri 0.192 0.33 0.607 1.159 1.436 0.616 0.495 0.286 − − 0.245 0.085 MDS DVD LogReg 0.26 0.394 0.663 1.201 1.47 0.48 0.375 0.295 0.255 0.251 MLP1 0.271 0.404 0.67 1.203 1.469 0.457 0.36 0.298 0.26 0.251 MLP4 0.278 0.392 0.62 1.075 1.303 0.439 0.327 0.16 − − 0.193 0.089 mttri 0.247 0.412 0.744 1.406 1.737 0.506 0.352 0.072 − − 0.663 0.431 MDS Books LogReg 0.296 0.489 0.874 1.645 2.03 0.408 0.332 0.269 0.222 0.219 MLP1 0.303 0.492 0.87 1.627 2.005 0.382 0.272 0.197 0.18 0.183 MLP4 0.312 0.486 0.832 1.525 1.871 0.393 0.258 0.081 − − 0.283 0.183 mttri 0.258 0.45 0.834 1.603 1.987 0.484 0.32 0.018 − 0.52 − 0.789 MDS LogReg 0.218 0.345 0.599 1.108 1.363 0.565 0.466 0.365 0.306 0.295 Kitchen MLP1 0.242 0.375 0.64 1.171 1.436 0.53 0.433 0.339 0.292 0.279 MLP4 0.248 0.387 0.665 1.22 1.498 0.521 0.416 0.263 0.026 − 0.076 mttri 0.179 0.238 0.355 0.59 0.708 0.642 0.589 0.503 0.376 0.31 For each dataset and metric, the best performance is highlighted in bold that learning models should always be recalibrated before being incorporated in practical workflows. This does not mean that one can then resort on standard accuracy or F1-score to choose which model to employ. The best performing model is still largely dependent on the cost factor. Notice that in the domain adaptation scenarios (MDS tasks), simple logistic regression (LogReg) consistently outperforms all other models for large values of k. This result should not be unexpected. Logistic regression is known to be a well-calibrated model per-se Kull et al. (2017), and temperature scaling likely further improves this behaviour, while more complex models struggle to achieve comparable calibration with simple recalibration strategies. The lively research area of calibration in machine learning and especially deep learning can provide useful solutions to this problem de Menezes e (Silva Filho et al. 2021). 13 238 Page 16 of 23 B. Sayin et al. Table 5 Comparison between accuracy, F1-score and value for recalibrated models Task Model Accuracy F1 Value k=0 k=1 k=2 k=4 k = 8 k = 10 Hate Speech Badj et al. 0.822 0.626 0.822 0.644 0.513 0.359 0.268 0.218 Agr et al. 0.732 0.621 0.732 0.464 0.207 0.0 0.0 0.0 Clickbait Fullnetconc 0.857 0.684 0.857 0.715 0.608 0.488 0.374 0.331 WeNet 0.852 0.672 0.852 0.703 0.597 0.472 0.357 0.326 LingNet 0.82 0.565 0.82 0.64 0.499 0.348 0.173 0.115 FullNet 0.856 0.663 0.856 0.713 0.6 0.488 0.372 0.335 MDS Electronics LogReg 0.762 0.736 0.762 0.524 0.362 0.226 0.119 0.098 MLP1 0.745 0.711 0.745 0.491 0.33 0.174 0.096 0.062 MLP4 0.745 0.713 0.745 0.491 0.291 0.11 0.0 0.0 mttri 0.808 0.786 0.808 0.616 0.447 0.192 0.112 0.0 MDS DVD LogReg 0.74 0.739 0.74 0.48 0.315 0.17 0.09 0.062 MLP1 0.729 0.732 0.729 0.459 0.28 0.148 0.037 0.023 MLP4 0.722 0.724 0.722 0.443 0.235 0.056 0.0 0.0 mttri 0.753 0.725 0.753 0.506 0.292 0.08 0.0 0.0 MDS Books LogReg 0.704 0.678 0.704 0.408 0.234 0.111 0.01 0.001 MLP1 0.697 0.662 0.697 0.395 0.199 0.002 0.0 0.0 MLP4 0.688 0.681 0.688 0.375 0.095 0.0 0.0 0.0 mttri 0.742 0.712 0.742 0.484 0.264 − 0.011 0.0 0.0 MDS Kitchen LogReg 0.782 0.771 0.782 0.565 0.41 0.267 0.153 0.127 MLP1 0.758 0.752 0.758 0.515 0.345 0.197 0.096 0.011 MLP4 0.752 0.758 0.752 0.504 0.305 0.122 0.0 0.0 mttri 0.821 0.832 0.821 0.642 0.493 0.227 0.102 0.0 Comparison between accuracy, F1-score and value for recalibrated models. For each dataset and metric, the best performance is highlighted inbold 4.2.4 Q4: Operating in an out-of-distribution setting substantially affects model value The lack of calibration in machine learning models is known to be particularly harmful when the model operates in an out-of-distribution (OOD) setting (Tomani and Buettner 2019; Wu et al. 2022), and the results on the domain adaptation tasks in Table 5 confirm this issue. To better understand the role of the OOD setting in determining the value of models, we thus focused on the MDS tasks and complemented the set of models presented in Table 5 with some state-of-the-art transformer models, which should be less affected by the problem given the huge corpora on which they are trained. The transformer models that we employed are the following: ● Google’s T5-base8 Raffel et al. (2020) (12-layers, 768-hidden-state, 3072 feed-forward hidden-state, 12-heads, 220 M parameters) fine-tuned on IMDB dataset9 Maas et al. (2011) for sentiment analysis task. ● SieBERT10Heitmann et al. (2020): a fine-tuned version of RoBERTa-large11 model Liu 8 https://tinyurl.com/t5-base-finetuned-sentiment. 9 https://huggingface.co/datasets/stanfordnlp/imdb. 10 https://tinyurl.com/SieBERT-sentiment. 11 https://huggingface.co/FacebookAI/roberta-large. 13 Rethinking and recomputing the value of machine learning models Page 17 of 23 238 et al. (2019) (24-layer, 1024-hidden-state, 16-heads, 355 M parameters) for sentiment analysis task that is fine-tuned and evaluated on 15 diverse text sources. ● GPT-3 Brown et al. (2020). Since it is producing human-like text for a given input, we fine-tuned it using the OpenAI API12. First, we prepared the MDS dataset for GPT3; we cleaned sentences that have more than 2049 tokens, and renamed the text column as “prompt" and the ground truth column as “completion". Then, we used OpenAI API to fine-tune GPT-3 separately on each of the 4 domains (DVD, books, electronics, and kitchen). We specified “classif ication_n_classes" parameter as 2 and classif ication_positive_class as ‘1’ so that the API tunes GPT-3 for binary sentiment analysis. Fine-tuning 4 models on the MDS dataset costs a total of $7.15. In order to test the fine-tuned models on different target domains, we specified the prompt in the format of “sentence + -> " because the API itself uses “ ->" sign to teach GPT-3 that the sentiment for a prompt is (‘ ->’) the completion. Thus, fine-tuned GPT-3 models produce either 0 or 1 for the given input. Testing each fine-tuned model on the other 3 domains (so, 12 cases in total) costs $43.89. We provide our source code on Github13 to show every step of using GPT-3 in our experiments. Table 6 reports the results of all models on the MDS tasks. As expected, large pre-trained language models tend to perform well across the board. This can be due to two reasons (besides the models being very powerful): (i) we know that very large models with very large train datasets are reasonably well calibrated (e.g. Jiang et al. 2021), and (ii) when the training data is so large, fewer examples are out of distribution in terms of language. For example, GPT-3 Brown et al. (2020) is trained on about 45TB of text data from various datasets, and the vocabulary of the MDS datasets is most likely already present in its training set. Notice however that even for these models, accuracy is a poor proxy of value when k is large. Indeed, SieBERT slightly outperforms GPT-3 in terms of both accuracy and F1 in all tasks. However, the situation is reversed for large values of k, with SieBERT reaching negative values in most cases, most likely because of a poorer calibration with respect to GPT-3. Finally, simple linear models occasionally outperform these powerful (and very expensive to employ) large-language models for the largest values of k, again confirming the importance of value in determining the most appropriate model for the situation at hand. 4.3 Key takeaways for AI-assisted decision-making While our experiments are conducted in a controlled setting, they are designed to reflect realistic decision-making scenarios relevant to enterprise environments, such as those in ServiceNow. Consider, for instance, an application that assesses or explains risk levels (e.g., the risk of applying a system patch). The utility of AI outputs in this context depends on the nature of potential errors: ● Correct Assessments: AI provides accurate risk evaluations, aiding decision-makers (e.g., Change Approvers) in making informed choices. ● Low-Value Outputs: AI offers insights that, while accurate, do not significantly aid de12 https://openai.com/api/ 13 https://github.com/burcusayin/value-of-ml-models/ 13 238 Page 18 of 23 B. Sayin et al. Table 6 Comparison between accuracy, F1-score and value in an OOD setting. LogRef, MLP1, ML4 and mttri are trained to perform domain adaptation and thus operate in a OOD setting, while transformer models (T5, SieBERT, GPT-3) are pre-trained or large corpora and thus likely operate in-distribution Task Model Accuracy F1 Value k=1 k=2 k=4 k=8 k = 10 MDS Electronics LogReg 0.762 0.736 0.524 0.339 0.162 0.053 0.033 MLP1 0.745 0.711 0.497 0.327 0.18 0.081 0.062 MLP4 0.745 0.713 0.47 0.24 − 0.143 − 0.78 − 1.06 mttri 0.808 0.786 0.616 0.441 0.148 − 0.354 − 0.58 T5 0.784 0.765 0.568 0.352 − 0.08 − 0.944 −1.376 SieBERT 0.842 0.831 0.685 0.527 0.217 − 0.397 −0.705 GPT-3 0.82 0.803 0.641 0.499 0.322 0.127 0.051 MDS DVD LogReg 0.74 0.739 0.48 0.283 0.122 0.038 0.027 MLP1 0.729 0.732 0.457 0.274 0.133 0.054 0.038 MLP4 0.722 0.724 0.439 0.202 − 0.158 − 0.737 −0.981 mttri 0.753 0.725 0.506 0.28 − 0.123 − 0.84 −1.166 T5 0.789 0.788 0.578 0.367 − 0.056 − 0.9 −1.323 SieBERT 0.836 0.832 0.672 0.508 0.193 − 0.436 −0.747 GPT-3 0.832 0.825 0.664 0.534 0.367 0.164 0.089 MDS Books LogReg 0.704 0.678 0.408 0.228 0.102 0.022 0.015 MLP1 0.697 0.662 0.382 0.134 0.013 − 0.017 −0.013 MLP4 0.688 0.681 0.393 0.154 − 0.171 − 0.666 − 0.86 mttri 0.742 0.712 0.484 0.254 − 0.16 − 0.869 −1.215 T5 0.77 0.791 0.541 0.311 − 0.148 − 1.066 −1.525 SieBERT 0.826 0.827 0.652 0.479 0.136 − 0.547 −0.879 GPT-3 0.806 0.808 0.613 0.46 0.272 0.077 0.004 MDS Kitchen LogReg 0.782 0.771 0.565 0.374 0.176 0.06 0.034 MLP1 0.758 0.752 0.53 0.337 0.164 0.07 0.044 MLP4 0.752 0.758 0.521 0.312 0.003 − 0.478 −0.685 mttri 0.821 0.832 0.642 0.489 0.235 − 0.192 −0.384 T5 0.777 0.768 0.555 0.332 − 0.113 − 1.004 −1.449 SieBERT 0.865 0.859 0.73 0.595 0.328 − 0.195 −0.454 GPT-3 0.853 0.851 0.706 0.599 0.464 0.308 0.251 For each dataset and metric, the best performance is highlighted in bold cision-making. ● Erroneous Assessments: AI produces misleading risk evaluations (e.g., downplaying a high-risk change), potentially leading to poor decisions. To mitigate the impact of errors, we apply a cost-based evaluation framework that assigns heavily negative weights to erroneous assessments-especially those that underestimate risks-relative to the positive weights for correct assessments. This reflects a deliberate design principle: it is preferable to provide no assistance than to offer misleading guidance. We determine whether to deploy a model by setting penalties such that a positive overall score indicates a net beneficial impact. While this introduces cost as an additional parameter, it aligns with standard model evaluation practices, where accuracy and utility thresholds guide deployment decisions. Importantly, this framework prioritizes the decision-maker’s 13 Rethinking and recomputing the value of machine learning models Page 19 of 23 238 needs, resulting in more instances of model rejection (when thresholds are unmet) rather than erroneous inferences. Notably, we have yet to encounter a use case where correct and erroneous assessments are assigned equal absolute weights by product managers. Similarly, non-inference (a model opting out) is rarely considered as detrimental as providing incorrect guidance. These observations suggest that our evaluation framework aligns with real-world utility considerations. 5 Limitations and conclusion In this paper, we investigated whether (i) model accuracy or F1-score serves as a reliable proxy for evaluating the true value of ML models, (ii) cost-sensitive error provides a meaningful measure of model value in cost-sensitive scenarios, (iii) calibration influences the value of ML models, and (iv) predictions in out-of-distribution settings impact model value. Our study focused on binary and multi-class classification tasks, employing various models under different cost settings. The findings revealed that (i) accuracy and F1-score are poor indicators of model value, (ii) cost-sensitive error is also an inadequate measure of model value, (iii) poor calibration significantly diminishes model value, and (iv) operating in outof-distribution settings considerably undermines model value. The takeaway from our experiments is that using accuracy-oriented metrics (that is, metrics that assume models are applied without rejection) is as a minimum a risky proposition and this is true even for models widely acknowledged as “leaders”. We should always assess models over a range of cost factors, and at least for reasonable cost factors we expect based on the set of application use cases we are targeting. k = 0 (accuracy) is almost never a reasonable one. We also saw how applying models without thresholding can lead to a negative value, and that threshold tuning seems to perform better than calibration. We also hypothesize and have obtained some support for identifying complexity and out-of-distribution as factors that may lead to rapid model quality degradation for higher cost factors. This being said, we see this work more as providing evidence of a problem and outlining the research needs: more studies (especially with large models and in vs out of distribution datasets) are needed to validate the hypothesis and a deeper understanding of how calibration, confidence distribution, and size of validation set affect model value. Author contributions Burcu Sayin: Conceptualization, Implementation, Experimental Evaluation, Writing— original draft, Writing—review & editing. Jie Yang: Conceptualization, Supervision, Writing - review & editing. Xinyue Chen: Experimental Evaluation, Writing—original draft. Andrea Passerini: Conceptualization, Funding acquisition, Supervision, Writing—original draft, Writing—review & editing. Fabio Casati: Conceptualization, Funding acquisition, Supervision, Writing - original draft, Writing - review & editing. Funding Open access funding provided by Università degli Studi di Trento within the CRUI-CARE Agreement. Funded by the European Union. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Health and Digital Executive Agency (HaDEA). Neither the European Union nor the granting authority can be held responsible for them. Grant Agreement no. 101120763 - TANGO. Grant Agreement No. 952215 - TAILOR. The work of Burcu Sayin was partially supported by the project AI@Trento (FBK-Unitn). AP also acknowledges the support of the MUR PNRR project FAIR - Future AI Research (PE00000013) funded by the NextGenerationEU. 13 238 Page 20 of 23 B. Sayin et al. Declarations Conflict of interest The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential Conflict of interest. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. References Agrawal S, Awekar A (2018) Deep learning for detecting cyberbullying across multiple social media platforms. In: Pasi G, Piwowarski B, Azzopardi L, Hanbury A (eds) Advances in information retrieval. Springer, Cham, pp 141–153 Arango A, Pérez J, Poblete B (2019) Hate speech detection is not as easy as you may think: A closer look at model validation. In: Proceedings of the 42nd International ACM SIGIR Conference on research and development in information retrieval. SIGIR’19, pp. 45–54. Association for Computing Machinery New York, NY, USA.https://doi.org/10.1145/3331184.3331262 Badjatiya P, Gupta S, Gupta M, Varma V (2017) Deep learning for hate speech detection in tweets. In: Proceedings of the 26th International Conference on World Wide Web Companion. WWW ’17 Companion, pp. 759–760. International World Wide Web Conferences Steering Committee Republic and Canton of Geneva, CHE.https://doi.org/10.1145/3041021.3054223 Bahat Y, Shakhnarovich G (2020) Classification confidence estimation with test-time data-augmentation. ArXiv abs/2006.16705. https://doi.org/10.48550/ARXIV.2006.16705 Balda E, Behboodi A, Mathar R (2020) Adversarial examples in deep neural networks: An overview. In: Deep learning: algorithms and applications, pp. 31–65. https://doi.org/10.1007/978-3-030-31760-7_2 Bendel R, Higgins S, Teberg J, Pyke D (1989) Comparison of skewness coefficient, coefficient of variation, and Gini coefficient as inequality measures within populations. Oecologia 78:394–400. https://doi.org /10.1007/BF00379115 Bragg J, Mausam Weld D.S (2016) Optimal testing for crowd workers. In: Proceedings of the 2016 international conference on autonomous agents & multiagent systems. AAMAS ’16, pp. 966–974. International foundation for autonomous agents and multiagent systems Richland, SC Brown T.B, Mann B, Ryder N, Subbiah M, Kaplan J, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A, Agarwal S, Herbert-Voss A, Krueger G, Henighan T, Child R, Ramesh A, Ziegler D.M, Wu J, Winter C, Hesse C, Chen M, Sigler E, Litwin M, Gray S, Chess B, Clark J, Berner C, McCandlish S, Radford A, Sutskever I, Amodei D (2020) Language models are few-shot learners. ArXiv abs/2005.14165. http s://doi.org/10.48550/ARXIV.2005.14165 Bukowski M, Kurek J, Antoniuk I, Jegorowa A (2021) Decision confidence assessment in multi-class classification. Sensors 21:3834. https://doi.org/10.3390/s21113834 Callaghan W, Goh J, Mohareb M, Lim A, Law E (2018) Mechanicalheart: a human-machine framework for the classification of phonocardiograms. In: CSCW’18, vol. 2, pp. 28–12817. https://doi.org/10.1145/3 274297 Casati F, Noel P, Yang J (2021) On the value of ml models. In: Neurips workshop on human decisions. https ://doi.org/10.48550/ARXIV.2112.06775 Chai X, Deng L, Yang Q, Ling CX (2004) Test-cost sensitive naive bayes classification. In: Fourth IEEE international conference on data mining (ICDM’04), pp. 51–58. https://doi.org/10.1109/ICDM.2004.10092 Charoenphakdee N, Cui Z, Zhang Y, Sugiyama M (2021) Classification with rejection based on cost-sensitive classification. In: Proceedings of the 38th international conference on machine learning, vol. 139, pp. 1507–1517. https://proceedings.mlr.press/v139/charoenphakdee21a.html Cheng J, Bernstein M.S (2015) Flock: Hybrid crowd-machine learning classifiers. In: Proceedings of the 18th Acm conference on computer supported cooperative work & social computing. https://doi.org/10.114 5/2675133.2675214 13 Rethinking and recomputing the value of machine learning models Page 21 of 23 238 Cordella LP, De Stefano C, Tortorella F, Vento M (1995) A method for improving classification reliability of multilayer perceptrons. IEEE Trans Neural Netw 6(5):1140–1147. https://doi.org/10.1109/72.410358 De Stefano C, Sansone C, Vento M (2000) To reject or not to reject: that is the question-an answer in case of neural classifiers. IEEE Trans Syst Man Cybern 30(1):84–94. https://doi.org/10.1109/5326.827457 Dellermann D, Ebel P, Söllner M, Leimeister JM (2019) Hybrid intelligence. Business Inform Syst Eng 61:637–643. https://doi.org/10.1007/s12599-019-00595-2 Dellermann D, Calma A, Lipusch N, Weber T, Weigel S, Ebel PA (2019) The future of human-ai collaboration: a taxonomy of design knowledge for hybrid intelligence systems. ArXiv abs/2105.03354 Domingos P (1999) Metacost: A general method for making classifiers cost-sensitive. In: Proceedings of the Fifth ACM SIGKDD international conference on knowledge discovery and data mining. KDD ’99, pp. 155–164. Association for computing machinery New York, NY, USA. https://doi.org/10.1145/312129 .312220 Elkan C (2001) The foundations of cost-sensitive learning. In: Proceedings of the 17th International Joint Conference on Artificial Intelligence - Volume 2. IJCAI’01, pp. 973–978. Morgan Kaufmann Publishers Inc. San Francisco, CA, USA Elkan C (2001) The foundations of cost-sensitive learning. In: Proceedings of the 17th international joint conference on artificial intelligence, pp. 973–978 Fumera G, Roli F (2002) Support vector machines with embedded reject option. In: Proceedings of the first international workshop on pattern recognition with support vector machines. SVM ’02, pp. 68–82. Springer Berlin, Heidelberg. https://doi.org/10.1007/3-540-45665-1_6 Gadiraju U, Yang J, Bozzon A (2017) Clarity is a worthwhile quality: On the role of task clarity in microtask crowdsourcing. In: Proceedings of the 28th ACM conference on hypertext and social media. HT ’17, pp. 5–14. Association for computing machinery New York, NY, USA. https://doi.org/10.1145/307871 4.3078715 Geifman Y, El-Yaniv R (2017) Selective classification for deep neural networks. Adv Neural Inform Proc Syst. https://doi.org/10.48550/ARXIV.1705.08500 Gunel B.S (2022) Towards reliable hybrid human-machine classifiers. PhD thesis at University of Trento. https://hdl.handle.net/11572/349843 Guo C, Pleiss G, Sun Y, Weinberger K.Q (2017) On calibration of modern neural networks. In: proceedings of the 34th international conference on machine learning - Volume 70. ICML’17, pp. 1321–1330.https: //doi.org/10.48550/ARXIV.1706.04599 Han L, Roitero K, Gadiraju U, Sarasua C, Checco A, Maddalena E, Demartini G (2021) The impact of task abandonment in crowdsourcing. IEEE Trans Knowl Data Eng 33(5):2266–2279. https://doi.org/10.110 9/TKDE.2019.2948168 Han L, Maddalena E, Checco A, Sarasua C, Gadiraju U, Roitero K, Demartini G (2020) Crowd worker strategies in relevance judgment tasks. In: Proceedings of the 13th International conference on web search and data mining. WSDM ’20, pp. 241–249. Association for computing machinery New York, NY, USA. https://doi.org/10.1145/3336191.3371857 Han L, Roitero K, Gadiraju U, Sarasua C, Checco A, Maddalena E, Demartini G (2019) All those wasted hours: On task abandonment in crowdsourcing. In: Proceedings of the twelfth ACM international conference on web search and data mining. WSDM ’19, pp. 321–329. Association for Computing Machinery New York, NY, USA. https://doi.org/10.1145/3289600.3291035 Heitmann M, Siebert C, Hartmann J, Schamp C (2020) More than a feeling: benchmarks for sentiment analysis accuracy. In: Communication & Computational Methods eJournal Hellman ME (1970) The nearest neighbor classification rule with a reject option. IEEE Trans Syst Sci Cybern 6(3):179–185. https://doi.org/10.1109/TSSC.1970.300339 He H, Ma Y (2013) Imbalanced learning: foundations, algorithms, and applications Hendrickx K, Perini L, Plas D, Meert W, Davis J (2021) Machine learning with a reject option: a survey. arXiv Jiang Z, Araki J, Ding H, Neubig G (2021) How can we know when language models know? On the calibration of language models for question answering. Trans Assoc Comput Linguist 9:962–977. https://doi. org/10.1162/tacl_a_00407 Jiang H, Kim B, Guan M.Y, Gupta M (2018) To trust or not to trust a classifier. In: Proceedings of the 32nd international conference on neural information processing systems. NIPS’18, pp. 5546–5557. Curran Associates Inc. Red Hook, NY, USA. https://doi.org/10.48550/ARXIV.1805.11783 Kamar E, Hacker S, Horvitz E (2012) Combining human and machine intelligence in large-scale crowdsourcing. In: AAMAS’12 - Volume 1, pp. 467–474 Krivosheev E, Casati F, Benatallah B (2018) Crowd-based multi-predicate screening of papers in literature reviews. In: Proceedings of the 2018 World wide web conference. WWW ’18, pp. 55–64. International world wide web conferences steering committee republic and canton of Geneva, CHE. https://doi.org/ 10.1145/3178876.3186036 13 238 Page 22 of 23 B. Sayin et al. Kull M, Silva Filho T, Flach PA (2017) Beyond sigmoids: How to obtain well-calibrated probabilities from binary classifiers with beta calibration. Electr J Stat 11:5052–5080 Kull M, Perello Nieto M, Kängsepp M, Silva Filho T, Song H, Flach P.: Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with dirichlet calibration. In: Wallach H, Larochelle H, Beygelzimer A, Alché-Buc F, Fox E, Garnett R. (eds.) (2019) Advances in neural information processing systems, vol. 32. https://proceedings.neurips.cc/paper/2019/file/8ca01ea920679a0fe372844149404 1b9-Paper.pdf Li H (2013) Error rate analysis of labeling by crowdsourcing. In: International conference on machine learning (ICML2013), Workshop on machine learning meets crowdsourcing Ling C, Sheng V (2010) Cost-sensitive learning and the class imbalance problem. Encycl Mach Learn Liu Q, Ihler A.T, Steyvers M (2013) Scoring workers in crowdsourcing: How many control questions are enough? In: Burges C.J, Bottou L, Welling M, Ghahramani Z, Weinberger K.Q. (eds.) Advances in neural information processing systems, vol. 26. https://proceedings.neurips.cc/paper/2013/file/cc1aa43 6277138f61cda703991069eaf-Paper.pdf Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: a robustly optimized bert pretraining approach. ArXiv abs/1907.11692. https://doi.org/10.48 550/ARXIV.1907.11692 Maas A.L, Daly R.E, Pham P.T, Huang D, Ng A.Y, Potts C (2011) Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual meeting of the association for computational linguistics: human language technologies, pp. 142–150. Association for computational linguistics Portland, Oregon, USA. https://aclanthology.org/P11-1015 Nagar Y, Malone TW (2011) Making business predictions by combining human and machine intelligence in prediction markets. In: International conference on interaction sciences Nagar Y, Malone T.W (2012) Improving predictions with hybrid markets. In: AAAI fall symposium: machine aggregation of human judgment Ng AY (2004) Feature selection, l1 vs. l2 regularization, and rotational invariance. In: Proceedings of the Twenty-First International Conference on Machine Learning. ICML ’04, p. 78. Association for Computing Machinery New York, NY, USA. https://doi.org/10.1145/1015330.1015435 Nuñez A.C (2022) Combining diverse forms of human and machine intelligence. In: PhD thesis at Massachusetts institute of technology Qarout R, Checco A, Bontcheva K (2018) Investigating stability and reliability of crowdsourcing output. In: Proceedings of the 1st workshop on disentangling the relation between crowdsourcing and bias management (CrowdBias 2018) Co-located the 6th AAAI conference on human computation and crowdsourcing (HCOMP 2018) Qiu S, Gadiraju U, Bozzon A (2020) Improving worker engagement through conversational microtask crowdsourcing. In: Proceedings of the 2020 CHI conference on human factors in computing systems. CHI ’20, pp. 1–12. Association for computing machinery New York, NY, USA. https://doi.org/10.114 5/3313831.3376403 Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu PJ (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. J Mach Learn Res. https://doi.org/10. 48550/ARXIV.1910.10683 Raghu M, Blumer K, Corrado G, Kleinberg J.M, Obermeyer Z, Mullainathan S (2019) The algorithmic automation problem: prediction, triage, and human effort. CoRR abs/1903.12220. arXiv:1903.12220 Rodriguez C, Daniel F, Casati F (2014) Crowd-based mining of reusable process model patterns. In: Business process management, pp. 51–66. https://doi.org/10.1007/978-3-319-10172-9_4 Ruder S, Plank B (2018) Strong baselines for neural semi-supervised learning under domain shift. In: The 56th annual meeting of the association for computational linguistics (ACL 2018), pp. 1044–1054. https://doi.org/10.18653/v1/P18-1096 Sayin B, Krivosheev E, Passerini JYA, Casati F (2021) A review and experimental analysis of active learning over crowdsourced data. Artif Intel Rev 54:5283–5305. https://doi.org/10.1007/s10462-021-10021-3 Sayin B, Casati F, Passerini A, Yang J, Chen X (2022) Rethinking and recomputing the value of ml models. arXiv preprint arXiv:2209.15157 Sayin B, Krivosheev E, Ramírez J, Casati F, Taran E, Malanina V, Yang J (2021) Crowd-powered hybrid classification services: Calibration is all you need. In: 2021 IEEE International conference on web services (ICWS), pp. 42–50. https://doi.org/10.1109/ICWS53863.2021.00019 Sayin B, Yang J, Passerini A, Casati F (2021) The science of rejection: a research area for human computation. In: The 9th AAAI conference on human computation and crowdsourcing. HCOMP 2021. https://d oi.org/10.48550/ARXIV.2111.06736 Sayin B, Yang J, Passerini A, Casati F (2023) Value-aware active learning. In: Frontiers in artificial intelligence and applications. Volume 368: HHAI 2023: Augmenting Human Intellect, pp. 215–223 13 Rethinking and recomputing the value of machine learning models Page 23 of 23 238 Sayin B, Yang J, Passerini A, Casati F (2023) Value-based hybrid intelligence. In: Frontiers in artificial intelligence and applications. Volume 368: HHAI 2023: Augmenting Human Intellect, pp. 366–370 Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27(3):379–423. https://doi.o rg/10.1002/j.1538-7305.1948.tb01338.x Sheng V.S, Ling C.X (2006) Thresholding for making classifiers cost-sensitive. In: Proceedings of the 21st National conference on artificial intelligence - Volume 1. AAAI’06, pp. 476–481 Silva Filho T, Song H, Perelló-Nieto M, Santos-Rodríguez R, Kull M, Flach P.A (2021) Classifier calibration: How to assess and improve predicted class probabilities: a survey. CoRR abs/2112.10327 Suri M (2022) PiCkLe at SemEval-2022 task 4: Boosting pre-trained language models with task specific metadata and cost sensitive learning. In: Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), pp. 464–472. Association for Computational Linguistics Seattle, United States. https://doi.org/10.18653/v1/2022.semeval-1.63 Sutton R.T, Pincock D, Baumgart D.C, Sadowski D, Fedorak R, Kroeker K (2020) An overview of clinical decision support systems: benefits, risks, and strategies for success. npj Digital Medicine 3 Teerapittayanon S, McDanel B, Kung H.T (2017) Branchynet: fast inference via early exiting from deep neural networks. ArXiv abs/1709.01686. https://doi.org/10.48550/ARXIV.1709.01686 Thai-Nghe N, Gantner Z, Schmidt-Thieme L (2010) Cost-sensitive learning methods for imbalanced data. In: The 2010 international joint conference on neural networks (IJCNN), pp. 1–8. https://doi.org/10.11 09/IJCNN.2010.5596486 Ting KM (1998) Inducing cost-sensitive trees via instance weighting. In: Proceedings of the Second European Symposium on Principles of Data Mining and Knowledge Discovery. PKDD ’98, pp. 139–147. Springer Berlin, Heidelberg. https://doi.org/10.1007/BFb0094814 Tomani C, Buettner F (2019) Towards trustworthy predictions from deep neural networks with fast adversarial calibration. In: AAAI conference on artificial intelligence Tu C.-Y, Lin H.-T (2020) Cost learning network for imbalanced classification. In: 2020 international conference on technologies and applications of artificial intelligence (TAAI), pp. 47–51. https://doi.org/10.11 09/TAAI51410.2020.00017 Waseem Z, Hovy D (2016) Hateful symbols or hateful people? predictive features for hate speech detection on Twitter. In: Proceedings of the NAACL Student Research Workshop, pp. 88–93. Association for Computational Linguistics San Diego, California. https://doi.org/10.18653/v1/N16-2013. https://aclan thology.org/N16-2013 Whitehill J, Wu T.-f, Bergsma J, Movellan J, Ruvolo P (2009) Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In: Bengio Y, Schuurmans D, Lafferty J, Williams C, Culotta A. (eds.) Advances in neural information processing systems, vol. 22. https://proceedings.ne urips.cc/paper/2009/file/f899139df5e1059396431415e770c6dd-Paper.pdf Wilder B, Horvitz E, Kamar E (2021) Learning to complement humans. In: Proceedings of the Twenty-Ninth international joint conference on artificial intelligence. IJCAI’20. https://doi.org/10.48550/ARXIV.20 05.00582 Wu M.-H, Quinn A.J (2017) Confusing the crowd: Task instruction quality on amazon mechanical turk. In: AAAI Conference on human computation & crowdsourcing Wu Y, Zeng Z, He K, Mou Y, Wang P, Xu W (2022) Distribution calibration for out-of-domain detection with Bayesian approximation. In: Proceedings of the 29th International Conference on Computational Linguistics, pp. 608–615. International committee on computational linguistics Gyeongju, Republic of Korea. https://aclanthology.org/2022.coling-1.50 Yang J, Redi J, Demartini G, Bozzon A (2016) Modeling task complexity in crowdsourcing. In: AAAI conference on human computation & crowdsourcing Zadrozny B, Langford J, Abe N (2003) Cost-sensitive learning by cost-proportionate example weighting. In: Proceedings of the Third IEEE international conference on data mining. ICDM ’03, p. 435. IEEE Computer Society USA. https://doi.org/10.1109/ICDM.2003.1250950 Zhou D, Basu S, Mao Y, Platt J (2012) Learning from the wisdom of crowds by minimax entropy. In: Pereira F, Burges C.J, Bottou L, Weinberger K.Q. (eds.) Advances in neural information processing systems, vol. 25. https://proceedings.neurips.cc/paper/2012/file/46489c17893dfdcf028883202cefd6 d1-Paper.pdf Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. 13