Powered by JASP

Currently Browsing: General
#
Straw Men Revised

#### About The Authors

### Eric-Jan Wagenmakers

#
SpaceX Starship SN10 Landing

#### About The Authors

### František Bartoš

#
Strong Public Claims May Not Reflect Researchers’ Private Convictions

### Abstract

### Questionnaire

### Results

### Concluding Comments

### References

#### About The Authors

### Johnny van Doorn

#
Preprint: Bayesian Estimation of Single-Test Reliability Coefficients

*This post is a synopsis of Pfadt, J. M., van den Bergh, D., Sijtsma, K., Moshagen, M., & Wagenmakers, E.-J. (in press). Bayesian estimation of single-test reliability coefficients. Multivariate Behavioral Research. Preprint available at https://psyarxiv.com/exg2y*
### Abstract

### Overview

### Reliability Coefficients

### Simulation Results

### Example Data Set

### Conclusion

### References

#### About The Authors

### Julius M. Pfadt

Julius M. Pfadt is PhD student at the Research Methods group at Ulm University

#
Preprint: Expert Agreement in Prior Elicitation and its Effects on Bayesian Inference

### Abstract

### Different experts – different priors?

### Different priors – different hypothesis testing results?

### Conclusions

### References:

#### About The Authors

### Angelika Stefan

### Dimitris Katsimpokis

### Quentin F. Gronau

### Eric-Jan Wagenmakers

Posted on Mar 25th, 2021

Last week’s post contained hyperbole, an unfortunate phrase involving family members, and reference to sensitive political opinions. I am grateful to everyone who suggested improvements, which I have incorporated to the best of my ability. In addition, I have made a series of more substantial changes to that blog post, because I could see how the overall tone was needlessly confrontational. Indeed, parts of my early post were interpreted as a personal attack on Devezer et al., and although I have of course denied this, it is entirely possible that some of my more snarky sentences were motivated by a desire to “retaliate” for what I believed was an unjust characterization of my position and that of a movement with which I identify. I hope the present version is more mature and balanced. You can find the revised blog post here.

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.

Posted on Feb 19th, 2021

In the past months, SpaceX was quickly moving forward with development of their interplanetary rocket – Starship. The last two prototypes even went ahead with a test flight to ~10 km. The test flight was, in both cases, highly successful, apart from ending in an RUD (rapid unscheduled disassembly) during the landing. That was not unexpected since the previous prototypes had a low chance for successful landing, according to Elon Musk. Nevertheless, many people (and we) are wondering whether the next prototype (SN10), scheduled to attempt the test flight and landing procedure in the upcoming weeks, will finally stick the landing.

A recent twitter poll of almost 40,000 people estimated the probability of SN10 successfully landing at 77.5% (after removing people who abstained from voting).

A much higher chance that Elon’s own estimate of ~60% which is comparable to the Metaculus prediction market based on 295 predictions that converged to 56% median probability of successful landing.

Here, we also try to predict whether the next Starship, SN10, will successfully land. As all statisticians, we start by replacing a difficult problem with a simpler one — instead of landing, we will predict whether the SN10 will successfully fire at least two of its engines as it approaches landing. Since the rocket engine can either fire up or malfunction, we approximate the engine firing up as a binomial event with probability θ. Starship prototypes have 3 rocket engines, out of which 2 are needed for successful landing. However, in previous landing attempts, SpaceX tried lighting up only 2 engines — both of which are required to fire up successfully. Now, in order to improve their landing chances, SpaceX decided to try lighting up all 3 engines and shutting down 1 of them if all fire successfully^{1}. We will therefore approximate the successful landing as observing 2 successful binomial events out of 3 trials.

To obtain the predictions, we will use Bayesian statistics and specify a prior distribution for the binomial probability parameter θ, an engine successfully firing up. Luckily, we can easily obtain the prior distribution from the two previous landing attempts:

- The first Starship prototype attempting landing, SN8, managed to fire both engines, however, crashed due to low oxygen pressure resulting. That resulted in insufficient trust and way too fast approach to the landing site. Video [here].
- The second Starship prototype attempting landing, SN9, did not manage to fire the second engine which, again, resulted in an RUD on approach. Video [here].

Adding an additional assumption of the events being independent, we can summarize the previous firing up attempts with beta(4, 2) distribution — corresponding to observing 3 successful and 1 unsuccessful event. In JASP, we can use the Learn Bayes module to plot our prior distribution for θ

and generate predictions for 3 future events. Since the prior distribution for θ is beta and we observe binomial events, the distribution of number of future successes based on 3 observations follows a beta-binomial(3, 4, 2) distribution. We obtain a figure depicting the predicted number of successes from JASP and we further request the probability of observing at least two of them. Finally, we arrive at an optimistic prediction of 71% chance of observing at least 2 of the engines fire up on the landing approach. Of course, we should treat our estimate as a higher bound on the actual probability of successful landing. There are many other things that can go wrong (see SpaceX’s demonstration [here]) that we did not account for (in contrast to SpaceX, we are not trying to do a rocket science here).

We can also ask how much does trying to fire up all 3 engines instead of 2 (as in previous attempts) increase the chance of successful landing. For that, we just need to obtain the probability of observing at 2 successful events based on 2 observations = 48% (analogously from beta-binomial(2, 4, 2) distribution), and subtract it from the previous estimate of 71%. That is a 23% higher chance of landing when trying to use all 3 instead of only 2 engines.

František Bartoš is a Research Master student in psychology at the University of Amsterdam.

Posted on Feb 18th, 2021

*This post is an extended synopsis of van Doorn, J., van den Bergh, D., Dablander, F., Derks, K., van Dongen, N.N.N., Evans, N. J., Gronau, Q. F., Haaf, J.M., Kunisato, Y., Ly, A., Marsman, M., Sarafoglou, A., Stefan, A., & Wagenmakers, E.‐J. (2021), Strong public claims may not reflect researchers’ private convictions. Significance, 18, 44-45.* *https://doi.org/10.1111/1740-9713.01493*. *Preprint available on PsyArXiv: *https://psyarxiv.com/pc4ad

How confident are researchers in their own claims? Augustus De Morgan (1847/2003) suggested that researchers may initially present their conclusions modestly, but afterwards use them as if they were a “moral certainty”^{1}. To prevent this from happening, De Morgan proposed that whenever researchers make a claim, they accompany it with a number that reflects their degree of confidence (Goodman, 2018). Current reporting procedures in academia, however, usually present claims without the authors’ assessment of confidence.

Here we report the partial results from an anonymous questionnaire on the concept of evidence that we sent to 162 corresponding authors of research articles and letters published in *Nature Human Behaviour* (NHB). We received 31 complete responses (response rate: 19%). A complete overview of the questionnaire can be found in online Appendices B, C, and D. As part of the questionnaire, we asked respondents two questions about the claim in the title of their NHB article: *“**In your opinion, how plausible was the claim before you saw the data?*

Figure 1 shows the responses to both questions. The blue dots quantify the assessment of prior plausibility. The highest prior plausibility is 75, and the lowest is 20, indicating that (albeit with the benefit of hindsight) the respondents did not set out to study claims that they believed to be either outlandish or trivial. Compared to the heterogeneity in the topics covered, this range of prior plausibility is relatively narrow.

From the difference between prior and posterior odds we can derive the Bayes factor, that is, the extent to which the data changed researchers’ conviction. The median of this informal Bayes factor is 3, corresponding to the interpretation that the data are 3 times more likely to have occurred under the hypothesis that the claim is true than under the hypothesis that the claim is false.

*Figure 1. All 31 respondents indicated that the data made the claim in the title of their NHB article more likely than it was before. However, the size of the increase is modest. Before seeing the data, the plausibility centers around 50 (median = 56); after seeing the data, the plausibility centers around 75 (median = 80). The gray lines connect the responses for each respondent.*

The authors’ modesty appears excessive. It is not reflected in the declarative title of their NHB articles, and it could not reasonably have been gleaned from the content of the articles themselves. Empirical disciplines do not ask authors to express the confidence in their claims, even though this could be relatively simple. For instance, journals could ask authors to estimate the prior/posterior plausibility, or the probability of a replication yielding a similar result (e.g., (non)significance at the same alpha level and sample size), for each claim or hypothesis under consideration, and present the results on the first page of the article. When an author publishes a strong claim in a top-tier journal such as NHB, one may expect this author to be relatively confident. While the current academic landscape does not allow authors to express their uncertainty publicly, our results suggest that they may well be aware of it. Encouraging authors to express this uncertainty openly may lead to more honest and nuanced scientific communication (Kousta, 2020).

De Morgan, A. (1847/2003). Formal Logic: The Calculus of Inference, Necessary and Probable. Honolulu: University Press of the Pacific.

Goodman, S. N. (2018). How sure are you of your result? Put a number on it. *Nature, 564*, 7.

Kousta, S. (Ed.). (2020). Editorial: Tell it like it is. *Nature Human Behavior, *4, 1.

Johnny van Doorn is a PhD candidate at the Psychological Methods department of the University of Amsterdam.

Posted on Feb 13th, 2021

Popular measures of reliability for a single-test administration include coefficient α, coefficient λ2, the greatest lower bound (glb), and coefficient ω. First, we show how these measures can be easily estimated within a Bayesian framework. Specifically, the posterior distribution for these measures can be obtained through Gibbs sampling – for coefficients α, λ2, and the glb one can sample the covariance matrix from an inverse Wishart distribution; for coefficient ω one samples the conditional posterior distributions from a single-factor CFA-model. Simulations show that – under relatively uninformative priors – the 95% Bayesian credible intervals are highly similar to the 95% frequentist bootstrap confidence intervals. In addition, the posterior distribution can be used to address practically relevant questions, such as “what is the probability that the reliability of this test is between .70 and .90?”, or, “how likely is it that the reliability of this test is higher than .80?”. In general, the use of a posterior distribution highlights the inherent uncertainty with respect to the estimation of reliability measures.

Reliability analysis aims to disentangle the amount of variance of a test score that is due to systematic influences (i.e., true-score variance) from the variance that is due to random influences (i.e., error-score variance; Lord & Novick, 1968).

When one estimates a parameter such as a reliability coefficient, the point estimate can be accompanied by an uncertainty interval. In the context of reliability analysis, substantive researchers almost always ignore uncertainty intervals and present only point estimates. This common practice disregards sampling error and the associated estimation uncertainty and should be seen as highly problematic. In this preprint, we show how the Bayesian credible interval can provide researchers with a flexible and straightforward method to quantify the uncertainty of point estimates in a reliability analysis.

Coefficient α, coefficient λ2, and the glb are based on classical test theory (CTT) and are lower bounds to reliability. To determine the error-score variance of a test, the coefficients estimate an upper bound for the error variances of the items. The estimators differ in the way they estimate this upper bound. The basis for the estimation is the covariance matrix Σ of multivariate observations. The CTT-coefficients estimate error-score variance from the variances of the items and true-score variance from the covariances of the items.

Coefficient ω is based on the single-factor model. Specifically, the single-factor model assumes that a common factor explains the covariances between the items (Spearman, 1904). Following CTT, the common factor variance replaces the true-score variance and the residual variances replace the error-score variance.

A straightforward way to obtain a posterior distribution of a CTT-coefficient is to estimate the posterior distribution of the covariance matrix and use it to calculate the estimate. Thus, we sample the posterior covariance matrices from an inverse Wishart distribution (Murphy, 2007; Padilla & Zhang, 2011).

For coefficient ω we sample from the conditional posterior distributions of the parameters in the single-factor model by means of a Gibbs sampling algorithm (Lee, 2007).

The results suggest that the Bayesian reliability coefficients perform equally well as the frequentist ones. The figure below depicts the simulation results for the condition with medium correlations among items. The endpoints of the bars are the average 95% uncertainty interval limits. The 25%- and 75%-quartiles are indicated with vertical line segments.

The below figures show the reliability results of an empirical data set from Cavalini (1992) with eight items and sample size of n = 828, and n = 100 randomly chosen observations. Depicted are posterior distributions of estimators with dotted prior densities and 95% credible interval bars. One can easily acknowledge the change in the uncertainty of reliability values when the sample size increases.

For example, from the posterior distribution of λ2 we can conclude that the specific credible interval contains 95% of the posterior mass. Since λ2 = .784, 95% HDI [.761, .806], we are 95% certain that λ2 lies between .761 and .806. Yet, how certain are we that the reliability is larger than .80? Using the posterior distribution of coefficient λ2, we can calculate the probability that it exceeds the cutoff of .80: p(λ2 > .80 | data) = .075.

The Bayesian reliability estimation adds an essential measure of uncertainty to simple point-estimated coefficients. Adequate credible intervals for single-test reliability estimates can be easily obtained applying the procedures described in the preprint, and as implemented in the R-package *Bayesrel*. Whereas the R-package addresses substantive researchers who have some experience in programming, we admit that it will probably not reach scientists whose software experiences are limited to graphical user interface programs such as SPSS. For this reason we have implemented the Bayesian reliability coefficients in the open-source statistical software JASP (JASP Team, 2020). Whereas we cannot stress the importance of reporting uncertainty enough, the question of the appropriateness of certain reliability measures cannot be answered by the Bayesian approach. No single reliability estimate can be generally recommended over all others. Nonetheless, practitioners are faced with the decision which reliability estimates to compute and report. Based on a single test administration the procedure should involve an assessment of dimensionality. Ideally, practitioners report multiple reliability coefficients with an accompanying measure of uncertainty, that is based on the posterior distribution.

This post is a synopsis of Pfadt, J. M., van den Bergh, D., Sijtsma, K., Moshagen, M., & Wagenmakers, E.-J. (in press). Bayesian estimation of single-test reliability coefficients. *Multivariate Behavioral Research.* Preprint available at https://psyarxiv.com/exg2y

Posted on Feb 5th, 2021

*This post is an extended synopsis of Stefan, A. M., Katsimpokis, D., Gronau, Q. F. & Wagenmakers, E.-J. (2021). Expert agreement in prior elicitation and its effects on Bayesian inference. Preprint available on PsyArXiv: https://psyarxiv.com/8xkqd/*

Bayesian inference requires the specification of prior distributions that quantify the pre-data uncertainty about parameter values. One way to specify prior distributions is through prior elicitation, an interview method guiding field experts through the process of expressing their knowledge in the form of a probability distribution. However, prior distributions elicited from experts can be subject to idiosyncrasies of experts and elicitation procedures, raising the spectre of subjectivity and prejudice. In a new pre-print, we investigate the effect of interpersonal variation in elicited prior distributions on the Bayes factor hypothesis test. We elicited prior distributions from six academic experts with a background in different fields of psychology and applied the elicited prior distributions as well as commonly used default priors in a re-analysis of 1710 studies in psychology. The degree to which the Bayes factors vary as a function of the different prior distributions is quantified by three measures of concordance of evidence: We assess whether the prior distributions change the Bayes factor direction, whether they cause a switch in the category of evidence strength, and how much influence they have on the value of the Bayes factor. Our results show that although the Bayes factor is sensitive to changes in the prior distribution, these changes rarely affect the qualitative conclusions of a hypothesis test. We hope that these results help researchers gauge the influence of interpersonal variation in elicited prior distributions in future psychological studies. Additionally, our sensitivity analyses can be used as a template for Bayesian robustness analyses that involves prior elicitation from multiple experts.

The goal of a prior elicitation effort is to formulate a probability distribution that represents the subjective knowledge of an expert. This probability distribution can then be used as a prior distribution on parameters in a Bayesian model. Parameter values the expert deems plausible receive a higher probability density, parameter values the expert deems implausible receive a lower probability density. Of course, most of us know from personal experience that experts can differ in their opinions. But to what extent will these differences influence elicited prior distributions? Here, we asked six experts from different fields in psychology about plausible values for small-to-medium effect sizes in their field. Below, you can see the elicited prior distribution for Cohen’s d for all experts alongside with their respective fields of research.

As can be expected, no two elicited distributions are exactly alike. However, the prior distributions, especially the distributions of Expert 2-5, are remarkably similar. Expert 1 deviated from the other experts in that they expected substantially lower effect sizes. Expert 6 displayed less uncertainty than the other experts.

After eliciting prior distributions from experts, the next question we ask is: To what extent do differences in priors influence the results of Bayesian hypothesis testing? In other words, how sensitive is the Bayes factor to interpersonal variation in the prior? This question addresses a frequently voiced concern about Bayesian methods: Results of Bayesian analyses could be influenced by arbitrary features of the prior distribution.

To investigate the sensitivity of the Bayes factor to the interpersonal variation in elicited priors, we applied the elicited prior distributions to a large number of re-analyses of studies in psychology. Specifically, for elicited priors on Cohen’s d, we re-analyzed t-tests from a database assembled by Wetzels et al. (2011) that contains 855 t-tests from the journals Psychonomic Bulletin & Review and the Journal of Experimental Psychology: Learning, Memory, and Cognition. In each test, we used the elicited priors as prior distribution on Cohen’s d in the alternative model.

What does it mean if a Bayes factor is sensitive to the prior? Here, we used three criteria: First, we checked for all combinations of prior distributions how often a change in priors led to a change in the direction of the Bayes factor. We recorded a change in direction if the Bayes factor showed evidence for the null model (i.e., BF_{10} < 1) for one prior and evidence for the alternative model (i.e., BF10 > 1) for a different prior. Agreement was conversely defined as both Bayes factors being larger or smaller than one. As can be seen below, agreement rates were generally high for all combinations of prior distributions.

As a second sensitivity criterion, we recorded changes in the evidence category of the Bayes factor. Often, researchers are interested in whether a hypothesis test provides strong evidence in favor of the alternative hypothesis (e.g., BF10 > 10), strong evidence in favor of the null hypothesis (e.g., BF10 < 1/10), or inconclusive evidence (e.g., 1/10 < BF10 < 10). Thus, they classify the Bayes factor as belonging to one of three evidence categories. We recorded whether different priors led to a change in these evidence categories, that is, whether one Bayes factor would be classified as strong evidence, while a Bayes factor using a different prior would be classified as inconclusive evidence or strong evidence in favor of the other hypothesis. From the figure below, we can see that overall the agreement of Bayes factors with regard to evidence category is slightly lower than the agreement with regard to direction. However, this can be expected since evaluating agreement across two cut-points will generally result in lower agreement than evaluating agreement across a single cut-point.

As a third aspect of Bayes factor sensitivity we investigated changes in the exact Bayes factor value. The figure below shows the correspondence of log Bayes factors for all experts and all tests in the Wetzels et al. (2011) database. What becomes clear is that Bayes factors are not always larger or smaller for one prior distribution compared to another, but that the relation differs per study. In fact, the effect size in the sample determines which prior distribution yields the highest Bayes factor in a study. Sample size has an additional effect, with larger sample sizes leading to more pronounced differences between Bayes factors for different prior distributions.

The sensitivity of the Bayes factor has often been a subject of discussion in previous research. Our results show that the Bayes factor is sensitive to the interpersonal variability between elicited prior distributions. Even for moderate sample sizes, differences between Bayes factors with different prior distributions can easily range in the thousands. However, our results also indicate that the use of different elicited prior distributions rarely changes the direction of the Bayes factor or the category of evidence strength. Thus, the qualitative conclusions of hypothesis tests in psychology rarely change based on the prior distribution. This insight may increase the support for informed Bayesian inference among researchers who were sceptical that the subjectivity prior distributions might determine the qualitative outcomes of their Bayesian hypothesis tests.

Wetzels, R., Matzke, D., Lee, M. D., Rouder, J. N., Iverson, G. J., & Wagenmakers, E. –J. (2011). Statistical evidence in experimental psychology: An empirical comparison using 855 t tests. *Perspectives on Psychological Science, 6*(3), 291–298. https://doi.org/10.1177/1745691611406923

Stefan, A., Katsimpokis, D., Gronau, Q. F., & Wagenmakers, E.-J. (2021). Expert agreement in prior elicitation and its effects on Bayesian inference.* PsyArXiv Preprint.* https://doi.org/10.31234/osf.io/8xkqd

Icons made by Freepik from www.flaticon.com

Angelika is a PhD candidate at the Psychological Methods Group of the University of Amsterdam.

Dimitris Katsimpokis is a PhD student at the University of Basel.

Quentin is a PhD candidate at the Psychological Methods Group of the University of Amsterdam & postdoctoral fellow working on stop-signal models for how we cancel and modify movements and on cognitive models for improving the diagnosticity of eyewitness memory choices.

Eric-Jan (EJ) Wagenmakers is professor at the Psychological Methods Group at the University of Amsterdam.