Hidden Aspects of the Research ADOS Are Bound to Affect Autism Science

The research-grade Autism Diagnostic Observational Schedule (ADOS) is a broadly used instrument that informs and steers much of the science of autism. Despite its broad use, little is known about the empirical variability inherently present in the scores of the ADOS scale or their appropriateness to define change and its rate, to repeatedly use this test to characterize neurodevelopmental trajectories. Here we examine the empirical distributions of research-grade ADOS scores from 1324 records in a cross-section of the population comprising participants with autism between five and 65 years of age. We find that these empirical distributions violate the theoretical requirements of normality and homogeneous variance, essential for independence between bias and sensitivity. Further, we assess a subset of 52 typical controls versus those with autism and find a lack of proper elements to characterize neurodevelopmental trajectories in a coping nervous system changing at nonuniform, nonlinear rates. Repeating the assessments over four visits in a subset of the participants with autism for whom verbal criteria retained the same appropriate ADOS modules over the time span of the four visits reveals that switching the clinician changes the cutoff scores and consequently influences the diagnosis, despite maintaining fidelity in the same test's modules, room conditions, and tasks' fluidity per visit. Given the changes in probability distribution shape and dispersion of these ADOS scores, the lack of appropriate metric spaces to define similarity measures to characterize change and the impact that these elements have on sensitivity-bias codependencies and on longitudinal tracking of autism, we invite a discussion on readjusting the use of this test for scientific purposes.


Introduction
Several debates have been published surrounding controversial reliance on the research- 35 grade Autism Diagnostic Observational Schedule (ADOS) [1] for use in scientific Autism  expressing error bars with +/-two standard deviations from the often-assumed 115 Gaussian mean. 116 Indeed, inherent in the use of the Autism clinical model that has been transferred to basic 117 science as a research-appropriate model, are the assumptions of normal distribution and linear 118 processes with stationary statistics that emerge -independent of the observer's bias when 119 rating rather complex and dynamic social behaviors ( Figure 2AB). In research adopting such 120 scoring scales, the scales have yet to be mapped to biophysical data from the somatic-  The ADOS range of scores is based on positive integer values, with 0-value at the lower 141 bound, representing that the behavior is not present as specified. Behaviors coded on the 142 ADOS are assumed to be behaviors that occur in non-spectrum individuals (e.g., eye contact, 143 pointing, shared enjoyment) as well as behaviors that could occur in ASD (e.g., 144 stereotyped/idiosyncratic language, complex mannerisms). We note that those behaviors 145 present in the non-autistic individuals have not been experimentally assessed under ADOS 146 conditions, as noted by their classical paper [1], in the last sentence of the paper. Aspects of 147 social interactions such as eye contact, pointing movements, face-processing micro-motions,  It may be possible that the discrete data from the observational scores statistically aligns 152 well with the non-normally distributed, non-linear complex dynamics data that we find in the 153 continuous biophysical data underlying the physiological and neurobiological studies of 154 Autism. If that was the case, we would have to rethink the scales currently defining the cut-155 off criteria adopted in research to detect Autism, as such discrete (static) scores are often 156 expected to correlate with the continuous (dynamic) biophysical data from scientific studies. because the clinical version of the ADOS is not a norm-referenced test, but rather a criterion-178 referenced one.

179
As such, the research-grade ADOS which was adopted in science from the clinical 180 counterpart, has no proper metric to measure relative changes away from typical levels. In 181 this sense, although the ADOS modules were designed to account for possible disparities in 182 cognitive/verbal capacity, there is no age-dependent criteria in the research-adopted version 183 to ascertain physical rate of change and to measure departure from typical physiologic 184 maturation, i.e. maturation with respect to normative trends ( Figure 1A). Even the so-called 185 "standardized" ADOS severity scores do not address this point, because as they were 186 developed for criteria-referenced for clinical use, the scale was not built using typical controls 187 as a norm-referenced test would be [35; 36; 37]. The score is designed to compare an 188 individual with ASD to other individuals with ASD of the same age and language level. It 189 also has a range of 6-10 for individuals with ASD and is not meant to represent ASD on a 190 range of 1-10. Autism is not only highly heterogeneous. It also has neurodevelopmental 191 asynchronies in a group of the same age, meaning that two individuals may be 10 years old, 192 but one may have the signatures of neuromotor control from normative 3-year-old children 193 [6] ( Figure 1A, B, D). Thus, aging with autism is different than typically aging.

194
The absence of a similarity metric for the adopted research ADOS poses a challenge to 195 the scientific community. What is the normative range of scores that reflect age-appropriate 196 typical social interactions?

197
Because of the prevalent influence that the research-grade ADOS test has on basic 198 science at all levels, it is imperative to examine the inherent theoretical assumptions that 199 adopters of this test have made and verify that the outcome of this test -as administered in 200 research settings-empirically matches the theoretical assumptions of the users.

201
In the first part of this paper, we use the Autism Brain Imaging Data Exchange (ABIDE) behavior that would be otherwise present; or if they significantly depart from 0. In the last 212 part of the paper, we longitudinally track a subset of the individuals with Autism who 213 returned to the lab 4 visits. We tested these participants using the same module (twice), each 214 module administered by different clinicians, to assess the extent to which such changes 215 impacted the scores that they received, despite adjusting for physical age. We report our     Psychologist. Further, two clinically certified raters independently video-taped and 248 discussed the sessions to ensure module administration fidelity.

249
The two raters who administered the ADOS to the participants were research reliable.     owing to the test's assumption that no significant sensory and motor issues are present. As 298 such, it is never possible to assess causality. For all these reasons, and because our lab  The response of the child determines the score. Likewise, the way in which the rater 307 evokes the response influences the child's choice of actions that are consequential to the 308 rater's provoking actions. To probe the extent to which a change in the tester influenced 309 the scoring, our experiment manipulated the rater as a parameter, while holding all other 310 conditions constant in two different visits. Participants came to the lab a total of four visits.

311
For each participant, two modules were selected, and research-reliable testers were 312 employed. One module was rendered the most adequate one, while the other was rendered 313 the feasible one. By most adequate, we mean that the module was at the child's verbal 314 level and developmental level, while feasible means that the child could perform the entire 315 module, but it would not be the adequate one to perform a diagnosis or to aid a clinician 316 in performing a diagnosis of autism. We note that previous research indicates that 317 inappropriate ADOS module use invalidates the assessment and the scores do not 318 accurately reflect the child's performance on the assessment. Nevertheless, since this study 319 is not about diagnosing autism, but rather it is about evaluating the use of this ADOS test 320 in basic science, specifically assessing the variability that using different raters may add 321 to the scores; in addition to changing the rater, we are also manipulating the use of the 322 modules across different visits.

323
Each module took between 40-60 minutes to complete. Both the rater and the 324 participant were recorded by two video cameras from different angles and by wearable 325 smart sensors that they wore embedded in the clothing, as wrist watches on the wrists and 326 on the ankles. The digital data will be the subject of a different paper. Additional 327 information about the ADOS test can be found in the Supplementary Table. 328 329 Open Access Data: The ABIDE records contain ADOS-2 and ADOS-G scores that we    pieces of information that will add to our understanding of its most appropriate use." 394 (emphasis added). Table 1 shows the 52 participants' scores and ages at baseline visit where 395 the most appropriate ADOS-2 module was selected for each child.

396
In 14 of the individuals with ASD (mean 9.3 years old ± 3.0), we re-assessed them across 397 4 visits taking place within 1.3 years on average (± 6 months), to ask the extent to which 398 switching the clinician and / or the ADOS-module would change the outcome of the test for 399 the same child. To that end, for each child, in the first two visits, we used the same clinician 400 but used two different ADOS-modules. According to each assessment in each visit, the raters 401 determined the modules that were the most appropriate and feasible. From these assessments  The first module (visit one) determined the most appropriate module at baseline, for the 405 given child. The second module (visit 2) was feasible (the child could do it) but was not 406 appropriate. For example, if the most appropriate module in visit 1 was module 3, we would 407 choose module 2 for visit 2. Then the same clinician would give these two modules whenever 408 the participant retrained the set of modules from visits 1 and 2, according to the raters' 409 evaluation; and the same two modules were then administered in subsequent visits (by a 410 different rater); module 3 in visit 3 and module 2 in visit 4 (see Table 2.) 411 We switched clinician and maintained module fidelity and identical room set up. In this 412 way, each child had a chance to become familiar with the 2 modules by the time that we 413 switched the clinician. Those two same modules in the same order as the first two visits, 414 would then be administered by the new clinician to give us a chance to probe the influences 415 of the clinician on the child's response. The flexibility in task administration according to the 416 child's responses was respected to ensure fluid responses. 417 We hypothesized that this switching of clinicians (despite the use of the same modules' 418 and tasks' order in each administration), would have a substantial effect on the ADOS sub-419 scores, thus significantly impacting the reliability of the total score and the cut-off for the 420 diagnosis given to the child by the rater (clinician). To test this hypothesis, we used non-421 parametric statistics whereby we do not assume any distribution a priori. Table 2      Despite the well-established physiological and neurobiological distinctions between 510 males and females in the spectrum of Autism, here we could not find any statistically 511 significant difference between the ADOS-G (or the ADOS-2) total scores to automatically 512 separate these two distinct phenotypes. We could not find either statistically significant 513 differences between the sub-scores of each ADOS test, when comparing males and females 514 using the Wilcoxon ranksum test, as all p-values were above 0.5.

515
However, we did notice a significant difference in the distributions of the total scores for 516 males when comparing those of the ADOS-G vs. those of the ADOS-2, (center panel in 517 Figure 6B). This difference was assessed using the Kolmogorov-Smirnov test for 2 empirical   534 and AS 535 We expected that despite their subtle differences, the variants of the same test would 536 provide consistent results for participants with AS and for those with ASD. In the case of 537 ADOS-G, we found a statistically significant difference between the scores of males with 538 ASD and those with AS, using the Wilcoxon ranksum test yielding a p-value of 3.7x10 -7 .   Asperger's syndrome. 571 We note that given the differences in sample size and the non-normality in the 572 distributions of scores, in all above comparisons, we used the Wilcoxon rank sum test. This      score. However, the RRB score significantly changed across visits (Chi-sq 21.01 and p-value 615 0.0001) with major differences when switching clinician in visit 3-4. Despite the use of the 616 same modules, room setup and task order for each child, the differences in ADOS-2 scores 617 for RRB were marked as systematically difference by the post hoc mult-compare test. These 618 outcomes can be appreciated in Figure 10 for total (A), social affect (B) and RRB (C). 619 We further tested all the scores for each clinician by pooling across all children and score 620 type, to examine the types of distributions best fitting their frequency histograms. Figure 10A   621 shows this analysis for each clinician, while Figure 10B shows the output of the non-622 parametric Kruskal-Wallis test, which revealed statistically significant difference. Figure   623 10C shows the failure of normality for scores by Clinician 1, while Figure 10D shows so for 624 Clinician 2. The use of MLE to ascertain the fit of several probability distribution functions 625 confirmed that the normal is not a good fit for either ( Figure 10E). Further, the Gamma 626 distribution was used as per the MLE outcome to fit the data and compare the scores of the 627 two clinicians for the same children, same modules, same modules order/visit and same tasks 628 order. Figure 10F shows the fits of the Normal distribution (left hand side) and that of the  The relative changes in score/age (with the age measured in years, months) were 644 obtained for each of the 14 participants that we tracked over 4 visits. When examining these 645 derivative data, we found significant differences in the RRB ADOS-2 scores. Consistent with 646 18 of 28 the effect that the change of clinician in visit 3 had revealed for the size-data given by absolute 647 ADOS-2 scores, here, the derivative-data considering the age change of the participant from 648 visit to visit, also reveal significant changes in the RRB scores. The same individuals were 649 rated significantly different by the clinicians, thus yielding different scores for the same 650 module and tasks. Kruskal-Wallis test for the comparison of the ADOS-2 RRB scores across 651 visits yielded significant differences across visits (Chi-sq 13.45, p-value 0.003).

652
The Figure 11A shows the evolution of the clinician's diagnostic criteria over time.  The individual evolution for each child is seen in Figure 11B, whereby the different 659 clinician's styles of scoring can be seen. There is no inter-rater reliability rendering these 660 ADOS-2 criteria robust. For the same child and ADOS-2 module, we see changes in the 661 classification of Autism vs. ASD. These differences in perception biases add to the findings 662 on non-normality of the distributions previously described in Figure 10 for this lab cohort, 663 and for the large cross-sectional population data from ABIDE.    are adaptable, we would be missing self-correcting mechanisms by imposing theoretical 730 models without empirically informing those models.

731
In the context of this test, changes in the distributions' shape and dispersion (across ages 732 and sex) imply lack of independence between sensitivity and bias. As such, the rater's As stated, neurodevelopment occurs at highly non-linear accelerated rates. In a coping 753 nervous system, age is a non-uniform quantity in that any two given children with the same does not produce an official diagnosis, its use in research labs in the US, could serve as a flag 781 to send parents to federally certified clinics that offer services upon multi-prone criteria 782 involving other tests. However, owing to copyright issues, the Western Psychological 783 Services Firm, WPS company does not allow researchers copy, reproduce or share the ADOS 784 booklet with important details of the outcome. In other words, those children that come to 785 our labs and receive the research-grade ADOS and pass the cut-off scores are labeled autistic 786 by this test. However, as researchers, we are not allowed to share details with their parents.

787
This obstructs their ability to go to a proper clinic and pursue the diagnosis that will give 788 them access to Early Intervention Programs or to Individualized Education Programs in cases 789 when the child is of school age. If the WPS and the trainers of the ADOS allowed this, the 790 test adopted by researchers would serve as a warning to parents that some aspects of the 791 child's neurodevelopment may be off track.

792
Unlike the statistical confounds that the ADOS total scores and sub-scores surely bring 793 to research, in its current form at the clinic, the ADOS provides psychological comfort to 794 adults who had never been previously diagnosed and could not understand their place in the 795 social scene. Many adults, newly diagnosed at the clinic, express a sense of relief upon 796 learning that they are on the Autism spectrum and as such have social-interaction differences.

797
Further, the ADOS adds important information to the coarser DSM diagnosis. Thus, the 798 clinical value of this instrument is highly appreciated. However, its use as a research 799 instrument to inform physiological studies, is clearly now questionable and should be 800 discussed among the scientific community, particularly the community with the skill set to 801 fully understand statistics beyond black-box approaches that utilize software packages, i.e. 802 without ever verifying the assumptions of the methods implemented by those packages. 803 We should point out that the SDT framework used by the ADOS to test validity, was 804 introduced to science in the late 50's to address very different problems in engineering. It  be the answer to the start of a new era in Autism scientific research aimed at a physiological 862 characterization for medical use. In our present study, it was evident that whether using 863 absolute scores, or derivative, age-dependent data accounting for longitudinal dynamic 864 changes from visit to visit, the RRB reflecting sensory motor issues, picked up best the 865 switching of the clinician. If we were to combine this structured social test with wearable 866 biosensors, we could automatically stratify Autism spectrum disorders and provide objective 867 criteria of use to the community doing basic scientific work (e.g. geneticists, 868 electrophysiologists, neuroimaging, etc.) and to the physicians treating the medical issues.

869
Some collaborative work along those lines has been done between clinicians and researchers, 870 but more research is needed to fully validate and replicate the use of digital ADOS within the 871 smart-mobile and personalized health concepts.

873
We invite the readership to consider that science in Autism needs to retake the path of 874 independence and reclaim its agency to be able to conduct proper scientific research. This