Fixation-related Brain Potentials during Semantic Integration of Object–Scene Information

Abstract In vision science, a particularly controversial topic is whether and how quickly the semantic information about objects is available outside foveal vision. Here, we aimed at contributing to this debate by coregistering eye movements and EEG while participants viewed photographs of indoor scenes that contained a semantically consistent or inconsistent target object. Linear deconvolution modeling was used to analyze the ERPs evoked by scene onset as well as the fixation-related potentials (FRPs) elicited by the fixation on the target object (t) and by the preceding fixation (t − 1). Object–scene consistency did not influence the probability of immediate target fixation or the ERP evoked by scene onset, which suggests that object–scene semantics was not accessed immediately. However, during the subsequent scene exploration, inconsistent objects were prioritized over consistent objects in extrafoveal vision (i.e., looked at earlier) and were more effortful to process in foveal vision (i.e., looked at longer). In FRPs, we demonstrate a fixation-related N300/N400 effect, whereby inconsistent objects elicit a larger frontocentral negativity than consistent objects. In line with the behavioral findings, this effect was already seen in FRPs aligned to the pretarget fixation t − 1 and persisted throughout fixation t, indicating that the extraction of object semantics can already begin in extrafoveal vision. Taken together, the results emphasize the usefulness of combined EEG/eye movement recordings for understanding the mechanisms of object–scene integration during natural viewing.


INTRODUCTION
In our daily activities-for example, when we search for something in a room-our attention is mostly oriented to objects. The time course of object recognition and the role of overt attention in this process are therefore topics of considerable interest in the visual sciences. In the context of real-world scene perception, the question of what constitutes an object is a more complex question than intuition would suggest (e.g., Wolfe, Alvarez, Rosenholtz, Kuzmova, & Sherman, 2011). An object is likely a hierarchical construct (e.g., Feldman, 2003), with both low-level features (e.g., visual saliency) and high-level properties (e.g., semantics) contributing to its identity. Accordingly, when a natural scene is inspected with eye movements, the observer's attentional selection is thought to be based either on objects (e.g., Nuthmann & Henderson, 2010), image features (saliency; Itti, Koch, & Niebur, 1998), or some combination of the two (e.g., Stoll, Thrun, Nuthmann, & Einhäuser, 2015).
An early and uncontroversial finding is that the recognition of objects is mediated by their semantic consistency. For example, an object that the observer would not expect to occur in a particular scene (e.g., a toothbrush in a kitchen) is recognized less accurately (e.g., Fenske, Aminoff, Gronau, & Bar, 2006;Davenport & Potter, 2004;Biederman, 1972) and looked at for longer than an expected object (e.g., Cornelissen & Võ, 2017;Henderson, Weeks, & Hollingworth, 1999;De Graef, Christiaens, & d'Ydewalle, 1990).
What is more controversial, however, is the exact time course along which the meaning of an object is processed and how this semantic processing then influences the overt allocation of visual attention (see Wu, Wick, & Pomplun, 2014, for a review). Two interrelated questions are at the core of this debate: (1) How much time is needed to access the meaning of objects after a scene is displayed, and (2) Can object semantics be extracted before the object is overtly attended, that is, while the object is still outside high-acuity foveal vision (> 1°eccentricity) or even in the periphery (> 5°eccentricity)?
Evidence that the meaning of not-yet-fixated objects can capture overt attention comes from experiments that have used sparse displays of several standalone objects (e.g., Cimminella, Della Sala, & Coco, in press;Nuthmann, de Groot, Huettig, & Olivers, 2019;Belke, Humphreys, Watson, Meyer, & Telling, 2008;Moores, Laiti, & Chelazzi, 2003). For example, across three different experiments, Nuthmann et al. found that the very first saccade in the display was directed more frequently to objects that were semantically related to a target object rather than to unrelated objects.
Whether such findings generalize to objects embedded in real-world scenes is currently an open research question. The size of the visual span-that is, the area of the visual field from which observers can take in useful information (see Rayner, 2014, for a review)-is large in scene viewing. For object-in-scene search, it corresponded to approximately 8°in each direction from fixation (Nuthmann, 2013). This opens up the possibility that both low-and high-level object properties can be processed outside the fovea. This is clearly the case for low-level visual features: Objects that are highly salient (i.e., visually distinct) are preferentially selected for fixation (e.g., Stoll et al., 2015). If semantic processing also takes place in extrafoveal vision, then objects that are inconsistent with the scene context (which are also thought to be more informative; Antes, 1974) should be fixated earlier in time than consistent ones (Loftus & Mackworth, 1978;Mackworth & Morandi, 1967).
However, results from eye-movement studies on this issue have been mixed. A number of studies have indeed reported evidence for an inconsistent object advantage (e.g., Borges, Fernandes, & Coco, 2019;LaPointe & Milliken, 2016;Bonitz & Gordon, 2008;Underwood, Templeman, Lamming, & Foulsham, 2008;Loftus & Mackworth, 1978). Among these studies, only Loftus and Mackworth (1978) have reported evidence for immediate extrafoveal attentional capture (i.e., within the first fixation) by object-scene semantics. In this study, which used relatively sparse line drawings of scenes, the mean amplitude of the saccade into the critical object was more than 7°, suggesting that viewers could process semantic information based on peripheral information obtained in a single fixation. In contrast, other studies have failed to find any advantage for inconsistent objects in attracting overt attention (e.g., Võ & Henderson, 2009Henderson et al., 1999;De Graef et al., 1990). In these experiments, only measures of foveal processing-such as gaze durationwere influenced by object-scene consistency, with longer fixation times on inconsistent than on consistent objects.
Interestingly, a similar controversy exists in the literature on eye guidance in sentence reading. Although some degree of parafoveal processing during reading is uncontroversial, it is less clear whether semantic information is acquired from the parafovea (Andrews & Veldre, 2019, for a review). Most evidence from studies involving readers of English has been negative (e.g., Rayner, Balota, & Pollatsek, 1986), whereas results from reading German (e.g., Hohenstein & Kliegl, 2014) and Chinese (e.g., Yan, Richter, Shu, & Kliegl, 2009) suggest that parafoveal processing can advance up to the level of semantic processing.
The processing of object-scene inconsistencies and its time course have also been investigated in electrophysiological studies (e.g., Mudrik, Lamy, & Deouell, 2010;Ganis & Kutas, 2003). In ERPs, it is commonly found that scene-inconsistent objects elicit a larger negative brain response compared with consistent ones. This long-lasting negative shift typically starts as early as 200-250 msec after stimulus onset (e.g., Draschkow, Heikel, Võ, Fiebach, & Sassenhagen, 2018;Mudrik, Shalgi, Lamy, & Deouell, 2014) and has its maximum at frontocentral scalp sites, in contrast to the centroparietal N400 effect for words (e.g., Kutas & Federmeier, 2011). The effect was found for objects that appeared at a cued location after the scene background was already shown (Ganis & Kutas, 2003), for objects that were photoshopped into the scene (Coco, Araujo, & Petersson, 2017;Mudrik et al., 2010Mudrik et al., , 2014, and for objects that were part of realistic photographs ( Võ & Wolfe, 2013). These ERP effects of object-scene consistency have typically been subdivided into two distinct components: N300 and N400. The earlier part of the negative response, usually referred to as N300, has been taken to reflect the context-dependent difficulty of object identification, whereas the later N400 has been linked to semantic integration processes after the object is identified (e.g., Dyck & Brodeur, 2015). The present study was not designed to differentiate between these two subcomponents, especially considering that their scalp distribution is strongly overlapping or even topographically indistinguishable (Draschkow et al., 2018). Thus, for reasons of simplicity, we will in most cases simply refer to all frontocentral negativities as "N400." One limiting factor of existing ERP studies is that the data were gathered using steady-fixation paradigms in which the free exploration of the scene through eye movements was not permitted. Instead, the critical object was typically large and/or located relatively close to the center of the screen, and ERPs were time-locked to the onset of the image (e.g., Mudrik et al., 2010). Because of these limitations, it remains unclear whether foveation of the object is a necessary condition for processing object-scene consistencies or whether such processing can at least begin in extrafoveal vision.
In the current study, we used fixation-related potentials (FRPs), that is, EEG waveforms aligned to fixation onset, to shed new light on the controversial findings of the role of foveal versus extrafoveal vision in extracting object semantics, while providing insights into the patterns of brain activity that underlie them (for reviews about FRPs, see Nikolaev, Meghanathan, & van Leeuwen, 2016;Dimigen, Sommer, Hohlfeld, Jacobs, & Kliegl, 2011).
In this study, we simultaneously recorded eye movements and FRPs during the viewing of real-world scenes to distinguish between three alternative hypotheses on object-scene integration that can be derived from the literature: (A) One glance of the scene is sufficient to extract object semantics from extrafoveal vision (e.g., Loftus & Mackworth, 1978), (B) extrafoveal processing of objectscene semantics is possible but takes some time to unfold (e.g., Bonitz & Gordon, 2008;Underwood et al., 2008), and (C) the processing of object semantics requires foveal vision, that is, a direct fixation of the critical object (e.g., Võ & Henderson, 2009;Henderson et al., 1999;De Graef et al., 1990). We note that these possibilities are not mutually exclusive, an issue we elaborate on in the Discussion section.
For the behavioral data, these hypotheses translate as follows: under Hypothesis A, the probability of immediate target fixation should reveal that already the first saccade on the scene goes more often toward inconsistent than consistent objects. Under Hypothesis B, there should be no effect on the first eye movement, but the latency to first fixation on the critical object should be shorter for inconsistent than consistent objects. Under Hypothesis C, only fixation times on the critical object itself should differ as a function of object-scene consistency, with longer gaze durations on inconsistent objects.
For the electrophysiological data analysis, we used a novel regression-based analysis approach (linear deconvolution modeling; Cornelissen, Sassenhagen, & Võ, 2019;Kristensen, Rivet, & Guérin-Dugué, 2017;Smith & Kutas, 2015b;Dandekar, Privitera, Carney, & Klein, 2012), which allowed us to control for the confounding influences of overlapping potentials and oculomotor covariates on the neural responses during natural viewing. In the EEG, Hypothesis A can be tested by computing the ERP time-locked to the onset of the scene on the display, following the traditional approach. Given that the critical objects in our study were not placed directly in the center of the screen from which observers started their exploration of the scene, any effect of object-scene congruency in this ERP would suggest that object semantics is rapidly processed in extrafoveal vision, even before the first eye movement is generated, in line with Loftus and Mackworth (1978). Under Hypothesis B, we would not expect to see an effect in the scene-onset ERP. Instead, we should find a negative brain potential (N400) for inconsistent as compared with consistent objects in the FRP aligned to the fixation that precedes the one that first lands on the critical object. Finally, if Hypothesis C is correct, an N400 for inconsistent objects should only arise once the critical object is foveated, that is, in the FRP aligned to the target fixation (fixation t). In contrast, no consistency effects should appear in the sceneonset ERP or in the FRP aligned to the pretarget fixation (fixation t − 1). To preview the results, both the eye movement and the EEG data lend support for Hypothesis B.

Design and Task Overview
We designed a short-term visual working memory change detection task, illustrated in Figures 1 and 2. During the study phase, participants were exposed to photographs Participants viewed photographs of indoor scenes that contained a target object (highlighted with a red circle) that was either semantically consistent (here, toothpaste) or semantically inconsistent (here, flashlight) with the context of the scene. The target object could be placed at different locations within the scene, on either the left or right side. The example gaze path plotted on the right illustrates the three types of fixations analyzed in the study: (a) t -1, the fixation preceding the first fixation to the target object; (b) t, the first fixation to the target; and (c) nt, all other (nontarget) fixations. Fixation duration is proportional to the diameter of the circle, which is red for the critical fixations and black for the nontarget fixations. of indoor scenes (e.g., a bathroom), each of which contained a target object that was either semantically consistent (e.g., toothpaste) or inconsistent (e.g., a flashlight) with the scene context. In the following recognition phase, after a short retention interval of 900 msec, the same scene was shown again, but in half of the trials, either the identity, the location, or both the identity and location of the target object had changed relative to the study phase.
The participants' task was to indicate with a keyboard press whether or not a change had happened to the scene (see also LaPointe & Milliken, 2016). All eyemovement and EEG analyses in the present article focus on the semantic consistency manipulation of the target object during the study phase.

Participants
Twenty-four participants (nine men) between the ages of 18 and 33 years (M = 25.0 years) took part in the experiment after providing written informed consent. They were compensated with £7 per hour. All participants had normal or corrected-to-normal vision. Data from an additional two participants were recorded but removed from the analysis because of excessive scalp muscle (EMG) activity or skin potentials in the raw EEG. Ethics approval was obtained from the Psychology Research Ethics Committee of the University of Edinburgh.

Apparatus and Recording
Scenes were presented on a 19-in. CRT monitor (Iiyama Vision Master Pro 454) at a vertical refresh rate of 75 Hz. At the viewing distance of 60 cm, each scene subtended 35.8°× 26.9°(width × height). Eye movements were recorded monocularly from the dominant eye using an SR Research EyeLink 1000 desktop-mounted system at a sampling rate of 1000 Hz. Eye dominance for each participant was determined with a parallax test. A chin-andforehead rest was used to stabilize the participant's head. Nine-point calibrations were run at the beginning of each session and whenever the participant's fixation deviated by > 0.5°horizontally or > 1°vertically from a drift correction point presented at trial onset.
The EEG was recorded from 64 active electrodes at a sampling rate of 512 Hz using BioSemi ActiveTwo amplifiers. Four electrodes, located near the left and right canthus and above and below the right eye, recorded the EOG. All channels were referenced against the BioSemi common mode sense (active electrode) and grounded to a passive electrode. The BioSemi hardware is DC coupled and applies digital low-pass filtering through the A/D-converter's decimation filter, which has a fifth-order sinc response with a −3 dB point at one fifth of the sample rate (corresponding approximately to a 100-Hz lowpass filter).
Offline, the EEG was rereferenced to the average of all scalp electrodes and filtered using EEGLAB's (Delorme & Makeig, 2004) Hamming-windowed sinc finite impulse response filter (pop_eegfiltnew.m) with default settings. The lower edge of the filter's passband was set to 0.2 Hz (with −6 dB attenuation at 0.1 Hz); and the upper edge, to 30 Hz (with −6 dB attenuation at 33.75 Hz). Eye tracking and EEG data were synchronized using shared triggers sent via the parallel port of the stimulus presentation PC to the two recording computers. Synchronization was performed offline using the EYE-EEG extension (v0.8) for EEGLAB (Dimigen et al., 2011). All data sets were aligned with a mean synchronization error ≤ 2 msec as computed based on trigger alignment after synchronization.

Materials and Rating
Stimuli consisted of 192 color photographs of indoor scenes (e.g., bedrooms, bathrooms, offices). Real target objects were placed in the physical scene, before each picture was taken with a tripod under controlled lighting conditions and with a fixed aperture (i.e., there was no photo-editing). One scene is shown in Figure 1; miniature versions of all stimuli used in this study are found Figure 2. Trial scheme. After a drift correction, the study scene appeared. The display duration of the scene was controlled by a gaze-contingent mechanism, and it disappeared, on average, 2000 msec after the target object was fixated. In the following retention interval, only a fixation cross was presented. During the recognition phase, the scene was presented again until participants pressed a button to indicate whether or not a change had occurred within the scene. All analyses in the present article focus on eye-movement and EEG data collected during the study phase. online at https://osf.io/sjprh/. Of the 192 scenes, 96 were conceived as change items and 96 were conceived as nochange items. Each one of the 96 change scenes was created in four versions. In particular, the scene (e.g., a bathroom) was photographed with two alternative target objects in it, one that was consistent with the scene context (e.g., a toothbrush) and one that was not (e.g., a flashlight). Moreover, each of these two objects was placed at two alternative locations (left or right side) within the scene (e.g., either on the sink or on the bathtub). Accordingly, three types of change were implemented during the recognition phase (Congruency, Location, and Both; see Procedure section below).
Each of the 96 no-change scenes was also a real photograph with either a consistent or an inconsistent object in it, which was again located in either the left or right half of the scene. Across the 96 no-change scenes, the factors consistency (consistent vs. inconsistent objects) and location ( left and right) were also balanced. However, each no-change scene was unique; that is, we did not create four different versions of each no-change scene. The data of the 96 no-change scenes, which were originally conceived to be filler trials, were included to improve the signal-to-noise ratio of the EEG analyses, as these scenes also had a balanced distribution of inconsistent and consistent objects.
As explained above, scenes contained a critical object that was either consistent or inconsistent with the scene context. Object consistency was assessed in a pretest rating study by eight naive participants who were not involved in any other aspect of the study. Each participant rated all of the no-change scenes as well as one of the four versions of each change-scene (counterbalanced across raters). Together with the scene, raters saw a box with a cropped image of the critical object. They were asked (a) to write down the name for the displayed object and (b) to respond to the question "How likely is it that this object would be found in this room?" using a 6-point Likert scale (1-6). For the object naming, a mean naming agreement of 96.35% was obtained. Furthermore, consistent objects were judged as significantly more likely (M = 5.78, SD = 0.57) to appear in the scene than inconsistent objects (M = 1.88, SD = 1.11), as confirmed by an independent-samples Kruskal-Wallis H test, χ 2 (1) = 616.09, p < .001.
In addition, we ensured that there was no difference between consistent and inconsistent objects on three important low-level variables: object size (pixels square), distance from the center of the scene (degrees of visual angle), and mean visual saliency of the object as computed using the Adaptive Whitening Saliency model (Garcia-Diaz, Fdez-Vidal, Pardo, & Dosil, 2012). Table 1 provides additional information about the target object. Independent t tests showed no significant difference between inconsistent and consistent objects in size, t(476) = −1.27, p = .2; visual saliency, t(476) = 0.82, p = .41; and distance from the center, t(476) = −1.75, p = .08.
The position of each target object was marked with an invisible rectangular bounding box, which was used to implement the gaze contingency mechanism (described in the Procedure section below) and to determine whether a fixation was inside the target object. The average width of the bounding box was 6.1°± 2.0°for consistent objects and 6.1°± 2.1°for inconsistent objects (see Table 1); the average height was 5.1°± 1.8°and/or 5.4°± 2.2°, respectively. The average distance of the object centroid from the center of the scene was 12.1°(± 2.8°) for consistent and 11.7°(± 3.0°) for inconsistent objects.

Procedure
A schematic representation of the task is shown in Figure 2. Each trial started with a drift correction of the eye tracker. Afterward, the study scene was presented (e.g., a bathroom). The display duration of the study scene was controlled by a gaze-contingent mechanism that ensured that participants fixated the target object (e.g., toothbrush or flashlight) at least once during the trial. Specifically, the study scene disappeared, on average, 2000 msec (with a random jitter of ± 200 msec, drawn from a uniform distribution) after the participant's eyes left the invisible bounding box of the target object (and provided that the target had been fixated for at least 150 msec). The jittered delay of about 2000 msec was implemented to prevent participants from learning to associate the last fixated object during the study phase with the changed object during the recognition phase. If the participant did not fixate the target object within 10 sec, the study scene disappeared from the screen and the retention interval was triggered, which lasted for 900 msec.
In the following recognition phase (data not analyzed here), the scene was presented again, either with (50% of trials) or without (50% of trials) a change to an object in the scene. Three types of object changes occurred with equal probability: Location, Consistency, or Both. In the (a) Location condition, the target object changed its position and moved either from left to right or from right to left to another plausible location within the scene (e.g., a toothbrush was placed elsewhere within the bathroom scene). In the (b) Consistency condition, the object remained in the same location but was replaced with another object of opposite semantic consistency (e.g., the toothbrush was replaced by a flashlight). Finally, in the (c) Both condition, the object was both replaced and moved within the scene (e.g., a toothbrush was replaced by a flashlight at a different location).
During the recognition phase, participants had to indicate whether they noticed any kind of change within the scene by pressing the arrow keys on the keyboard. Afterward, the scene disappeared, and the next trial began. If participants did not respond within 10 sec, a missing response was recorded.
The type of change between trials was fully counterbalanced using a Latin Square rotation. Specifically, the 96 change trials were distributed across 12 different lists, implementing the different types of change. This implies that each participant was exposed to an equal number of consistent and inconsistent change trials. The 96 nochange trials also were composed of an equal number of consistent and inconsistent scenes and were the same for each participant. During the experiment, all 192 trials were presented in a randomized order. They were preceded by four practice trials at the start of the session. Written instructions were given to explain the task, which took 20-40 min to complete. The experiment was implemented using the SR Research Experiment Builder software.

Eye-movement Events and Data Exclusion
Fixations and saccade events were extracted from the raw gaze data using the SR Research Data Viewer software, which performs saccade detection based on velocity and acceleration thresholds of 30°sec −1 and 9500°sec −2 , respectively. To provide directly comparable results for eye-movement behavior and FRP analyses, we discarded all trials on which we did not have clean data from both recordings. Specifically, from 4608 trials (24 participants × 192 trials), we excluded 10 trials (0.2%) because of machine error (i.e., no data were recorded for those trials), 689 trials (15.0%) because the participant responded incorrectly after the recognition phase, and 494 trials (10.7%) because the target object was not fixated during the study phase. Finally, we removed an additional 97 trials (2.1%) for which the target fixation overlapped with intervals of the EEG that contained nonocular artifacts (see below). The final data set for the behavioral and FRP analyses therefore was composed of 3318 unique trials: 1567 for the consistent condition and 1751 for the inconsistent condition. Per participant, this corresponded to an average of 65.3 trials (± 6.9, range = 48-78) for consistent and 73.0 trials (± 6.9, range = 59-82) for inconsistent Incoming saccade amplitude to t − 1 (°) 6.1 ± 5.2 6 ± 4.8 Incoming saccade amplitude to t (°) 8.5 ± 5.2 8.3 ± 4.8 Incoming saccade amplitude to t + 1 (°) 9.5 ± 5.9 10.2 ± 5.8 Distance of fixation t − 1 from the closest edge of target (°) Target object size and distance to target are based on the bounding box around the object. The fixation t + 1 is the first fixation after leaving the bounding box of the target object.
items. Because of the fixation check, participants were always fixating at the screen center when the scene appeared on the display. This ongoing central fixation was removed from all analyses.

EEG Ocular Artifact Correction
EEG recordings during free viewing are contaminated by three types of ocular artifacts (Plöchl, Ossandón, & König, 2012) that need to be removed to get at the genuine brain activity. Here, we applied an optimized variant (Dimigen, 2020) of independent component analysis (ICA; Jung et al., 1998), which uses the information provided by the eye tracker to objectively identify ocular ICA components (Plöchl et al., 2012). In a first step, we created optimized ICA training data by high-pass filtering a copy of the EEG at 2 Hz (Dimigen, 2020;Winkler, Debener, Müller, & Tangermann, 2015) and segmenting it into epochs lasting from scene onset until 3 sec thereafter. These high-pass-filtered training data were entered into an extended Infomax ICA using EEGLAB, and the resulting unmixing weights were then transferred to the original (i.e., less strictly filtered) recording (Debener, Thorne, Schneider, & Viola, 2010). From this original EEG data set, we then removed all independent components whose time course varied more strongly during saccade intervals (defined as lasting from −20 msec before saccade onset until 20 msec after saccade offset) than during fixations, with the threshold for the variance ratio (saccade/fixation; see Plöchl et al., 2012) set to 1.3. Finally, the artifact-corrected continuous EEG was back-projected to the sensor space. For a validation of the ICA procedure, please refer to Supplementary Figure S1.
In a next step, intervals with residual nonocular artifacts (e.g., EMG bursts) were detected by shifting a 2000-msec moving window in steps of 100 msec across the continuous recording. Whenever the voltages within the window exceeded a peak-to-peak threshold of 100 μV in at least one of the channels, all data within the window were marked as "bad" and subsequently excluded from analysis. Within the linear deconvolution framework (see below), this can easily be done by setting all predictors to zero during these bad EEG intervals (Smith & Kutas, 2015b), meaning that the data in these intervals will not affect the computation.

Eye-movement Data
Dependent measures. Behavioral analyses focused on four eye-movement measures commonly reported in the semantic consistency literature: (a) the cumulative probability of having fixated the target object as a function of the ordinal fixation number, (b) the probability of immediate object fixation, (c) the latency to first fixation on the target object, and (d) the gaze duration on the target object (cf. Võ & Henderson, 2009).
Linear mixed-effects modeling. Eye-movement data were analyzed using linear mixed-effects models (LMMs) and generalized LMMs (GLMM) as implemented in the lme4 package in R (Bates, Mächler, Bolker, & Walker, 2015). The only exception was the cumulative probability of first fixations on the target for which a generalized linear model (GLM) was used. One advantage of (G)LMM modeling is that it allows one to simultaneously model the intrinsic variability of both participants and scenes (e.g., . In all analyses, the main predictor was the consistency of the critical object (contrast coding: consistent = −0.5, inconsistent = 0.5) in the study scene. In the (G)LMMs, Participant (24) and Scene (192) were included as random intercepts. 1 The cumulative probability of object fixation was analyzed using a GLM with a binomial (probit) link. This model included the Ordinal Number of Fixation on the scene as a predictor; it was entered as a continuous variable ranging from 1 to a maximum of 28 (the 99th quantile).
In the tables of results, we report the beta coefficients, t values (LMM), z values (GLMM), and p values for each model. For LMMs, the level of significance was calculated from an F test based on the Satterthwaite approximation to the effective degrees of freedom (Satterthwaite, 1946), whereas p values in GLMMs are based on asymptotic Wald tests.

Electrophysiological Data
Linear deconvolution modeling (first level of analysis). EEG measurements during active vision are associated with two major methodological problems: overlapping potentials and low-level signal variability . Overlapping potentials arise from the rapid pace of active information sampling through eye movements, which causes the neural responses that are evoked by subsequent fixations on the stimulus to overlap with each other. Because the average fixation duration usually varies between conditions, this changing overlap can easily confound the measured waveforms. A related issue is the mutual overlap between the ERP elicited by the initial presentation of the stimulus and the FRPs evoked by the subsequent fixations on it. This second type of overlap is especially important in experiments like ours, in which the critical fixations occurred at different latencies after scene onset in the two experimental conditions.
The problem of signal variability refers to the fact that low-level visual and oculomotor variables can also influence the morphology of the predominantly visually evoked fixation-related neural responses (e.g., Kristensen et al., 2017;Nikolaev et al., 2016;Dimigen et al., 2011). The most relevant of these variables, which is known to modulate the entire FRP waveform, is the amplitude of the saccade that precedes fixation onset (e.g., Dandekar et al., 2012;Thickbroom, Knezevič, Carroll, & Mastaglia, 1991). One option for controlling the effect of saccade amplitude is to include it as a continuous covariate in a massive univariate regression model (Smith & Kutas, 2015a, 2015b, in which a separate regression model is computed for each EEG time point and channel ( Weiss, Knakker, & Vidnyánszky, 2016;Hauk, Davis, Ford, Pulvermüller, & Marslen-Wilson, 2006). However, this method does not account for overlapping potentials.
An approach that allows one to simultaneously control for overlapping potentials and low-level covariates is deconvolution within the linear model (for tutorial reviews, see Smith & Kutas, 2015a, 2015b, sometimes also called "continuous-time regression" (Smith & Kutas, 2015b). Initially developed to separate overlapping BOLD responses (e.g., Serences, 2004), linear deconvolution has also been applied to separate overlapping potentials in ERP (Smith & Kutas, 2015b) and FRP (Cornelissen et al., 2019;Kristensen et al., 2017;Dandekar et al., 2012) paradigms. Another elegant property of this approach is that the ERPs elicited by scene onset and the FRPs elicited by fixations on the scene can be disentangled and simultaneously estimated in the same regression model. The benefits of deconvolution are illustrated in more detail in Supplementary Figures S2 and S3.
Here, we applied this technique by using the new unfold toolbox , which represents the first-level analysis and provides us with the partial effects (i.e., the beta coefficients or "regression ERPs"; Smith & Kutas, 2015a, 2015b for each predictor of interest. In a first step, both stimulus onset events and fixation onset events were included as stick functions (also called "finite impulse responses") in the design matrix of the regression model. To account for overlapping activity from adjacent experimental events, the design matrix was then time-expanded in a time window between −300 and +800 msec around each stimulus and fixation onset event. Time expansion means that the time points within this window are added as predictors to the regression model. Because the temporal distance between subsequent events in the experiment is variable, it is possible to disentangle their overlapping responses. Time expansion with stick functions is explained in Serences (2004) and Ehinger and Dimigen (2019; see their Figure 2). The model was run on EEG data sampled at the original 512 Hz; that is, no down-sampling was performed.
Using Wilkinson notation, the model formula for scene onset events was defined as ERP ∼ 1 þ Consistency In this formula, the beta coefficients for the intercept (1) capture the shape of the overall waveform of the stimulus ERP in the consistent condition, which was used as the reference level, whereas those for Consistency capture the differential effect of presenting an inconsistent object in the scene (relative to a consistent object) on the ERP. The coefficients for the predictor Consistency are therefore analogous to a difference waveform in a traditional ERP analysis (Smith & Kutas, 2015a, 2015b and would reveal if semantic processing already occurs immediately after the initial presentation of the scene. In the same regression model, we also included the onsets of all fixations made on the scene. Fixation onsets were modeled with the formula FRP ∼ 1 þ Consistency * Type þ Sacc Amplitude Thus, we predicted the FRP for each time point as a function of the semantic Consistency of the target object (consistent vs. inconsistent; consistent as the reference level) in interaction with the Type of fixation (critical fixation vs. nontarget fixation; nontarget fixation as the reference level). In this model, any FRP consistency effects elicited by the pretarget or target fixation would appear as an interaction between Consistency and Fixation Type. In addition, we included the incoming Saccade Amplitude (in degrees of visual angle) as a continuous linear covariate to control for the effect of saccade size on the FRP waveform. 2 Thus, the full model was as follows: fERP ∼1 þ Consistency; FRP ∼ 1 þ Consistency * Type þ Sacc Amplitude g This regression model was then solved for the betas using the LSMR algorithm in MATLAB (without regularization).
The deconvolution model specified by the formula above was run twice: In one version, we treated the pretarget fixation (t − 1) as the critical fixation; in the other version, we treated the target fixation (t) as the critical fixation. In a given model, all fixations but the critical ones were defined as nontarget fixations. FRPs for fixation t − 1 and for fixation t were estimated in two separate runs of the model, rather than simultaneously within the same model, because the estimation of overlapping activity was much more stable in this case. In other words, although the deconvolution method allowed us to control for much of the overlapping brain activity from other fixations, we were not able to use the model to directly separate the (two) N400 consistency effects elicited by the fixations t − 1 and t. 3 Both runs of the model (the one for t − 1 and t) also yield an estimate for the scene-onset ERP, but because the results for the scene-onset ERP were virtually identical, we present the betas from the first run of the model. Baseline placement for FRPs. Another challenging issue for free-viewing EEG experiments is the choice of an appropriate neutral baseline interval for the FRP waveforms (Nikolaev et al., 2016). Baseline placement is particularly relevant for experiments on extrafoveal processing where we do not know in advance when EEG differences will arise and whether they may already develop before fixation onset.
For the pretarget fixation t − 1 and nontarget fixations nt, we used a standard baseline interval by subtracting the mean channel voltages between −200 and 0 msec before the event (note that the saccadic spike potential ramping up at the end of this interval was almost completely removed by our ICA procedure; see Supplementary Figure S1). For fixation t, we cannot use such a baseline because semantic processing may already be ongoing by the time the target object is fixated. Thus, to apply a neutral baseline to fixation t, we subtracted the mean channel voltages in the 200-msec interval before the preceding fixation t − 1 also from the FRP aligned to the target fixations t (see Nikolaev et al., 2016, for similar procedures). The sceneonset ERP was corrected with a standard prestimulus baseline (−200 to 0 msec).

Group statistics for EEG (second level of analysis).
To perform second-level group statistics, averaged EEG waveforms at the single-participant level ("regression ERPs") were reconstructed from the beta coefficients of the linear deconvolution model. These regression-based ERPs are directly analogous to participant-level averages in a traditional ERP analysis (Smith & Kutas, 2015a). We then used two complementary statistical approaches to examine consistency effect in the EEG at the group level: linear mixed models and a cluster-based permutation test.
LMM in a priori defined time windows. LMMs were used to provide hypothesis-based testing motivated by existing literature. Specifically, we adopted the spatio-temporal definitions by Võ and Wolfe (2013) and compared the consistent and inconsistent conditions in the time windows from 250 to 350 msec (early effect) and 350 to 600 msec (late effect) at a midcentral ROI of nine electrodes (comprising FC1, FCz, FC2, C1, Cz, C2, CP1, CPz, and CP2). Because the outputs provided by the linear deconvolution model (the first-level analysis) are already aggregated at the level of participant averages, the only predictor included in these LMMs was the Consistency of the object. Furthermore, to minimize the risk of Type I error (Barr, Levy, Scheepers, & Tily, 2013), we started with a random effects structure with Participant as random intercept and slope for the Consistency predictor. This random effects structure was then evaluated and backwards-reduced using the step function of the lmerTest package (Kuznetsova, Brockhoff, & Christensen, 2017) to retain the model that was justified by the data; that is, it converged, and it was parsimonious in the number of parameters (Matuschek, Kliegl, Vasishth, Baayen, & Bates, 2017).
Cluster permutation tests. It is still largely unknown to what extent the topography of traditional ERP effects translates to natural viewing. Therefore, to test for consistency effects across all channels and time points, we additionally applied the Threshold-Free Cluster Enhancement (TFCE) procedure developed by Smith and Nichols (2009) and adapted to EEG data by Mensen and Khatami (2013; http://github.com/Mensen/ept_TFCE-matlab). In a nutshell, TFCE is a nonparametric permutation test that controls for multiple comparisons across time and space, while maintaining relatively high sensitivity (e.g., compared with a Bonferroni correction). Its advantage over previous cluster permutation tests (e.g., Maris & Oostenveld, 2007) is that it does not require the experimenter to set an arbitrary cluster-forming threshold. In the first stage of the TFCE procedure, a raw statistical measure (here, t values) is weighted according to the support provided by clusters of similar values at surrounding electrodes and time points. In the second stage, these cluster-enhanced t values are then compared with the maximum cluster-enhanced values observed under the null hypotheses (based on n = 2000 random permutations of the data). In the present article (Figures 4 and 5), we not only report the global result of the test but also plot the spatio-temporal extent of the first-stage clusters, because they provide some indication about which time points and electrodes likely contributed to the overall significant effect established by the test. Please note, however, that unlike the global test result, these first-stage values are not stringently controlled for false positives and do not establish precise effect onsets or offsets (Sassenhagen & Draschkow, 2019). We report them here as a descriptive statistic.
Finally, for purely descriptive purposes and to provide a priori information for future studies, we also plot the 95% between-participant confidence interval for the consistency effects at the central ROI (corresponding to sampleby-sample paired t testing without correction for multiple comparisons; see also Mudrik et al., 2014)

Task Performance (Change Detection Task)
After the recognition phase, participants pressed a button to indicate whether or not a change had taken place within the scene. Response accuracy in this task was high (M = 85.0 ± 5.16%) and did not differ as a function of whether the study scene contained a consistent (84.6 ± 5.28%) or an inconsistent (85.3 ± 5.12%) target object. Figure 3A shows the cumulative probability of having fixated the target object as a function of the ordinal number of fixation and semantic consistency, and Table 2 reports the corresponding GLM coefficients. We found a significant main effect of Consistency; overall, inconsistent objects were looked at with a higher probability than consistent objects. As expected, the cumulative probability of looking at the critical object increased as a function of the Ordinal Number of Fixation. There was also a significant interaction between the two variables.

Eye-movement Behavior
Complementing this global analysis, we analyzed the very first eye movement during scene exploration to assess whether observers had immediate extrafoveal access to object-scene semantics (Loftus & Mackworth, 1978). The mean probability of immediate object fixation was 12.93%; we observed a numeric advantage of inconsistent objects over consistent objects ( Figure 3B), but this difference was not significant (Table 3). The latency to first fixation on the target object is another measure to capture the potency of an object in attracting early attention in extrafoveal vision (e.g., Võ & Henderson, 2009;Underwood & Foulsham, 2006). This measure is defined as the time elapsed between the onset of the scene image and the first fixation on the critical object. Importantly, this latency was significantly shorter for inconsistent as compared with consistent objects ( Figure 3C, Table 3).
Moreover, we analyzed gaze duration as a measure of foveal object processing time (e.g., Henderson et al., 1999). First-pass gaze duration for a critical object is defined as the sum of all fixation durations from first entry to first exit. On average, participants looked longer at inconsistent (519 msec) than consistent (409 msec) objects  before leaving the target object for the first time, and this difference was significant (Table 3). Table 1 summarizes additional oculomotor characteristics in the two conditions of object consistency. Supplementary Figures S4 and S5 visualize the locations of the pretarget, target, and posttarget fixations for two example scene stimuli.

Electrophysiological Results
Figures 4 and 5 depict the ERP evoked by the presentation of the scene as well as the FRPs for the three types of fixation that were analyzed. Results focus on the midcentral ROI for which effects of object-scene consistency have been reported. Waveforms for other scalp sites are depicted in Supplementary Figures S6-S9.

Scene-onset ERP
The left panels of Figure 4 show the grand-averaged ERP aligned to scene onset. Although inspection of the scalp maps indicated slightly more positive amplitudes over central right-hemispheric electrodes in the inconsistent condition, these differences were not statistically significant. Specifically, no effect of Consistency was found with the LMM analysis in the early or late time window (see Table 4 for detailed LMM results). Similarly, the TFCE test across all channels and time points yielded no significant Consistency effect (all ps > .2; see Figure 4C). Thus, we found no evidence that the semantic consistency of the target object influences the neural response to the initial presentation of the scene.

Nontarget Fixations, nt
Next, we tested whether fixations on scenes with an inconsistent object evoke a globally different neural response than those on scenes containing a consistent object. As the right panels of Figure 4 show, this was not the case: Consistency had no effect on the FRP for nontarget (nt) fixations, neither in the LMM analysis (see Table 4) nor in the TFCE statistic (all ps > .2; see Figure 4G).
Pretarget Fixation, t − 1 Figure 5 depicts the FRPs aligned to the pretarget and target fixations. Importantly, in the FRP aligned to the pretarget fixation t − 1, waveforms began to clearly diverge between the two consistency conditions, developing into a long-lasting frontocentral negativity in the inconsistent as compared with the consistent condition ( Figure 5A and B; see also Supplementary Figure S8). The scalp distribution of this difference, shown in Figure 6, closely resembled the frontocentral N400 (and N300) previously reported in ERP studies on object-scene consistency (e.g., Mudrik et al., 2014;Võ & Wolfe, 2013). In the LMM analyses conducted on the midcentral ROI, this effect was marginally significant ( p < .1) for the early time window (250-350 msec) but became highly significant between 350 and 600 msec ( p < .001; Table 4). The TFCE test across all channels and time points also revealed a significant effect of consistency on the pretarget FRP ( p < .05). Figure 5C also shows the extents of the underlying spatio-temporal clusters, computed in the first stage of the TFCE procedure. Between 372 and 721 msec after fixation onset, we observed a cluster of 14 frontocentral electrodes that was shifted slightly to the left hemisphere. This N400 modulation on the pretarget fixation could be seen even in traditionally averaged FRP waveforms without any control of overlapping potentials (see Supplementary Figure S3). In summary, we were able to measure a significant frontocentral N400 modulation during natural scene viewing that already emerged in FRPs aligned to the pretarget fixation.
On average, the target fixation t occurred at a median latency of 240 msec (± 18 msec) after fixation t − 1, as marked by the vertical dashed line in Figure 5B. If we take the extent of the cluster from the TFCE test as a rough approximation for the likely onset of the effect in the FRP, this means that, on average, at the time when the electrophysiological consistency effect started (372 msec), the eyes had been looking at the target object for only 132 msec (372 minus 240 msec).

Target Fixation, t
An anterior N400 effect was also clearly visible in the FRP aligned to fixation t. In the LMM analysis at the central ROI, the effect was significant in both the early (250-350 msec, p < .01) and late (350-600 msec, p < .05) windows (see Table 4). However, compared with the effect aligned to the pretarget fixation, this N400 was significant at only a few electrodes in the TFCE statistic (Cz, FCz, and FC1; see Figure 6). Aligned to the target fixation t, the N400 also peaked extremely early, with the maximum of the difference curve already observed at 200 msec after fixation onset ( Figure 5F). Qualitatively, a frontocentral negativity was already visible much earlier than that, within the first 100 msec after fixation onset ( Figure 5H). The TFCE permutation test confirmed an overall effect of consistency ( p < .05) on the target-locked FRP. Figure 5G also shows the extents of the underlying first-stage clusters. For the target fixation, clusters only extended across a brief interval between 151 and 263 msec after fixation onset, an interval during which the N400 effect also reached its peak. Figure 5F shows that, numerically, voltages at the central ROI were more negative in the inconsistent condition during the baseline interval already, that is, before the critical object was fixated. To understand the role of activity already present before fixation onset, we repeated the FRP analyses for fixation t after applying a standard baseline correction, with the baseline placed immediately before the target fixation itself (−200 to 0 msec). This way, we eliminate any weak N400-like effects that may have already been ongoing before target fixation onset. Interestingly, in the resulting FRP waveforms, the target-locked N400 effects were weakened: The N400 effect now failed to reach significance in the TFCE statistic and in the LMM analysis for the second window (350-600 msec; see the last row of Table 4) and only remained significant for the early window (250-350 msec). This indicates that some N400-like negativity was already ongoing before target fixation onset.
To summarize, we found no immediate influences of object-scene consistency in ERPs time-locked to scene onset. However, N400 consistency effects were found in FRPs aligned to the target fixation (t) and in those aligned to the pretarget fixation (t − 1).

DISCUSSION
Substantial research in vision science has been devoted to understanding the behavioral and neural mechanisms underlying object recognition (e.g., Loftus & Mackworth, 1978;Biederman, 1972). At the core of this debate are the type of object features that are accessed (e.g., lowlevel vs. high-level), the time course of their processing (e.g., preattentive vs. attentive), and the region of the visual field in which these features can be acquired (e.g., foveal vs. extrafoveal). A particularly controversial topic is whether and how quickly semantic properties of objects are available outside foveal vision.
In the current study, we approached these questions from a new perspective by coregistering eye movements and EEG while participants freely inspected images of real-world scenes in which a critical object was either consistent or inconsistent with the scene context. As a novel finding, we demonstrate a fixation-related N400 effect during natural scene viewing. Moreover, behavioral and electrophysiological measures converge to suggest that the extraction of object-scene semantics can already begin in extrafoveal vision, before the critical object is fixated.
It is a rather undisputed finding that inconsistent objects, such as a flashlight in a bathroom, require increased processing when selected as targets of overt attention. Accordingly, several eye-movement studies have reported longer gaze durations on inconsistent than consistent objects, probably reflecting the greater effort required to resolve the conflict between object meaning and scene context (e.g., Cornelissen & Võ, 2017;Henderson et al., 1999;De Graef et al., 1990). In addition, a number of traditional ERP studies using steady-fixation paradigms have found that inconsistent objects elicit a larger negative brain response at frontocentral channels (an N300/N400 complex) as compared with consistent objects (e.g., Coco et al., 2017;Mudrik et al., 2010;Ganis & Kutas, 2003).
However, previous research with eye movements remained inconclusive on whether semantic processing can take place before foveal inspection of the object. Evidence in favor of extrafoveal processing of object-scene semantics comes from studies in which inconsistent objects were selected for fixation earlier than consistent ones (e.g., Borges et al., 2019;LaPointe & Milliken, 2016;Underwood et al., 2008). However, other studies have not found evidence for earlier selection of inconsistent objects (e.g., Võ & Henderson, 2009Henderson et al., 1999;De Graef et al., 1990). Parafoveal and peripheral vision are known to be crucial for saccadic programming (e.g., Nuthmann, 2014). Therefore, any demonstration that semantic information can act as a source of guidance for fixation selection in scenes implies that some semantic processing must have occurred prior to foveal fixation, that is, in extrafoveal vision.
ERPs are highly sensitive to semantic processing (Kutas & Federmeier, 2011) and provide an excellent temporal resolution to investigate the time course of object processing. However, an obvious limitation of existing ERP studies is that observers were not allowed to explore the scene with saccadic eye movements, thereby constraining their Figure 6. Scalp distribution of frontocentral N400 effects in the time windows significant in the TFCE statistic (see also Figure 5). White asterisks highlight the spatial extent of the clusters observed in the first stage of the TFCE permutation test for both intervals. In the FRP aligned to the pretarget fixation (left), clusters extended from 372 to 721 msec and across 14 frontocentral channels. In the FRP aligned to the target fixation (right), clusters extended from 151 to 263 msec at three frontocentral channels. Consist. = consistent; Inconsist. = inconsistent. normal attentional dynamics. Instead, the critical object was usually large and/or placed near the point of fixation. Hence, these studies were unable to establish whether semantic processing can take place before foveal inspection of the critical object.
In the current study, we addressed this problem by simultaneously recording behavioral and brain-electric correlates of object processing. Specifically, we analyzed different eye-movement responses that tap into extrafoveal and foveal processing along with FRPs time-locked to the first fixation on the critical object (t) and the fixation preceding it (t − 1). We also analyzed the sceneonset ERP evoked by the trial-initial presentation of the image. Recent advances in linear deconvolution methods for EEG (e.g.,  allowed us to disentangle the overlapping brain potentials produced by the scene onset and the subsequent fixations and to control for the modulating influence of saccade amplitude on the FRP. The eye-movement behavior showed no evidence for Hypothesis A, as outlined in the Introduction, according to which semantic information can exert an immediate effect on eye-movement control (Loftus & Mackworth, 1978). Specifically, the mean probability of immediate object fixation was fairly low (12.9%) and not modulated by Consistency. Instead, the data lend support to Hypothesis B, according to which extrafoveal processing of object-scene semantics is possible but takes some time to unfold. In particular, the results for the latency to first fixation of the critical object show that inconsistent objects were, on average, looked at sooner than consistent objects (cf. Bonitz & Gordon, 2008;Underwood et al., 2008). At the same time, we observed longer gaze durations on inconsistent objects, replicating previous findings (e.g., Võ & Henderson, 2009;Henderson et al., 1999;De Graef et al., 1990). Thus, we found not only behavioral evidence for the extrafoveal processing of object-scene (in)consistencies but also differences in the subsequent foveal processing.
The question then remains why existing eye-movement studies have provided very different results, ranging from rapid processing of semantic information in peripheral vision to a complete lack of evidence for extrafoveal semantic processing. Researchers have suggested that the outcome may depend on factors related to the critical object or the scene in which it is located. Variables that may (or may not) facilitate the appearance of the incongruency effect include visual saliency (e.g., Underwood & Foulsham, 2006;Henderson et al., 1999), image clutter (Henderson & Ferreira, 2004), and the critical object's size and eccentricity (Gareze & Findlay, 2007). Therefore, an important question for future research is to identify the specific conditions under which extrafoveal semantic information can be extracted or when the three outlined hypotheses and/or outcomes would prevail.
Returning to the present data, the FRP waveforms showed a negative shift over frontal and central scalp sites when participants fixated a scene-inconsistent object. This result is in agreement with traditional ERP studies that have shown a frontocentral N300/N400 complex after passive foveal stimulation (e.g., Coco et al., 2017;Mudrik et al., 2014;Võ & Wolfe, 2013;Ganis & Kutas, 2003) and extends this finding for the first time to a natural viewing situation with eye movements. Regarding the time course, the present data suggest that the effect was already initiated during the preceding fixation (t − 1) but then carried on through fixation (t) on the target object.
As a cautionary note, we emphasize that it is not trivial to unambiguously ascribe typical N400 (and N300) effects in the EEG to either extrafoveal or foveal processing. The reason is that these canonical congruency effects only begin 200-250 msec after stimulus onset (Draschkow et al., 2018;Mudrik et al., 2010). This means that even a purely extrafoveal effect would be almost impossible to measure during the pretarget fixation (t − 1) itself, because it would only emerge at a time when the eyes are already moving to the target object. That being said, three properties of the observed FRP consistency effect suggest that it was already initiated during the pretarget fixation.
First, because of the temporal jitter introduced by variable fixation durations, an effect that only arises in foveal vision should be the most robust in the FRP averages aligned to fixation t but latency-jittered and attenuated in those aligned to fixation t − 1. However, the opposite was the case: At least qualitatively, a frontocentral N400 effect was seen at more electrodes ( Figure 6) and for longer time intervals ( Figure 5) in the FRP aligned to the pretarget fixation as compared with the actual target fixation. The second argument for extrafoveal contributions to the effect is the forward shift in its time course. Relative to fixation t, the observed N400 occurred almost instantly: As the effect topographies in Figure 5H show, the frontocentral negativity for inconsistent objects was qualitatively visible within the first 100 msec after fixation onset, and the effect reached its peak after only 200 msec. Clusters underlying the TFCE test were also restricted to an early time range between 151 and 263 msec after fixation onset and therefore to a much earlier interval to what we would expect from the canonical N300 or N400 effect elicited by foveal stimulation.
Of course, it is possible that even purely foveal N400 effects may emerge earlier during active scene exploration with eye movements as compared with the latencies established in traditional ERP research. For example, it is reasonable to assume that, during natural vision, observers preprocess some low-level (nonsemantic) features of the soon-to-be fixated object in extrafoveal vision (cf. Nuthmann, 2017). This nonsemantic preview benefit might then speed up the timeline of foveal processing (including the latency of semantic access) once the object is fixated (cf. Dimigen, Kliegl, & Sommer, 2012, for reading). Moreover, if eye movements are permitted, observers have more time to build a representation of the scene before they foveate the target, and this increased contextual constraint may also affect the N400 timing (but see Kutas & Hillyard, 1984). Importantly, however, neither of these two accounts could explain why the N400 effect is stronger-rather than much weaker-in the waveforms aligned to fixation t − 1 as compared with fixation t. The fact that the eye movement data also provided clear evidence in favor of extrafoveal processing further strengthens our interpretation of the N400 timing.
Finally, we found that the N400 consistency effect aligned to the target fixation (t) became weaker (and nonsignificant in two of the three statistical measures considered) if the baseline interval for the FRP analysis was placed directly before this target fixation. Again, this indicates that at least a weak frontocentral negativity in the inconsistent condition was already present during the baseline period before the target was fixated. Together, these results are difficult to reconcile with a pure foveal processing account and are more consistent with the notion that semantic processing of the object was at least initiated in extrafoveal vision (and then continued after it was foveated).
Crucially, we did not find any effect of target consistency in the traditional ERP aligned to scene onset. In line with the behavioral results, this goes against the most extreme Hypothesis A postulating that object semantics can be extracted from peripheral vision already at the first glance of a scene (Loftus & Mackworth, 1978). Similarly, there was no effect of consistency on the FRPs evoked by the nontarget fixations on the scene ( Figure 4); this was also the case in a control analysis that only included nontarget fixations that occurred earlier than t − 1 and at an extrafoveal distance between 3°and 7°f rom the target object (see Supplementary Figure S10). All these analyses suggest that the semantic information of the critical object started during fixation t − 1. However, from any given fixation, there are many candidate locations that could potentially be chosen for the next saccade (cf. Tatler, Brockmole, & Carpenter, 2017). Thus, it is conceivable that observers may have partially acquired semantic information of the critical object outside foveal vision before fixation t − 1, but without selecting it as a saccade target. Such reasoning leaves open the possibility that observers may have already picked up some information about the target object's semantics during these occasions.
Taken together, our behavior and electrophysiological findings are consistent with the claim formulated in Hypothesis B that objects can be recognized outside the fovea or even in the visual periphery, at least to some degree. Indirectly, our results also speak to the debate about the unit of saccade targeting and, by inference, attentional selection during scene viewing. Finding effects of object-scene semantics on eye guidance is evidence in favor of objectand meaning-based, rather than image-based, guidance of attention in scenes (e.g., Henderson, Hayes, Peacock, & Rehrig, 2019; Hwang, Wang, & Pomplun, 2011).
In summary, our findings converge to suggest that the visual system is capable of accessing semantic features of objects in extrafoveal vision to guide attention toward objects that do not fit to the scene's overall meaning. They also highlight the utility of investigating attentional and neural mechanisms in parallel to uncover the mechanisms underlying object recognition during the unconstrained exploration of naturalistic scenes.
1. We did not include random slopes for two reasons: For Participant, the inclusion of a random slope led to a small variance and a perfect correlation between intercept and slope. For the random effect Scene, only the change trials were fully counterbalanced in terms of location and consistency, meaning that the slope for Consistency could not be estimated for the no-change trials. 2. Other low-level variables, such as local image features in the currently foveated image region (e.g., luminance, spatial frequency), are also known to modulate the FRP waveform. In the model presented here, we did not include these other covariates because (1) their influence on the FRP waveform is small compared with that of saccade amplitude and (2) the properties of the target object (such as its visual saliency) did not differ between the two levels of object consistency (see Materials and Rating section). For reasons of simplicity, saccade amplitude was included as a linear predictor in the current model, although its influence on the FRP becomes nonlinear for large saccades (e.g., Dandekar et al., 2012). However, virtually identical results were obtained when we included it as a nonlinear (spline) predictor instead . 3. In theory, a more elegant model would include Type as a three-level predictor, with the levels of pretarget, target, and nontarget fixation. In principle, this would allow us to dissociate which parts of the N400 consistency effects are elicited by fixation t − 1 versus fixation t. The practical disadvantage of this approach is that the overlapping activities from both t − 1 and t would then be estimated on comparatively fewer observations (compared with the extremely stable estimate for the numerous nontarget fixations). This is critical because, compared with the limited amount of jitter in natural fixation durations, N400 effects are a long-lasting response, which makes the deconvolution more challenging. Specifically, we found that, with the three-level model, model outputs became extremely noisy and did not yield significant consistency effects for any EEG time-locking point. By defining either fixation t − 1 or fixation t as the critical fixation in two separate runs of the model and by treating all other fixations as nontarget fixations, the estimation becomes very robust. This simpler model still removes most of the overlapping activity from other fixations. However, the consistency-specific activity evoked by fixation t − 1 (i.e., the N400 effect) will not be removed from the FRP aligned to the fixation t and vice versa.