Estimating feedforward and feedback effective connections from fMRI time series: Assessments of statistical methods

We test the adequacies of several proposed and two new statistical methods for recovering the causal structure of systems with feedback from synthetic BOLD time series. We compare an adaptation of the first correct method for recovering cyclic linear systems; Granger causal regression; a multivariate autoregressive model with a permutation test; the Group Iterative Multiple Model Estimation (GIMME) algorithm; the Ramsey et al. non-Gaussian methods; two non-Gaussian methods by Hyvärinen and Smith; a method due to Patel et al.; and the GlobalMIT algorithm. We introduce and also compare two new methods, Fast Adjacency Skewness (FASK) and Two-Step, both of which exploit non-Gaussian features of the BOLD signal. We give theoretical justifications for the latter two algorithms. Our test models include feedback structures with and without direct feedback (2-cycles), excitatory and inhibitory feedback, models using experimentally determined structural connectivities of macaques, and empirical human resting-state and task data. We find that averaged over all of our simulations, including those with 2-cycles, several of these methods have a better than 80% orientation precision (i.e., the probability of a directed edge is in the true structure given that a procedure estimates it to be so) and the two new methods also have better than 80% recall (probability of recovering an orientation in the true structure).


Supplementary Material A The FASK Algorithm
The Fast Adjacency Skewness (FASK) procedure is given as algorithms 1, 2, 3 and 4. The idea is as follows: First, we run Fast Adjacency Search stable (FAS-stable) on the data, producing an undirected graph. FAS-stable is the order independent adjacency search of the PC-stable algorithm (Spirtes et al., 2000;Colombo & Maathuis, 2014). We use the linear, Gaussian BIC score as a conditional independence test with a specified penalty discount (see section Methods: Cyclic Search Procedures, in the main text). This yields undirected graph G 0 . The reason FASstable works for sparse cyclic models where the linear coefficients are all less than 1, is that correlations induced by long cyclic paths are statistically judged as zero, since they are products of multiple coefficients less than 1. One then orients each of the adjacencies in G 0 as a 2-cycle or a directed edge left or right. Taking up each adjacency in turn, one tests to see whether the adjacency is a 2-cycle using the 2-cycle Detection . If so, one adds edges X → Y and X ← Y in the output graph G 1 . If not, one applies the Left-Right orientation rule and orients X → Y or X ← Y in the output graph G 1 . G 1 is a fully oriented graph. For some models, where the true coefficients of a 2-cycle between X and Y are close but opposite in sign, a correlation test may eliminate the edge between X and Y when in fact a 2-cycle exists. In this case, we check explicitly whether corr(X, Y |X > 0) and corr(X, Y |Y > 0) differ by more than a set amount of 0.3. If so, we add the adjacency to the graph and orient it using the aforementioned rules. The idea for the heuristic extra adjacency rule is that if X and Y are independent, then |corr(X, Y |X > 0) − corr(X, Y |Y > 0)| = 0, so if this difference is not zero, there must be a trek between X and Y. If the difference is very different from zero, we infer a very short path and so add an adjacency. The specific threshold 0.3 is a parameter than can be tuned.
A comment on notation: When we write E(X|X > 0) we mean E(X) under the condition X > 0, and similarly for corr(X, Y |X > 0). When we write corr(X, Y |Z, X > 0) we mean corr(X, Y |Z) under the condition X > 0.
Algorithm 3 2-Cycle Detection Rule 1: D a continuous dataset, α a threshold level, G 0 an undirected graph, X, Y a pair of adjacent variables. 2: procedure TwoCycle(D, X, Y, G 0 , α) Returns true if X Y . 3: The Left-Right Orientation Rule Assuming X → Y or X ← Y without a 2-cycle between X and Y , we give a procedure to orient the adjacency left or right. To do this, we define a certain relationship that must obtain if the probability distributions of the variables X, Y and their residuals, e X and e Y , are smoothly positively skewed, defined thus: Definition 1. We say a variable X with p.d.f., f X (x) centered around c is smoothly positively skewed if for every b > 0, | c c−b xf X (x)dx| > | c+b c xf X (x)dx|. X is smoothly negatively skewed if −X is smoothly positively skewed.
Variables that are smoothly positively skewed are positively skewed in Pearson's sense, E[(X −µ X ) 3 ]/(E[(X −µ X ) 2 ]) 3/2 > 0 (Pearson, 1894). There may be variables skewed positively in Pearson's sense that are are not smoothly positively skewed, as for instance case where f X (x) has a small bump directly to the right of the center c but is zero to the left within some interval (c − b, c + b) but overall skewed in Pearson's sense. In such cases, for small b, it may be that | c c−b xf X (x)dx| < | c+b c xf X (x)dx|, whereas for sufficiently larger b, this will not hold. These cases we set aside.
We assume the following for pairwise direct causal relationships: 2. X and e Y are independent: X |= e Y .
3. X and e Y are centered.
5. e Y is smoothly positively skewed.
Proof. See Figure A1 for X and eY centered around zero. The E(Xe Y |Y > 0) integrates over regions A, B, C and D, with x 0 , e y 0 > 0. The total expectation can be decomposed into the expectation for regions A and B taken together and the expectation for regions C and D taken together. The expectation for regions A and B taken together is given by: |−ax| e y f e Y (e y )de y ) The sum inside the first parenthesis we know to be negative because of assumption 4 and definition 1. The term inside the second parenthesis is positive since e y integrates over a positive range. Thus, the total product will be negative. Equivalently, for regions C ad D taken together: E(Xe Y |Y > 0) C,D = ( 0 −ey 0 e y f e Y (e y )de y + ey 0 0 e y f e Y (e y )de y )( Equivalently, by assumption 5 and definition 1 we know the sum inside parenthesis is negative, and the term inside the second parenthesis is positive since x integrates over a positive range. Thus, the total product is negative. Therefore, E(Xe Y |Y > 0) = E(Xe Y |Y > 0) A,B + E(Xe Y |Y > 0) C,D < 0.
(0, 0) Lemma 1 may be weakened in several ways. All that is necessary is that E(Xe Y |Y > 0) integrates to a negative number over regions A, B, C, and D. So if X is smoothly positively skewed but e Y is symmetric about zero, this integral will still be negative, and the same is true in reverse. Also, clearly X or e Y do not need to be smoothly positively skewed over their entire domains; if one or the other is smoothly positively skewed under any restriction of its domain, the conclusion will follow, so long as it is constant over the rest of its domain. In addition, small aberrations may be tolerated in the smooth skewness condition, so long as they are overpowered sufficiently soon in the integration.
Note that additional paths between X and Y do not alter the conclusion of Lemma 1, so long as assumptions 2, 3, 4, 5 and 6 are upheld. (Confounder paths do not satisfy these conditions, since X is not independent of e Y in these cases.) Here, we simply use the marginal relationship between X, Y , and e Y for Figure A1.
Theorem 1. Assume 1-6, and additionally that Y and e X are smoothly positively skewed and that X and Y do not form a 2-cycle, then Proof. If X → Y , we have by Lemma 2 that and following the same strategy as in Lemma 2 but multiplying the terms by Y instead of X, we can derive, From this, the conclusion of the theorem follows.
Theorem 1 may be weakened as indicated for Lemma 1. A comment should be added as to how negative coefficients and negative skewness (violations of assumptions 4 -6) can be handled. We may assume fairly in most cases that X and e X are either both smoothly positively skewed together or both smoothly negatively skewed together, and that the same is true of Y and e Y . Then if X and Y are both smoothly positively skewed and a > 0, Theorem 1 tells us that X → Y . If a < 0, E(Xe Y ) is negative for regions A and B and positive for the regions now symmetrically opposite to regions C and D. So whether the total integral for E(Xe Y ) is positive or negative will depend on the value of a. There is a threshold δ, empirically tuned for particular types of data, for a greater than which this integral is negative and less than which it is positive. A δ of about −0.2 is suitable for our fMRI simulations. If a < δ, then the orientation of the causality needs to be reversed. If X and Y are both smoothly negatively skewed, the same conclusion follows, since X and Y will both need to be replaced by their negations for Lemmas 1 and 2 and Theorem 1. For judging the δ threshold condition, one will need to replace X or Y or both by their negations when using correlation of X and Y as an estimate of the coefficient a for Lemmas 1, 2 and Theorem 1. The formulation of these ideas as an algorithm is given in Algorithm 4.
If the data is centered, the formulation in Theorem 1 is very close to the claim that X → Y if and only if corr(X, Y |X > 0) > corr(X, Y |Y > 0). The latter claim will be true as well, if X and e Y are not centered around zero but they are smoothly positively skewed. In practice, we find that the formulation in Theorem 1 works better.
The Left-Right rule is similar in many ways to the pairwise orientation rules proposed by Hyvärinen and Smith (2013), though differently motivated. It is closest to their Robust Skew rule substituting h(x) = max(0, x) for g(x) = log(cosh(max(0, x))), a very similar function. There are some differences suggested by the algebra. In simulation, it has a noticeable advantage for the data we are concerned with. Also, the assumptions of Theorem 1 may be weakened as indicated. For these reasons, we prefer it.

Left-Right Orientation Rule with Confounding
Although it is not exact, the above argument can be adjusted for the case where we have a linearly added confounding term like so, Y = aX + bZ + e Y and X = cZ + e X Assume X |= e Y , Z |= e X ; a + bc > 0; and Z, e X and e Y centered and smoothly positively skewed. In the equation for Y multiply each term by X and take expectations to get, Conditioning on X > 0, together with the X |= e Y assumption, we have When conditioning on Y > 0 we get, The question is if under the presence of a confounder, E(XY |X > 0) > E(XY |Y > 0). This question reduces to whether Algebraically, these expressions are not identical, but the conclusion of Lemma 1 still either holds directly (in the cases we investigate in this paper) or is very close. As an example where one expects strong confounding, let the errors all be distributed as Beta(1, 5) (which has a small variance and positive skew), and let coefficients a, b, c be drawn from (−0.9, −0.1) ∪ (0.1, 0.9), then over 100 runs of the simulation for this linear model, we have mean(E(Z 2 |X > 0)) = 0.0300, variance(E(Z 2 |X > 0)) = 0.0002, mean(E(Z 2 |Y > 0)) = 0.0377, variance(E(Z 2 |Y > 0)) = 0.0003, with a mean difference of −0.0077. If the product of the coefficients bc is positive, as is generally the case with our simulations, the break is in the right direction, but even if the product bc is negative, bc(E(Z 2 |X > 0) − E(Z 2 |Y > 0)) is quite small, since b and c are small in absolute value.

2-Cycle Detection Rule
In order to apply the Left-Right rule described above, we need to rule out the possibility that X → Y and X ← Y forming a 2-cycle. Fortunately, we can detect 2-cycles using ideas from the Left-Right rule, though extra conditioning on combinations of adjacent variables needs to be done to block the effect of other possible cyclic paths between X and Y . The 2-cycle detection rule without conditioning only discovers whether a cyclic path exists between X and Y but cannot distinguish if such a path is a 2-cycle or a higher degree cycle. The 2-cycle detection rule is based on the claim that if there are no additional cycles in the graph and X → Y and X ← Y , then corr(X, Y ) = corr(X, Y |X > 0) and corr(X, Y ) = corr(X, Y |Y > 0). Here, we use centered correlations as opposed to the uncentered correlations of Theorem 1 because the calculations are very close, and with centered correlations we can calculate z-statistics. As a result, the calculations are approximations, but very good ones.
The justification is based on arguments for the Left-Right rule. Assume X, Y , e X , and e Y centered and smoothly positively skewed. If X → Y , with a linear model Y = aX + e Y , a > 0, and X |= e Y , then E(XY )/E(X 2 ) = a. Additionally, by Lemma 2 we know that E(XY |X > 0)/E(X 2 |X > 0) = a, and E(XY |Y > 0)/E(X 2 |Y > 0) < a. Considering these relationships, if X → Y then these two conditions will hold: But for a 2-cycle, in which X → Y and X ← Y , with linear models Y = aX + e Y and X = bY + e X , these two conditions do not hold anymore. Condition (1) fails because X is not independent of e Y anymore, the necessary assumption for determining the equality. Condition (2) will still be an inequality but possibly with direction flipped depending on the relative strength of the coefficients. The important fact is that under a 2-cycle, conditions (1) and (2) will become inequalities that can be tested. In general, if X → Y and X ← Y , we expect: We have seen in simulations two problems with testing (3) and (4) as they are. First, small decimal differences make the conditions to hold even without the presence of a 2-cycle. Second, the conditions may hold even without a 2-cycle if the sample is not large enough. To overcome these problems we approximate the ratios in (3) and (4) with conditional and unconditional correlations and perform statistical tests of the difference of correlations. In particular, using centered rather than non-centered correlations, we test if corr(X, Y ) = corr(X, Y |X > 0) and corr(X, Y ) = corr(X, Y |Y > 0).
Specifically, we convert these correlations to z-scores and make two-tailed hypothesis testing of the differences of the conditioned z-score with the unconditioned z-score for X > 0 and Y > 0 correspondingly. If both differences are judged significantly different from zero, we may conclude that X and Y form a 2-cycle.
The one problem with this approach is that the coefficients in opposite directions in the 2-cycle may have opposite signs, a type of cycle called control cycle. In control 2-cycles, one or the other of the conditioned correlations might not be distinguishable from the unconditioned correlation because of cancellation or near cancellation of coefficients. That is, one might reject the null of the difference test only for X > 0 or Y > 0, but not both, therefore missing the presence of a 2-cycle. However, in control 2-cycles the differences in correlations will be of opposite sign, unlike the unidirectional X → Y case, in which these differences will be of the same sign. Leveraging this information it is possible to distinguish a control 2-cycle from an unidirectional edge. The exact condition is given in the pseudo-code for Algorithm 3. Following this approach, all 2-cycles in a causal graph that contains no other cycles are identifiable.
If the graph contains other cyclic paths between X and Y , we simply need to condition on some combination of the adjacent variables of X and the adjacent variables of Y , other than X and Y themselves, to block additional d-connecting cyclic paths between X and Y . The theory for this is given in (Richardson, 1996), for the Cyclic Causal Discovery (CCD) algorithm. For the 2-cycle detection rule, we may try every combination of such adjacent variables, looking for a conditioning set where the 2-cycle condition above fails to obtain. If we find such a conditioning set, we stop and apply the Left-Right rule. If there is no such conditioning set, we conclude that the adjacency is a 2-cycle. This idea is incorporated into the pseudo-code for Algorithm 3.

Supplementary Material B B1 Parameter Tuning
All the search procedures tested here require at least one user-input parameter; some require two or three. In order to give fair comparisons under parameters optimal for the respective algorithms and kind of data, datasets were created exactly as the ones described in the main text section Methods: Data Description, but changing the coefficient values range, by sampling them from a Gaussian distribution with mean = 0.4, std.dev = 0.1, and truncated between 0.2 and 0.6. We refer to these data as training datasets.
Ten datasets were selected at random without replacement from 60 individual training datasets for Network 5 and Network 7 exclusively. Network 5 is a representative case for a structure with 2-cycles; and Network 7 represents structures with higher degree cycles.
Each of the search algorithms was run on the training datasets with varying parameters. The Matthews correlation coefficient (MCC) (Matthews, 1975) computed from the confusion matrix, was used to assess the performance of the methods. The MCC ranges from +1, when an algorithm perfectly recovers the true graph without false positive edges, to -1, when the algorithm produces false positive edges without recovering any of the true edges in the graph. The average of the Matthews correlation coefficient (MCC) for adjacency estimation and orientation estimation was computed for each combination of parameters, and the best-performing parameters were selected. In cases where two or more combinations of parameters performed equally well, the combination which was present in the list of highest performances for both networks was selected. If multiple combinations appeared in both high-performance lists, each combination of parameters was used in the test phase. Matthews correlation coefficient contains an implicit utility trade-off between recall and precision that may not be appropriate for some settings.
For both implementations of Two-Step (FAS-stable and Alasso), multiple combinations of parameters performed equally well, but there was no overlap between the high-performing lists of parameters for Network 5 and 7, and thus no combination of parameters could be selected following the above procedure. To solve this problem, Two-Step with FAS-stable and with Alasso were run on ten randomly selected training datasets without replacement generated from the SmallDegree macaque network with coefficients sampled from a Gaussian distribution with mean = 0.4, std.dev = 0.1 and values truncated between 0.2 and 0.6. The highest-performing combination of parameters on this network was selected for both implementations of Two-Step.
Using the parameter values obtained under the process described above, all the algorithms were run on test data from Networks 1 to 9 from Figure 1.

B2 Parameter Tuning for Macaque-Based Simulations
As noted in the main text section Methods: Data Description, for the macaque-based simulations the coefficients were restricted to very small values to guarantee the stability of the synthetic BOLD signals. Given this restriction on the coefficients we did not find it feasible to create a training set based on a different range for the coefficients, as we did in Section B1. Instead, we followed this procedure: We centered and concatenated the first ten datasets for each Macaque-based simulation, and used the concatenated data as training data. The remaining 50 individual datasets were used as the test set. FASK and Two-Step with Alasso were run for different parameters values and we chose those that maximize the average of the Matthews correlation coefficient for adjacencies and 2-cycle orientations in the case of FASK, and for adjacencies and the complete orientations for Two-Step, we explain the reason for this difference below. For FASK, the penalty discount c controls the sparsity of the adjacency search, while the α threshold controls the sensitivity for the difference of correlations used in the the 2-cycle detection step. The left-right orientation rule does not require a parameter. Thus, we look for the pair of parameters that maximize the average of the Matthews correlation coefficient for adjacencies and 2-cycles (MCCa2c). This process is illustrated in Figure B1, with heatmaps for the MCCa2c values obtained under different combinations of values for the penalty discount c and the α threshold for 2-cycle detection, for each of the four macaque-based networks.

SmallDegree
LongRange LongRangeControl Full Figure B1: Average Matthews correlation coefficient for adjacencies and 2-cycles (MCCa2c) for the FASK algorithm, under different combinations of c penalty discount for the BIC * score (Y axis) and α significance threshold for the 2-cycle detection rule (X axis). Results for the four macaque-based simulations.
For Two-Step with Alasso, we vary the sparsity penalty λ of the B matrix estimation, and the threshold for the absolute values in the final B matrix. In contrast to FASK, the detection of 2-cycles is part of the global ICA estimation procedure of Two-Step, and so we use the average of the Matthews correlation coefficient for adjacencies and orientations (MCCao). Figure B2 shows the parameter tuning heatmaps for combinations of λ and the threshold for the absolute values of the B matrix, for each of the four macaque-based networks simulations.

SmallDegree
LongRange LongRangeControl Full Figure B2: Average Matthews correlation coefficient for adjacencies and orientations (MCCao) for the Two-Step(Alasso) algorithm, under different combinations of the sparsity parameter λ (Y axis) and the threshold for the absolute B matrix values (X axis). Results for the four macaque-based simulations.

Supplementary Material C C1 Results for simple networks with 2-cycles
Complete results for analysis shown in Figure 2 of the main text, for Networks 1 to 6 and their amplifying and control variants. "amp" and "cont" after each network label indicates if the network has an amplifying (excitatory) or a control (inhibitory) cycle. For Network 5 two additional control variants were tested, one where the 2-cycle is a control cycle with coefficients +0.3 and −0.7 (Net5 c p3n7); and its inverse where the 2-cycle coefficients are +0.7 and −0.3 (Net5 c p7n3). Tables show for each algorithm tested the average results for 60 repetitions of ten concatenated datasets from the corresponding networks, for precision and recall of adjacencies, orientations, and 2-cycle detection, and running times in seconds. The average and standard deviation over all ten networks for each algorithm, is presented at the right of each average table. These averages of averages are plotted in Figure 2. Corresponding tables showing the standard deviation across 60 repetitions of ten concatenated datasets for each network and algorithm are also presented.                C2 Results for simple networks with higher degree cycles Complete results for analysis shown in Figure 3 of the main text, for Networks 7 to 9 and their amplifying and control variants. "amp" and "cont" after each network label indicates if the network has an amplifying (excitatory) or a control (inhibitory) cycle. Tables show for each algorithm tested the average results over 60 repetitions of ten concatenated datasets from the corresponding simulations, for precision and recall of adjacencies, orientations, and 2-cycle false positives; and running times in seconds. Average over all eight networks for each algorithm is shown to the right of each table, followed by the corresponding standard deviation. These averages are plotted in Figure 3.              C4 Results for resting state medial temporal lobe data

C4.1 Left hemisphere results
Results for 23 repetitions of ten concatenated subjects, for seven regions of interest of the medial temporal lobe in the left hemisphere. For FASK and Two-Step(Alasso) we report the frequency of appearance in percentage, of each edge across the 23 repetitions. A value of 1.00 indicates that the edge shows in all 23 repetitions. A value of 0.04 indicates that the edge shows in only one of the repetitions. FASK was run with penalty c = 1 and α = 0.05; Two-Step(Alasso) was run with λ = 2 and threshold 0.10.

C4.2 Right hemisphere results
Results for 23 repetitions of ten concatenated subjects, for seven regions of interest of the medial temporal lobe in the right hemisphere. Figure C1 shows graphs depicting robust directed edges that appear in 48% or more of the 23 repetitions, for FASK and Two-Step(Alasso). For FASK and Two-Step(Alasso) we also report the frequency of appearance in percentage, of each edge across the 23 repetitions. A value of 1.00 indicates that the edge shows in all 23 repetitions. A value of 0.04 indicates that the edge shows in only one of the repetitions. FASK was run with penalty c = 1 and α = 0.05; Two-Step(Alasso) was run with λ = 2 and threshold 0.10.  Figure C1: Comparison of the most robust edges estimated by FASK and Two-Step(Alasso) in 23 repetitions of ten concatenated subjects for seven regions of interest from the right hemisphere medial temporal lobe. A directed edge is depicted if it is estimated in 48% or more of the 23 repetitions of ten subjects concatenated. Regions of interest include: Cornu Ammonis 1 (CA1); CA2, CA3 and dentate gyrus together (CA23DG); entorhinal cortex (ERC); perirhinal cortex divided in Brodmann areas (BA35 and BA36); and parahippocampal cortex (PHC).  Table C30: FASK: medial temporal lobe edges ordered by frequency of appearance (right hemisphere).
In bold the edges with frequency equal to 48% or more. Table C31: Two-Step(Alasso): medial temporal lobe edges by frequency of appearance (right hemisphere). In bold the edges with frequency equal to 48% or more.

C4.3 2-cycles frequency of appearance
For FASK and Two-Step(Alasso) we report 2-cycles by frequency of appearance in percentage, across the 23 repetitions of ten concatenated subjects, for seven regions of interest from the medial temporal lobe for the left and right hemisphere correspondingly. A value of 1.00 indicates that the 2-cycle appears in all 23 repetitions. A value of 0.04 indicates that the 2-cycle appears in only one repetition. FASK was run with penalty c = 1 and α = 0.05; Two-Step(Alasso) was run with λ = 2 and threshold 0.10.

C4.4 Individual datasets results for medial temporal lobe data
Results for 23 individual subjects datasets, for seven regions of interest of the medial temporal lobe in the left and right hemisphere correspondingly. For FASK and Two-Step(Alasso) we report the frequency of appearance in percentage, of each edge across the 23 individual subjects. A value of 1.00 indicates that the edge shows in all 23 subjects. A value of 0.04 indicates that the edge shows in only one of the subjects. We also report frequency of appearance for 2-cycles. FASK was run with penalty c = 1 and α = 0.10; Two-Step(Alasso) was run with λ = 2 and threshold 0.10.

C5 Results for rhyming task individual data
FASK and Two-Step(Alasso) were run on 9 individual subjects standardized datasets (160 datapoints) for eight regions of interest in the cerebral cortex and one Input variable modeling the dynamics of the visual stimuli presentation in the rhyming task. We report frequency of appearance in percentage, of individual edges across the 9 subjects. We also report 2-cycles frequency of appearance. A value of 1.00 indicates the edge appears in all the 9 subjects. A value of 0.11 indicates the edge appears in just one subject. FASK was run with a penalty discount of c = 1 and α = 0.10 to account for the reduction in sample size relative to the concatenated data (1,440 datapoints). Two-Step(Alasso) was run with λ = 2 and threshold for the absolute values of the B matrix of 0.10.     Table C43: Two-Step(Alasso): Rhyming task data 2-cycles ordered by frequency of appearance (individual datasets).

C6 Measurement noise and sampling resolution results
Values used to compute percentage changes reported in Figures 4 and 5 of the main text, for data with and without measurement noise (abbreviated as m.n) and a fast sampling rate of TR = 1.2 seconds, and a slower rate of TR = 3 seconds.

Supplementary Material D Spectral density of simulated neural signals
The neural signals z (from Eq. 1 in the main text, dz/dt = σAz + Cu) come from a Poisson process that controls two states with mean duration of 2.5 seconds (up) and 10 s (down) together with the connectivity matrix A. We plot (overlapped) spectral density estimates (left) and corresponding log-log plots (right), from 91 synthetic neural signals from the full macaque network, with a black line representing the mean log spectral density. The mean log spectral density (black line in the log-log plot) was fitted to a power-law function p(x) = x −a , which resulted in an exponent of a = 2.4. We consider that the spectral densities shown in our log-log plot resemble, in shape and mean exponent value to the log-log plots for spectral density of human electrocorticography (ECoG) reported in Fig. 1, Fig. 2, and Fig. S1 of He, B. J., et al. (2010). The temporal structures and functional significance of scale-free brain activity. Neuron, 66 (3), 353-369. Figure D1: Spectral density estimates (left) and log-log plots (right) for simulated neural signals of 91 nodes from the full macaque network. The black line in the log-log plot represent the mean spectral density estimates. The mean log spectral density was fitted to a power-law function p(x) = x −a , which resulted in an exponent of a = 2.4.

Supplementary Material E Performance at increasing sample size
To analyze the dependency of the performance of FASK and Two-Step to the sample size, we ran FASK and Two-Step(Alasso) at increasing sample size (obtained by concatenating datasets) for the Network 4 simulation. The minimum sample size tested was 500, and the maximum 30,000. FASK was run with a value of c = 2 for the penalty discount and α = 10 −6 for the 2-cycle detection. Two-Step(Alasso) with a λ = 32 and a threshold of 0.15 for the absolute values of the estimated B matrix.

Supplementary Material F Subsampling strategy application
We apply the Bühlmann & van de Geer (2011, ch. 10) subsampling strategy for the control of false positives to our particular problem of inferring a cyclic directed network from fMRI data. This strategy consists in creating k random subsamples of size n/2 without replacement (within each subsample) from an original dataset of sample size n; infer k graphs and count the number of times a particular edge appears across the k graphs. A final graph is built with those edges that appear in more than a proportion t of the k graphs, where 0.50 < t ≤ 1. The value of t is a threshold -which may be different for each network and dataset -that depends on the number of tolerated false positives. To obtain t we first order the edges, from high to low, according to f , the proportion of times an edge appears across the k graphs. Then, we assign to each ordered edge an index q starting at q = 1 for the edge with the highest f . Finally for each q and f we compute E(a) = (q 2 /p)(1/(2f − 1)), which is a bound on the expected number of false positives if we set the threshold t = f . In the above equation, p = v(v − 1), where v is the number of nodes and p is the number of total possible directed edges in the network. Once we build this table, we can find the value of f corresponding to the value of E(a) closest to the number of false positives we are tolerating, and set the threshold t equal to such f , as long as 0.50 < f ≤ 1. For example, if we tolerate a number of false positives equal to 5% of the total number of possible edges p, then we look in the column E(a) of our table for the value closest to 0.05p, and set t equal to the corresponding f . Intuitively, the smaller the number of false positives tolerated the higher the value of t that will be selected. Using datasets with 500 datapoints, we test this subsampling strategy for FASK and Two-Step(Alasso) on the 18 simple simulations and in the four Macaque-based simulations. We compare precision and recall for these datasets against those obtained with the concatenated datasets of 5,000 datapoints for which results were previously reported in Figures 2, 3 and Table 1 in the main text. For the subsampling strategy we use k = 100 to guarantee stability of the results. In our simulations, for smaller size problems, such as the 18 simple networks, a number of false positives equal to 5% of the possible directed edges p, produces good results, but with higher size problems, such as the Macaque-based simulations, we obtained better results with a proportion of 1/p.
For the datasets with 500 datapoints, we use FASK with a penalty of c = 1 and α = 0.01 for the 18 simple simulations, and α = 0.10 for the Macaque-based; Two-Step(Alasso) was run with λ = 1 and varying B matrix thresholds, 0.10 for the 18 simple simulations and Macaque SmallDegree network; 0.05 for the Macaque LongRange and Macaque LongRangeControl cases; and 0.005 for the Macaque Full network. These threshold values for Two-Step correspond to the ones derived previously in the parameter tuning process.