Impacts of an Information and Communication Technology-Assisted Program on Attitudes and English Communication Abilities: An Experiment in a Japanese High School

We conducted a randomized experiment targeting 322 Japanese high school students to examine the impacts of a newly developed English-language learning program. The treated students were offered an opportunity to communicate for 25 minutes with English-speaking Filipino teachers via Skype several times a week over a 5-month period as an extracurricular activity. The results show that the Skype program increased the interest of the treated students in an international vocation and in foreign affairs. However, the students did not improve their English communication abilities, as measured by standardized tests, probably because of the program's low utilization rate. Further investigation showed that the utilization rate was particularly low among students demonstrating a tendency to procrastinate. These results suggest the importance of maintaining students’ motivation to keep using such information and communication technology-assisted learning programs if they are not already incorporated into the existing curriculum. Having procrastinators self-regulate may be especially crucial.

communication technology (ICT) has increasingly been used as an alternative to more conventional resources (e.g., Gee andHayes 2011, Levy 2009). Such ICT-assisted educational resources can be best used to help overcome the limitations of conventional resources. In particular, because ICT can provide customized and self-paced learning opportunities, the use of ICT in education has huge potential to improve the effectiveness of home learning.
According to surveys by Bulman and Fairlie (2016) and Snilstveit et al. (2016), the classroom use of ICT generally has positive impacts, especially for students in lower grades studying math or science. While earlier observational studies found large positive impacts of home use of ICT on students' academic outcomes, these studies suffered from the selection bias that students or teachers with unobserved high ability or motivation tended to introduce the new ICT-assisted resources. More recent experimental studies tended to find smaller or even no impacts. 1 Such mixed results for the home use of ICT partly reflect differences in the grades of the sampled students, their proficiency levels, sampled countries, and studied or targeted subjects; however, we particularly need evidence on whether the home use of ICT can compensate for the weaknesses of conventional education resources.
To test the usefulness of the home use of ICT in complementing current education programs, we conducted a randomized controlled trial (RCT) that provided ICT-assisted resources for Japanese high school students learning English. In contrast to the high internationally normed performance of Japanese students in reading, math, and science-as measured by the Organisation for Economic Co-operation and Development's Program for International Student Assessment for Grade 9 students-their performance in English has been far from satisfactory. According to a nationwide English test conducted in 2014 by the Ministry of Education, Culture, Sports, Science and Technology, Japan (MEXT), a majority of Grade 12 students ranked at the lowest level (A1) in the Common European Framework of Reference for Languages, with their speaking performance lowest among the four skills measured. Based on these results, MEXT recognized that the quality of English education, particularly in nurturing speaking ability, should be improved (MEXT 2015a). As conventional English education programs in Japan have been unsuccessful, there is scope for the use of ICT-assisted resources to improve the quality of such education.
We experimentally introduced a newly developed online English learning program as an extracurricular activity to 322 Japanese students in Grade 10. This online program is an individualized, self-paced program in which students communicate with English-speaking Filipino interlocutors, mostly consisting of current students or graduates of the University of the Philippines, the top national university in the country. The students can communicate with them at mutually convenient times via Skype using learning materials of their own choice. This program is an example of human resource arbitrage from developing to developed countries with the help of modern ICT technology. Although it is beyond the scope of this paper, the program may have positive impacts not only on the Japanese-student side but also on the Filipino-instructor side by creating earning opportunities.
We introduced the Skype English program with a crossover design. 2 First, we randomly selected half of our sample (161 students) to be given the opportunity to use the program for 5 months from July to November 2015, while the remaining 161 students were given the opportunity to use the program for 5 months from January to May 2016. While all the students had an equal opportunity to use the program by May 2016, only half of them had taken this opportunity as of December 2015, when we conducted the endline survey. We therefore refer to the students exposed to the program in the first round (July-November 2015) as the treatment group and those exposed to it in the second round (January-May 2016) as the control group. 3 Combining program usage records and panel data collected before and after the introduction of the program to the treatment group (but not yet to the control group), two main findings emerge. First, the program changed the attitudes of the treated students positively, especially in terms of their interest in an international vocation and in foreign affairs. In particular, our estimates of the local average treatment effect (LATE) suggest that the effects were large for students with greater program utilization. This finding is important because past longitudinal studies suggested that it is difficult to change students' attitudes toward an international vocation and foreign affairs when they study a foreign language (Ortega and Iberri-Shea 2005). This may be particularly the case in the Japanese school environment, which is known to have a monocultural and monolingual orientation. Furthermore, Sasaki (2011); Yashima (2002); and Yashima, Zenuk-Nishide, and Shimizu (2004) found that such attitudinal change among Japanese students will eventually lead to improvements in their English communication skills.
Second, despite the positive impacts on the students' attitudes, there is no measured impact on their English communication skills. This may be attributed to the low intensity of the program (25 minutes per lesson) in comparison with the students' concurrent regular English classes (50 minutes per lesson on most weekdays) as well as the program's low utilization rate. Only 10 of the 161 students in the treatment group took 50 or more lessons over the 5-month period, as recommended by the program provider, and 31 students took no lessons over the same period. In addition, regression analyses show that the utilization rate was particularly low among students with a tendency to procrastinate, which is consistent with the emerging literature on self-control problems (e.g., Duckworth, Milkman, and Laibson 2018). These findings warrant further research on how to improve and maintain students' motivation, particularly those with a tendency to procrastinate, to adopt home-use ICT programs such as the one targeted in this study.
The remainder of this paper is organized as follows. Section II describes our experiment, including the sample, timeline, and details of the intervention. Section III discusses sample balance and program utilization, and section IV presents the estimated program impacts. Finally, section V contains a summary of the findings and implications for future studies.

A. Sample
We collaborated with a public high school that is a top-tier school in central Japan. This school was selected by the Government of Japan in 2015 as one of the 112 Super Global High Schools among the 4,939 high schools in Japan. Super Global High Schools receive extra budgetary support to nurture globalized leaders with high levels of interest in societal problems, communication skills, and problem-solving abilities, who will play internationally active roles in the future (MEXT 2015b). The school agreed to introduce the online program as an extracurricular activity.
Our sample consisted of all 322 first-year high school students (Grade 10) who were newly admitted to the school a few months before the experiment. 4 In Japan, high school admissions, whether public or private, are mostly based on students' academic performance on the entrance examination, with students subsequently tracked into different high schools of varying quality. After our sample students were admitted to our target high school, they were randomly assigned to one of eight classrooms, each consisting of 40 or 41 students. Classroom assignment was not affected by any preexisting peer groups; we took advantage of this to attain randomization in our experiment.
Further, each of the four full-time English teachers in the school were randomly assigned to teach two of these eight classes. To achieve balance in the quality of the English teachers in the classroom, we stratified the sample of students at the teacher-classroom level, randomly assigning one of the two classes instructed by each English teacher to the treatment group and the other to the control group ( Figure 1). In sum, we have four treatment classes (with 160 students) and four control classes (with 161 students). Although our experiment may suffer from a small number of clusters (i.e., eight classes), the classroom-level intracluster correlation coefficients for outcome variables at the baseline survey are close to 0, indicating that there is little correlation of responses within a cluster, and thus, our randomization can be considered as being close to the student-level randomization. 5

B. Timeline
Before introducing the program, we conducted a baseline survey designed to collect information on the students' characteristics and attitudes toward English communication. The survey was conducted in June 2015, using a mark-sheet questionnaire we developed. The timeline of our research is presented in Table 1.
Soon after the baseline survey, the sample students took the Versant speaking test (Pearson Inc. 2008), a standardized test designed to evaluate the oral English 5 The classroom-level randomization will help us mitigate the violation of the Stable Unit Treatment Value Assumption caused by spillover effects among students in the same classroom. While admitting that it is technically difficult to separate the direct effect of our intervention from the indirect effect through their peers in the classroom-level randomization, as pointed out by Imbens and Wooldridge (2009), we think that the degree of such indirect effect is limited because our outcome variables are individual measures of attitudes and test scores, which are more likely to be affected by interactions with English teachers than by those with their classmates. skills (integrated listening and speaking) of nonnative English speakers. 6 The test was administrated solely for this research project (although the results were shared with the students as feedback) to construct our measure of English communication ability. Following the survey and the Versant test administered in June 2015, we commenced the intervention on 1 July 2015. The students in the treatment classes were provided with opportunities to use the online program free of charge, although the market price of the program was ¥5,800 (about $52) per month. This included one 25-minute lesson for every day of the intervention period.
Soon after our intervention commenced, the students took a nationally administrated English test developed and distributed by Benesse Co. The test is a mock university entrance exam designed primarily to measure students' English reading ability. The sample students took a similar test again in November, toward the end of our intervention. Although the tests were not taken for the purpose of our study, the school agreed to share the results with us to be used as another measure of the students' English abilities. In addition, in November, the students took the Global Test of English Communication (GTEC), a standardized test developed and distributed by Benesse Co. to evaluate reading, listening, writing, and speaking skills in English. 7 The school also agreed to share the results of this test with us.
In December 2015, when only the treated students had received the program, we conducted an endline survey and Versant test. In other words, to investigate the effects of the online program, the treatment and the control classes were compared using a difference-in-differences (DiD) design. To mitigate inequality between the two groups (as mentioned above), we provided the same amount of intervention 6 We chose this particular test because of its reported high validity and reliability among populations similar to the sample in the present study and because it requires a relatively short time (20 minutes) to conduct compared with other English communication tests (e.g., TOEFL iBT). During the Versant test, the students listened to questions spoken in English and provided verbal answers in English. Their answers were recorded and automatically marked online. The test was conducted by class in a computer room inside the school, and thus, the test-taking environment was essentially the same for all students. The Versant test scores ranged from 20 to 80 and involved four criteria: (i) sentence mastery, (ii) vocabulary, (iii) fluency, and (iv) pronunciation. The scores correspond with the levels of the Common European Framework of Reference for Languages: for example, a Versant score of 20-25 is equivalent to the lowest (A1) level, while a score of 79-80 is equivalent to the highest (C2) level. 7 The test consists of 30 multiple-choice reading items (24 minutes), 30 multiple-choice listening items (13 minutes), 3 performative writing items (26 minutes), and 4 performative speaking items (12 minutes). with a time lag, with the program being made available to the control classrooms from January to May 2016. By the end of May 2016, all 322 students had been exposed to the same intensity of intervention (or lack thereof).

C. Intervention
Our intervention consisted of providing the sampled students with opportunities to use the online program. In contrast to conventional face-to-face English learning methods, in this program, learners and teachers do not have to be present in the same space. In addition, learners can be matched with teachers on a more flexible basis because learners can select among available teachers at a time of their convenience. Such online English programs have become increasingly popular among Japanese businesspeople, partly because of time flexibility advantages and partly because of the low cost of such programs relative to similar face-to-face English learning programs offered by commercial conversation or cramming schools. However, according to the baseline survey that we conducted before the beginning of the intervention, 65% of the students had never heard of this type of online English learning program, only one student was using such a program, and another 10 had used one in the past. In this baseline survey, 30% of the students responded that they would be very willing to use the program if given the opportunity, and another 50% responded that they were moderately willing to use it. Hence, while the program was new to most of the students, it was favorably perceived at the beginning of our intervention.
The online program was provided to the students outside of their regular English classes. Each lesson took 25 minutes, and the students were recommended to take one lesson every 3 days (i.e., 10 lessons a month, or 50 in total) to take full advantage of the program. The students could make an appointment for a lesson at any time between 6 a.m. and 1 a.m. on the following day and could choose any of the available teachers. If the student's preferred teacher was not available at the time of their convenience, they were able to choose another time slot or another available teacher in the same slot. The pool of teachers consisted mostly of current students or graduates of the University of the Philippines. Because English is the language of instruction in their home university and also because they were screened on the basis of the company's strict hiring criteria, we judged that the quality of the teachers was reasonably guaranteed. While some of the teachers spoke Japanese, participating students had to communicate entirely in English with the help of the chat (texting) function in Skype. Students were free to choose appropriate study materials for each lesson from a wide range of materials provided by the program, including daily conversation, academic talk, grammar and vocabulary, and business English. In other words, the participants' choice of teachers, time slots, and study materials were their decision entirely. Most importantly, while we provided the students with opportunities to use the program at home, it was ultimately up to them whether and how often to take the lessons, especially because their participation did not affect their grades. One of the biggest advantages of this online program is its cost-effectiveness. The government launched the Japan Exchange and Teaching Program in 1987, which involved providing English-speaking aides known as Assistant English Teachers (AETs) to Japanese English teachers in primary, middle, and high schools (Grades 1-12). This program has expanded since then-a total of 5,163 AETs were employed as of 2017. The individual annual cost for an AET is approximately $53,000, including salary, coordination, and transportation, while the market price of this English program is $600 per year. Based on the program provider's back-of-the-envelope calculation, the program enables students to devote 15 times more minutes to speaking with English-speaking partners than speaking with an AET for every dollar spent. Table 2 presents the basic characteristics of students that could potentially influence the take-up rate and effects of the online program. As the literature finds that a lack of self-control, including procrastination, can result in poor test performance or low grades (e.g., Golsteyn, Grönqvist, and Lindahl 2014;Onji and Kikuchi 2011), we constructed an index of procrastination as a control variable based on the six questions to rate students' perception of themselves, taken from Osaka University (2013) and Honda and Nishijima (2007). The questions (originally written in Japanese and translated by the authors) included items such as "Are you a person who postpones plans even when you make them?" and "Are you a person who is happy as long as you are having fun now?" The students answered all six questions with categorical responses: (i) yes, (ii) moderately yes, (iii) 50/50, (iv) moderately no, or (v) no. We assigned a score of 4 to the answer yes, 3 to moderately yes, 2 to 50/50, 1 to moderately no, and 0 to no. We then aggregated the scores for all six questions to construct a single index of procrastination, which ranged from 0 to 24 (maximum of 4 multiplied by 6 items). These aggregated scores were normalized by subtracting the sample mean and then dividing by the standard deviation. The mean z-score of the procrastination index is −0.02 among the treatment group and 0.021 among the control group; importantly, these means are statistically not different.

A. Balance
Other control variables include gender, past exposure to English (whether the student has been abroad and the grade at which they started learning English in primary school), and current study environment (having their own room and electronic device, such as a personal computer connected to the internet or a tablet, commuting time to school, and membership of a school sports club), as well as their family background (number of books at home and parental educational attainment). 8 We also collected information on smartphone ownership, but almost all of the students (96%) owned one so we do not include this variable as a control. The differences in means between the two groups are statistically insignificant at the 5% level for all the variables, indicating that randomization was performed successfully.

B.
Program Utilization Figure 2 shows daily changes in the number of students who took the lessons based on program usage records. Of the 160 students assigned to the treatment group, the average number of students who took lessons each day was 25 in July 2015. However, if all students had completed the recommended 10 lessons a month, that number would be 52 (10 lessons multiplied by 160 students and divided by 31 days). Thus, the take-up rate in the first month of the intervention was about 50%. Moreover, the number of students taking lessons decreased gradually, presumably because the novelty effect faded and peer pressure was muted by the summer vacation, which started during the last week of July, with the average number falling to 15 in August, 12 in September, 6 in October, and 5 in November. While Figure  2 shows daily changes in program utilization, Figure 3 shows the student-level number of lessons taken during the intervention period. Thirty-one (19%) of the 160 students never took any lessons in the 5-month period, and 57 (36%) took five or fewer lessons. Only 23 students (14%) completed 25 or more lessons, one-half of the recommended number, of whom only 10 (6%) completed the recommended 50 or more lessons.
To identify the factors associated with program utilization, we estimated the ordinary least squares models while controlling for the English teacher dummies. Column 1 shows that the effect of the procrastination index is negative and significant, illustrating the detrimental effect of procrastination on program utilization. The significance of this variable remains robust and consistent, even after the variables listed in Table 2 are controlled (column 2). In terms of the size of the effects, a 1 standard deviation increase in the procrastination index reduces the number of lessons by about 4 times, where the mean was 12.2 times; thus, the influence of procrastination seems nonnegligible.
As the program was new to most of the students and the first few trials of the program are critical for subsequent utilization, we estimated a linear probability model, where the dependent variable is coded as a dummy variable that equals 1 if a student has ever used this Skype program and 0 otherwise. Indeed, according to our informal interviews with some of the students, regular Skype users started to like the program as they proceeded through the initial few talks with Filipino interlocutors, whereas nonusers felt hesitant to take the first lesson. Columns 4-6 show the results, and the procrastination variable is negative and significant.
Table 3 also shows that the English teacher dummies are large in magnitude and statistically significant. For instance, a student with English teacher D was about 40 percentage points less likely to have ever used the program than a student with English teacher A (base category). The degree of in-class encouragement and reminders substantially differed from one teacher to another, with teacher A, who is the most senior and experienced among the four teachers, providing more encouragement and more frequent reminders to students to participate in the Skype tasks. According to our informal interviews, this teacher frequently asked the students whether they used the program to put gentle pressure on them as well as to share their experiences with other classmates. This teacher also posted an eye-catching message in the classroom to regularly use the program. These observations suggest that the frequencies of such promotive acts from teachers may be critical to the home use of ICT-assisted inputs.

A. Descriptive Analyses: Attitudes
We included two sets of outcome measures to evaluate the impacts of the online program: (i) attitudes and (ii) English communication abilities. Notes: Estimated coefficients are reported here. *** , ** , and * indicate 1%, 5%, and 10% levels of statistical significance, respectively. Numbers in parentheses are t-statistics based on heteroscedasticity-robust standard errors. The base category for the English-since variable is "English since Grade 1 or 2," for the commuting time variable it is "Commuting time 20 minutes or less," and for the teacher dummies it is "Teacher A." Source: Authors' calculations.
To quantitatively measure any changes in students' attitudes toward English communication before and after the intervention, we employed two motivational attributes that have been found to influence students' second-language development: (i) international posture and (ii) willingness to communicate (WTC) (e.g., Yashima, Zenuk-Nishide, and Shimizu 2004). First, the construct of international posture was operationally defined as a composite of four subconstructs: (i) intercultural orientation; (ii) interest in an international vocation; (iii) reactions to different customs, values, or behaviors; and (iv) interest in foreign affairs. These subcomponents and corresponding items were adapted from those made available on the homepage of Professor Tomoko Yashima, who first introduced this construct to the field of applied linguistics. 9 This construct has proved to be one of the most distinct and significant factors explaining students' motivation, especially in English-as-a-foreign-language contexts (see, for example, Dörnyei and Ryan 2015). Using all 22 available items (seven for subcomponent 1, six for subcomponent 2, five for subcomponent 3, and four for subcomponent 4), we then created questions requiring either yes or no answers. Although the original versions of the 22 questions required responses using a six-point Likert scale, we simplified it to yes-no answers to avoid causing excessive fatigue among the students, who had to respond to many questions in our survey. We computed a score for each of the four subcomponents of international posture and then computed total scores, which ranged from 0 to 22, with a higher score indicating a more internationally oriented student. Finally, we computed z-scores for the total score as well as for the four subcomponents. 10 Panel A of Table 4 presents the means of the international posture scores by group, before and after our intervention with the treatment group (but not yet with the control group). First, the means of all the scores before the intervention were not statistically different between the two groups (see the p-values reported on the right). For instance, the baseline mean z-score for the treatment group was 0.042, which was slightly higher than the control group mean of −0.041, but the scores are not statistically different. After the intervention, however, the total score became higher among the treatment group than the control group, and the difference is statistically significant at the 5% level. If we examine the subcomponents, a significant difference is observed for subcomponent 2 (interest in an international vocation) and subcomponent 4 (interest in foreign affairs).
Interestingly, the total score dropped from the baseline mean of −0.041 to an endline mean of −0.172 among the control group (z-scores were computed using the means and standard deviations among the baseline samples), which is a decline of 0.13 standard deviations. This declining trend was particularly observable for subcomponents 1 and 2, which suggests that the motivation of students to learn English shifted from a more to less internationally oriented one: preparation for university entrance exams. In the top-tier high school where we conducted 9 Tomoko Yashima. Kokusai. http://www2.ipcku.kansai-u.ac.jp/∼yashima/data/kokusai.pdf (accessed April 15, 2019).
10 Appendix Table A1 presents regression results that analyze the baseline correlates of the international posture z-score as well as the baseline correlates of our other outcome variables discussed below.  Notes: z-scores are computed using the means and standard deviations among the baseline samples for international posture, willingness to communicate, and Versant score. The level of the Benesse test is different from one test to another, as it is in accordance with the school curriculum; z-score is separately computed for baseline and endline samples. For the GTEC score, we only have observations at the endline; z-scores are computed using the means and standard deviations among the endline samples. *** , ** , and * indicate 1%, 5%, and 10% levels of statistical significance, respectively. Source: Authors' calculations.
the experiment, the curriculum focuses on exam preparation even for first-year students (Sasaki 2018). Hence, panel A appears to suggest that our program helped mitigate the worsening attitudes among sampled students by stimulating their interest in an international vocation and international affairs (subcomponents 2 and 4, respectively). The second motivational variable, WTC, also has significant and complex relationships with second-language learner confidence, motivation, and actual language use (e.g., MacIntyre 2007). As in the case of international posture, we took the eight items that measured WTC from the above-mentioned homepage because they have been successfully used in the past with Japanese high school students learning English as a second language (e.g., Yashima 2009). 11 The questions asked whether the students would be willing to communicate in English in hypothetical situations such as "group discussions on an English course," "giving a speech in public," and "a chance meeting with a foreign friend in the street." A six-point Likert scale offered the following choices: always, usually, sometimes, not very often, seldom, and never. We assigned 5 points to the answer always, 4 to usually, 3 to sometimes, 2 to not very often, 1 to seldom, and 0 to never, and computed the z-value of the total points.
The means of the z-scores are reported toward the bottom of panel A in Table  4. Similar to international posture, the control mean dropped from the baseline to the endline. However, the drop was smaller among the treatment group, and the initially nondifferent means became marginally different in the endline. This finding suggests that although the students' WTC tended to decline as a result of an English curriculum, such as the one followed in the top-tier high school under study, the Skype program played a role in mitigating the declining WTC.
As an additional variable to examine the attitudes of sample students, we use the Cambodia study tour dummy variable reported at the bottom of panel A. The school organized a 1-week study tour to Cambodia in December 2017 and the students had a chance to voluntarily apply for inclusion. The school provided us with a list of students who applied, and we constructed a dummy variable that equals 1 if a student applied and 0 otherwise. Sixteen (10.1%) of the treated students and 11 (6.8%) of the control students applied. Although the difference is not statistically significant, the application rate was 4.2 percentage points higher among the treatment group. Importantly, the correlation between the application dummy and the total endline international posture score was positive with a correlation coefficient of 0.21 (not reported). Thus, the ICT program may have encouraged more students to apply by improving their international posture, which we may not be able to detect because of the weak statistical power.

B. Descriptive Analyses: English Communication Abilities
To quantify the students' English abilities, we use three sets of English tests: Versant, Benesse, and GTEC. We conducted the Versant tests both before and after our intervention to measure the development. In addition, the Benesse test was taken soon after our intervention started and toward the end of it, so the Benesse test score can also be used for the comparison using a DiD design. The GTEC test only measures cross-sectional differences after the intervention. All the test scores are presented as standardized z-scores. The scores of the standardized Versant test are comparable over time, and we computed z-values using the means and standard deviations among the baseline samples. Thus, we can measure the improvement in English communication abilities by looking at the changes in those abilities. However, the Benesse test score differs from one round to the other, as it is designed in accordance with the school curriculum and the difficulty of the test increases as students proceed with the curriculum. Thus, the z-scores are computed separately for the baseline and endline samples, and the changes in the z-scores before and after the treatment do not necessarily indicate changes in students' levels of English abilities because the Benesse test is likely to be more difficult in the endline.
Panel B in Table 4 shows the results of the treatment and control groups' respective scores in the international posture and English tests. Although we primarily intended to use the Versant test as our measure of English communication abilities, the answers provided by some students were not properly recorded because of overburdened internet connections. That is, the test was conducted in a computer room inside the school in order to provide the same test-taking environment for all students, but we ultimately organized a follow-up session for the students whose answers were not recorded. Because not all students attended the follow-up session, the problem is that scores were unrecorded for students who were less confident and more hesitant to retake the test. Appendix Table A2 presents the regression results, where the left-hand-side variable is a dummy variable equal to 1 if the student took the Versant test. The results show that the Versant take-up was not correlated with the observable characteristics at the baseline, but was correlated with the baseline Versant score at the endline (column 6). This suggests that poorly performing students were less likely to have taken the endline Versant test, and we should therefore interpret the results cautiously.
For the Versant score, there is a slight difference between the two groups at the baseline, but it is not statistically significant. The score at the endline is statistically different between the two groups, with the treatment group having a higher score. However, this difference may be due to the types of students choosing to take the test, particularly among the treated students. Panel B also shows that the control mean increased from −0.093 to 0.406, which is a one-half standard deviation increase over 6 months. This is equivalent to a 2-point increase in the Versant score (out of a full score of 80), which is quite large according to the service provider. This improvement is most likely the consequence of the regular curriculum. By contrasting this result with our discussion above, we argue that while the regular school curriculum was unsuccessful in making the students' motivation to learn English more internationally oriented, it did improve their English communication abilities. The Skype program has the potential to sustain the students' intrinsic motivation and therefore supplement the regular curriculum.
The mean scores of the Benesse test, reported in the middle of panel B, were balanced at the baseline and there was no significant difference at the endline. One possible reason for this null result is that the Benesse test primarily measures reading abilities, whose improvement was not the main focus of the Skype program. The same logic applies to the overall GTEC score, which comprehensively measures four English-language skills. Yet, even when we look at the subcomponents of the GTEC, there was no statistical difference in subcomponent 2 (listening ability) or in subcomponent 4 (speaking ability). Taken together, the results shown in Panel B suggest that our intervention did not improve the English communication abilities of the treated students.

C. Econometric Specification
To rigorously analyze the impacts of the online program by controlling the baseline level of outcome variables or other characteristics, we applied two econometric specifications: analysis of covariance (ANCOVA) and DiD regression. Let y ijkt be an outcome variable of student i in classroom j with English teacher k at time t. The ANCOVA specification is written as where Treatment j is a dummy variable equal to 1 for the student in treated class j, y ijkt−1 is an outcome variable at t − 1 (since we have only two time periods, t − 1 represents the baseline and t the endline), η k is a set of English teacher dummies, and ε ijkt is a heteroscedasticity-robust standard error. The standard error is not clustered because the number of clusters is much smaller than the rule-of-thumb number of 42 (Angrist and Pischke 2009). To control for possible intracluster correlations, together with correcting for the small number of clusters, we report the 95% confidence intervals (CIs) based on the wild cluster bootstrap method suggested in Cameron, Gelbach, and Miller (2008). We used boottest Stata command developed by Roodman et al. (2019) for the computation of the bootstrapped CIs.
In equation (1), β is the parameter of interest, which captures the intentionto-treat (ITT) impacts of the program. In addition to the ANCOVA specification, we also estimate a standard DiD model to control for unobserved, time-invariant, student-level heterogeneity, υ i , using the following specification: where Endline t is a dummy variable equal to 1 if the data are collected in the endline (i.e., after the intervention). β in equation (2) is the parameter of interest, whereas δ measures the changes in the outcome variable from the baseline to the endline, which are mainly consequences of the regular school curriculum, as well as other changes that are common to all students. 12 To analyze the different impacts of the online program by level of utilization, we use an instrumental approach to estimate the LATE (Imbens and Angrist 1994). Specifically, we replace Treatment j in equations (1) and (2) with Lessons k i , which equals 1 if student i took at least k lessons during the intervention period. We use Treatment j as an instrument for Lessons k i to estimate the program impact for students in compliance by changing the threshold number of lessons. Since the assignment of treatment was randomized and the control students could not take any lessons, Treatment j works as a valid instrument. We, however, suffer from the weak instrument problem since the take-up rate was not high. To correct for this problem, we report the 95% CIs based on the wild cluster bootstrap because it also corrects for weak instruments (Roodman et al. 2019). In addition, we perform the conditional likelihood ratio tests developed by Moreira (2003), using condivreg Stata command by Moreira and Poi (2003) for robustness check. Table 5 shows the ITT estimates of the program impacts. Odd-numbered columns present the ANCOVA estimation results based on equation (1), while even-numbered columns present the DiD results based on equation (2). Panel A presents the estimated impacts on the attitude measures. Column 2 shows the positive and significant coefficients of the treatment on the total international posture score and the wild cluster bootstrap CI excludes 0, supporting our  discussion in the previous section. In the DiD estimation reported in column 2, the impact is positive but insignificant although the t-statistic is as large as 1.41, with the corresponding p-value of 0.148 (not reported). The point estimate is 0.12 and that of Endline is −0.11, which is statistically significant; these coefficients suggest that the overall international posture score declined from the baseline survey in June 2015 to the endline survey in December of the same year, but the Skype program offset the declining international posture score among the treated students. Furthermore, the significant teacher dummy suggests the presence of substantial teacher heterogeneity, as discussed in section III.B. We report our results on WTC in columns 3 and 4. While not statistically significant, the point estimate is positive in both the ANCOVA and DiD estimations. In columns 5 and 6, we report results on the Cambodia tour. The point estimate is not significant, but the CI barely includes 0 in column 5 and excludes 0 in column 6. Hence, the treated students were more likely to have voluntarily applied for the opportunity to study abroad.

D. Econometric Analyses: Intention to Treat
Panel B shows positive and significant impacts on subcomponents 2 (columns 3 and 4) and 4 (columns 7 and 8). The CIs for these two subcomponents exclude 0 (except for column 8, where the CI barely includes 0). With the point estimates for subcomponents 1 and 3 being close to 0, the impact on international posture comes from the changes in subcomponents 2 and 4. In particular, we find that while the Grade 10 students tended to become less interested in an international vocation-the size of the effect being 0.12 standard deviations (see column 4)such a tendency was compensated for by our intervention.
Panel C of Table 5 shows the ITT estimates of the program impacts on students' English communication abilities in the same manner as panel A. The point estimates are small or even negative, particularly for the Benesse (columns 3 and 4) and GTEC tests (columns 5 and 6), and the corresponding t-statistics are close to 0. In addition, all the CIs include zero. Even if we look at the subcomponents of the GTEC shown in panel D, particularly subcomponents 2 (listening) and 4 (speaking), we find similar patterns of small coefficients with small t-statistics and CIs including zero. Hence, our regression analyses show that the Skype program had limited impacts on the students' English communication abilities.
However, attitudinal attributes have been reported to lead to eventual improvement in students' second-language skills (e.g., Sasaki 2011, Yashima 2002; therefore, the Skype program may have significant impacts over the long term. Unfortunately, all of the sample students had received the same amount of online intervention by the end of May 2016, and thus, we do not have variation to evaluate such long-term impacts. In addition, we may possibly have detected an effect if our intervention had been implemented for a longer period because Ross (2000), among others, finds that the duration is a major determinant of the effectiveness of second-language learning. Another important point to note from panel C is the significant coefficient of the endline dummy in column 2. As the scores of the standardized Versant test are intertemporally comparable, the positive and significant coefficients suggest that students' communication abilities significantly improved over time, most likely due to the regular school curriculum in this top-tier high school.

E.
Econometric Analyses: Local Average Treatment Effect Table 6 reports the LATE estimates of program impacts on attitudes in panel A and on English communication abilities in panel B. In columns 1, 4, and 7 (where k = 5), the lesson dummy equals 1 if a student took at least five lessons in the intervention period; thus, the coefficient captures the impacts of the online program for students who completed at least five lessons.
In panel A, the size of the coefficient increases with k, indicating that the students who took more lessons benefited more from the program. For instance, the students who took 25 or more lessons (half of the recommended number by the service provider) have an international posture z-score that is 1.01 standard deviation higher than the average of the control students (column 3). However, the first-stage F-statistics decrease and the CIs widen as k increases because only 23 students (14%) completed 25 or more lessons, and the standard errors increase with k. This is one of the reasons why we do not find statistically significant coefficients for WTC (columns 4-6). In columns 7-9, although the coefficient is insignificant, CIs exclude or barely include zero, indicating the positive impact on students' participation in the overseas study.
In panel B, we find a similar increasing pattern for the Versant test (columns 1-3), but not for the Benesse test (columns 4-6) or the overall GTEC scores (columns 7-9). Unfortunately, none of the three indicators are a perfect measure of English communication abilities: (i) the Versant test with the nonrandom attrition, (ii) the Benesse test with the primary focus on reading skills, and (iii) the GTEC with the cross-sectional nature. Our tentative conclusion is that the impacts of our intervention on English communication abilities were at most limited.

F. Additional Analyses
We conducted two sets of additional analyses. First, we analyzed the heterogeneous treatment effects by interacting the treatment dummy with the control variables, including procrastination, gender, past exposure to English, family background, and baseline levels of the outcome variable. Panel A of Table 7 reports results for the international posture score; no interaction term is statistically significant, including those not reported (Table 7 only reports the results for the variables that were found to be correlated with some outcome variables in Appendix  Table A1.) This may be because of the moderate size of the average treatment effects. Panel B reports the results for the Benesse test score. We found that  only the interaction with the abroad dummy is positive and marginally significant, suggesting that the program may have widened the gap between strongly performing students with greater degrees of international exposure and those showing no such orientation because the former is more likely to take advantage of learning opportunities to further improve their English communication abilities. The second set of analyses is the impact of the Skype program on the students' school performance based on their self-reported information. While admitting that we do not have more objective data based on assessments by their teachers, the treated students were more likely to work hard and actively participate in English classes at school (Table 8, columns 1-4). In addition, the treated students may be more likely to work hard in classes other than English classes (columns 5-6). Therefore, the program had positive impacts on overall school performance. In addition, the possibility of a crowding-out effect, where the students spend more time studying English while spending less time on other subjects, seems limited.

IV. Conclusion
We conducted a unique and rare field experiment in collaboration with a Japanese public high school to provide students with a home-use, ICT-assisted program for English. Through the examination of program usage records and panel data, we analyzed the factors associated with program utilization and estimated the program impacts. In our descriptive and econometric analyses, we found that the program significantly changed the internationally oriented attitudes of the treated students but not their English communication abilities. We could justifiably speculate that the insignificant improvement in their communication abilities was due to the low take-up rate of the targeted program. As we found that students showing a tendency to procrastinate were less likely to start and continue using the program, more research is warranted on how to improve and maintain students' motivation, particularly those with a tendency to procrastinate, and encourage them to use ICT-assisted programs such as the one targeted in this study. In addition, as improved internationally oriented attitudes could have a positive impact on students' English development on a long-term basis, future studies need to evaluate the long-term impacts of such programs.
We also found that although the entrance-exam-oriented regular school curriculum did improve the students' English (oral) communication abilities, it seemed to have negative effects on their international orientation. As we identified the positive causal effects of the online English learning program on the students' attitudes, given that it supplemented the weaknesses of the regular curriculum, future research should consider how to combine regular English lessons and such ICT-based programs in a complementary manner. In addition to encouraging interventions designed to encourage home use, using such programs during regular English lessons also might be an option.  Notes: All outcomes are measured using a five-point Likert scale with 1 = not at all, 2 = no, 3 = neutral, 4 = yes, and 5 = definitely yes. Estimated coefficients reported. *** , ** , and * indicate 1%, 5%, and 10% levels of statistical significance, respectively. Numbers in parentheses are t-statistics based on heteroscedasticity-robust standard errors. Wild cluster bootstrap (95% CI) is for the treatment variable. Using boottest Stata command developed by Roodman et al. (2019), we repeated wild cluster bootstrapping for 1,000 times. In so doing, we used the gamma distribution with the shape parameter of 4 and the scale parameter of 0.5 for weights. Source: Authors' calculations. Notes: Estimated coefficients are reported. *** , ** , and * indicate 1%, 5%, and 10% levels of statistical significance, respectively. Numbers in parentheses are t-statistics based on heteroscedasticity-robust standard errors. The base category for the English-since variable is "English since Grade 1 or 2," for the commuting time variable it is "Commuting time 20 minutes or less," and for the teacher dummies it is "Teacher A." Source: Authors' calculations.