Detecting Cross-Cultural Differences Using a Multilingual Topic Model

Understanding cross-cultural differences has important implications for world affairs and many aspects of the life of society. Yet, the majority of text-mining methods to date focus on the analysis of monolingual texts. In contrast, we present a statistical model that simultaneously learns a set of common topics from multilingual, non-parallel data and automatically discovers the differences in perspectives on these topics across linguistic communities. We perform a behavioural evaluation of a subset of the differences identified by our model in English and Spanish to investigate their psychological validity.


Introduction
Recent years have seen a growing interest in textmining applications aimed at uncovering public opinions and social trends (Fader et al., 2007;Monroe et al., 2008;Gerrish and Blei, 2011;Pennacchiotti and Popescu, 2011). They rest on the assumption that the language we use is indicative of our underlying worldviews. Research in cognitive and sociolinguistics suggests that linguistic variation across communities systematically reflects differences in their cultural and moral models and goes beyond lexicon and grammar (Kövecses, 2004;Lakoff and Wehling, 2012). Cross-cultural differences manifest themselves in text in a multitude of ways, most prominently through the use of explicit opinion vocabulary with respect to a certain topic (e.g. "policies that benefit the poor"), idiomatic and metaphorical language (e.g. "the company is spinning its wheels") and other types of figurative language, such as irony or sarcasm.
The connection between language, culture and reasoning remains one of the central research questions in psychology.
Thibodeau and Boroditsky (2011) investigated how metaphors affect our decision-making. They presented two groups of human subjects with two different texts about crime. In the first text, crime was metaphorically portrayed as a virus and in the second as a beast. The two groups were then asked a set of questions on how to tackle crime in the city. As a result, while the first group tended to opt for preventive measures (e.g. stronger social policies), the second group converged on punishment-or restraint-oriented measures. According to Thibodeau and Boroditsky, their results demonstrate that metaphors have profound influence on how we conceptualize and act with respect to societal issues. This suggests that in order to gain a full understanding of social trends across populations, one needs to identify subtle but systematic linguistic differences that stem from the groups' cultural backgrounds, expressed both literally and figuratively. Performing such an analysis by hand is labor-intensive and often impractical, particularly in a multilingual setting where expertise in all of the languages of interest may be rare.
With the rise of blogging and social media, NLP techniques have been successfully used for a number of tasks in political science, including automatically estimating the influence of particular politicians in the US senate (Fader et al., 2007), identifying lexical features that differentiate political rhetoric of opposing parties (Monroe et al., 2008), predicting voting patterns of politicians based on their use of language (Gerrish and Blei, 2011), and predicting political affiliation of Twitter users (Pennacchiotti and Popescu, 2011). Fang et al. (2012) addressed the problem of automatically detecting and visualising the contrasting perspectives on a set of topics attested in multiple distinct corpora. While successful in their tasks, all of these approaches focused on monolingual data and did not reach beyond literal language. In contrast, we present a method that detects fine-grained cross-cultural differences from multilingual data, where such differences abound, expressed both literally and figuratively. Our method brings together opinion mining and cross-lingual topic modelling techniques for this purpose. Previous approaches to cross-lingual topic modelling (Boyd-Graber and Blei, 2009;Jagarlamudi and Daumé III, 2010) addressed the problem of mining common topics from multilingual corpora. We present a model that learns such common topics, while simultaneously identifying lexical features that are indicative of the underlying differences in perspectives on these topics by speakers of English, Spanish and Russian. These differences are mined from multilingual, non-parallel datasets of Twitter and news data. In contrast to previous work, our model does not merely output a list of monolingual lexical features for manual comparison, but also automatically infers multilingual contrasts.
Our system (1) uses word-document co-occurrence data as input, where the words are labeled as topic words or perspective words; (2) finds the highest-likelihood dictionary between topic words in the two languages given the co-occurrence data; (3) finds cross-lingual topics specified by distributions over topic-words and perspective-words; and (4) automatically detects differences in perspectiveword distributions in the two languages. We perform a behavioural evaluation of a subset of the differences identified by the model and demonstrate their psychological validity. Our data and dictionaries are available from the first author upon request.

Related work
View detection. Identifying different viewpoints is related to the well-studied area of subjectivity detection, which aims at exposing opinion, evaluation, and speculation in text (Wiebe et al., 2004) and attributing it to specific people (Awadallah et al., 2011;Abu-Jbara et al., 2012). In our work, we are less interested in explicit local forms of subjectivity, instead aiming at detecting more general contrasts across linguistic communities.
Another line of research has focused on inferring author attributes such as gender, age (Garera and Yarowsky, 2009), location (Jones et al., 2007), or political affiliation (Pennacchiotti and Popescu, 2011). Such studies make use of syntactic style, discourse characteristics, as well as lexical choice. The models used for this are typically binary classifiers trained in a fully supervised fashion. In contrast, in our task, we automatically infer the topic distributions and find topic-specific contrasts.
Probabilistic topic models. Probabilistic topic models have proven useful for a variety of semantic tasks, such as selectional-preference induction (Ó Séaghdha, 2010;Ritter et al., 2010), sentiment analysis (Boyd-Graber and Resnik, 2010) and studying the evolution of concepts and ideas (Hall et al., 2008). The goal of a topic model is to characterize observed data in terms of a much smaller set of unobserved, semantically coherent topics. A particularly popular probabilistic topic model is Latent Dirichlet Allocation (LDA) (Blei et al., 2003). Under its assumptions, each document has a unique mix of topics, and each topic is a distribution over terms in the vocabulary. A topic is chosen for every word token according to the topic mix of the document to which it belongs, and then the word's identity is drawn from the corresponding topic's distribution.
Handling multilingual corpora. LDA is designed for monolingual text and thus it lacks the structure necessary to model cross-lingually valid topics. While topic models can be trained individually on two languages and then the acquired topics can be matched, the correspondences between the topics for the two terms will be highly unstable. To address this, Boyd-Graber and Blei (2009) (MUTO) and Jagarlamudi and Daumé III (2010) (JOINTLDA) introduced the notion of crosslingually valid concepts associated with different terms in different languages, using bilingual dictionaries to model topics across languages. Based on a model by Haghighi et al. (2008), MUTO is capable of learning translations-i.e., matching between terms in the different languages being compared. The Polylingual Topic Model of Mimno et al. (2009) is another approach to finding topics in multilingual corpora, but it requires tuples composed of compa-48 rable documents in each language of the corpus.
Topic models for view detection. LDA also assumes that the distribution of each topic is fixed across all documents in a corpus. Therefore, a topic associated with, e.g., war will have the same distribution over the lexicon regardless of whether the document was taken from a pro-war editorial or an anti-war speech. However, in reality we may expect a single topic to exhibit systematic and predictable variations in its distribution based on authorship.
The cross-collection LDA model by Paul and Girju (2009) addresses this by specifically aiming to expose viewpoint differences across different document collections. Ahmed and Xing (2010) proposed a similar model for detecting ideological differences. Fang et al. (2012)'s Cross-Perspective Topic (CPT) model breaks up the terms in the vocabulary into topic terms and perspective terms with different generative processes, and differentiates between different collections of documents within the corpus. The topic terms are assumed to be generated as in LDA. However, the distribution of perspective terms in a document is taken to be dependent on both the topic mixture of the document as well as the collection from which the document is drawn.
Recent works proposed models for specific types of data.  use user identities and interactions in threaded discussions, while Gottipati et al. (2013) developed a topic model for Debatepedia, a semi-structured resource in which arguments are explicitly enumerated. However, all of these models perform their analyses on monolingual datasets. Thus, they are useful for comparing different ideologies expressed in the same language, but not for cross-linguistic comparisons.

Method
The goal of our model is to analyse large, nonparallel, multilingual corpora and present crosslingually valid topics and the associated perspectives, automatically inferring the differences in conceptualization of these topics across cultures. Following Boyd-Graber and Blei (2009) and Jagarlamudi and Daumé III (2010), our distributions of latent topics range over latent, cross-lingual topic concepts that manifest themselves as language-specific topic words. We use bilingual dictionaries, contain- ing words in one language and their translations in another language, to represent the topic concepts. These are represented as a bipartite graph, with each translation entry being an edge and each topic word in the two languages being a vertex. While the topic words are tied together by the translation dictionary, the perspective words can vary freely across languages. Following Fang et al. (2012), we treat nouns as topic words and verbs and adjectives as perspective words 1 . The model assumes that adjective and verb tokens in each document are assigned to topics in proportion to the topic assignments of the topic word tokens. Then, the perspective term for this topic is drawn depending on the topic assignment and the language of the speaker.

Basic Generative Model
Given the languages ∈ {a, b}, our model infers the distributions of multi-lingual topics and languagespecific perspective-words (Fig. 2), as follows: 1. Draw a set C of concepts (u, v) matching topic word u from language a to topic word v from language b, where the probability of concept (u, v) is proportional to a prior π u,v (e.g. based on information from a translation dictionary).

Model Variants
We have experimented with several variants of our model, in order to account for the translation of polysemous words, adapt the translation model to the corpus used, and to handle words for which no translation is found. a) SINGLE variants of the model match each topic term in a language with at most one topic term in the other language.
MULTIPLE variants allow each term to match to multiple other words in the other language.
b) INFER variants allow higher-likelihood matchings to be inferred from the data.
STATIC variants treat the matchings as fixed, which is equivalent to assigning a probability of 0 or 1 to every edge in our bipartite graph C.
c) RELEGATE variants relegate all unmatched words in each language to a single separate background topic distinct from the topics that are learned for the matched topic words. This is akin to forcing the probability for currently unmatched words to 0 in all topics except for one, and forcing the probability of all currently matched words to 0 in this topic.
INCLUDE variants do not restrict the assignment unmatched words; they are assigned to the same set of topics as the matched words.
We test the following six variants: SINGLESTATI- We do not test MULTI-PLEINFER variants because of the complexity of inferring a multiple matching in a bipartite graph.

Learning & Inference
For all variants, a collapsed Gibbs sampler can be used to infer topics φ ,o and φ w , per-document topic distributions θ, as well as topic assignments z and x. This corresponds to the S-step below. For INFER variants, we follow Boyd-Graber and Blei in using an M-step involving a bipartite graph matching algorithm to infer the matching m that maximizes the posterior likelihood of the matching. S-Step: Sample topics for words in the corpus using a collapsed Gibbs sampler. For topic-word w i = u belonging to document d, if the word occurs in concept c i = (u, v), then sample the topic and entry according to: where the sum in the denominator of the first term is over all topics, and in the second term is over all words matched to u. N dk is the count of topic-words of topic k in document d, N k(u,v) is the count of topic-words either of type u or of type v assigned to topic k in all the corpora. 2 For perspective-word o i = n, sample the topic according to: In RELEGATE variants, for u unmatched zi is sampled as: which can be seen as β w u· → ∞ for unmatched terms.
where the sum in the second term of the denominator is over the perspective-word vocabulary of language d ; N dk is the count of topic words in document d with topic k; and N d km is the count of perspectiveword m being assigned topic k in language d . Note that in all the counts above, the current word token i is omitted from the count.
Given our sampling assignments, we can then estimate θ d , φ ,o , and φ w as follows:

M-
Step: (for INFER variants only): Run the Jonker-Volgenant (Jonker and Volgenant, 1987) bipartite matching algorithm to find the optimal matching C given some weights. For topic-term u from language a and topic-term v from language b, our weights correspond to the log of the posterior odds that the occurrences of u and v come from a matched topic distribution, as opposed to coming from unmatched distributions: where N u is the count of topic-term u in the corpus. This expression can also be interpreted as a kind of pointwise mutual information (Haghighi et al., 2008). The Jonker-Volgenant algorithm has time complexity of at most O(V 3 ), where V is the size of the lexicon (Jonker and Volgenant, 1987).

Inference of Perspective-Word Contrasts
Having learned our model and inferred how likely perspective-terms are for a topic in a given language, we seek to know whether these perspectives differ significantly in the two languages. More precisely, can we infer whether word m in language a and the equivalent word n in language b have significantly different distributions under a topic k? To do this, we make the assumption that the perspective-words in languages a and b are in one-to-one correspondence to each other. Recall that, for a given topic k and language , N km is the count for term m and φ ,o k,m is the probability for word m in language . Just as we collect the probabilities into word-topic distribution vectors φ ,o k , we collect the counts into word-topic count vectors [N k1 , N k2 , ..]. Then, since our model assumes a prior over the parameter vectors φ ,o k , we can infer the likelihood for that observed word-topic counts N a km and N b kn were drawn from a single word-topic-distribution prior denoted byφ := φ a,o km = φ b,o kn . Below all our probabilities are conditioned implicitly on this event as well as on N a k and N b k being fixed. Denote the total count of word tokens in topic k from language by N k = m N km . Now, we derive the probability that we observe a ratio greater than δ between the proportion of words in topic k that belong to word type m in language a and to corresponding word type n in language b: By symmetry, it suffices to derive an expression for the first term. We note that the inequality in the probability is equivalent to a sum over a range of values of N a km and N b kn . By rearranging terms, applying the law of conditional probability to condition on the termφ, and exploiting the conditional independence of N a km and N b km givenφ, N a k , and N b k , we can rewrite this first term as under our model. Assume a symmetric Dirichlet distribution for simplicity. It can then be shown that the marginal distribution ofφ isφ ∼ Beta(β o , (V − 1)β o ), where V is the total size of the perspectiveword vocabulary. Similarly, it can be shown that the marginal distribution of N km given φ ,o k is N km ∼ Binom(N k , φ ,o i ) for ∈ {a, b}. Therefore, the integrand above is proportional to the beta-binomial distribution with number of trials N a k + N b k , successes x + y, and parameters β o and (V − 1)β o , but with partition function N a k y N b k x . Denote the PMF of this 51 distribution by f (N a k +N b k , x+y, β o ). Then expression (1) above becomes: We cannot observe N a kb , N b kn , N a k and N b k explicitly, but we can estimate them by obtaining posterior samples from our Gibbs sampler. We substitute these estimates into expression (2).

Data
Twitter Data. We gathered Twitter data in English, Spanish and Russian during the first two weeks of December 2013 using the Twitter API. Following previous work (Puniyani et al., 2010), we treated each Twitter user account as a document. We then tagged each document for part-ofspeech, and divided the word tokens in it into topicwords and perspective-words. We constructed a lexicon of 2,000 topic terms and 1,500 perspectiveterms for each language by filtering out any terms that occurred in more than 10% of the documents in that language, and then selecting the remaining terms with the highest frequency. Finally, we kept only documents that contained 4 or more topic words from our lexicon. This left us with 847,560 documents in English (4,742,868 topic-word and 1,907,685 perspective-word tokens); 756,036 documents in Spanish (4,409,888 topicword and 1,668,803 perspective-word tokens); and 260,981 documents in Russian (1,621,571 topicword and 981,561 perspective-word tokens). News Data. We gathered all the articles published online during the year 2013 by the state-run media agencies of the United States (Voice of America or "VOA"-English), Russia (RIA Novosti or "RIA"-Russian), and Venezuela (Agencia Venezolana de Noticias or "AVN"-Spanish). These three news agencies were chosen because they not only provide media in three distinct languages, but they are guided by the political world-views of three distinct governments. We treated each news article as a document, and removed duplicates. Once again, we constructed a lexicon of 2,000 topic terms and 1,500 perspective-terms using the same criteria as for Twitter, and kept only documents that contained 4 or more topic words from our lexicon. This left us with 23,159 articles (10,410,949 tokens) from VOA, 41,116 articles (11,726,637 tokens) from RIA, and 8,541 articles (2,606,796 tokens) from AVN.
Dictionaries. To create the translation dictionaries, we extracted translations from the English, Spanish, and Russian editions of Wiktionary, both from the translation sections and the gloss sections if the latter contained single words as glosses. Multiword expressions were universally removed. We added inverse translations for every original translation. From the resulting collection of translations, we then created separate translation dictionaries for each language and part-of-speech tag combination.
In order to give preference to more important translations, we assigned each translation an initial weight of 1 + 1 r , where r was the rank of the translation within the page. Since a translation (or its inverse) can occur on multiple pages, we aggregated these initial weights and then assigned final weights of 1 + 1 r , where r was the rank after aggregation and sorting in descending order of weights.

Experimental Conditions
To evaluate the different variants of our model, we held out 30,000 documents (test set) during training. We plugged in the estimates of φ w and C acquired during training using the rest of the corpus to produce a likelihood estimate for these held-out documents. All models were initialized with the prior matching determined by the dictionary data. For each number of topics K, we set α to 50/Kand the β variables to 0.02, as in Fang et al. (2012). For the MULTIPLE variants, we set π i,j = 1 if i and j share an entry and 0 otherwise. For INFER variants, only three M -steps were performed to avoid overfitting, at 250, 500, and 750 iterations of Gibbs sampling, following the procedure in Boyd-Graber and Blei (2009).

Comparison of model variants
In order to compare the variants of our model, we computed the perplexity and coherence for each variant on TWITTER and NEWS, for English-Spanish and English-Russian language pairs. Perplexity is a measure of how well a model trained on a training set predicts the co-occurrence of words on an unseen test set H. Lower perplexity indicates better model fit. We evaluate the held-out perplexity for topic words w i and perspective-words o i separately. For topic words, the perplexity is defined as exp(− w i ∈H logp(w i )/N w ). As for standard LDA, exact inference of p(w i ) is intractable under this model. Therefore we adapted the estimator developed by Murray and Salakhutdinov (2009) to our models.
Coherence is a measure inspired by pointwise mutual information (Newman et al., 2010). Let D(v) be the the number of documents with at least one token of type v and let D(v, w) be the number of documents containing at least one token of type v and at least one token of type w. Then Mimno et al. (2011) define the coherence of topic k as is a list of the M most probable words in topic k and is a small smoothing constant used to avoid taking the logarithm of zero. Mimno et al. (2011) find that coherence correlates better with human judgments than do likelihoodbased measures. Coherence is topic-specific measure, so for each model variant we trained, we computed the median topic coherence across all the topics learned by the model. We set = 0.1. Model performance and analysis. Fig. 2 shows perplexity for the variants as a function of the number of iterations of Gibbs sampling on the English-Spanish NEWS corpus. The figure confirms that 1000 iterations of Gibbs sampling on the NEWS corpus was sufficient for convergence across model variants. We omit figures for English-Russian and for the TWITTER corpus, since the patterns were nearly identical. Figure 3 shows how perplexity varies as a function of the number of topics. We used this information to choose optimal models for the different corpora. The optimal number of topics was K = 175 for the English-Spanish NEWS corpus, K = 200 for the English-Russian NEWS, K = 325 for the English-Spanish TWITTER, and K = 300 for the English-Russian TWITTER. Although the optimal number of topics varied across corpora, the relative performance of the different models was the same. In all of our corpora, the MULTIPLE variants provided better fits than their corresponding SINGLE variants. There are several explanations for this. For one, the MULTIPLE variants are able to exploit the information from multiple translations, unlike the SINGLE variants, which discarded all but one translation per word. For another, the matchings produced by the SINGLEINFER variants can be purely coincidental and the result of overfitting (see some examples below). INCLUDE variants performed markedly better than RELEGATE variants. INFER variants improved model fit compared to STATIC variants, but required more topics to produce optimal fit.
Recall that we performed an M-step in the IN-FER variants 3 times, at 250, 500, and 750 iterations. As noted in §3.3, the M-step in the INFER variants maximizes the posterior likelihood of the matching. However, Fig. 2 shows that this maximization causes held-out perplexity to increase substantially just after the first matching M-step, around 250 iterations, before decreasing again after about 50 more iterations of Gibbs sampling. We believe that this happens because the M-step is maximizing over expectations that are approximate, since they are estimated using Gibbs sampling. If the sampler has not yet converged, then the M-step's maximization will be unstable. We found support for this explanation when we re-ran the INFER variants using 1000 iterations between M-steps, giving the Markov chain enough time to converge. After this change, perplexity went down immediately after the M-step and kept decreasing monotonically, rather than increasing after the M-step before decreasing. However, this did not result in a significantly lower final perplexity or coherence and thus did not change the relative performance of the models. In addition, Fig.  2 suggests that the second and third M-steps (at 500 and 750 iterations, respectively) had little effect on perplexity. In light of the high computational expense of each inference step, this suggests in practice a single inference step may be sufficient. Fig. 4 shows that the MULTIPLESTATICINCLUDE variant was also the superior model as measured by median topic coherence. Once again, this general pattern held true for the English-Russian pair and TWITTER corpora. Overall, the results show that MULTIPLESTATICINCLUDE provides superior performance across measures, corpora, topic numbers, and languages. We therefore used this variant in further data analysis and evaluation. Incidentally, the observed decrease in topic coherence as K increases is expected, because as K increases, lowerlikelihood topics tend to be more incoherent (Mimno et al., 2011). Experiments by Stevens et al. (2012) show that this effect is observed for LDA-, NMF-, and SVD-based topic models.

Cross-linguistic matchings.
The matchings inferred by the SINGLEINFERINCLUDE variant were of mixed quality. Some of the matchings corrected low-quality translations in the original dictionary. For instance, our prior dictionary matched passage in English to pasaje in Spanish. Though technically correct, the dominant meaning of pasaje is [travel] ticket. The TWITTER model correctly matched passage to ruta instead. Many of the matchings learned by the model did not provide technically correct translations, yet were still revelatory and interesting. For instance, the dictionary translated the Spanish word pito as cigarette in English. However, in informal usage this word refers specifically to cannabis cigarettes, not tobacco cigarettes. The TWITTER

Data analysis and discussion
We have conducted a qualitative analysis of the topics, perspectives and contrasts produced by our models for English-Spanish and English-Russian, TWITTER and NEWS datasets. While the topics were coherent and consistent across languages, sets of perspective words manifested systematic differences revealing interesting cross-cultural contrasts. Fig. 5 and 7 show the top perspective words discovered by the model for the topic of finance and economy in English and Spanish NEWS and TWITTER corpora, respectively. While some of the perspective words are neutral, mostly literal and occur in both English and Spanish (e.g. balance or authorize), many others represent metaphorical vocabulary (e.g. saddle, gut, evaporate in English, or incendiar, sangrar, abatir in Spanish) pointing at distinct models of conceptualization of the topic. When we applied the contrast detection method (described in §3.4) to these perspective words, it highlighted the differences in metaphorical perspectives, rather than the literal ones, as shown in Fig. 6 and 8. En- glish speakers tend to discuss economic and financial processes using motion terms, such as "slow, drive, boost or sluggish", or a related metaphor of horse-riding, e.g. "rein in debt", "saddle with debt", or even "breed money". In contrast, Spanish speakers tend to talk about the economy in terms of size rather than motion, using verbs such as ampliar or disminuir, and other metaphors, such as sangrar (to bleed) and incendiar (to light up). These examples demonstrate coherent conceptualization patterns that differ in the two languages. Interestingly, this difference manifested itself in both NEWS and TWITTER corpora and echoes the findings of a previous corpus-linguistic study of Charteris-Black and Ennis (2001), who manually analysed metaphors used in English and Spanish financial discourse and reported that motion and navigation metaphors that abound in English were rarely observed in Spanish.
For the majority of the topics we analysed the model revealed interesting cross-cultural differences. For instance, the Spanish corpora exhibited metaphors of battle when talking about poverty (with poverty seen as an enemy), while in the English corpus poverty was discussed more neutrally as a social problem that needs a practical solution. English-Russian NEWS experiments revealed a surprising difference with respect to the topic of protests. They suggested that while US media tend to use stronger metaphorical vocabulary, such as Topic EN budget debt deficit reduction spend balance cut increase limit downtown tax stress addition planet Topic ES presupuesto deficit deuda reduccion equilibrio disminucion gasto aumentacion tasa sacerdote Perspective EN balance default triple rein accumulate accrue trim incur saddle slash prioritize avert gut burden evaporate borrow pile cap cut tackle Perspective ES renegociar mejora etiquetado desplomar recortar endeudar incendiar destinar asignar autorizar aprobado ascender sangrar augurar abatir   clash, erupt or fire, in Russian protests are discussed more neutrally. Generally, the NEWS corpora contained more abstract topics and richer information about conceptual structure and sentiment in all languages. Many of the topics discovered in TWIT-TER related to everyday concepts, such as pets or concerts, with fewer topics covering societal issues. Yet, a few TWITTER-specific contrasts could be observed: e.g., the sports topic tends to be discussed using war and battle vocabulary in Russian to a greater extent than in English.
Our models tend to identify two general kinds of differences: (1) cross-corpus differences representing world views of particular populations whom the corpora characterize (such differences exist both across and within languages, e.g. the metaphors used in the progressive New York Times would be different from the ones in the more conservative Wall Street Journal); and (2) deeply entrenched crosslinguistic differences, such as the motion versus expansion metaphors for the economy in English and Spanish. Such systematic cross-linguistic contrasts can be associated with contrastive behavioural patterns across the different linguistic communities (Casasanto and Boroditsky, 2008;Fuhrman et al., 2011). In both NEWS and TWITTER data, our model effectively identifies and summarises such contrasts simplifying the manual analysis of the data Topic EN economy growth rate percent bank economist interest reserve market policy Topic ES economía crecimiento tasa banco poltica mercado interés inflacin empleo economista Perspective EN economic financial grow global expect remain cut boost low slow drive Perspective ES económico mundial agregar financiero informal pequeño significar interno bajar   by highlighting linguistic trends that are indicative of the underlying conceptual differences. However, the conceptual differences are not straightforward to evaluate based on the surface vocabulary alone. In order to investigate this further, we conducted a behavioural experiment testing a subset of the contrasts discovered by our model.

Behavioural evaluation
We assessed the relevance of the contrasts through an experimental study with native English-speaking and native Spanish-speaking human subjects. We focused on a linguistic difference in the metaphors used by English speakers versus Spanish speakers when discussing changes in a nation's economy. While English speakers tend to use metaphors involving both locative motion verbs (e.g. slow) as well as expansive/contractive motion verbs (e.g. shrink), Spanish speakers preferentially employ expansive/contractive motion verbs (e.g. disminuir) to describe changes in the economy. These differences could reflect linguistic artefacts (such as collocation frequencies) or could reflect entrenched conceptual differences. Our experiment addresses the question of whether such patterns of behaviour arise crosslinguistically in response to non-linguistic stimuli. If the linguistic differences are indicative of entrenched conceptual differences, then we expect to see responses to the non-linguistic stimuli that correspond to the usage differences in the two languages.

Experimental setup
We recruited 60 participants from one Englishspeaking country (the US) and 60 participants from three Spanish-speaking countries (Chile, Mexico, and Spain) using the CrowdFlower crowdsourcing platform. Participants first read a brief description of the experimental task, which introduced them to a fictional country in which economists are devising a simple but effective graphic for "representing change in [the] economy". They then completed a demographic questionnaire including information about their native language. Results from 9 US and 3 non-US participants were discarded for failure to meet the language requirement.
Participants navigated to a new page to complete the experimental task. Stimuli were presented in a 1200 × 700-pixel frame. The center of the frame contained a sphere with a 64-pixel diameter. For each trial, participants clicked on a button to activate an animation of the sphere which involved (1) a positive displacement (in rightward pixels) of 10% or 20%, or a negative displacement (in leftward pixels) of 10% or 20%; 3 and, (2) an expansion (in increased pixel diameter) of 10% or 20%, or a contraction (in decreased pixel diameter) of 10% or 20%. 4 Participants saw each of the resulting conditions 3 times. The displacement and size conditions were drawn from a random permutation of 16 conditions using a Fisher-Yates shuffle (Fisher and Yates, 1963). Crucially, half of the stimuli contained conflicts of information with respect to the size and displacement metaphors for economic change (e.g. the sphere could both grow and move to the left). Overall we expected the Spanish speakers' responses to be more closely associated with changes in diameter due to the presence and salience of the size metaphor, and the English speakers' responses to be influenced by both conditions. We expected these differences to be most prominent in the con-3 The use of leftward/rightward horizontal displacement to represent decreases/increases in magnitude is supported by research in numerical cognition showing that people associate smaller magnitudes with the left side of space and larger magnitudes with the right side (Dehaene, 1992;Fias et al., 1995). 4 A demonstration of the English experimental interface can be accessed at http://goo.gl/W3YVfC. The Spanish interface is identical, but for a direct translation of the guidelines provided by a native Spanish/fluent English speaker. Figure 9: "Economy Improved" response rate in conflicting stimulus conditions. flicting trials, which force English speakers (unlike Spanish speakers) to choose between two available metaphors. We focus on these conflicting trials in our analysis and discussion of the results.

Results
In trials in which stimuli moving rightward were simultaneously contracting, English speakers responded that the economy improved 66% of the time, whereas Spanish speakers judged the economy to have improved 43% of the time. In trials in which stimuli moving leftward were simultaneously expanding, English speakers judged the economy to have improved 34% of the time, and Spanish speakers responded that the economy improved 55% of the time. The results are illustrated in Figure 9.
These results indicate three effects: (1) English speakers exhibit a pronounced bias for using horizontal displacement rather than expansion/contraction during the decision-making process; (2) Spanish speakers are more biased toward expansion/contraction in formulating a decision; and, (3) across the two languages the responses show contrasting patterns. The results support our expectation on the relevance of different metaphors when reasoning about the economy by the English and Spanish speakers.
To examine the significance of these effects, we fit a binary logit mixed effects model 5 to the data. The full analysis modeled judgment with native language, displacement, and size as fully crossed fixed effects and participant as a random effect. This analysis confirmed that native language was associated with judgments about economic change. In particular, it indicated that changes in size affected English speakers' judgments and Spanish speakers' judgments differently (p < 0.001), with an increase in size increasing the odds (e β = 2.5) of a judgment of IMPROVED by Spanish speakers and decreasing the odds (e β = 0.44) of a judgment of IMPROVED by English speakers. A Type II Wald test revealed the interaction between language and size to be highly statistically significant (χ 2 (1) < 0.001).
In summary, the patterns we see in the behavioural data are consistent with the patterns uncovered in the output of our model. While much territory remains to be investigated to delimit the nature of this relationship, our results represent a first step toward establishing an association between information mined from large textual data collections and information observed through behavioural responses on a human scale.

Conclusion
We presented the first model that detects common topics from multilingual, non-parallel data and automatically uncovers differences in perspectives on these topics across linguistic communities. Our data analysis and behavioural evaluation offer evidence of a symbiotic relationship between ecologically sound corpus experiments and scientifically controlled human subject experiments, paving the way for the use of large-scale text mining to inform cognitive linguistics and psychology research.
We believe that our model represents a good foundation for future projects in this area. A promising area for further work is in developing better methods for identifying contrasts in perspective terms. This could perhaps involve modifying the generative process for perspective terms or incorporating syntactic dependency information. It would also be interesting to investigate the effect of dictionary quality and corpus size on the relative performance of STATIC and INFER variants. Finally, we note that the model can be applied to identify contrastive perspectives in monolingual as well as multilingual data, providing a general tool for the analysis of subtle, yet important, cross-population differences. 57