Dynamic Language Models for Streaming Text

We present a probabilistic language model that captures temporal dynamics and conditions on arbitrary non-linguistic context features. These context features serve as important indicators of language changes that are otherwise difficult to capture using text data by itself. We learn our model in an efficient online fashion that is scalable for large, streaming data. With five streaming datasets from two different genres—economics news articles and social media—we evaluate our model on the task of sequential language modeling. Our model consistently outperforms competing models.


Introduction
Language models are a key component in many NLP applications, such as machine translation and exploratory corpus analysis. Language models are typically assumed to be static-the word-given-context distributions do not change over time. Examples include n-gram models (Jelinek, 1997) and probabilistic topic models like latent Dirichlet allocation (Blei et al., 2003); we use the term "language model" to refer broadly to probabilistic models of text.
Recently, streaming datasets (e.g., social media) have attracted much interest in NLP. Since such data evolve rapidly based on events in the real world, assuming a static language model becomes unrealistic. In general, more data is seen as better, but treating all past data equally runs the risk of distracting a model with irrelevant evidence. On the other hand, cautiously using only the most recent data risks overfitting to short-term trends and missing important timeinsensitive effects (Blei and Lafferty, 2006;Wang et al., 2008). Therefore, in this paper, we take steps toward methods for capturing long-range temporal dynamics in language use.
Our model also exploits observable context variables to capture temporal variation that is otherwise difficult to capture using only text. Specifically for the applications we consider, we use stock market data as exogenous evidence on which the language model depends. For example, when an important company's price moves suddenly, the language model should be based not on the very recent history, but should be similar to the language model for a day when a similar change happened, since people are likely to say similar things (either about that company, or about conditions relevant to the change). Non-linguistic contexts such as stock price changes provide useful auxiliary information that might indicate the similarity of language models across different timesteps.
We also turn to a fully online learning framework (Cesa-Bianchi and Lugosi, 2006) to deal with nonstationarity and dynamics in the data that necessitate adaptation of the model to data in real time. In online learning, streaming examples are processed only when they arrive. Online learning also eliminates the need to store large amounts of data in memory. Strictly speaking, online learning is distinct from stochastic learning, which for language models built on massive datasets has been explored by Hoffman et al. (2013) and Wang et al. (2011). Those techniques are still for static modeling. Language modeling for streaming datasets in the context of machine translation was considered by Levenberg and Osborne (2009) and Levenberg et al. (2010). Goyal et al. (2009) introduced a streaming algorithm for large scale language modeling by approximating ngram frequency counts. We propose a general online learning algorithm for language modeling that draws inspiration from regret minimization in sequential predictions (Cesa-Bianchi and Lugosi, 2006) and on-line variational algorithms (Sato, 2001;Honkela and Valpola, 2003).
To our knowledge, our model is the first to bring together temporal dynamics, conditioning on nonlinguistic context, and scalable online learning suitable for streaming data and extensible to include topics and n-gram histories. The main idea of our model is independent of the choice of the base language model (e.g., unigrams, bigrams, topic models, etc.). In this paper, we focus on unigram and bigram language models in order to evaluate the basic idea on well understood models, and to show how it can be extended to higher-order n-grams. We leave extensions to topic models for future work.
We propose a novel task to evaluate our proposed language model. The task is to predict economicsrelated text at a given time, taking into account the changes in stock prices up to the corresponding day. This can be seen an inverse of the setup considered by Lavrenko et al. (2000), where news is assumed to influence stock prices. We evaluate our model on economics news in various languages (English, German, and French), as well as Twitter data.

Background
In this section, we first discuss the background for sequential predictions then describe how to formulate online language modeling as sequential predictions.

Sequential Predictions
Let w 1 , w 2 , . . . , w T be a sequence of response variables, revealed one at a time. The goal is to design a good learner to predict the next response, given previous responses and additional evidence which we denote by x t ∈ R M (at time t). Throughout this paper, we use the term features for x. Specifically, at each round t, the learner receives x t and makes a predictionŵ t , by choosing a parameter vector α t ∈ R M . In this paper, we refer to α as feature coefficients.
There has been an enormous amount of work on online learning for sequential predictions, much of it building on convex optimization. For a sequence of loss functions 1 , 2 , . . . , T (parameterized by α), an online learning algorithm is a strategy to minimize the regret, with respect to the best fixed α * in hindsight. 1 Regret guarantees assume a Lipschitz con-1 Formally, the regret is defined as Regret T (α * ) = dition on the loss function that can be prohibitive for complex models. See Cesa-Bianchi and Lugosi (2006), Rakhlin (2009), Bubeck (2011), and Shalev-Shwartz (2012 for in-depth discussion and review. There has also been work on online and stochastic learning for Bayesian models (Sato, 2001;Honkela and Valpola, 2003;Hoffman et al., 2013), based on variational inference. The goal is to approximate posterior distributions of latent variables when examples arrive one at a time.
In this paper, we will use both kinds of techniques to learn language models for streaming datasets.

Problem Formulation
Consider an online language modeling problem, in the spirit of sequential predictions. The task is to build a language model that accurately predicts the texts generated on day t, conditioned on observable features up to day t, x 1:t . Every day, after the model makes a prediction, the actual texts w t are revealed and we suffer a loss. The loss is defined as the negative log likelihood of the model t = − log p(w t | α, β 1:t−1 , x 1:t−1 , n 1:t−1 ), where α and β 1:T are the model parameters and n is a background distribution (details are given in §3.2). We can then update the model and proceed to day t + 1. Notice the similarity to the sequential prediction described above. Importantly, this is a realistic setup for building evolving language models from large-scale streaming datasets.

Notation
We index timesteps by t ∈ {1, . . . , T } and word types by v ∈ {1, . . . , V }, both are always given as subscripts. We denote vectors in boldface and use 1 : T as a shorthand for {1, 2, . . . , T }. We assume words of the form {w t } T t=1 for w t ∈ R V , which is the vector of word frequences at timetstep t. Nonlinguistic context features are {x t } T t=1 for x t ∈ R M . The goal is to learn parameters α and β 1:T , which will be described in detail next.

Generative Story
The main idea of our model is illustrated by the following generative story for the unigram language P T t=1 t(xt, αt, wt) − infα * P T t=1 t(xt, α * , wt). model. (We will discuss the extension to higher-order language models later.) A graphical representation of our proposed model is given in Figure 1.
1. Draw feature coefficients α ∼ N(0, λI). 2 Here α is a vector in R M , where M is the dimensionality of the feature vector. 2. For each timestep t: (a) Observe non-linguistic context features Here, β t is a vector in R V , where V is the size of the word vocabulary, ϕ is the variance parameter and δ k is a fixed hyperparameter; we discuss them below. (c) For each word w t,v , draw w t,v ∼ Categorical exp(n 1:t−1,v +βt,v) P j∈V exp(n 1:t−1,j +β t,j ) . In the last step, β t and n are mapped to the Vdimensional simplex, forming a distribution over words. n 1:t−1 ∈ R V is a background (log) distribution, inspired by a similar idea in Eisenstein et al. (2011). In this paper, we set n 1:t−1,v to be the logfrequency of v up to time t − 1. We can interpret β as a time-dependent deviation from the background log-frequencies that incorporates world-context. This deviation comes in the form of a weighted average of earlier deviation vectors.
The intuition behind the model is that the probability of a word appearing at day t depends on the background log-frequencies, the deviation coefficients of the word at previous timesteps β 1:t−1 , and the similarity of current conditions of the world (based on observable features x) to previous timesteps through f (x t , x k ). That is, f is a function that takes ddimensional feature vectors at two timesteps x t and x k and returns a similarity vector f (x t , x k ) ∈ R M (see §6.1.1 for an example of f that we use in our experiments). The similarity is parameterized by α, and decays over time with rate δ k . In this work, we assume a fixed window size c (i.e., we consider c most recent timesteps), so that δ 1:t−c−1 = 0 and δ t−c:t−1 = 1. This allows up to cth order dependencies. 3 Setting δ this way allows us to bound the 2 Feature coefficients α can be also drawn from other distributions such as α ∼ Laplace(0, λ).
3 In online Bayesian learning, it is known that forgetting inaccurate estimates from earlier timesteps is important (Sato, x Figure 1: Graphical representation of the model. The subscript indices q, r, s are shorthands for the previous timesteps t − 3, t − 2, t − 1. Only four timesteps are shown here. There are arrows from previous where c is the window size as described in §3.2. They are not shown here, for readability. number of past vectors β that need to be kept in memory. We set β 0 to 0. Although the generative story described above is for unigram language models, extensions can be made to more complex models (e.g., mixture of unigrams, topic models, etc.) and to longer n-gram contexts. In the case of topic models, the model will be related to dynamic topic models (Blei and Lafferty, 2006) augmented by context features, and the learning procedure in §4 can be used to perform online learning of dynamic topic models. However, our model captures longer-range dependencies than dynamic topic models, and can condition on nonlinguistic features or metadata. In the case of higherorder n-grams, one simple way is to draw more β, one for each history. For example, for a bigram model, β is in R V 2 , rather than R V in the unigram model. We consider both unigram and bigram language models in our experiments in §6. However, the main idea presented in this paper is largely independent of the base model. observable features (e.g., author, publication venue, geography, and other document-level metadata), but conducted inference in a batch setting, thus their approaches are not suitable for streaming data. It is not immediately clear how to generalize their approach to dynamic settings. Algorithmically, our work comes closest to the online dynamic topic model of Iwata et al. (2010), except that we also incorporate context features.

Learning and Inference
The goal of the learning procedure is to minimize the overall negative log likelihood, However, this quantity is intractable. Instead, we derive an upper bound for this quantity and minimize that upper bound. Using Jensen's inequality, the variational upper bound on the negative log likelihood is: Specifically, we use mean-field variational inference where the variables in the variational distribution q are completely independent. We use Gaussian distributions as our variational distributions for β, denoted by γ in the bound in Eq. 4. We denote the parameters of the Gaussian variational distribution for β t,v (word v at timestep t) by µ t,v (mean) and σ t,v (variance). Figure 2 shows the functional form of the variational bound that we seek to minimize, denoted byB. The two main steps in the optimization of the bound are inferring β t and updating feature coefficients α. We next describe each step in detail.

Learning
The goal of the learning procedure is to minimize the upper bound in Figure 2 with respect to α. However, since the data arrives in an online fashion, and speed is very important for processing streaming datasets, the model needs to be updated at every timestep t (in our experiments, daily).
Notice that at timestep t, we only have access to x 1:t and w 1:t , and we perform learning at every timestep after the text for the current timestep w t is revealed. We do not know x t+1:T and w t+1:T . Nonetheless, we want to update our model so that it can make a better prediction at t + 1. Therefore, we can only minimize the bound until timestep t.
. Our learning algorithm is a variational Expectation-Maximization algorithm (Wainwright and Jordan, 2008).

E-step
Recall that we use variational inference and the variational parameters for β are µ and σ. As shown in Figure 2, since the log-sum-exp in the last term of B is problematic, we introduce additional variational parameters ζ to simplify B and obtain B (Eqs. 2-3). The E-step deals with all the local variables µ, σ, and ζ.
Fixing other variables and taking the derivative of the boundB w.r.t. ζ t and setting it to zero, we obtain the closed-form update for ζ t : To minimize with respect to µ t and σ t , we apply gradient-based methods since there are no closedform solutions. The derivative w.r.t. µ t,v is: Although we require iterative methods in the E-step, we find it to be reasonably fast in practice. 4 Specifically, we use the L-BFGS quasi-Newton algorithm (Liu and Nocedal, 1989). We can further improve the bound by updating the variational parameters for timestep 1 : t − 1, i.e., µ 1:t−1 and σ 1:t−1 , as well. However, this will require storing the texts from previous timesteps. Additionally, this will complicate the M-step update described Figure 2: The variational bound that we seek to minimize, B. H(q) is the entropy of the variational distribution q. The derivation from line 1 to line 2 is done by replacing the probability distributions p(β t | β k , α, x t ) and p(w t | β t , n t ) by their respective functional forms. Notice that in line 3 we compute the expectations under the variational distributions and further bound B by introducing additional variational parameters ζ using Jensen's inequality on the log-sum-exp in the last term. We denote the new boundB.
below. Therefore, for each s < t, we choose to fix µ s and σ s once they are learned at timestep s.

M-step
In the M-step, we update the global parameter α, fixing µ 1:t . Fixing other parameters and taking the derivative ofB w.r.t. α, we obtain: 5 where: .
We follow the convex optimization strategy and simply perform a stochastic gradient update: α t+1 = α t + η t ∂B ∂αt (Zinkevich, 2003). While the variational boundB is not convex, given the local variables µ 1:t 5 In our implementation, we augment α with a squared L2 regularization term (i.e., we assume that α is drawn from a normal distribution with mean zero and variance λ) and use the FOBOS algorithm (Duchi and Singer, 2009). The derivative of the regularization term is simple and is not shown here. Of course, other regularizers (e.g., the L1-norm, which we use for other parameters, or the L 1/∞ -norm) can also be explored. and σ 1:t , optimizing α at timestep t without knowing the future becomes a convex problem. 6 Since we do not reestimate µ 1:t−1 and σ 1:t−1 in the E-step, the choice to perform online gradient descent instead of iteratively performing batch optimization at every timestep is theoretically justified.
Notice that our overall learning procedure is still to minimize the variational upper boundB. All these choices are made to make the model suitable for learning in real time from large streaming datasets. Preliminary experiments showed that performing more than one EM iteration per day does not considerably improve performance, so in our experiments we perform one EM iteration per day.
To learn the parameters of the model, we rely on approximations and optimize an upper boundB. We have opted for this approach over alternatives (such as MCMC methods) because of our interest in the online, large-data setting. Our experiments show that we are still able to learn reasonable parameter estimates by optimizingB. Like online variational methods for other latent-variable models such as LDA (Sato, 2001;Hoffman et al., 2013), open questions remain about the tightness of such approximations and the identifiability of model parameters. We note, how-ever, that our model does not include latent mixtures of topics and may be generally easier to estimate.

Prediction
As described in §2.2, our model is evaluated by the loss suffered at every timestep, where the loss is defined as the negative log likelihood of the model on text at timestep w t . Therefore, at each timestep t, we need to predict (the distribution of) w t . In order to do this, for each word v ∈ V , we simply compute the deviation means β t,v as weighted combinations of previous means, where the weights are determined by the world-context similarity encoded in x: Recall that the word distribution that we use for prediction is obtained by applying the operator π that maps β t and n to the V -dimensional simplex, forming a distribution over words: π(β t , n 1:t−1 ) v = exp(n 1:t−1,v +βt,v) P j∈V exp(n 1:t−1,j +β t,j ) , where n 1:t−1,v ∈ R V is a background distribution (the log-frequency of word v observed up to time t − 1).

Experiments
In our experiments, we consider the problem of predicting economy-related text appearing in news and microblogs, based on observable features that reflect current economic conditions in the world at a given time. In the following, we describe our dataset in detail, then show experimental results on text prediction. In all experiments, we set the window size c = 7 (one week) or c = 14 (two weeks), λ = 1 2|V | (V is the size of vocabulary of the dataset under consideration), and ϕ = 1.

Dataset
Our data contains metadata and text corpora. The metadata is used as our features, whereas the text corpora are used for learning language models and predictions. The dataset (excluding Twitter) can be downloaded at http://www.ark.cs.cmu. edu/DynamicLM.

Metadata
We use end-of-day stock prices gathered from finance.yahoo.com for each stock included in the Standard & Poor's 500 index (S&P 500). The index includes large (by market value) companies listed on US stock exchanges. 7 We calculate daily (continuously compounded) returns for each stock, o: r o,t = log P o,t − log P o,t−1 , where P o,t is the closing stock price. 8 We make a simplifying assumption that text for day t is generated after P o,t is observed. 9 In general, stocks trade Monday to Friday (except for federal holidays and natural disasters). For days when stocks do not trade, we set r o,t = 0 for all stocks since any price change is not observed.
We transform returns into similarity values as follows: f (x o,t , x o,k ) = 1 iff sign(r o,t ) = sign(r o,k ) and 0 otherwise. While this limits the model by ignoring the magnitude of price changes, it is still reasonable to capture the similarity between two days. 10 There are 500 stocks in the S&P 500, so x t ∈ R 500 and f (x t , x k ) ∈ R 500 .

Text data
We have five streams of text data. The first four corpora are news streams tracked through Reuters. 11 Two of them are written in English, North American Business Report (EN:NA) and Japanese Investment News (EN:JP). The remaining two are German Economic News Service (DE, in German) and French Economic News Service (FR, in French). For all four of the Reuters streams, we collected news data over a period of thirteen months (392 days), 2012-05-26 to 2013-06-21. See Table 1 for descriptive statistics of these datasets. Numerical terms are mapped to a single word, and all letters are downcased.
The last text stream comes from the Decahose/Gardenhose stream from Twitter. We collected public tweets that contain ticker symbols (i.e., symbols that are used to denote stocks of a particular company in a stock market), preceded by the dollar sign $ (e.g., $GOOG, $MSFT, $AAPL, etc.). These tags are generally used to indicate tweets about the stock market. We look at tweets from the period 2011-01-01 to 2012-09-30 (639 days). As a result, we have approximately 100-800 tweets per day. We tokenized the tweets using the CMU ARK TweetNLP tools, 12 numerical terms are mapped to a single word, and all letters are downcased. We perform two experiments using unigram and bigram language models as the base models. For each dataset, we consider the top 10,000 unigrams after removing corpus-specific stopwords (the top 100 words with highest frequencies). For the bigram experiments, we only use 5,000 words to limit the number of unique bigrams so that we can simulate experiments for the entire time horizon in a reasonable amount of time. In standard open-vocabulary language modeling experiments, the treatment of unknown words deserves care. We have opted for a controlled, closed-vocabulary experiment, since standard smoothing techniques will almost surely interact with temporal dynamics and context in interesting ways that are out of scope in the present work.

Baselines
Since this is a forecasting task, at each timestep, we only have access to data from previous timesteps. Our model assumes that all words in all documents in a corpus come from a single multinomial distribution. Therefore, we compare our approach to the corresponding base models (standard unigram and bigram language models) over the same vocabulary (for each stream). The first one maintains counts of every word and updates the counts at each timestep. This corresponds to a base model that uses all of the available data up to the current timestep ("base all"). The second one replaces counts of every word with the 12 https://www.ark.cs.cmu.edu/TweetNLP counts from the previous timestep ("base one"). Additionally, we also compare with a base model whose counts decay exponentially ("base exp"). That is, the counts from previous timesteps decay by exp(−γs), where s is the distance between previous timesteps and the current timestep and γ is the decay constant. We set the decay constant γ = 1. We put a symmetric Dirichlet prior on the counts ("add-one" smoothing); this is analogous to our treatment of the background frequencies n in our model. Note that our model, similar to "base all," uses all available data up to timestep t − 1 when making predictions for timestep t. The window size c only determines which previous timesteps' models can be chosen for making a prediction today. The past models themselves are estimated from all available data up to their respective timesteps.
We also compare with two strong baselines: a linear interpolation of "base one" models for the past week ("int. week") and a linear interpolation of "base all" and "base one" ("int one all"). The interpolation weights are learned online using the normalized exponentiated gradient algorithm (Kivinen and Warmuth, 1997), which has been shown to enjoy a stronger regret guarantee compared to standard online gradient descent for learning a convex combination of weights.

Results
We evaluate the perplexity on unseen dataset to evaluate the performance of our model. Specifically, we use per-word predictive perplexity: Note that the denominator is the number of tokens up to timestep T . Lower perplexity is better. Table 2 and Table 3 show the perplexity results for Dataset base all base one base exp int. week int. one all c = 7 c = 14 EN: NA  3,341  3,677  3,486  3,403  3,271  3,262 3,285  EN:JP  2,802  3,212  2,750  2,949  2,708  2,656 2,689  FR  3,603  3,910  3,678  3,625  3,416  3,404 3,438  DE  3,789  4,199  3,979  3,926  3,634  3,649 3,687  Twitter  3,880  6,168  5,133  5,859  4,047  3,801 3,819   Table 2: Perplexity results for our five data streams in the unigram experiments. The base models in "base all," "base one," and "base exp" are unigram language models. "int. week" is a linear interpolation of "base one" from the past week. "int. one all" is a linear interpolation of "base one" and "base all". The rightmost two columns are versions of our model. Best results are highlighted in bold.
Dataset base all base one base exp int. week int. one all c =  Table 3: Perplexity results for our five data streams in the bigram experiments. The base models in "base all," "base one," and "base exp" are bigram language models. "int. week" is a linear interpolation of "base one" from the past week. "int. one all" is a linear interpolation of "base one" and "base all". The rightmost column is a version of our model with c = 7. Best results are highlighted in bold.
each of the datasets for unigram and bigram experiments respectively. Our model outperformed other competing models in all cases but one. Recall that we only define the similarity function of world context as: f (x o,t , x o,k ) = 1 iff sign(r o,t ) = sign(r o,k ) and 0 otherwise. A better similarity function (e.g., one that takes into account market size of the company and the magnitude of increase or decrease in the stock price) might be able to improve the performance further. We leave this for future work. Furthermore, the variations can be captured using models from the past week. We discuss why increasing c from 7 to 14 did not improve performance of the model in more detail in §6.4. We can also see how the models performed over time. Figure 4 traces perplexity for four Reuters news stream datasets. 13 We can see that in some cases the performance of the "base all" model degraded over time, whereas our model is more robust to temporal 13 In both experiments, in order to manage the time and space complexities of updating β, we apply a sparsity shrinkage technique by using OWL-QN (Andrew and Gao, 2007) when maximizing it, with regularization constant set to 1. Intuitively, this is equivalent to encouraging the deviation vector to be sparse (Eisenstein et al., 2011). shifts.
In the bigram experiments, we only ran our model with c = 7, since we need to maintain β in R V 2 , instead of R V in the unigram model. The goal of this experiment is to determine whether our method still adds benefit to more expressive language models. Note that the weights of the linear interpolation models are also learned in an online fashion since there are no classical training, development, and test sets in our setting. Since the "base one" model performed poorly in this experiment, the performance of the interpolated models also suffered. For example, the "int. one all" model needed time to learn that the "base one" model has to be downweighted (we started with all interpolated models having uniform weights), so it was not able to outperform even the "base all" model.

Analysis and Discussion
It should not be surprising that conditioning on world-context reduces perplexity (Cover and Thomas, 1991). A key attraction of our model, we believe, lies in the ability to inspect its parameters.
Deviation coefficients. Inspecting the model allows us to gain insight into temporal trends. We investigate the deviations learned by our model on the Twitter dataset. Examples are shown in Figure 3. The left plot shows β for four words related to Google: goog, #goog, @google, google+. For comparison, we also show the return of Google stock for the corresponding timestep (scaled by 50 and centered at 0.5 for readability, smoothed using loess (Cleveland, 1979), denoted by r GOOG in the plot). We can see that significant changes of return of Google stocks (e.g., the r GOOG spikes between timesteps 50-100, 150-200, 490-550 in the plot) occurred alongside an increase in β of Google-related words. Similar trends can also be observed for Microsoft-related words in the right plot. The most significant loss of return of Microsoft stocks (the downward spike near timestep 500 in the plot) is followed by a sudden sharp increase in β of the words #microsoft and microsoft.
Feature coefficients. We can also inspect the learned feature coefficients α to investigate which stocks have higher associations with the text that is generated. Our feature coefficients are designed to reflect which changes (or lack of changes) in stock prices influence the word distribution more, not which stocks are talked about more often. We find that the feature coefficients do not correlate with obvious company characteristics like market capitalization (firm size). For example, on the Twitter dataset with bigram base models, the five stocks with the highest weights are: ConAgra Foods Inc., Intel Corp., Bristol-Myers Squibb, Frontier Communications Corp., and Amazon.com Inc. Strongly negative weights tended to align with streams with less activ- ity, suggesting that these were being used to smooth across all c days of history. A higher weight for stock o implies an increase in probability of choosing models from previous timesteps s, when the state of the world for the current timestep t and timestep s is the same (as represented by our similarity function) with respect to stock o (all other things being equal), and a decrease in probability for a lower weight.
Selected models. Besides feature coefficients, our model captures temporal shift by modeling similarity across the most recent c days. During inference, our model weights different word distributions from the past. The similarity is encoded in the pairwise features f (x t , x k ) and the parameters α. Figure 5 shows the distributions of the strongest-posterior models from previous timesteps, based on how far in the past they are at the time of use, aggregated across rounds on the EN:NA dataset, for window size c = 14. It shows that the model tends to favor models from days closer to the current date, with the t − 1 models selected the most, perhaps because the state of the world today is more similar to dates closer to today compare to more distant dates. The plot also explains why increasing c from 7 to 14 did not improve performance of the model, since most of the variation in our datasets can be captured with models from the past week.
Topics. Latent topic variables have often figured heavily in approaches to dynamic language modeling. In preliminary experiments incorporating singlemembership topic variables (i.e., each document belongs to a single topic, as in a mixture of unigrams), we saw no benefit to perplexity. Incorporating topics also increases computational cost, since we must maintain and estimate one language model per topic, per timestep. It is straightforward to design models that incorporate topics with single-or mixedmembership as in LDA (Blei et al., 2003), an interesting future direction.
Potential applications. Dynamic language models like ours can be potentially useful in many applications, either as a standalone language model, e.g., predictive text input, whose performance may depend on the temporal dimension; or as a component in applications like machine translation or speech recognition. Additionally, the model can be seen as a step towards enhancing text understanding with numerical, contextual data.

Conclusion
We presented a dynamic language model for streaming datasets that allows conditioning on observable real-world context variables, exemplified in our experiments by stock market data. We showed how to perform learning and inference in an online fashion for this model. Our experiments showed the predictive benefit of such conditioning and online learning by comparing to similar models that ignore temporal dimensions and observable variables that influence the text.