Unsupervised Discovery of Biographical Structure from Text

We present a method for discovering abstract event classes in biographies, based on a probabilistic latent-variable model. Taking as input timestamped text, we exploit latent correlations among events to learn a set of event classes (such as Born, Graduates High School, and Becomes Citizen), along with the typical times in a person’s life when those events occur. In a quantitative evaluation at the task of predicting a person’s age for a given event, we find that our generative model outperforms a strong linear regression baseline, along with simpler variants of the model that ablate some features. The abstract event classes that we learn allow us to perform a large-scale analysis of 242,970 Wikipedia biographies. Though it is known that women are greatly underrepresented on Wikipedia—not only as editors (Wikipedia, 2011) but also as subjects of articles (Reagle and Rhue, 2011)—we find that there is a bias in their characterization as well, with biographies of women containing significantly more emphasis on events of marriage and divorce than biographies of men.


Introduction
The written text that we interact with on an everyday basis-news articles, emails, social media, bookscontains a vast amount of information centered on people: news (including common NLP corpora such as the New York Times and the Wall Street Journal) details the roles of actors in current events, social media (including Twitter and Facebook) documents the actions and attitudes of friends, and books chronicle the stories of fictional characters and real people alike.This focus on people gives us an abundance of information on how the lives of those portrayed unfold; for corpora that include historically deep biographical information (such as Wikipedia, booklength biographies and autobiographies, and even newspaper obituaries) this data includes the actors involved in particular historical events and the times and places in which they occur.The life events described in these texts have natural structure: event classes exhibit correlations with each other (e.g., those who DIVORCE must have been MARRIED), can occur at roughly similar times in the lives of different individuals (MARRIAGE is more likely to occur earlier in one's life than later), and can be bound to historical moments as well (FIGHTS IN WORLD WAR II peaks in the early 1940s).
Social scientists have long been interested in the structure of these events in investigating the role that individual agency and larger social forces play in shaping the course of an individual's life.Life stages marking "transitions to adulthood" (such as LEAVING SCHOOL, ENTERING THE WORKFORCE and MARRIAGE) have important correlates with demographic variables (Modell et al., 1976;Hogan and Astone, 1986;Shanahan, 2000); and researchers study the interactional effects that life events have on each other, such as the relationship between divorce and pre-marital cohabitation (Lillard et al., 1995;Reinhold, 2010) or having children (Lillard and Waite, 1993).
The data on which these studies draw, however, has largely been restricted to categorical surveys and observational data; we present here a latentvariable model that exploits the correlations of event descriptions in text to learn the structure of abstract events, grounded in time, from text alone.While our model can be estimated on any set of texts where the birth dates of a set of mentioned entities are known, we illustrate our method on a large-scale dataset of 242,970 biographies extracted from Wikipedia.This paper makes two contributions: first, we present a general unsupervised model for learning life event classes from biographical text, along with the structure that binds them; second, in using this method to learn event classes from Wikipedia, we uncover evidence of systematic bias in the presentation of male and female biographies (with biographies of women containing significantly disproportionate emphasis on the personal events of marriage and divorce).In addition to these contributions, we also present a range of other analyses that uncovering life events in text can make possible.Data and code to support this work can be found at http://www.ark.cs.cmu.edu/bio/.

Data
The data for this analysis originates in the January 2, 2014 dump of English-language Wikipedia.1We extract biographies by identifying all articles with persondata metadata 2 in which the DATE OF BIRTH field is known.This results in a set of 927,403 biographies.
For each biography, we perform part-of-speech tagging using the Stanford POS tagger (Toutanova et al., 2003) and named entity recognition using the Stanford named entity recognizer (Finkel et al., 2005), cluster all mentions of co-referring proper names (Davis et al., 2003;Elson et al., 2010) and resolve pronominal co-reference (Bamman et al., 2014), aided by gender inference for each entity as the gender corresponding to the maximum number of gendered pronouns (i.e., he and she) mentioned in the article, as also used by Reagle and Rhue (2011).In a random test set of 500 articles, this method of gender inference is overwhelmingly accurate, achieving 100% precision with 97.6% recall (12 articles had no pronominal mentions and so gender is not assigned).
As further preprocessing, we identify multiword expressions in all texts as maximal sequences of adjective + noun part of speech tags (yielding, for example, New York, United States, early life and high school), as first described in Justeson and Katz (1995).For each biographical article, we then extract all sentences in which the subject of the article is mentioned along with a single date and retain only the terms in each sentence that are among the most frequent 10,000 unigrams and multiword expressions in all documents, excluding stopwords such as the and all numbers (including dates).An "event" is the bag of these unigrams and multiword expressions extracted from one such sentence, along with a corresponding timestamp measured as the difference between the observed date in the sentence and the date of birth of the entity.
Table 1 illustrates the actual form of the data with a sample of extracted sentences from the biography of Frank Lloyd Wright, along with the data as input to the model.In the terminology of the model described below, each sentence constitutes one "event" in the subject's life.
For the final dataset we retain all biographies where the subject of the article is born after the year 1800 and for which there exist at least 5 events (242,970 people).The complete data consists of 2,313,867 events across these 242,970 people.

Model
The quantities of interest that we want to learn from the data are: 1.) a broad set of major life events recorded in Wikipedia biographies that people experience at similar stages in their lives (such as BEING BORN, GRADUATING HIGH SCHOOL, SERVING IN THE ARMY, GETTING MARRIED, and so on); 2.) correlations among those life events (e.g., knowing that if an individual WINS A NOBEL PRIZE that they're more likely to RECEIVE AN HONORARY DOCTORATE); and 3.) an attribution of those classes of events to particular moments in a specific individual's life (e.g., John Nash RECEIVED AN HON-ORARY DOCTORATE in 1999).
We cast this problem as an unsupervised learning one; given no labeled instances, can we infer these quantities from text alone?One possible alternative approach would be to leverage the categorical  (Matthew and Harrison, 2004).
Figure 1a illustrates the graphical form of our hierarchical Bayesian model, which articulates the relationship between an entity's set of events (where each event is an observation defined as the bag of terms in text and the difference between the year it was recorded as happening and the birth year), an abstract set of event classes, correlations among those abstract classes, and the distribution of vocabulary terms that defines each one.To capture correlations among different classes, we place a logistic normal prior on each biography's distribution over event classes (Blei and Lafferty, 2006a;Blei and Lafferty, 2007;Mimno et al., 2008); unlike a Dirichlet, a logistic normal is able to capture arbitrary correlations between elements through the structure of the covariance matrix of its underlying multivariate normal.We take a Bayesian approach to estimating the mean µ η and covariance Σ η , drawing them from a conjugate Normal-Inverse Wishart prior.
The generative story for the model runs as follows: let K be the number of latent event classes, P be the number of biographies, and E p be the number of events in biography p.
• Draw event class means and covariances

. , K}:
-Draw event-term distribution φ k ∼ Dir(γ) • For each biography p: -For each event e in biography p: Figure 1: Graphical form of the full model (described in §3) and models with ablations (described in §4).
Inference proceeds via stochastic EM: after initializing all variables to random values, we alternate between collapsed Gibbs sampling for the latent class indicators followed by maximization steps over all other parameters: 1. Sample all z using collapsed Gibbs sampling conditioned on current values for η and all other z.
2. For each biography p, maximize likelihood with respect to η p via gradient ascent given the current samples of z and priors µ η and Σ η .
3. Assign MAP estimates of µ η and Σ η given current values of η and the Normal-Inverse Wishart prior.Update µ and σ2 according to its maximum likelihood estimate given z.
We describe the technical details of each step below.
Sampling z.Given fixed biography-event class proportions η, observed tokens w, timestamp t, and current samples z − for all other events, the probability of a given event belonging to event class k is as follows: Here c − (k, v) is the count of the number of times vocabulary term v shows up in all events whose current sample z = k (excepting the current one being sampled), c − (k, ) is the total count of all terms in all events whose current z = k (again excepting the current one), N e is the number of terms in event e, and e(v) is the count of vocabulary term v in the current event.(Note the complexity of the last term is due to drawing multiple observations from a single collapsed multinomial; Carpenter, 2010.) Maximizing η.Under our model, the terms in the likelihood function that involve η include the likelihood of the samples drawn from it and its own probability given the multivariate Normal prior: The log likelihood is proportional to: Given samples of the latent event class z for all events in biography p, we maximize the value of η p using gradient ascent.We can think of this as maximizing the likelihood of the observations z subject to matrix in the regularizer encourages correlations in η: if a document contains many examples of z = k and z k is highly correlated with z j , then the optimal η is encouraged to contain high weights at both η k and η j rather than simply η k alone.
Maximizing µ η , Σ η , µ, σ 2 .Given values for η, we then find maximum a posteriori estimates of µ η and Σ η conditioned on the Normal-Inverse Wishart (NIW) prior.The NIW is a conjugate prior to a multivariate Gaussian, parameterized by dimensionality K, initial mean µ 0 , positive-definite scale matrix Ψ, and scalars ν > K − 1 and λ > 0. The prior parameters Ψ and ν have an intuitive interpretation as the scatter matrix ν i=1 ( The expected value of the covariance matrix drawn from a NIW distribution parameterized by Ψ and ν is To disprefer correlations among topics in the absence of strong evidence, we fix µ 0 = 0 and set Ψ so that this prior expectation over Σ η is the product of a scalar value ρ and the identity matrix I: Ψ = (ν − K − 1)ρI; ρ defines the expected variance, and the higher the value of ν, the more strongly the prior dominates the posterior estimate of the covariance matrix (i.e., the more the covariance matrix is shrunk toward ρI).λ likewise has an intuitive understanding as a dampening parameter: the higher its value, the more the posterior estimate of the mean μ shrinks toward 0. For n data points, we set λ = n/10, ν = K + 2, and ρ = 1.
Since the NIW is conjugate with the multivariate normal, posterior updates to µ η and Σ η have closedform expressions given values of η (here, η denotes the mean value of η over all biographies).
Since we have no meaningful prior information on the values of µ and σ 2 , we calculate their maximum likelihood estimate given current samples z.

Evaluation
While the goal of this work is to learn qualitative categories of life events from text, we can quantita-tively evaluate the performance of our model on the empirical task of predicting the age in a person's life when an event occurs.
For this task, we compare the full model described above with a strong baseline of 2 -regularized linear regression and also with comparable models with feature ablations, in order to quantify the extent to which various aspects of the full model are contributing to its empirical performance.The comparable ablated models include the following: • -CORRELATION, figure 1b.Rather than a logistic normal prior on the entity-specific distribution over event types (η), we draw η from a symmetric Dirichlet distribution parameterized by a global α.In a Dirichlet distribution, arbitrary correlations cannot be captured.• -TIME, figure 1c.In the full model, the timestamps of the observed events influence the event classes we learn by encouraging them to be internally coherent and time-sensitive.To test this design choice, we ablate time as a feature during inference.• -CORRELATION,-TIME, figure 1d.We also test a model that ablates both the correlation structure in the prior and the influence of time; this model corresponds to smoothed, unsupervised naïve Bayes.
As during inference, we define an event to be the set of terms, excluding stopwords and numbers, that are present in the vocabulary of the 10,000 most frequent words and multiword expressions in the data overall.Each event is accompanied by the year of its occurrence, from which we calculate the gold target prediction (the age of the person at the time of the event) as the year minus the entity's year of birth.For all of the four models described above (the full model and three ablations), we train the model on 4/5 of the biographies (194,376 entities, on average 1,851,094 events); we split the remaining 1/5 of the biographies into development data (where t is observed) and test data (where t is predicted).The details of inference for each model are as follows: 1. FULL.Inference as above for a burn-in period of 100 iterations, using slice sampling (Neal, 2003) to optimize the value of the Dirichlet hyperparameter γ every 10 iterations; after inference, the parameters µ η , Σ η , µ, σ 2 and φ are es-timated from samples drawn at the final iteration and held fixed.For test entities, we infer the MAP value of η using development data, and predict the age of each test event as the mean time marginalizing over the event type in- 2. -CORRELATION.Here we perform collapsed Gibbs sampling for 100 iterations, using slice sampling to optimize the value of α and γ every 10 iterations; after inference, the parameters µ, σ 2 and φ are estimated from single final samples and held fixed.For development and test data, we run Gibbs sampling on event indicators z for 10 iterations and predict the age of each test event as the mean time marginalizing over the event type indicator z.
3. -TIME.Inference as above for 100 iterations, using slice sampling to optimize the value of γ every 10 iterations; after inference, the parameters µ η , Σ η and φ are estimated from single final samples and held fixed.Since time is not known to this model during inference, we create post hoc estimates of μz as the empirical mean age of events sampled to event class z using single samples for each event in the training data from the final sampling iteration.For test entities, we infer the MAP value of η using development data, and predict the age of each test event as the average empirical age marginalizing over the event type indicator z.
4. -CORRELATION,-TIME.We perform inference as above for the -CORRELATION model, and time prediction as in the -TIME model.
To compare against a potentially more powerful discriminative model, we also evaluate linear regression with 2 (ridge) regularization, using binary indicators of the same unigrams and multiword expressions available to the models above.

LINEAR REGRESSION. Train on training and
development data, optimizing the regularization coefficient λ in three-fold cross-validation.
During training, linear regression learns that the terms most indicative of events that take place later in life are stamp, descendant, commemorated, died, plaque, grandson, and lifetime achievement award, while those that denote early events are born, baptised, apprenticed, and acting debut.
We evaluate all models on identical splits using 5-fold cross validation.For an interpretable error score, we use mean absolute error, which corresponds to the number of years, on average, by which each model is incorrect.
Figure 2 presents the results of this evaluation for all models and different choices of the number of latent event classes K ∈ {10, 25, 50, 100, 250, 500}.Linear regression represents a powerful model, achieving a mean absolute error of 11.87 years across all folds, but is eclipsed by the latent variable model at K ≥ 50.The correlations captured by the logistic normal prior make a clear difference, uniformly yielding improvements over otherwise equivalent Dirichlet models across all K.As expected, models trained without knowledge of time during inference perform less well than models that contain that information.

Analysis
To analyze the latent event classes in Wikipedia biographies, we train our full model (with a logistic normal prior and time as an observable variable) on the full dataset of 242,970 biographies with K = 500 event classes; as above, we run inference for a burn-in period of 100 iterations and collect 50 samples from the posterior distributions for z (the event class indicator for each event).
Table 2 illustrates a sample of 20 event classes along with the mean time µ and standard deviation σ, the gender distribution (calculated from the posterior distribution over z for all entities whose gender is known3 ) and the most probable terms in the class.
The latent classes that we learn span a mix of major life events of Wikipedia notable figures (including events that we might characterize as GRADU-

ATING HIGH SCHOOL, BECOMING A CITIZEN, DI-VORCE, BEING CONVICTED OF A CRIME, and DY-ING) and more fine-grained events (such as BE-ING DRAFTED BY A SPORTS TEAM and BEING IN-DUCTED INTO THE HALL OF FAME).
Emerging immediately from this summary is an imbalance in the gender distribution for many of these event classes.Among the 242,858 biographies whose gender is known, 14.8% are of women; we would therefore expect around 14.8% of the partic-ipants in most event classes to be female.Figures 3 and 4 illustrate five of the most highly skewed classes in both directions, ranked according to the z score of a two-tailed binomial proportion test (H 0 = 14.8).
While some of these classes reflect a biased world in which more men are drafted into sports teams, serve in the armed forces, and are ordained as priests, one latent class that calls out for explanation is that surrounding DIVORCE (divorce, marriage, divorced, filed, married, wife, separated, years, ended, later), whose female proportion of 39.4% is nearly triple that of the data overall (and whose z-score reveals it to be strongly statistically different [p 0.0001] from the H 0 mean, even accounting for the Bonferroni correction we must make when considering the K = 500 tests we implicitly perform when ranking).While we did not approach this analysis with any a priori hypotheses to test, our unsupervised model reveals an interesting hypothesis to pursue with confirmatory analysis: biographies of women on Wikipedia disproportionately focus on marriage and divorce compared to those of men.
To test this hypothesis with more traditional z %Fem.Most frequent terms 60.46 76.9% miss, pageant, title, usa, miss universe, beauty, held, teen, crowned, competed 57.21 49.9% birth, gave, daughter, son, born, first child, named, wife, announced, baby 55.63 59.8% fashion, model, show, campaign, week, appeared, face, career, became, modeling 37.89 39.4% divorce, marriage, divorced, married, filed, wife, separated, years, ended, later 36.70 36.5% summer olympics, competed, olympics, team, finished, event, final, world vorcing).The result of this analysis confirms that of the model.Of the 4,608 biographies in which at least one of these terms appears, 38.8% are those of a woman, far more than the 14.8% we would expect (in a two-tailed binomial proportion test against H 0 = 14.8, this difference is significant at p < 0.0001); this corresponds to divorce being mentioned in 5.0% of all 35,932 women's biographies, and 1.4% of all 206,926 men's; on average, a woman's biography is 3.66 times more likely to mention divorce than a man's.
We repeat the gender proportion experiment with terms denoting marriage (married, marry, marries, marrying and marriage) and find a similar trend: of the 39,142 biographies where at least one of these terms is mentioned, 23.6% belong to women; again, in a two-tailed proportion test, this difference is significant at p < 0.0001.This corresponds to marriage appearing in 25.7% of all women's biographies, and 14.5% of men's; a woman's biography is 1.78 times more likely to mention marriage than a man's.

Additional Analyses
The analysis above represents one substantive result that mining life events from biographical data makes possible.To illustrate the range of other analyses that this method can occasion, we briefly present two other directions that can be pursued: investigating correlations among event classes and the distribution of event classes over historical time.

Correlations among events
In our full model with a logistic normal prior over a document's set of events, correlations among latent event classes are learned during inference.From the covariance matrix Σ η , we can directly read off correlations among events; for other models (such as those with a Dirichlet prior), we can infer correlations using the posterior estimates for η.
Table 5 illustrates the event classes that have the highest correlations to the event class defined by family, boss, murder, crime, mafia, became, arrested, john, gang, chicago.The structure that we learn here neatly corresponds to a CRIMINAL AC-TION frame, with common events for KILLING, BE-ING SUBJECT TO FEDERAL INVESTIGATION, BE-ING ARRESTED and BEING BROUGHT TO TRIAL.Grounding specific life events in history has the potential to enable analysis of how historical time affects the life histories of individuals-including both the influence of the general passage of time, as on transitions to adulthood (Modell et al., 1976;Hogan, 1981;Modell, 1980), and the influence of specific historical moments like the Great Depression (Elder, 1974) or World War II (Mayer, 1988;Elder, 1991).

Related Work
In learning general classes of events from text, our work draws on a rich background spanning several research traditions.By considering the structure that exists between event classes, we draw on the original work on procedural scripts and schemas (Minsky, 1974;Schank and Abelson, 1977) and narrative chains (Chambers and Jurafsky, 2008;Chambers and Jurafsky, 2009), including more recent advances in the unsupervised learning of frame semantic representations (Modi et al., 2012;O'Connor, 2013;Cheung et al., 2013;Chambers, 2013).
In learning latent classes from text, our work is also clearly related to research on topic modeling (Blei et al., 2003;Griffiths and Steyvers, 2004).This work differs from that tradition by scoping our data only over text that we have reason to believe describes events (by including absolute dates).While other topic models have leveraged temporal information in the learning of latent topics, such as the dynamic topic model (Blei and Lafferty, 2006b;Wang et al., 2012) and "topics over time" (Wang and McCallum, 2006), our model is the first to infer classes of events whose contours are shaped by the time in a person's life that they take place.
While the information extraction tasks of template filling (Hobbs et al., 1993) and relation detection (Banko et al., 2007;Fader et al., 2011;Carlson et al., 2010) generally fall into a paradigm of classifying text segments into a predetermined ontology, they too have been informed by unsupervised approaches to learning relation classes (Yao et al., 2011) and events (Ritter et al., 2012).Our work here differs from this past work in leveraging explicit absolute temporal information in the unsupervised learning of event classes (and their structure).Reasoning about the temporal ordering of events likewise has a long tradition of its own, both in NLP (Pustejovsky et al., 2003;Mani et al., 2006;Verhagen et al., 2007;Chambers et al., 2007) and information extraction (Talukdar et al., 2012).Rather than attempting to model the ordering of events relative to each other, we focus instead on their occurrence relative to the beginning of a person's life.
Wikipedia likewise has been used extensively in NLP; Wikipedia biographies in particular have been used for the task of training summarization models (Biadsy et al., 2008), recognizing biographical sentences (Conway, 2010), learning correlates of "success" (Ng, 2012), and disambiguating named entities (Bunescu and Pasca, 2006;Cucerzan, 2007).In our work in mining biographical structure from it, we draw on previous research into automatically uncovering latent structure in resumés (Mimno and McCallum, 2007a) and approaches to learning life path trajectories from categorical survey data (Massoni et al., 2009;Ritschard et al., 2013).
In using Wikipedia as a dataset for analysis, we must note that the subjects of biographies are not a representative sample of the population, nor are their contents unbiased representations.Nearly all encyclopedias necessarily prefer the historically notorious (if due to nothing else than inherent biases in the preservation of historical records); many, like Wikipedia, also have disproportionately low coverage of women, minorities, and other demographic groups, in part because of biases in community membership.Estimates of the percentage of female editors on Wikipedia, for example, ranges from 9% to 16.1% (Collier and Bear, 2012;Reagle and Rhue, 2011;Cassell, 2011;Hill and Shaw, 2013;Wikipedia, 2011).Different language editions of Wikipedia have a natural geographic bias in article selection (Hecht and Gergle, 2009), with each emphasizing their own "local heroes" (Kolbitsch and Maurer, 2006), and also differ in the kind of information they present (Pfeil et al., 2006;Callahan and Her-ring, 2011).This extends to selection of biographies as well, with one study finding approximately 16% of 1000 sampled biographies being those of women (Reagle and Rhue, 2011), a figure very close to the 14.8% we observe in our analysis here.

Conclusion
We present a method for mining life events from biographies, leveraging the correlation structure of event descriptions.Unlike prior work that has focused on inferring "life trajectories" from categorical survey data, we learn relevant structure in an unsupervised manner directly from text, opening the door to applying this method to a broad set of biographies beyond Wikipedia (including full-text books from the Internet Archive or Hathi Trust, and other encyclopedic biographies as well).In a quantitative analysis, the model we present outperforms a strong baseline at the task of event time prediction, and surfaces a substantive qualitative distinction in the content of the biographies of men and women on Wikipedia: in contrast to previous work that uses computational methods to measure a difference in coverage, we show that such methods are able to tease apart differences in characterization as well.
While the task of event time prediction provides a quantitative means to compare different models, we expect the real application of this work will lie in the latent event classes themselves, and the information they provide both about the subjects and authors of biographies.Latent topics have provided one way of organizing large document collections in the past (Mimno and McCallum, 2007b); in addition to occasioning data analysis of the kind we describe here, we expect that personal event classes can have a practical application in helping to organize data describing people as well.Data and code to support this work, including an interface to explore event classes in Wikipedia, can be found at http://www.ark.cs.cmu.edu/bio/.
dation grant CAREER IIS-1054319 to N.A.S. and Google's support of the Reading is Believing project at CMU.This work was made possible through the use of computing resources made available by the Open Science Data Cloud (OSDC), an Open Cloud Consortium (OCC)-sponsored project.

Figure 2 :
Figure 2: Mean average error (in years) for time prediction.

Figure 3
Figure 3 likewise illustrates the distribution over time for a set of learned event classes.While the only notion of time that our model has access to during inference is that of time relative to a person's birth, we can estimate the empirical distribution of event classes in historical time by charting the density plot of their observed absolute dates.Several historically relevant event classes are legible, including SERVING IN THE ARMY (with peaks dur-

Table 2 :
Salient event classes learned from 242,970 Wikipedia biographies.All 500 event classes can be viewed at http://www.ark.cs.cmu.edu/bio.

Table 3 :
championships Female-skewed event classes, ranked by z-score in a two-tailed binomial proportion test.

Table 4 :
Male-skewed event classes, ranked by z-score in a two-tailed binomial proportion test.means, we estimated the empirical gender proportions of biographies containing terms explicitly denoting divorce (divorced, divorce, divorces and di-

Table 5 :
Highest correlations between the family, boss, murder, crime, mafia class and other events.