Named Entity Recognition with Bidirectional LSTM-CNNs

Named entity recognition is a challenging task that has traditionally required large amounts of knowledge in the form of feature engineering and lexicons to achieve high performance. In this paper, we present a novel neural network architecture that automatically detects word- and character-level features using a hybrid bidirectional LSTM and CNN architecture, eliminating the need for most feature engineering. We also propose a novel method of encoding partial lexicon matches in neural networks and compare it to existing approaches. Extensive evaluation shows that, given only tokenized text and publicly available word embeddings, our system is competitive on the CoNLL-2003 dataset and surpasses the previously reported state of the art performance on the OntoNotes 5.0 dataset by 2.13 F1 points. By using two lexicons constructed from publicly-available sources, we establish new state of the art performance with an F1 score of 91.62 on CoNLL-2003 and 86.28 on OntoNotes, surpassing systems that employ heavy feature engineering, proprietary lexicons, and rich entity linking information.


Introduction
Named entity recognition is an important task in NLP.High performance approaches have been dominated by applying CRF, SVM, or perceptron models to hand-crafted features (Ratinov and Roth, 2009;Passos et al., 2014;Luo et al., 2015).However, Collobert et al. (2011b) proposed an effective neural network model that requires little feature engineering and instead learns important features from word embeddings trained on large quantities of unlabelled text -an approach made possible by recent advancements in unsupervised learning of word embeddings on massive amounts of data (Collobert and Weston, 2008;Mikolov et al., 2013) and neural network training algorithms permitting deep architectures (Rumelhart et al., 1986).
Unfortunately there are many limitations to the model proposed by Collobert et al. (2011b).First, it uses a simple feed-forward neural network, which restricts the use of context to a fixed sized window around each word -an approach that discards useful long-distance relations between words.Second, by depending solely on word embeddings, it is unable to exploit explicit character level features such as prefix and suffix, which could be useful especially with rare words where word embeddings are poorly trained.We seek to address these issues by proposing a more powerful neural network model.
A well-studied solution for a neural network to process variable length input and have long term memory is the recurrent neural network (RNN) (Goller and Kuchler, 1996).Recently, RNNs have shown great success in diverse NLP tasks such as speech recognition (Graves et al., 2013), machine translation (Cho et al., 2014), and language modeling (Mikolov et al., 2011).The long-short term memory (LSTM) unit with the forget gate allows highly non-trivial long-distance dependencies to be easily learned (Gers et al., 2000).For sequential labelling tasks such as NER and speech recognition, a bi-directional LSTM model can take into account an effectively infinite amount of context on both sides of a word and eliminates the problem of limited context that applies to any feed-forward model (Graves et al., 2013).While LSTMs have been studied in the past for the NER task by Hammerton (2003), the lack of computational power (which led to the use  The CNN (Figure 2) extracts a fixed length feature vector from character-level features.For each word, these vectors are concatenated and fed to the BLSTM network and then to the output layers (Figure 3).
of very small models) and quality word embeddings limited their effectiveness.Convolutional neural networks (CNN) have also been investigated for modeling character-level information, among other NLP tasks.Santos et al. (2015) and Labeau et al. (2015) successfully employed CNNs to extract character-level features for use in NER and POS-tagging respectively.Collobert et al. (2011b) also applied CNNs to semantic role labeling, and variants of the architecture have been applied to parsing and other tasks requiring tree structures (Blunsom et al., 2014).However, the effectiveness of character-level CNNs has not been evaluated for English NER.While we considered using character-level bi-directional LSTMs, which was recently proposed by Ling et al. (2015) for POStagging, preliminary evaluation shows that it does not perform significantly better than CNNs while being more computationally expensive to train.
Our main contribution lies in combining these neural network models for the NER task.We present a hybrid model of bi-directional LSTMs and CNNs that learns both character-and word-level features, presenting the first evaluation of such an architecture on well-established English language evaluation datasets.Furthermore, as lexicons are crucial to NER performance, we propose a new lexicon encoding scheme and matching algorithm that can make use of partial matches, and we compare it to the simpler approach of Collobert et al. (2011b).Extensive evaluation shows that our proposed method establishes a new state of the art on both the CoNLL-2003 NER shared task and the OntoNotes 5.0 datasets.

Model
Our neural network is inspired by the work of Collobert et al. (2011b), where lookup tables transform discrete features such as words and characters into continuous vector representations, which are then concatenated and fed into a neural network.Instead of a feed-forward network, we use the bi-directional long-short term memory (BLSTM) network.To induce character-level features, we use a convolutional neural network, which has been successfully applied to Spanish and Portuguese NER (Santos et al., 2015) and German POS-tagging (Labeau et al., 2015).

Sequence-labelling with BLSTM
Following the speech-recognition framework outlined by Graves et al. (2013), we employed a stacked 1 bi-directional recurrent neural network with long short-term memory units to transform word features into named entity tag scores.Figures 1, 2, and 3 illustrate the network in detail.
The extracted features of each word is fed into a forward LSTM network and a backward LSTM network.The output of each network at each time step is decoded by a linear layer and a log-softmax layer into log-probabilities for each tag category.These two vectors are then simply added together to produce the final output.
We tried minor variants of output layer architecture and selected the one that performed the best in preliminary experiments.

Extracting Character Features Using a Convolutional Neural Network
For each word we employ a convolution and a max layer to extract a new feature vector from the percharacter feature vectors such as character embeddings (Section 2.3.2) and (optionally) character type (Section 2.5).Words are padded with a number of special PADDING characters on both sides depending on the window size of the CNN.
The hyper-parameters of the CNN are the window size and the output vector size.
1 For each direction (forward and backward), the input is fed into multiple layers of LSTM units connected in sequence (i.e.LSTM units in the second layer takes in the output of the first layer, and so on); the number of layers is a tuned hyperparameter.Figure 1 shows only one unit for simplicity.We also experimented with two other sets of published embeddings, namely Stanford's GloVe embeddings3 trained on 6 billion words from Wikipedia and Web text (Pennington et al., 2014) and Google's word2vec embeddings4 trained on 100 billion words from Google News (Mikolov et al., 2013).

Category
In addition, as we hypothesized that word embeddings trained on in-domain text may perform better, we also used the publicly available GloVe (Pennington et al., 2014) program and an in-house re-implementation5 of the word2vec (Mikolov et al., 2013) program to train word embeddings on Wikipedia and Reuters RCV1 datasets as well. 6ollowing Collobert et al. (2011b), all words are lower-cased before passing through the lookup table to convert to their corresponding embeddings.The pre-trained embeddings are allowed to be modified during training.7

Character Embeddings
We randomly initialized a lookup table with values drawn from an uniform distribution with range [−0.5, 0.5] to output a character embedding of 25 dimensions.The character set includes all unique characters in the CoNLL-2003 dataset8 plus the special tokens PADDING and UNKNOWN.The PADDING token is used for the CNN, and the UNKNOWN token is used all other characters (which appear in OntoNotes).The same set of random embeddings was used for all experiments.9

Capitalization Feature
As capitalization information is erased during lookup of the word embedding, we evaluate Collobert's method of using a separate lookup table to add a capitalization feature with the following options: allCaps, upperInitial, lowercase, mixedCaps, noinfo (Collobert et al., 2011b).This method is compared with the character type feature (Section 2.5) and character-level CNNs.

Lexicons
Most state of the art NER systems make use of lexicons as a form of external knowledge (Ratinov and Roth, 2009;Passos et al., 2014).
For each of the four categories (Person, Organization, Location, Misc) defined by the CoNLL 2003 NER shared task, we compiled a list of known named entities from DBpedia (Auer et al., 2007), by extracting all descendants of DBpedia types corresponding to the CoNLL categories. 14We did not construct separate lexicons for the OntoNotes tagset because correspondences between DBpedia categories and its tags could not be found in many instances.In addition, for each entry we first removed parentheses and all text contained within, then stripped trailing punctuation,15 and finally tokenized it with the Penn Treebank tokenization script for the purpose of partial matching.Table 1 shows the size of each category in our lexicon compared to Collobert's lexicon, which we extracted from their SENNA system.Figure 4 shows an example of how the lexicon features are applied. 16For each lexicon category, we match every n-gram (up to the length of the longest lexicon entry) against entries in the lexicon.A match is successful when the n-gram matches the prefix or suffix of an entry and is at least half the length of the entry.Because of the high potential for spurious matches, for all categories except Person, we discard partial matches less than 2 tokens in length.When there are multiple overlapping matches within the same category, we prefer exact matches over partial matches, and then longer matches over shorter matches, and finally earlier matches in the sentence over later matches.All matches are case insensitive.
For each token in the match, the feature is en-  coded in BIOES annotation (Begin, Inside, Outside, End, Single), indicating the position of the token in the matched entry.In other words, B will not appear in a suffix-only partial match, and E will not appear in a prefix-only partial match.
As we will see in Section 4.5, we found that this more sophisticated method outperforms the method presented by Collobert et al. (2011b), which treats partial and exact matches equally, allows prefix but not suffix matches, allows very short partial matches, and marks tokens with YES/ NO.
In addition, since Collobert et al. (2011b) released their lexicon with their SENNA system, we also applied their lexicon to our model for comparison and investigated using both lexicons simultaneously as distinct features.We found that the two lexicons complement each other and improves performance on the CoNLL-2003 dataset.
Our best model uses the SENNA lexicon with exact matching and our DBpedia lexicon with partial matching, with BIOES annotation in both cases.

Additional Character-level Features
A lookup table was used to output a 4-dimensional vector representing the type of the character (upper case, lower case, punctuation, other).

Implementation
We implement the neural network using the torch7 library (Collobert et al., 2011a).Training and inference are done on a per-sentence level.The initial states of the LSTM are zero vectors.Except for the character and word embeddings whose initialization have been described previously, all lookup tables are randomly initialized with values drawn from the standard normal distribution.

Objective Function and Inference
We train our network to maximize the sentencelevel log-likelihood from Collobert et al. (2011b). 17 First, we define a tag-transition matrix A where A i,j represents the score of jumping from tag i to tag j in successive tokens, and A 0,i as the score for starting with tag i.This matrix of parameters are also learned.Define θ as the set of parameters for the neural network, and θ = θ ∪ {A i,j ∀i, j} as the set of all parameters to be trained.Given an example sentence, [x] T 1 , of length T , and define [f θ ] i,t as the score outputted by the neural network for the t th word and i th tag given parameters θ, then the score of a sequence of tags [i] T 1 is given as the sum of network and transition scores: 17 Much later, we discovered that training with cross entropy objective while performing Viterbi decoding to restrict output to valid tag sequences also appears to work just as well. 18OntoNotes results taken from (Durrett and Klein, 2014)  Then, letting [y] T 1 be the true tag sequence, the sentence-level log-likelihood is obtained by normalizing the above score over all possible tag-sequences [j] T 1 using a softmax: This objective function and its gradients can be efficiently computed by dynamic programming (Collobert et al., 2011b).
At inference time, given neural network outputs [f θ ] i,t we use the Viterbi algorithm to find the tag sequence

Tagging Scheme
The output tags are annotated with BIOES (which stand for Begin, Inside, Outside, End, Single, indicating the position of the token in the entity) as this scheme has been reported to outperform others such as BIO (Ratinov and Roth, 2009).sion of the corpus with a different data split. 21Numbers taken from the original paper (Luo et al., 2015).While the precision, recall, and F1 scores are clearly inconsistent, it is unclear in which way they are incorrect.

Learning Algorithm
Training is done by mini-batch stochastic gradient descent (SGD) with a fixed learning rate.Each mini-batch consists of multiple sentences with the same number of tokens.We found applying dropout to the output nodes 22 of each LSTM layer (Pham et al., 2014) was quite effective in reducing overfitting (Section 4.4).We explored other more sophisticated optimization algorithms such as momentum (Nesterov, 1983), AdaDelta (Zeiler, 2012), and RM-SProp (Hinton et al., 2012), and in preliminary experiments they did not improve upon plain SGD.

Evaluation
Evaluation was performed on the well-established CoNLL-2003 NER shared task dataset (Tjong Kim Sang andDe Meulder, 2003) and the much larger but less-studied OntoNotes 5.0 dataset (Hovy et al., 2006;Pradhan et al., 2013).Table 2 gives an overview of these two different datasets.
For each experiment, we report the average and standard deviation of 10 successful trials.Table 6: F1 score results of BLSTM and BLSTM-CNN models with various additional features; emb = Collobert word embeddings, char = character type feature, caps = capitalization feature, lex = lexicon features.Note that starred results are repeated for ease of comparison.

Dataset Preprocessing
For all datasets, we performed the following preprocessing: • All digit sequences are replaced by a single "0".
• Before training, we group sentences by word length into mini-batches and shuffle them.
In addition, for the OntoNotes dataset, in order to handle the Time, Money, Percent, Quantity, Ordinal, and Cardinal named entity tags, we split tokens before and after every digit.

CoNLL 2003 Dataset
The CoNLL-2003 dataset (Tjong Kim Sang and De Meulder, 2003) consists of newswire from the Reuters RCV1 corpus tagged with four types of named entities: location, organization, person, and miscellaneous.As the dataset is small compared to OntoNotes, we trained the model on both the training and development sets after performing hyperparameter optimization on the development set.

OntoNotes 5.0 Dataset
Pradhan et al. ( 2013) compiled a core portion of the OntoNotes 5.0 dataset for the CoNLL-2012 shared task and described a standard train/dev/test split, which we use for our evaluation.Following Durrett and Klein (2014), we applied our model to the portion of the dataset with gold-standard named entity annotations; the New Testaments portion was excluded for lacking gold-standard annotations.This dataset is much larger than CoNLL-2003 and consists of text from a wide variety of sources, such as broadcast conversation, broadcast news, newswire, magazine, telephone conversation, and Web text.

Hyper-parameter Optimization
We performed two rounds of hyper-parameter optimization and selected the best settings based on development set performance 23 .Table 3 shows the final hyper-parameters, and Table 4 shows the dev set performance of the best models in each round.
In the first round, we performed random search and selected the best hyper-parameters over the development set of the CoNLL-2003 data.We evaluated around 500 hyper-parameter settings.Then, we took the same settings and tuned the learning rate and epochs on the OntoNotes development set. 24 For the second round, we performed independent hyper-parameter searches on each dataset using Optunity's implementation of particle swarm (Claesen et al., 2014), as there is some evidence that it is more efficient than random search (Clerc and Kennedy, 2002).We evaluated 500 hyper-parameter settings this round as well.As we later found out that training fails occasionally (Section 3.5) as well as large variation from run to run, we ran the top 5 settings from each dataset for 10 trials each and selected the best one based on averaged dev set performance.
For CoNLL-2003, we found that particle swarm produced better hyper-parameters than random search.However, surprisingly for OntoNotes particle swarm was unable to produce better hyperparameters than those from the ad-hoc approach in round 1.We also tried tuning the CoNLL-2003 hyper-parameters from round 2 for OntoNotes and that was not any better 25 either.
We trained CoNLL-2003 models for a large num-23 Hyper-parameter optimization was done with the BLSTM-CNN + emb + lex feature set, as it had the best performance.
24 Selected based on dev set performance of a few runs. 25The result is 84.41 (± 0.33) on the OntoNotes dev set.7: F1 scores when the Collobert word vectors are replaced.We tried 50-and 300-dimensional random vectors (Random 50d, Random 300d); GloVe's released vectors trained on 6 billion words (GloVe 6B 50d, GloVe 6B 300d); Google's released vectors trained on 100 billion words from Google News (Google 100B 300d); and 50-dimensional GloVe and word2vec skip-gram vectors that we trained on Wikipedia and Reuters RCV-1 (Our GloVe 50d, Our Skip-gram 50d).ber of epochs because we observed that the models did not exhibit overtraining and instead continue to slowly improve on the development set long after reaching near 100% on the training set.In contrast, despite OntoNotes being much larger than CoNLL-2003, training for more than about 18 epochs causes performance on the development set to decline steadily due to overfitting.

Excluding Failed Trials
On the CoNLL-2003 dataset, while BLSTM models completed training without difficulty, the BLSTM-CNN models fail to converge around 5∼10% of the time depending on feature set.Similarly, on OntoNotes, 1.5% of trials fail.We found that using a lower learning rate reduces failure rate.We also tried clipping gradients and using AdaDelta and both of them were effective at eliminating such failures by themselves.AdaDelta, however, made training more expensive with no gain in model performance.
In any case, for all experiments we excluded trials where the final F1 score on a subset of training data falls below a certain threshold, and continued to run trials until we obtained 10 successful ones.
For CoNLL-2003, we excluded trials where the final F1 score on the development set was less than 95; there was no ambiguity in selecting the threshold as every trial scored either above 98 or below 90.For OntoNotes, the threshold was a F1 score of 80 on the last 5,000 sentences of the training set; every trial scored either above 80 or below 75.

Training and Tagging Speed
On an Intel Xeon E5-2697 processor, training takes about 6 hours while tagging the test set takes about 12 seconds for CoNLL-2003.The times are 10 hours and 60 seconds respectively for OntoNotes.

Results and Discussion
Table 5 shows the results for all datasets.To the best of our knowledge, our best models have surpassed the previous highest reported F1 score for both CoNLL-2003 and OntoNotes.In particular, with no external knowledge other than word embeddings, our model is competitive on the CoNLL-2003 dataset and establishes a new state of the art for OntoNotes, suggesting that given enough data, the neural network automatically learns the relevant features for NER without feature engineering.

Comparison with FFNNs
We re-implemented the FFNN model of Collobert et al. (2011b) as a baseline for comparison.Table 5 shows that while performing reasonably well on CoNLL-2003, FFNNs are clearly inadequate for OntoNotes, which has a larger domain, showing that LSTM models are essential for NER.

Character-level CNNs vs. Character Type and Capitalization Features
Comparison of models in Table 6 shows that on CoNLL-2003, BLSTM-CNN models significantly 26 outperform the BLSTM models when given the same feature set.This effect is much smaller and not statistically significant on OntoNotes when capitalization features are added.Adding character type and capitalization features to the BLSTM-CNN models degrades performance for CoNLL and mostly improves performance OntoNotes, suggesting character-level CNNs can replace hand-crafted character features in some cases, but systems with weak lexicons may benefit from character features.
26 Wilcoxon rank sum test, p < 0.05 when comparing the four BLSTM models with the corresponding BLSTM-CNN models using the same feature set.The Wilcoxon rank sum test was selected for its robustness against small sample sizes when the distribution is unknown.

Word Embeddings
Table 5 and Table 7 show that we obtain a large, significant 27 improvement when trained word embeddings are used, as opposed to random embeddings, regardless of the additional features used.This is consistent with Collobert et. al. (2011b)'s results.Table 7 compares the performance of different word embeddings in our best model in Table 5 (BLSTM-CNN + emb + lex).For CoNLL-2003, publicly available GloVe and Google embeddings are about one point behind Collobert's embeddings.For OntoNotes, GloVe embeddings perform close to Collobert embeddings while Google embeddings are again one point behind.In addition, 300 dimensional embeddings present no significant improvement over 50 dimensional embeddings -a result previously reported by Turian et al. (2010).
One possible reason that Collobert embeddings perform better than other publicly available embeddings on CoNLL-2003 is that they are trained on the Reuters RCV-1 corpus, the source of CoNLL-2003 dataset, whereas the other embeddings are not 28 .On the other hand, we suspect that Google's embeddings perform poorly because of vocabulary mismatch -in particular, Google's embeddings were trained in a case-sensitive manner, and embeddings for many common punctuations and sym-27 Wilcoxon rank sum test, p < 0.001 28 To make direct comparison to Collobert et al. (2011b), we do not exclude the CoNLL-2003 NER task test data from the word vector training data.While it is possible that this difference could be responsible for the disparate performance of word vectors, the CoNLL-2003 training data comprises only 20k out of 800 million words, or 0.00002% of the total data; in an unsupervised training scheme, the effects are likely negligible.bols were not provided.To test these hypotheses, we performed experiments with new word embeddings trained using GloVe and word2vec, with vocabulary list and corpus similar to Collobert et. al. (2011b).As shown in Table 7, our GloVe embeddings improved significantly29 over publicly available embeddings on CoNLL-2003, and our word2vec skip-gram embeddings improved significantly30 over Google's embeddings on OntoNotes.
Due to time constraints we did not perform new hyper-parameter searches with any of the word embeddings.As word embedding quality depends on hyper-parameter choice during their training (Pennington et al., 2014), and also, in our NER neural network, hyper-parameter choice is likely sensitive to the type of word embeddings used, optimizing them all will likely produce better results and provide a fairer comparison of word embedding quality.

Effect of Dropout
Table 8 compares the result various dropout values for each dataset.The models are trained using only the training set for each dataset to isolate the effect of dropout on both dev and test sets.All other hyper-parameters and features remain the same as our best model in Table 5.In both datasets and on both dev and test sets, dropout is essential for state of the art performance, and the improvement is statistically significant31 .Dropout is optimized on the dev set, as described in Section 3.4.Hence, the chosen value may not be the best-performing in Table 8.

Lexicon Features
Table 6 shows that on the CoNLL-2003 dataset, using features from both the SENNA lexicon and our proposed DBpedia lexicon provides a significant 32 improvement and allows our model to clearly surpass the previous state of the art.Unfortunately the difference is minuscule for OntoNotes, most likely because our lexicon does not match DBpedia categories well.Figure 5 shows that on CoNLL-2003, lexicon coverage is reasonable and matches the tags set for everything except the catchall MISC category.For example, LOC entries in lexicon matches mostly LOC named entities and vice versa.However, on OntoNotes, the matches are noisy and correspondence between lexicon match and tag category is quite ambiguous.For example, all lexicon categories have spurious matches in unrelated named entities like CARDINAL, and LOC, GPE, and LANGUAGE entities all get a lot of matches from the LOC category in the lexicon.In addition, named entities in categories like NORP, ORG, LAW, PRODUCT receive little coverage.The lower coverage, noise, and ambiguity all contribute to the disappointing performance.This suggests that the DBpedia lexicon construction method needs to be improved.A reasonable place to start would be the DBpedia category to OntoNotes NE tag mappings.
In order to isolate the contribution of each lexicon and matching method, we compare different sources and matching methods on a BLSTM-CNN model with with randomly initialized word embed-32 Wilcoxon rank sum test, p < 0.001.dings and no other features or sources of external knowledge.Table 9 shows the results.In this weakened model, both lexicons contribute significant 33 improvements over the baseline.
Compared to the SENNA lexicon, our DBpedia lexicon is noisier but has broader coverage, which explains why when applying it using the same method as Collobert et al. (2011b), it performs worse on CoNLL-2003 but better on OntoNotesa dataset containing many more obscure named entities.However, we suspect that the method of Collobert et al. (2011b) is not noise resistant and therefore unsuitable for our lexicon because it fails to distinguish exact and partial matches 34 and does not set a minimum length for partial matching. 35Instead, when we apply our superior partial matching algorithm and BIOES encoding with our DBpedia lexicon, we gain a significant 36 improvement, allowing our lexicon to perform similarly to the SENNA lexicon.Unfortunately, as we could not reliably remove partial entries from the SENNA lexicon, we were unable to investigate whether or not our lexicon matching method would help in that lexicon.
In addition, using both lexicons together as distinct features provides a further improvement 37 on CoNLL-2003, which we suspect is because the lexi- cons are complementary; the SENNA lexicon is relatively clean and tailored to newswire, whereas the DBpedia lexicon is noisier but has high coverage.

Analysis of OntoNotes Performance
Table 10 shows the per-genre breakdown of OntoNotes results.As expected, our model performs best on clean text like broadcast news (BN) and newswire (NW), and worst on noisy text like telephone conversation (TC) and Web text (WB).
Our model also substantially improves over previous work on all genres except TC, where the small size of the training data likely hinders learning.Finally, the performance characteristics of our model appears to be quite different than the previous CRF models (Finkel and Manning, 2009;Durrett and Klein, 2014), likely because we apply a completely different machine learning method.

Related Research
Named entity recognition is a task with a long history.In this section, we summarize the works we compare with and that influenced our approach.

Named Entity Recognition
Most recent approaches to NER has been characterized by the use of CRF, SVM, and perceptron models, where performance is heavily dependent on feature engineering.Ratinov and Roth (2009) used non-local features, gazetteer extracted from 38 We downloaded their publicly released software and model to perform the per-genre evaluation.
Wikipedia, and Brown-cluster-like word representations, and achieved an F1 score of 90.80 on CoNLL-2003. Lin andWu (2009) surpassed them without using a gazetteer by instead using phrase features obtained by performing k-means clustering over a private database of search engine query logs.Passos et al. (2014)  Training an NER system together with related tasks such as entity linking has recently been shown to improve the state of the art.Durrett and Klein (2014) combined coreference resolution, entity linking, and NER into a single CRF model and added cross-task interaction factors.Their system achieved state of the art results on the OntoNotes dataset, but they did not evaluate on the CoNLL-2003 dataset due to lack of coreference annotations.Luo et al. (2015) achieved state of the art results on CoNLL-2003 by training a joint model over the NER and entity linking tasks, the pair of tasks whose interdependencies contributed the most to the work of Durrett and Klein (2014).

NER with Neural Networks
While many approaches involve CRF models, there has also been a long history of research involving neural networks.Early attempts were hindered by

Figure 1 :
Figure1: The (unrolled) BLSTM for tagging named entities.Multiple lookup tables word-level feature vectors.The CNN (Figure2) extracts a fixed length feature vector from character-level features.For each word, these vectors are concatenated and fed to the BLSTM network and then to the output layers (Figure3).

Figure 2 :
Figure 2: The convolutional neural network extracts character features from each word.The character embedding and (optionally) the character type feature vector are computed through lookup tables.Then, they are concatenated and passed into the CNN.

Figure 3 :
Figure3: The output layers ("Out" in Figure1) decode output into a score for each tag category.

Figure 5 :
Figure 5: Fraction of named entities of each tag category matched completely by entries in each lexicon category of the SENNA/DBpedia combined lexicon.White = higher fraction.
obtained nearly the same performance using only public data by training phrase vectors in their lexicon-infused skip-gram model.In order to combat the problem of sparse features, Suzuki et al. (2011) employed large-scale unlabelled data to perform feature reduction and achieved an F1 score of 91.02 on CoNLL-2003, which is the current state of the art for systems without external knowledge.

Table 3 :
Hyper-parameter search space and final values used for all experiments

Table 4 :
Development set F1 score performance of the best hyper-parameter settings in each optimization round.

Table 5 :
Results of our models, with various feature sets, compared to other published results.The three sections are, in order, our models, published neural network models, and published non-neural network models.For the features, emb = Collobert word embeddings, caps = capitalization feature, lex = lexicon features from both SENNA and DBpedia lexicons.For F1 scores, standard deviations are in parentheses.
19Evaluation on OntoNotes 5.0 done by Pradhan et al. (2013)20Not directly comparable as they evaluated on an earlier ver-

Table 8 :
F1 score results with various dropout values.Models were trained using only the training set for each dataset.All other experiments use dropout = 0.68 for CoNLL-2003 and dropout = 0.63 for OntoNotes 5.0.

Table 9 :
Comparison of lexicon and matching/encoding methods over the BLSTM-CNN model employing random embeddings and no other features.When using both lexicons, the best combination of matching and encoding is Exact-BIOES for SENNA and Partial-BIOES for DBpedia.Note that the SENNA lexicon already contains "partial entries" so exact matching in that case is really just a more primitive form of partial matching.