DeLFT NER applications
See here for the list of supported sequence labeling architectures.
In general, the best results will be obtained with BidLSTM_CRF
architecture together with ELMo
(ELMo is particularly good for sequence labeling, so don't forget ELMo!) or with BERT_CRF
using a pretrained transformer model specialized in the NER domain of the application (e.g. SciBERT for scientific NER, CamemBERT for general NER on French, etc.).
NER models can be trained and applied via the script delft/applications/nerTagger.py
. We describe on this page some available models and results obtained with DeLFT applied to standard NER tasks.
See NER Datasets for more information on the datasets used in this section.
Overview
We have reimplemented in DeLFT some reference neural architectures for NER of the last four years and performed a reproducibility analysis of the these systems with comparable evaluation criterias. Unfortunaltely, in publications, systems are usually compared directly with reported results obtained in different settings, which can bias scores by more than 1.0 point and completely invalidate both comparison and interpretation of results.
You can read more about our reproducibility study of neural NER in this blog article. This effort is similar to the work of (Yang and Zhang, 2018) (see also NCRFpp) but has also been extended to BERT for a fair comparison of RNN for sequence labeling, and can also be related to the motivations of (Pressel et al., 2018) MEAD.
All reported scores bellow are f-score for the CoNLL-2003 NER dataset. We report first the f-score averaged over 10 training runs, and second the best f-score over these 10 training runs. All the DeLFT trained models are included in this repository.
Architecture | Implementation | Glove only (avg / best) | Glove + valid. set (avg / best) | ELMo + Glove (avg / best) | ELMo + Glove + valid. set (avg / best) |
---|---|---|---|---|---|
BidLSTM_CRF | DeLFT | 91.03 / 91.38 | 91.37 / 91.69 | 92.57 / 92.80 | 92.95 / 93.21 |
(Lample and al., 2016) | - / 90.94 | ||||
BidLSTM_CNN_CRF | DeLFT | 90.64 / 91.23 | 90.98 / 91.38 | 92.30 / 92.57 | 92.67 / 93.04 |
(Ma & Hovy, 2016) | - / 91.21 | ||||
(Peters & al. 2018) | 92.22** / - | ||||
BidLSTM_CNN | DeLFT | 89.49 / 89.96 | 89.85 / 90.13 | 91.66 / 92.00 | 92.01 / 92.16 |
(Chiu & Nichols, 2016) | 90.88*** / - | ||||
BidGRU_CRF | DeLFT | 90.17 / 90.55 | 91.04 / 91.40 | 92.03 / 92.44 | 92.43 / 92.71 |
(Peters & al. 2017) | 91.93* / - |
Results with transformer fine-tuning for CoNLL-2003 NER dataset, including a final CRF activation layer, instead of a softmax. A CRF activation layer improves f-score in average by around +0.10 for sequence labelling task, but increase the runtime by 23%:
Architecture | pretrained model | Implementation | f-score |
---|---|---|---|
BERT | bert-base-cased | DeLFT | 91.19 |
BERT_CRF | bert-base-cased +CRF | DeLFT | 91.25 |
BERT_ChainCRF | bert-base-cased +CRF | DeLFT | 91.22 |
BERT | roberta-base | DeLFT | 91.64 |
Note: DeLFT uses BERT
as architecture name for transformers in general, but the transformer model could be in principle any transformer variants preset in HuggingFace Hub. DeLFT supports 2 implementations of a CRF layer to be combined with RNN and transformer architectures: CRF
based on TensorFlow Addons and ChainCRF
a custom implementation. Both should produce similar accuracy results, but ChainCRF
is significantly faster and robust.
For reference, the original reported result for bert-base-cased
model in (Devlin & al. 2018) is 92.4, using "document context".
For DeLFT, the average is obtained with 10 training runs (see latest full results) and for (Devlin & al. 2018) averaged with 5 runs. As noted here, the original CoNLL-2003 NER results with BERT reported by the Google Research paper are not easily reproducible, and the score obtained by DeLFT is very similar to those obtained by all the systems having reproduced this experiment in similar condition (e.g. without "document context").
* reported f-score using Senna word embeddings and not Glove.
** f-score is averaged over 5 training runs.
*** reported f-score with Senna word embeddings (Collobert 50d) averaged over 10 runs, including case features and not including lexical features. DeLFT implementation of the same architecture includes the capitalization features too, but uses the more efficient GloVe 300d embeddings.
Command Line Interface
Different datasets and languages are supported. They can be specified by the command line parameters. The general usage of the CLI is as follow:
usage: nerTagger.py [-h] [--fold-count FOLD_COUNT] [--lang LANG] [--dataset-type DATASET_TYPE]
[--train-with-validation-set] [--architecture ARCHITECTURE] [--data-path DATA_PATH]
[--file-in FILE_IN] [--file-out FILE_OUT] [--embedding EMBEDDING]
[--transformer TRANSFORMER]
action
Neural Named Entity Recognizers based on DeLFT
positional arguments:
action one of [train, train_eval, eval, tag]
optional arguments:
-h, --help show this help message and exit
--fold-count FOLD_COUNT
number of folds or re-runs to be used when training
--lang LANG language of the model as ISO 639-1 code (en, fr, de, etc.)
--dataset-type DATASET_TYPE
dataset to be used for training the model
--train-with-validation-set
Use the validation set for training together with the training set
--architecture ARCHITECTURE
type of model architecture to be used, one of ['BidLSTM_CRF', 'BidLSTM_CNN_CRF',
'BidLSTM_CNN_CRF', 'BidGRU_CRF', 'BidLSTM_CNN', 'BidLSTM_CRF_CASING', 'BERT',
'BERT_CRF', 'BERT_CRF_FEATURES', 'BERT_CRF_CHAR', 'BERT_CRF_CHAR_FEATURES']
--data-path DATA_PATH
path to the corpus of documents for training (only use currently with Ontonotes
corpus in orginal XML format)
--file-in FILE_IN path to a text file to annotate
--file-out FILE_OUT path for outputting the resulting JSON NER anotations
--embedding EMBEDDING
The desired pre-trained word embeddings using their descriptions in the file. For
local loading, use delft/resources-registry.json. Be sure to use here the same
name as in the registry, e.g. ['glove-840B', 'fasttext-crawl', 'word2vec'] and
that the path in the registry to the embedding file is correct on your system.
--transformer TRANSFORMER
The desired pre-trained transformer to be used in the selected architecture. For
local loading use, delft/resources-registry.json, and be sure to use here the
same name as in the registry, e.g. ['bert-base-cased', 'bert-large-cased',
'allenai/scibert_scivocab_cased'] and that the path in the registry to the model
path is correct on your system. HuggingFace transformers hub will be used
otherwise to fetch the model, see https://huggingface.co/models for model names
CONLL 2003
DeLFT comes with various trained models for the CoNLL-2003 NER dataset.
By default, the BidLSTM_CRF
architecture is used.
Using BidLSTM_CRF
model with ELMo embeddings, following [7] and some parameter optimisations and warm-up, improve the f1 score on CoNLL 2003 significantly.
For re-training a model, the CoNLL-2003 NER dataset (eng.train
, eng.testa
, eng.testb
) must be present under data/sequenceLabelling/CoNLL-2003/
in IOB2 tagging sceheme (look here for instance ;) and here. The CONLL 2003 dataset (English) is the default dataset and English is the default language, but you can also indicate it explicitly as parameter with --dataset-type conll2003
and specifying explicitly the language --lang en
.
For training and evaluating following the traditional approach (training with the train set without validation set, and evaluating on test set), use:
> python3 delft/applications/nerTagger.py --dataset-type conll2003 train_eval
Some recent works like (Chiu & Nichols, 2016) and (Peters and al., 2017) also train with the validation set, leading obviously to a better accuracy (still they compare their scores with scores previously reported trained differently, which is arguably a bit unfair - this aspect is mentioned in (Ma & Hovy, 2016)). To train with both train and validation sets, use the parameter --train-with-validation-set
:
python3 delft/applications/nerTagger.py --dataset-type conll2003 --train-with-validation-set train_eval
Note that, by default, the BidLSTM_CRF
model is used. (Documentation on selecting other models and setting hyperparameters to be included here !)
For evaluating against CoNLL 2003 testb set with the existing model:
> python3 delft/applications/nerTagger.py --dataset-type conll2003 eval
Evaluation on test set:
f1 (micro): 91.35
precision recall f1-score support
ORG 0.8795 0.9007 0.8899 1661
PER 0.9647 0.9623 0.9635 1617
MISC 0.8261 0.8120 0.8190 702
LOC 0.9260 0.9305 0.9282 1668
avg / total 0.9109 0.9161 0.9135 5648
If the model has been trained also with the validation set (--train-with-validation-set
), similarly to (Chiu & Nichols, 2016) or (Peters and al., 2017), results are significantly better:
Evaluation on test set:
f1 (micro): 91.60
precision recall f1-score support
LOC 0.9219 0.9418 0.9318 1668
MISC 0.8277 0.8077 0.8176 702
PER 0.9594 0.9635 0.9614 1617
ORG 0.9029 0.8904 0.8966 1661
avg / total 0.9158 0.9163 0.9160 5648
Using ELMo with the best model obtained over 10 training (not using the validation set for training, only for early stop):
Evaluation on test set:
f1 (micro): 92.80
precision recall f1-score support
LOC 0.9401 0.9412 0.9407 1668
MISC 0.8104 0.8405 0.8252 702
ORG 0.9107 0.9151 0.9129 1661
PER 0.9800 0.9722 0.9761 1617
all (micro avg.) 0.9261 0.9299 0.9280 5648
Using BERT architecture for sequence labelling (pre-trained transformer with fine-tuning), for instance here the bert-base-cased
, cased, pre-trained model, use:
> python3 delft/applications/nerTagger.py --architecture BERT_CRF --dataset-type conll2003 --fold-count 10 --transformer bert-base-cased train_eval
average over 10 folds
precision recall f1-score support
ORG 0.8804 0.9114 0.8957 1661
MISC 0.7823 0.8189 0.8002 702
PER 0.9633 0.9576 0.9605 1617
LOC 0.9290 0.9316 0.9303 1668
macro f1 = 0.9120
macro precision = 0.9050
macro recall = 0.9191
For training with all the available data:
> python3 delft/applications/nerTagger.py --dataset-type conll2003 train
To take into account the strong impact of random seed, you need to train multiple times with the n-folds options. The model will be trained n times with different seed values but with the same sets if the evaluation set is provided. The evaluation will then give the average scores over these n models (against test set) and for the best model which will be saved. For 10 times training for instance, use:
> python3 delft/applications/nerTagger.py --dataset-type conll2003 --fold-count 10 train_eval
After training a model, for tagging some text, for instance in a file data/test/test.ner.en.txt
(), use the command:
> python3 delft/applications/nerTagger.py --dataset-type conll2003 --file-in data/test/test.ner.en.txt tag
For instance for tagging the text with a specific architecture that has been previously trained:
> python3 delft/applications/nerTagger.py --dataset-type conll2003 --file-in data/test/test.ner.en.txt --architecture BERT_CRF_FEATURES --transformer bert-base-cased tag
Note that, currently, the input text file must contain one sentence per line, so the text must be presegmented into sentences. To obtain the JSON annotations in a text file instead than in the standard output, use the parameter --file-out
. Predictions work at around 7400 tokens per second for the BidLSTM_CRF architecture with a GeForce GTX 1080 Ti.
This produces a JSON output with entities, scores and character offsets like this:
{
"runtime": 0.34,
"texts": [
{
"text": "The University of California has found that 40 percent of its students suffer food insecurity. At four state universities in Illinois, that number is 35 percent.",
"entities": [
{
"text": "University of California",
"endOffset": 32,
"score": 1.0,
"class": "ORG",
"beginOffset": 4
},
{
"text": "Illinois",
"endOffset": 134,
"score": 1.0,
"class": "LOC",
"beginOffset": 125
}
]
},
{
"text": "President Obama is not speaking anymore from the White House.",
"entities": [
{
"text": "Obama",
"endOffset": 18,
"score": 1.0,
"class": "PER",
"beginOffset": 10
},
{
"text": "White House",
"endOffset": 61,
"score": 1.0,
"class": "LOC",
"beginOffset": 49
}
]
}
],
"software": "DeLFT",
"date": "2018-05-02T12:24:55.529301",
"model": "ner"
}
For English NER tagging, when used, the default static embeddings is Glove (glove-840B
). Other static embeddings can be specified with the parameter --embedding
, for instance:
> python3 delft/applications/nerTagger.py --dataset-type conll2003 --embedding word2vec train_eval
Ontonotes 5.0 CONLL 2012
DeLFT comes with pre-trained models with the Ontonotes 5.0 CoNLL-2012 NER dataset. As dataset-type identifier, use conll2012
. All the options valid for CoNLL-2003 NER dataset are usable for this dataset. Static embeddings for Ontonotes can be set with parameter --embedding
.
For re-training, the assembled Ontonotes datasets following CoNLL-2012 must be available and converted into IOB2 tagging scheme, see here for more details. To train and evaluate following the traditional approach (training with the train set without validation set, and evaluating on test set), with BidLSTM_CRF
architecture use:
> python3 nerTagger.py --dataset-type conll2012 train_eval --architecture BidLSTM_CRF --embedding glove-840B
training runtime: 23692.0 seconds
Evaluation on test set:
f1 (micro): 87.01
precision recall f1-score support
DATE 0.8029 0.8695 0.8349 1602
CARDINAL 0.8130 0.8139 0.8135 935
PERSON 0.9061 0.9371 0.9214 1988
GPE 0.9617 0.9411 0.9513 2240
ORG 0.8799 0.8568 0.8682 1795
MONEY 0.8903 0.8790 0.8846 314
NORP 0.9226 0.9501 0.9361 841
ORDINAL 0.7873 0.8923 0.8365 195
TIME 0.5772 0.6698 0.6201 212
WORK_OF_ART 0.6000 0.5060 0.5490 166
LOC 0.7340 0.7709 0.7520 179
EVENT 0.5000 0.5556 0.5263 63
PRODUCT 0.6528 0.6184 0.6351 76
PERCENT 0.8717 0.8567 0.8642 349
QUANTITY 0.7155 0.7905 0.7511 105
FAC 0.7167 0.6370 0.6745 135
LANGUAGE 0.8462 0.5000 0.6286 22
LAW 0.7308 0.4750 0.5758 40
all (micro avg.) 0.8647 0.8755 0.8701 11257
With bert-base-cased
BERT_CRF
architecture:
> python3 delft/applications/nerTagger.py train_eval --dataset-type conll2012 --architecture BERT_CRF --transformer bert-base-cased
training runtime: 14367.8 seconds
Evaluation on test set:
precision recall f1-score support
CARDINAL 0.8443 0.8064 0.8249 935
DATE 0.8474 0.8770 0.8620 1602
EVENT 0.7460 0.7460 0.7460 63
FAC 0.7163 0.7481 0.7319 135
GPE 0.9657 0.9437 0.9546 2240
LANGUAGE 0.8889 0.7273 0.8000 22
LAW 0.6857 0.6000 0.6400 40
LOC 0.6965 0.7821 0.7368 179
MONEY 0.8882 0.9108 0.8994 314
NORP 0.9350 0.9584 0.9466 841
ORDINAL 0.8199 0.8872 0.8522 195
ORG 0.8908 0.8997 0.8952 1795
PERCENT 0.8917 0.8968 0.8943 349
PERSON 0.9396 0.9472 0.9434 1988
PRODUCT 0.5600 0.7368 0.6364 76
QUANTITY 0.6187 0.8190 0.7049 105
TIME 0.6184 0.6651 0.6409 212
WORK_OF_ART 0.6138 0.6988 0.6535 166
all (micro avg.) 0.8825 0.8951 0.8888 11257
With ELMo embeddings (using the default hyper-parameters, except the batch size which is increased to better learn the less frequent classes):
> python3 delft/applications/nerTagger.py train_eval --dataset-type conll2012 --architecture BidLSTM_CRF --embedding glove-840B --use-ELMo
training runtime: 36812.025 seconds
Evaluation on test set:
precision recall f1-score support
CARDINAL 0.8534 0.8342 0.8437 935
DATE 0.8499 0.8733 0.8615 1602
EVENT 0.7091 0.6190 0.6610 63
FAC 0.7667 0.6815 0.7216 135
GPE 0.9682 0.9527 0.9604 2240
LANGUAGE 0.9286 0.5909 0.7222 22
LAW 0.7000 0.5250 0.6000 40
LOC 0.7759 0.7542 0.7649 179
MONEY 0.9054 0.9140 0.9097 314
NORP 0.9323 0.9501 0.9411 841
ORDINAL 0.8082 0.9077 0.8551 195
ORG 0.8950 0.9019 0.8984 1795
PERCENT 0.9117 0.9169 0.9143 349
PERSON 0.9430 0.9482 0.9456 1988
PRODUCT 0.6410 0.6579 0.6494 76
QUANTITY 0.7890 0.8190 0.8037 105
TIME 0.6683 0.6462 0.6571 212
WORK_OF_ART 0.6301 0.6566 0.6431 166
all (micro avg.) 0.8943 0.8956 0.8949 11257
French model (based on Le Monde corpus)
Note that Le Monde corpus is subject to copyrights and is limited to research usage only, it is usually referred to as "corpus FTB". The corpus file ftb6_ALL.EN.docs.relinked.xml
must be located under delft/data/sequenceLabelling/leMonde/
. This is the default French model, so it will be used by simply indicating the language as parameter: --lang fr
, but you can also indicate explicitly the dataset with --dataset-type ftb
. Default static embeddings for French language models are wiki.fr
, which can be changed with parameter --embedding
.
Similarly as before, for training and evaluating use:
> python3 delft/applications/nerTagger.py --lang fr --dataset-type ftb train_eval
In practice, we need to repeat training and evaluation several times to neutralise random seed effects and to average scores, here ten times:
> python3 delft/applications/nerTagger.py --lang fr --dataset-type ftb --fold-count 10 train_eval
The performance is as follow, for the BiLSTM_CRF
architecture and fasttext wiki.fr
embeddings, averaged over 10 training:
----------------------------------------------------------------------
** Worst ** model scores - run 2
precision recall f1-score support
<artifact> 1.0000 0.5000 0.6667 8
<business> 0.8242 0.8772 0.8499 342
<institution> 0.8571 0.7826 0.8182 23
<location> 0.9386 0.9582 0.9483 383
<organisation> 0.8750 0.7292 0.7955 240
<person> 0.9631 0.9457 0.9543 221
all (micro avg.) 0.8964 0.8817 0.8890 1217
** Best ** model scores - run 3
precision recall f1-score support
<artifact> 1.0000 0.7500 0.8571 8
<business> 0.8457 0.8977 0.8709 342
<institution> 0.8182 0.7826 0.8000 23
<location> 0.9367 0.9661 0.9512 383
<organisation> 0.8832 0.7875 0.8326 240
<person> 0.9459 0.9502 0.9481 221
all (micro avg.) 0.9002 0.9039 0.9020 1217
----------------------------------------------------------------------
Average over 10 folds
precision recall f1-score support
<artifact> 1.0000 0.6000 0.7432 8
<business> 0.8391 0.8830 0.8605 342
<institution> 0.8469 0.7652 0.8035 23
<location> 0.9388 0.9645 0.9514 383
<organisation> 0.8644 0.7592 0.8079 240
<person> 0.9463 0.9529 0.9495 221
all (micro avg.) 0.8961 0.8929 0.8945
With frELMo:
----------------------------------------------------------------------
** Worst ** model scores - run 2
precision recall f1-score support
<artifact> 1.0000 0.5000 0.6667 8
<business> 0.8704 0.9035 0.8867 342
<institution> 0.8000 0.6957 0.7442 23
<location> 0.9342 0.9634 0.9486 383
<organisation> 0.8043 0.7875 0.7958 240
<person> 0.9641 0.9729 0.9685 221
all (micro avg.) 0.8945 0.9055 0.9000 1217
** Best ** model scores - run 3
precision recall f1-score support
<artifact> 1.0000 0.7500 0.8571 8
<business> 0.8883 0.9298 0.9086 342
<institution> 0.8500 0.7391 0.7907 23
<location> 0.9514 0.9713 0.9612 383
<organisation> 0.8597 0.7917 0.8243 240
<person> 0.9774 0.9774 0.9774 221
all (micro avg.) 0.9195 0.9195 0.9195 1217
----------------------------------------------------------------------
Average over 10 folds
precision recall f1-score support
<artifact> 0.8833 0.5125 0.6425 8
<business> 0.8803 0.9067 0.8933 342
<institution> 0.7933 0.7391 0.7640 23
<location> 0.9438 0.9679 0.9557 383
<organisation> 0.8359 0.8004 0.8176 240
<person> 0.9699 0.9760 0.9729 221
all (micro avg.) 0.9073 0.9118 0.9096
Using camembert-base
as transformer layer in a BERT_CRF
architecture:
> python3 delft/applications/nerTagger.py --lang fr --dataset-type ftb train_eval --architecture BERT_CRF --transformer camembert-base
precision recall f1-score support
<artifact> 0.0000 0.0000 0.0000 8
<business> 0.8940 0.9123 0.9030 342
<institution> 0.6923 0.7826 0.7347 23
<location> 0.9563 0.9713 0.9637 383
<organisation> 0.8270 0.8167 0.8218 240
<person> 0.9688 0.9819 0.9753 221
all (micro avg.) 0.9102 0.9162 0.9132 1217
For historical reason, we can also consider a particular split of the FTB corpus into train, dev and set set and with a forced tokenization (like the old CoNLL 2013 NER), that was used in previous work for comparison. Obviously the evaluation is dependent to this particular set and the n-fold cross validation is a much better practice and should be prefered (as well as a format that do not force a tokenization). For using the forced split FTB (using the files ftb6_dev.conll
, ftb6_test.conll
and ftb6_train.conll
located under delft/data/sequenceLabelling/leMonde/
), use as parameter --dataset-type ftb_force_split
:
> python3 delft/applications/nerTagger.py --lang fr --dataset-type ftb_force_split --fold-count 10 train_eval
which gives for the BiLSTM-CRF architecture and fasttext wiki.fr
embeddings averaged over 10 training:
----------------------------------------------------------------------
** Worst ** model scores - run 4
precision recall f1-score support
Company 0.7908 0.7690 0.7797 290
FictionCharacter 0.0000 0.0000 0.0000 2
Location 0.9164 0.9164 0.9164 347
Organization 0.7895 0.7235 0.7550 311
Person 0.9000 0.9220 0.9108 205
Product 1.0000 0.3333 0.5000 3
all (micro avg.) 0.8498 0.8256 0.8375 1158
** Best ** model scores - run 0
precision recall f1-score support
Company 0.8026 0.8552 0.8280 290
FictionCharacter 0.0000 0.0000 0.0000 2
Location 0.9326 0.9164 0.9244 347
Organization 0.8244 0.7395 0.7797 311
Person 0.8826 0.9171 0.8995 205
Product 1.0000 1.0000 1.0000 3
all (micro avg.) 0.8620 0.8523 0.8571 1158
----------------------------------------------------------------------
Average over 10 folds
precision recall f1-score support
Company 0.7920 0.8148 0.8030 290
FictionCharacter 0.0000 0.0000 0.0000 2
Location 0.9234 0.9098 0.9165 347
Organization 0.8071 0.7328 0.7681 311
POI 0.0000 0.0000 0.0000 0
Person 0.8974 0.9254 0.9112 205
Product 1.0000 0.9000 0.9300 3
all (micro avg.) 0.8553 0.8396 0.8474
With frELMo:
----------------------------------------------------------------------
** Worst ** model scores - run 3
precision recall f1-score support
Company 0.8215 0.8414 0.8313 290
FictionCharacter 0.0000 0.0000 0.0000 2
Location 0.9020 0.9280 0.9148 347
Organization 0.7833 0.7556 0.7692 311
Person 0.9327 0.9463 0.9395 205
Product 0.0000 0.0000 0.0000 3
all (micro avg.) 0.8563 0.8592 0.8578 1158
** Best ** model scores - run 1
precision recall f1-score support
Company 0.8289 0.8690 0.8485 290
FictionCharacter 0.0000 0.0000 0.0000 2
Location 0.9290 0.9424 0.9356 347
Organization 0.8475 0.7685 0.8061 311
Person 0.9327 0.9463 0.9395 205
Product 0.6667 0.6667 0.6667 3
all (micro avg.) 0.8825 0.8756 0.8791 1158
----------------------------------------------------------------------
Average over 10 folds
precision recall f1-score support
Company 0.8195 0.8503 0.8346 290
FictionCharacter 0.0000 0.0000 0.0000 2
Location 0.9205 0.9363 0.9283 347
Organization 0.8256 0.7595 0.7910 311
POI 0.0000 0.0000 0.0000 0
Person 0.9286 0.9454 0.9369 205
Product 0.7417 0.6667 0.6824 3
all (micro avg.) 0.8718 0.8666 0.8691
For the ftb_force_split
dataset, similarly as for CoNLL 2013, you can use the train_with_validation_set
parameter to add the validation set in the training data. The above results are all obtained without using train_with_validation_set
(which is the common approach).
Finally, for training with all the dataset without evaluation (e.g. for production):
> python3 delft/applications/nerTagger.py --lang fr --dataset-type ftb train
and for annotating some examples:
> python3 delft/applications/nerTagger.py --lang fr --dataset-type ftb --file-in data/test/test.ner.fr.txt tag
{
"date": "2018-06-11T21:25:03.321818",
"runtime": 0.511,
"software": "DeLFT",
"model": "ner-fr-lemonde",
"texts": [
{
"entities": [
{
"beginOffset": 5,
"endOffset": 13,
"score": 1.0,
"text": "Allemagne",
"class": "<location>"
},
{
"beginOffset": 57,
"endOffset": 68,
"score": 1.0,
"text": "Donald Trump",
"class": "<person>"
}
],
"text": "Or l’Allemagne pourrait préférer la retenue, de peur que Donald Trump ne surtaxe prochainement les automobiles étrangères."
}
]
}
This above work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.
Insult recognition
A small experimental model for recognising insults and threats in texts, based on the Wikipedia comment from the Kaggle Wikipedia Toxic Comments dataset, English only. This uses a small dataset labelled manually.
usage: insultTagger.py [-h] [--fold-count FOLD_COUNT] [--architecture ARCHITECTURE]
[--embedding EMBEDDING] [--transformer TRANSFORMER]
action
Experimental insult recognizer for the Wikipedia toxic comments dataset
positional arguments:
action
optional arguments:
-h, --help show this help message and exit
--fold-count FOLD_COUNT
--architecture ARCHITECTURE
Type of model architecture to be used, one of ['BidLSTM_CRF', 'BidLSTM_CNN_CRF',
'BidLSTM_CNN_CRF', 'BidGRU_CRF', 'BidLSTM_CNN', 'BidLSTM_CRF_CASING', 'BERT',
'BERT_CRF', 'BERT_CRF_FEATURES', 'BERT_CRF_CHAR', 'BERT_CRF_CHAR_FEATURES']
--embedding EMBEDDING
The desired pre-trained word embeddings using their descriptions in the file. For
local loading, use delft/resources-registry.json. Be sure to use here the same
name as in the registry, e.g. ['glove-840B', 'fasttext-crawl', 'word2vec'] and
that the path in the registry to the embedding file is correct on your system.
--transformer TRANSFORMER
The desired pre-trained transformer to be used in the selected architecture. For
local loading use, delft/resources-registry.json, and be sure to use here the
same name as in the registry, e.g. ['bert-base-cased', 'bert-large-cased',
'allenai/scibert_scivocab_cased'] and that the path in the registry to the model
path is correct on your system. HuggingFace transformers hub will be used
otherwise to fetch the model, see https://huggingface.co/models for model names
For training:
> python3 delft/applications/insultTagger.py train
By default training uses the whole train set.
Example of a small tagging test:
> python3 delft/applications/insultTagger.py tag
will produced (socially offensive language warning!) result like this:
{
"runtime": 0.969,
"texts": [
{
"entities": [],
"text": "This is a gentle test."
},
{
"entities": [
{
"score": 1.0,
"endOffset": 20,
"class": "<insult>",
"beginOffset": 9,
"text": "moronic wimp"
},
{
"score": 1.0,
"endOffset": 56,
"class": "<threat>",
"beginOffset": 54,
"text": "die"
}
],
"text": "you're a moronic wimp who is too lazy to do research! die in hell !!"
}
],
"software": "DeLFT",
"date": "2018-05-14T17:22:01.804050",
"model": "insult"
}