Creating GROBID models with DeLFT
GROBID uses a cascade of sequence labeling models to parse complete documents. The particularity of these models is to use joint text and layout fatures to identify document structures more accurately. The script delft/applications/grobidTagger.py
allows the creation of various GROBID models to be used by the GROBID services for parsing various structures such as document headers, references, affiliations, authors, dates, etc.
General command line for training GROBID models in DeLFT
usage: grobidTagger.py [-h] [--fold-count FOLD_COUNT] [--architecture ARCHITECTURE] [--use-ELMo]
[--embedding EMBEDDING] [--transformer TRANSFORMER] [--output OUTPUT]
[--input INPUT] [--feature-indices FEATURE_INDICES]
[--max-sequence-length MAX_SEQUENCE_LENGTH] [--batch-size BATCH_SIZE]
[--tensorboard]
model {train,train_eval,eval,tag}
Trainer for GROBID models using the DeLFT library
positional arguments:
model Name of the model.
{train,train_eval,eval,tag}
optional arguments:
-h, --help show this help message and exit
--fold-count FOLD_COUNT
Number of fold to use when evaluating with n-fold cross validation.
--architecture ARCHITECTURE
Type of model architecture to be used, one of ['BidLSTM', 'BidLSTM_CRF',
'BidLSTM_ChainCRF', 'BidLSTM_CNN_CRF', 'BidLSTM_CNN_CRF', 'BidGRU_CRF',
'BidLSTM_CNN', 'BidLSTM_CRF_CASING', 'BidLSTM_CRF_FEATURES',
'BidLSTM_ChainCRF_FEATURES', 'BERT', 'BERT_CRF', 'BERT_ChainCRF',
'BERT_CRF_FEATURES', 'BERT_CRF_CHAR', 'BERT_CRF_CHAR_FEATURES']
--use-ELMo Use ELMo contextual embeddings
--embedding EMBEDDING
The desired pre-trained word embeddings using their descriptions in the
file. For local loading, use delft/resources-registry.json. Be sure to
use here the same name as in the registry, e.g. ['glove-840B',
'fasttext-crawl', 'word2vec'] and that the path in the registry to the
embedding file is correct on your system.
--transformer TRANSFORMER
The desired pre-trained transformer to be used in the selected
architecture. For local loading use, delft/resources-registry.json, and
be sure to use here the same name as in the registry, e.g. ['bert-base-
cased', 'bert-large-cased', 'allenai/scibert_scivocab_cased'] and that
the path in the registry to the model path is correct on your system.
HuggingFace transformers hub will be used otherwise to fetch the model,
see https://huggingface.co/models for model names
--output OUTPUT Directory where to save a trained model.
--input INPUT Grobid data file to be used for training (train action), for training
and evaluation (train_eval action) or just for evaluation (eval action).
--max-sequence-length MAX_SEQUENCE_LENGTH
max-sequence-length parameter to be used.
--batch-size BATCH_SIZE
batch-size parameter to be used.
GROBID models
DeLFT supports GROBID training data (originally for CRF) and GROBID feature matrix to be labelled. Default static embeddings for GROBID models are glove-840B
, which can be changed with parameter --embedding
.
Train a model with all available training data:
python3 *name-of-model* train --architecture *name-of-architecture*
where name-of-model is one of GROBID model (date, affiliation-address, citation, header, name-citation, name-header, ...), for instance:
and where name-of-architecture is one of ['BidLSTM', 'BidLSTM_CRF', 'BidLSTM_ChainCRF', 'BidLSTM_CNN_CRF', 'BidLSTM_CNN_CRF', 'BidGRU_CRF', 'BidLSTM_CNN', 'BidLSTM_CRF_CASING', 'BidLSTM_CRF_FEATURES', 'BidLSTM_ChainCRF_FEATURES', 'BERT', 'BERT_CRF', 'BERT_ChainCRF', 'BERT_CRF_FEATURES', 'BERT_CRF_CHAR', 'BERT_CRF_CHAR_FEATURES']
.
python3 delft/applications/grobidTagger.py date train --architecture BidLSTM_CRF
To segment the training data and eval on 10%, use the action train_eval
instead of train
:
python3 delft/applications/grobidTagger.py *name-of-model* train_eval --architecture *name-of-architecture*
For instance for the date model:
python3 delft/applications/grobidTagger.py date train_eval --architecture BidLSTM_CRF
Evaluation:
f1 (micro): 96.41
precision recall f1-score support
<month> 0.9667 0.9831 0.9748 59
<year> 1.0000 0.9844 0.9921 64
<day> 0.9091 0.9524 0.9302 42
avg / total 0.9641 0.9758 0.9699 165
For applying a model on some examples:
python3 delft/applications/grobidTagger.py date tag --architecture BidLSTM_CRF
{
"runtime": 0.509,
"software": "DeLFT",
"model": "grobid-date",
"date": "2018-05-23T14:18:15.833959",
"texts": [
{
"entities": [
{
"score": 1.0,
"endOffset": 6,
"class": "<month>",
"beginOffset": 0,
"text": "January"
},
{
"score": 1.0,
"endOffset": 11,
"class": "<year>",
"beginOffset": 8,
"text": "2006"
}
],
"text": "January 2006"
},
{
"entities": [
{
"score": 1.0,
"endOffset": 4,
"class": "<month>",
"beginOffset": 0,
"text": "March"
},
{
"score": 1.0,
"endOffset": 13,
"class": "<day>",
"beginOffset": 10,
"text": "27th"
},
{
"score": 1.0,
"endOffset": 19,
"class": "<year>",
"beginOffset": 16,
"text": "2001"
}
],
"text": "March the 27th, 2001"
}
]
}
As usual, depending of the architecture to be used you can indicate wither which embeddings whould be used for a RNN model (default is glove-840B):
python3 delft/applications/grobidTagger.py citation train_eval --architecture BidLSTM_CRF_FEATURES --embedding glove-840B
or the name of the transformer model you wish use in an architecture including a transformer layer:
python3 delft/applications/grobidTagger.py header train --architecture BERT_CRF --transformer allenai/scibert_scivocab_cased
With the architectures having a feature channel, the categorial features (as generated by GROBID) will be automatically selected (typically the layout and lexical class features). The models not having a feature channel will only use the tokens as input (as the usual Deep Learning models for text).
Similarly to the NER models, for n-fold training (action train_eval
only), specify the value of n
with the parameter --fold-count
, e.g.:
python3 delft/applications/grobidTagger.py citation train_eval --architecture BidLSTM_CRF_FEATURES --fold-count=10
By default the Grobid data to be used are the ones available under the data/sequenceLabelling/grobid
subdirectory, but a Grobid data file can be provided by the parameter --input
:
python3 delft/applications/grobidTagger.py *name-of-model* train --architecture *name-of-architecture* --input *path-to-the-grobid-data-file-to-be-used-for-training*
or
python3 delft/applications/grobidTagger.py *name-of-model* train_eval --architecture *name-of-architecture* --input *path-to-the-grobid-data-file-to-be-used-for-training_and_eval_with_random_split*
The evaluation of a model with a specific Grobid data file can be performed using the eval
action and specifying the data file with --input
:
python3 delft/applications/grobidTagger.py citation eval --architecture *name-of-architecture* --input *path-to-the-grobid-data-file-to-be-used-for-evaluation*