Creating GROBID models with DeLFT

GROBID uses a cascade of sequence labeling models to parse complete documents. The particularity of these models is to use joint text and layout fatures to identify document structures more accurately. The script delft/applications/grobidTagger.py allows the creation of various GROBID models to be used by the GROBID services for parsing various structures such as document headers, references, affiliations, authors, dates, etc.

General command line for training GROBID models in DeLFT

usage: grobidTagger.py [-h] [--fold-count FOLD_COUNT]
                       [--architecture ARCHITECTURE] [--output OUTPUT]
                       [--embedding EMBEDDING] [--transformer TRANSFORMER]
                       [--input INPUT] [--incremental]
                       [--input-model INPUT_MODEL]
                       [--max-sequence-length MAX_SEQUENCE_LENGTH]
                       [--batch-size BATCH_SIZE] [--patience PATIENCE]
                       [--learning-rate LEARNING_RATE] [--max-epoch MAX_EPOCH]
                       [--early-stop EARLY_STOP] [--multi-gpu]
                       [--num-workers NUM_WORKERS] [--wandb]
                       model {train,train_eval,eval,tag}

Trainer for GROBID models using the DeLFT library

positional arguments:
  model                 Name of the model.
  {train,train_eval,eval,tag}

options:
  -h, --help            show this help message and exit
  --fold-count FOLD_COUNT
                        Number of fold to use when evaluating with n-fold cross validation.
  --architecture ARCHITECTURE
                        Type of model architecture to be used, one of
                        ['BidLSTM', 'BidLSTM_CRF', 'BidLSTM_ChainCRF',
                        'BidLSTM_CNN_CRF', 'BidLSTM_CNN_CRF', 'BidGRU_CRF',
                        'BidLSTM_CNN', 'BidLSTM_CRF_CASING',
                        'BidLSTM_CRF_FEATURES', 'BidLSTM_ChainCRF_FEATURES',
                        'BERT', 'BERT_FEATURES', 'BERT_CRF', 'BERT_ChainCRF',
                        'BERT_CRF_FEATURES', 'BERT_ChainCRF_FEATURES',
                        'BERT_CRF_CHAR', 'BERT_CRF_CHAR_FEATURES']
  --output OUTPUT       Directory where to save a trained model.
  --embedding EMBEDDING
                        The desired pre-trained word embeddings, see --help
                        and `delft/resources-registry.json` (e.g. 'glove-840B',
                        'fasttext-crawl', 'word2vec').
  --transformer TRANSFORMER
                        The desired pre-trained transformer to be used in the
                        selected architecture, either a local entry from
                        `delft/resources-registry.json` or a HuggingFace Hub
                        model id (e.g. 'bert-base-cased',
                        'allenai/scibert_scivocab_cased').
  --input INPUT         Grobid data file to be used for training (train action),
                        for training and evaluation (train_eval action) or just
                        for evaluation (eval action).
  --incremental         training is incremental, starting from existing model
                        if present.
  --input-model INPUT_MODEL
                        In case of incremental training, path to an existing
                        model to be used to start the training, instead of the
                        default one.
  --max-sequence-length MAX_SEQUENCE_LENGTH
                        max-sequence-length parameter to be used.
  --batch-size BATCH_SIZE
                        batch-size parameter to be used.
  --patience PATIENCE   patience, number of extra epochs to perform after the
                        best epoch before stopping a training.
  --learning-rate LEARNING_RATE
                        Initial learning rate.
  --max-epoch MAX_EPOCH
                        Maximum number of epochs for training.
  --early-stop EARLY_STOP
                        Force early training termination when metrics scores
                        are not improving after a number of epochs equals to
                        the patience parameter.
  --multi-gpu           Enable distributed computing across multiple GPUs (the
                        batch size needs to be set accordingly using
                        --batch-size).
  --num-workers NUM_WORKERS
                        Number of worker processes for data loading (default:
                        1, use 0 or 1 for no multiprocessing).
  --wandb               Enable the logging of the training using Weights and
                        Biases.

GROBID models

DeLFT supports GROBID training data (originally for CRF) and GROBID feature matrix to be labelled. Default static embeddings for GROBID models are glove-840B, which can be changed with parameter --embedding.

Train a model with all available training data:

python3  *name-of-model* train --architecture *name-of-architecture*

where name-of-model is one of GROBID model (date, affiliation-address, citation, header, name-citation, name-header, ...), for instance:

and where name-of-architecture is one of ['BidLSTM', 'BidLSTM_CRF', 'BidLSTM_ChainCRF', 'BidLSTM_CNN_CRF', 'BidLSTM_CNN_CRF', 'BidGRU_CRF', 'BidLSTM_CNN', 'BidLSTM_CRF_CASING', 'BidLSTM_CRF_FEATURES', 'BidLSTM_ChainCRF_FEATURES', 'BERT', 'BERT_CRF', 'BERT_ChainCRF', 'BERT_CRF_FEATURES', 'BERT_CRF_CHAR', 'BERT_CRF_CHAR_FEATURES'].

python3 delft/applications/grobidTagger.py date train --architecture BidLSTM_CRF

To segment the training data and eval on 10%, use the action train_eval instead of train:

python3 delft/applications/grobidTagger.py *name-of-model* train_eval --architecture *name-of-architecture*

For instance for the date model:

python3 delft/applications/grobidTagger.py date train_eval --architecture BidLSTM_CRF
        Evaluation:
        f1 (micro): 96.41
                 precision    recall  f1-score   support

        <month>     0.9667    0.9831    0.9748        59
         <year>     1.0000    0.9844    0.9921        64
          <day>     0.9091    0.9524    0.9302        42

    avg / total     0.9641    0.9758    0.9699       165

For applying a model on some examples:

python3 delft/applications/grobidTagger.py date tag --architecture BidLSTM_CRF
{
    "runtime": 0.509,
    "software": "DeLFT",
    "model": "grobid-date",
    "date": "2018-05-23T14:18:15.833959",
    "texts": [
        {
            "entities": [
                {
                    "score": 1.0,
                    "endOffset": 6,
                    "class": "<month>",
                    "beginOffset": 0,
                    "text": "January"
                },
                {
                    "score": 1.0,
                    "endOffset": 11,
                    "class": "<year>",
                    "beginOffset": 8,
                    "text": "2006"
                }
            ],
            "text": "January 2006"
        },
        {
            "entities": [
                {
                    "score": 1.0,
                    "endOffset": 4,
                    "class": "<month>",
                    "beginOffset": 0,
                    "text": "March"
                },
                {
                    "score": 1.0,
                    "endOffset": 13,
                    "class": "<day>",
                    "beginOffset": 10,
                    "text": "27th"
                },
                {
                    "score": 1.0,
                    "endOffset": 19,
                    "class": "<year>",
                    "beginOffset": 16,
                    "text": "2001"
                }
            ],
            "text": "March the 27th, 2001"
        }
    ]
}

As usual, depending of the architecture to be used you can indicate wither which embeddings whould be used for a RNN model (default is glove-840B):

python3 delft/applications/grobidTagger.py citation train_eval --architecture BidLSTM_CRF_FEATURES --embedding glove-840B

or the name of the transformer model you wish use in an architecture including a transformer layer:

python3 delft/applications/grobidTagger.py header train --architecture BERT_CRF --transformer allenai/scibert_scivocab_cased

With the architectures having a feature channel, the categorial features (as generated by GROBID) will be automatically selected (typically the layout and lexical class features). The models not having a feature channel will only use the tokens as input (as the usual Deep Learning models for text).

Similarly to the NER models, for n-fold training (action train_eval only), specify the value of n with the parameter --fold-count, e.g.:

python3 delft/applications/grobidTagger.py citation train_eval --architecture BidLSTM_CRF_FEATURES --fold-count=10 

By default the Grobid data to be used are the ones available under the data/sequenceLabelling/grobid subdirectory, but a Grobid data file can be provided by the parameter --input:

python3 delft/applications/grobidTagger.py *name-of-model* train --architecture *name-of-architecture* --input *path-to-the-grobid-data-file-to-be-used-for-training*

or

python3 delft/applications/grobidTagger.py *name-of-model* train_eval --architecture *name-of-architecture* --input *path-to-the-grobid-data-file-to-be-used-for-training_and_eval_with_random_split*

The evaluation of a model with a specific Grobid data file can be performed using the eval action and specifying the data file with --input:

python3 delft/applications/grobidTagger.py citation eval --architecture *name-of-architecture* --input *path-to-the-grobid-data-file-to-be-used-for-evaluation*