Snippet classification with DeLFT

As an example, we use here DeLFT for creating classifiers for various snippet types: citation contexts, software mention contexts, dataset sentences, etc.

In general, the best results will be obtained with a transformer classifier (architecture bert), possibly using a costly 10-fold ensemble classifications if speed is not an issue. However, a gru architecture with 10-fold ensemble and good static embeddings might be similar or even more accurate than a transformer in some cases, leading to a much faster and less memory-hungry solution. It is thus advised to experiment with a gru architecture with 10-fold ensemble before deciding with a transformer classifier.

Toxic comment classification

The dataset of the Kaggle Toxic Comment Classification challenge can be found here: https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data

This is a multi-label regression problem, where a Wikipedia comment (or any similar short texts) should be associated to 6 possible types of toxicity (toxic, severe_toxic, obscene, threat, insult, identity_hate).

usage: delft/applications/toxicCommentClassifier.py [-h] [--fold-count FOLD_COUNT] [--architecture ARCHITECTURE]
                                 [--embedding EMBEDDING] [--transformer TRANSFORMER]
                                 action

Classification of comments/short texts in toxicity types (toxic, severe_toxic, obscene, threat, insult,
identity_hate) based on DeLFT

positional arguments:
  action

optional arguments:
  -h, --help            show this help message and exit
  --fold-count FOLD_COUNT
  --architecture ARCHITECTURE
                        type of model architecture to be used, one of ['lstm', 'bidLstm_simple', 'cnn',
                        'cnn2', 'cnn3', 'mix1', 'dpcnn', 'conv', 'gru', 'gru_simple', 'lstm_cnn', 'han',
                        'bert']
  --embedding EMBEDDING
                        The desired pre-trained word embeddings using their descriptions in the file. For
                        local loading, use delft/resources-registry.json. Be sure to use here the same
                        name as in the registry, e.g. ['glove-840B', 'fasttext-crawl', 'word2vec'] and
                        that the path in the registry to the embedding file is correct on your system.
  --transformer TRANSFORMER
                        The desired pre-trained transformer to be used in the selected architecture. For
                        local loading use, delft/resources-registry.json, and be sure to use here the
                        same name as in the registry, e.g. ['bert-base-cased', 'bert-large-cased',
                        'allenai/scibert_scivocab_cased'] and that the path in the registry to the model
                        path is correct on your system. HuggingFace transformers hub will be used
                        otherwise to fetch the model, see https://huggingface.co/models for model names

To launch the training with default BiGRU model:

> python3 delft/applications/toxicCommentClassifier.py train

To use for instance the BERT architecture, with bert-base-cased as pretrained model, and training data splitting for training and evaluating:

> python3 delft/applications/toxicCommentClassifier.py train_eval --architecture bert --transformer bert-base-cased

For training with n-folds and default BiGRU model, use the parameter --fold-count:

> python3 delft/applications/toxicCommentClassifier.py train --fold-count 10

This will train 10 classifiers that will be used then as ensemble classifier.

After training (1 or n-folds), to process the Kaggle test set, use:

> python3 delft/applications/toxicCommentClassifier.py test

To classify a set of comments:

> python3 delft/applications/toxicCommentClassifier.py classify

Citation classification

We use the dataset developed and presented by A. Athar in the following article:

Awais Athar. "Sentiment Analysis of Citations using Sentence Structure-Based Features". Proceedings of the ACL 2011 Student Session, 81-87, 2011. http://www.aclweb.org/anthology/P11-3015

For a given scientific article, the task is to estimate if the occurrence of a bibliographical citation is positive, neutral or negative given its citation context. Note that the dataset, similarly to the Toxic Comment classification, is highly unbalanced (86% of the citations are neutral).

In this example, we formulate the problem as a 3 class regression (negative. neutral, positive). To train the model:

usage: delft/applications/citationClassifier.py [-h] [--fold-count FOLD_COUNT] [--architecture ARCHITECTURE]
                             [--embedding EMBEDDING] [--transformer TRANSFORMER]
                             action

Sentiment classification of citation contexts based on DeLFT

positional arguments:
  action

optional arguments:
  -h, --help            show this help message and exit
  --fold-count FOLD_COUNT
  --architecture ARCHITECTURE
                        type of model architecture to be used, one of ['lstm', 'bidLstm_simple', 'cnn',
                        'cnn2', 'cnn3', 'mix1', 'dpcnn', 'conv', 'gru', 'gru_simple', 'lstm_cnn', 'han',
                        'bert']
  --embedding EMBEDDING
                        The desired pre-trained word embeddings using their descriptions in the file. For
                        local loading, use delft/resources-registry.json. Be sure to use here the same
                        name as in the registry, e.g. ['glove-840B', 'fasttext-crawl', 'word2vec'] and
                        that the path in the registry to the embedding file is correct on your system.
  --transformer TRANSFORMER
                        The desired pre-trained transformer to be used in the selected architecture. For
                        local loading use, delft/resources-registry.json, and be sure to use here the
                        same name as in the registry, e.g. ['bert-base-cased', 'bert-large-cased',
                        'allenai/scibert_scivocab_cased'] and that the path in the registry to the model
                        path is correct on your system. HuggingFace transformers hub will be used
                        otherwise to fetch the model, see https://huggingface.co/models for model names

Examples:

> python3 delft/applications/citationClassifier.py train

with n-folds:

> python3 delft/applications/citationClassifier.py train --fold-count 10

Training and evalation (ratio) with 10-folds:

> python3 delft/applications/citationClassifier.py train_eval --fold-count 10

which should produce the following evaluation, using the default 2-layers Bidirectional GRU model gru):

Evaluation on 896 instances:
                   precision        recall       f-score       support
      negative        0.1494        0.4483        0.2241            29
       neutral        0.9653        0.8058        0.8784           793
      positive        0.3333        0.6622        0.4434            74

Similarly as other scripts, use --architecture to specify an alternative DL architecture, for instance SciBERT:

> python3 delft/applications/citationClassifier.py train_eval --architecture bert --transformer allenai/scibert_scivocab_cased

Evaluation on 896 instances:
                   precision        recall       f-score       support
      negative        0.1712        0.6552        0.2714            29
       neutral        0.9740        0.8020        0.8797           793
      positive        0.4015        0.7162        0.5146            74

Using a ten-folds SciBERT ensemble classifiers:

> python3 delft/applications/citationClassifier.py train_eval --architecture bert --transformer allenai/scibert_scivocab_cased --fold-count 10

Evaluation on 896 instances:
                   precision        recall       f-score       support
      negative        0.3023        0.4483        0.3611            29
       neutral        0.9651        0.8714        0.9158           793
      positive        0.3869        0.7162        0.5024            74

To classify a set of citation contexts with default model (2-layers Bidirectional GRU model gru):

> python3 delft/applications/citationClassifier.py classify

which will produce some JSON output like this:

{
    "model": "citations",
    "date": "2018-05-13T16:06:12.995944",
    "software": "DeLFT",
    "classifications": [
        {
            "negative": 0.001178970211185515,
            "text": "One successful strategy [15] computes the set-similarity involving (multi-word) keyphrases about the mentions and the entities, collected from the KG.",
            "neutral": 0.187219500541687,
            "positive": 0.8640883564949036
        },
        {
            "negative": 0.4590276777744293,
            "text": "Unfortunately, fewer than half of the OCs in the DAML02 OC catalog (Dias et al. 2002) are suitable for use with the isochrone-fitting method because of the lack of a prominent main sequence, in addition to an absence of radial velocity and proper-motion data.",
            "neutral": 0.3570767939090729,
            "positive": 0.18021513521671295
        },
        {
            "negative": 0.0726129561662674,
            "text": "However, we found that the pairwise approach LambdaMART [41] achieved the best performance on our datasets among most learning to rank algorithms.",
            "neutral": 0.12469841539859772,
            "positive": 0.8224021196365356
        }
    ],
    "runtime": 1.202
}