Loading transformer models
DeLFT integrates HuggingFace transformers as Keras layers. A transformer can be loaded from three different sources, in order of preference (each falls back to the next if the resource is not found):
- Local directory — a self-contained HuggingFace-format model directory (config, tokenizer, weights), declared via
model_dirindelft/resources-registry.json. - Plain (legacy) checkpoint — separate config, weights and vocabulary files, declared via
path-config,path-weightsandpath-vocabentries. Used to load older BERT-style checkpoints downloaded directly from the original GitHub releases. - HuggingFace Hub — fetched online by name when no registry entry matches. Requires network access; private repositories require
HF_ACCESS_TOKENin the environment.
The selection logic is implemented in delft/utilities/Transformer.py (configure_from_registry / init_preprocessor / instantiate_layer).
Loading sources
| Transformer | Description | Registry entry? | Hugging Face Hub? | Loader path (Transformer.py) |
|---|---|---|---|---|
allenai/scibert_scivocab_cased |
SciBERT pulled from the Hub by name | no | yes | LOADING_METHOD_HUGGINGFACE_NAME — AutoModel.from_pretrained(name) |
portiz/matbert |
RoBERTa-based model saved locally as a HF directory | yes — only model_dir is needed |
no | LOADING_METHOD_LOCAL_MODEL_DIR — AutoModel.from_pretrained(model_dir) |
scibert (legacy GitHub release) |
Original SciBERT TF1 checkpoint with separate files | yes — path-config, path-weights, path-vocab required |
no | LOADING_METHOD_PLAIN_MODEL — BertTokenizer.from_pretrained(path-vocab) + manual config/weights load |
Configuration examples
Local HuggingFace-format directory (preferred for local models):
{
"name": "portiz/matbert",
"model_dir": "/Users/lfoppiano/development/projects/embeddings/pre-trained-embeddings/matbert",
"lang": "en"
}
Plain (legacy) checkpoint with separate config / weights / vocabulary files:
{
"name": "dmis-lab/biobert-base-cased-v1.2",
"path-config": "/media/lopez/T5/embeddings/biobert_v1.2_pubmed/bert_config.json",
"path-weights": "/media/lopez/T5/embeddings/biobert_v1.2_pubmed/model.ckpt-1000000",
"path-vocab": "/media/lopez/T5/embeddings/biobert_v1.2_pubmed/vocab.txt",
"lang": "en"
}
Hub-only (no registry entry needed) — just pass the Hub model id on the command line:
python -m delft.applications.grobidTagger header train --architecture BERT_CRF --transformer allenai/scibert_scivocab_cased
Notes
- For models whose Hub name contains
casedoruncased, DeLFT forcesdo_lower_caseaccordingly to work around tokenizers shipped without an explicit casing configuration (seeTransformer.py:122-147and issue #144). - When DeLFT saves a fine-tuned model, the transformer layer is serialized inside the model directory using a fourth (internal) loading method,
LOADING_METHOD_DELFT_MODEL, with the file nametransformer-config.json. This is automatic and not user-configurable.