Building iob2labels: A Python Package for NER Label Conversion
Table of Contents
Intro
This work is born from recent experience supporting the training of a custom transformer-based Named Entity Recognition (NER) model. The project utilized Prodigy for annotating a novel, domain-specific dataset and in turn developing a model training and inference pipeline. What started as a few utility functions quickly grew to requiring rigorous testing and being re-used by others. So I decided to package the capabilites and publish to PyPi.
Background
Prodigy uses thes IOB2-format for capturing entity spans in the source text. Other open source examples of this format include this news-headlines dataset (e.g., referenced by Prodigy) and the biomed-ner dataset.
This format provides a rich JSON object which preserves all information from the annotation task. Converting it into a tensor format for token classification, however, is a little tricky.
Below is an example Example taken from MITMovie dataset of an NER/IOB2 format annotation:
# example annotation and labels
annotation = {
"text": "Did Dame Judy Dench star in a British film about Queen Elizabeth?",
"spans": [
{"label": "actor", "start": 4, "end": 19},
{"label": "plot", "start": 30, "end": 37},
{"label": "character", "start": 49, "end": 64}
]
}
In order to train a NER model (a la Token Classification style task), we need to represent the target output of the above example as follows:
[0, 1, 2, 2, 2, 0, 0, 0, 5, 0, 0, 3, 4, 0]
In which the target labels correspond to the following classes:
0 -> outside (i.e., no labels)
1,2 -> actor (beginning and inside)
3,4 -> character (beginning and inside)
5 -> plot (only beginning; word only requires 1 token)
The Package
The package
Install iob2labels from PyPi
exposes an IOB2Encoder that handles the full round-trip: annotation spans to integer labels for training, and integer labels back to spans at inference time. You give it your entity labels and a tokenizer checkpoint, and it takes care of the rest.
from iob2labels import IOB2Encoder
encoder = IOB2Encoder(
labels=["actor", "character", "plot"],
tokenizer="bert-base-uncased",
)
# encode: annotation spans → integer labels
iob_labels = encoder(
text="Did Dame Judy Dench star in a British film about Queen Elizabeth?",
spans=[
{"label": "actor", "start": 4, "end": 19},
{"label": "plot", "start": 30, "end": 37},
{"label": "character", "start": 49, "end": 64},
],
)
iob_labels
>>> [-100, 0, 1, 2, 2, 2, 0, 0, 0, 5, 0, 0, 3, 4, 0, -100]
# decode: integer labels → annotation spans
encoder.decode_text(iob_labels, "Did Dame Judy Dench star in a British film about Queen Elizabeth?")
>>> [
{"start": 4, "end": 19, "label": "actor"},
{"start": 30, "end": 37, "label": "plot"},
{"start": 49, "end": 64, "label": "character"},
]
The only runtime dependencies are tokenizers (the Rust-backed HuggingFace tokenizer) and pydantic. No torch or transformers required — just pass a checkpoint name and the encoder handles tokenizer initialization directly. The output is a list[int], ready to convert to a tensor or array as needed.
See here for the full documentation.
Under the Hood
The rest of this post walks through the key decisions and challenges involved in building the conversion.
Preprocessing
One of the first challenges is managing the complexity of different annotation formats, field names, and the various ways to label an annotated entity. Since Transformers are coupled to a Tokenizer, an NER schema based around attaching labels to tokens (e.g., words) introduces complexity because the labeled tokens will have to be converted for every different tokenizer. For these reasons, assigning entity labels to string indices is more generic, decoupled from any specific tokenizer, and more easily checkable for errors in the data or any subsequent processing.
The encoder validates annotations through Pydantic on every call — checking for negative offsets, spans past the text boundary, overlapping entities, and so on. If your data uses non-standard field names (like the BioMed-NER dataset, which uses "entities" and "class" instead of "spans" and "label"), those are configurable on the encoder constructor.
The Label Map
The IOB2 format distinguishes between the beginning and inside of an entity, so each entity class generates 2 labels following the B-LABEL / I-LABEL convention. Combined with the outside class (O) for non-entity tokens, the total label count is always (n * 2) + 1:
encoder.label_map
>>> {
'O': 0,
'B-ACTOR': 1, 'I-ACTOR': 2,
'B-CHARACTER': 3, 'I-CHARACTER': 4,
'B-PLOT': 5, 'I-PLOT': 6
}
Encoding
The encoder uses the tokenizer’s char_to_token() mapping to align character-level span boundaries to token positions, then fills in the B/I labels accordingly. Special tokens ([CLS], [SEP], etc.) receive -100, which PyTorch’s CrossEntropyLoss skips by default. There is a built-in conversion check (on by default) which verifies the result is correct by recovering the entity text from the produced labels and comparing it to the original annotation.
Decoding
The inverse operation maps token positions back to character offsets using token_to_chars(). This sounds straightforward, but different tokenizer families have different ideas about whitespace. SentencePiece tokenizers (ALBERT, XLNet, T5) absorb leading spaces into the token — so ▁Queen maps to character range (48, 54) even though the entity starts at 49. The decoder handles this by trimming whitespace from the recovered character boundaries.
Tests
The test suite includes over 300 tests: unit tests for label map construction, annotation validation, and the conversion checker, plus a parametrized matrix of 18 tokenizer checkpoints (WordPiece, BPE, SentencePiece) across multiple edge cases. The decoder is verified via round-trip tests — encode, then decode, then assert the recovered spans match the original — across all supported tokenizers.