Iob2tensor: IOB Label Conversion for NER

2024-04-06

I have done some work recently around NER, and needed to write functionality to convert IOB2 annotations into a tensor format for Pytorch training. This article contains some notes on that work and a link to the related repo.

This work contains simple functions for converting IOB2-format NER annotation data into tensor formats for Transformer-based NER tasks. Open source examples of this format include this news-headlines dataset (e.g., referenced by Prodigy) and the biomed-ner dataset.

A few initial notes

If you use Prodigy to annotate data for an NER task, the IOB2 format is what will be output.
The below functions only convert one text example at a time, so a batch job will require some additional looping, etc. See concluding note on interface development.
The conversion process relies heavily on the HuggingFace Tokenizer class which provides utilties for mapping across token and character indices, between input text and the encoded input ids.

IOB2 Example

Below is an example of an NER/IOB2 format annotation:

# example annotation and labels
annotation = {
    "text": "Did Dame Judy Dench star in a British film about Queen Elizabeth?",
    "spans": [
        {"label": "actor", "start": 4, "end": 19},
        {"label": "plot", "start": 30, "end": 37},
        {"label": "character", "start": 49, "end": 64}
    ]
}

*example pulled from MITMovie dataset

In order to train a an NER model (a la Token Classification style task), we can represent the target output of the above example as follows:

[0, 1, 2, 2, 2, 0, 0, 0, 5, 0, 0, 3, 4, 0]

In which the target labels correspond to the following classes:

0 -> outside (i.e., no labels)
1,2 -> actor (beginning and inside)
3,4 -> character (beginning and inside)
5 -> plot (only beginning; word only requires 1 token)

The following contains instructions for producing this conversion.

Usage

Preprocessing and Schema Validation

One of the first challenges in preprocessing data annotated for an NER task is managing the complexity of nested annotations, different field names, and the various ways to label an annotated entity. Since Transformers are coupled to a Tokenizer, an NER schema based around attaching labels to tokens (e.g., words) introduces complexity because the labeled tokens will have to be converted for every different tokenizer. Additionally, this also allows any nuances of the tokenizer used in annotation to mix with the data.

For these reasons, assigning entity labels to string indices is more generic, decoupled from any specific tokenizer, and more easily checkable for errors in the data or any subsequent processing.

The example below is pulled from the MITMovie annotated dataset.

# example annotation and labels
annotation = {
    "text": "Did Dame Judy Dench star in a British film about Queen Elizabeth?",
    "spans": [
        {"label": "actor", "start": 4, "end": 19},
        {"label": "plot", "start": 30, "end": 37},
        {"label": "character", "start": 49, "end": 64}
    ]
}

Due to the complex structure of NER spans and the associated text field, we perform a preprocessing and validation step to ensure everything is in good order. This happens thanks to Pydantic as an intermediate step, but the outputs are still typed dictionaries to keep things simple for the user.

from iob2tensor import preprocess

text = "Did Dame Judy Dench star in a British film about Queen Elizabeth?"

spans = [
    {"label": "actor", "start": 4, "end": 19},
    {"label": "plot", "start": 30, "end": 37},
    {"label": "character", "start": 49, "end": 64}
]
# validate input annotations
annotation = preprocess(text, spans)

The default or expected fields for input annotations are as follows:

from typing import TypedDict

class Span(TypedDict):
    start: int
    end: int
    label: str

class Annotation(TypedDict):
    text: str
    spans: list[Span]

If your annotated data uses different fields, specify those fields as function arguments. For instance, the BioMed-NER dataset follows the standard NER spans schema but uses different field names.

annotation = {
    "text": "Weed seed inactivation in soil mesocosms via biosolarization..." "entities": [
        {"start": 0, "end": 4, "class": "ORGANISM"},
        {"start": 5, "end": 9, "class": "ORGANISM"},
        {"start": 26, "end": 30, "class": "CHEMICALS"},
        ...
}

annotation = preprocess(
    **annotation,
    spans_field="entities",
    label_field="class"
)

Create Label Map

Next, create the IOB label map with your dataset’s entity labels. The default label in the IOB2 format, represents all tokens which are not entities and thus is referred to as the outside class. The convention is to assign all tokens of this class as label=0. Additionally, the IOB2 format distinguishes between the beginning of inside of an entity label, so each entity class will generate 2 distinct labels, following this format:

B-LABEL
I-LABEL

This means the label set and mapping will always have a size of (_n_ * 2) + 1, where n equals the number of distinct labels (e.g., “location”, “organization”, “person”, etc.) and the +1 is from the outside (non-entity) class.

Use the following function to create the initial label map for your dataset’s labels.

from iob2tensor import create_label_map

labels = ["actor", "character", "plot"]

label_map = create_label_map(labels)
label_map
>>> {
    'O': 0,
    'B-ACTOR': 1, 'I-ACTOR': 2,
    'B-CHARACTER': 3, 'I-CHARACTER': 4,
    'B-PLOT': 5, 'I-PLOT': 6
}

Create Target Output

Now we select and initialize a tokenizer - which has to be involved in the iob label conversion due to tokenization - and convert our NER annotation into a label array.

There is a built-in conversion check (on by default) which ensures the conversion is correct. This is guaranteed to work for the supported tokenizers, but can also be turned off in order to reduce computation.

from transformers import AutoTokenizer

from iob2tensor import to_iob_tensor

checkpoint = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

iob_labels = to_iob_tensor(annotation, label_map, tokenizer)
iob_labels
>>> [-100, 0, 1, 2, 2, 2, 0, 0, 0, 5, 0, 0, 3, 4, 0, -100]

Now just one step away from a tensor!

import torch

x = torch.tensor(iob_labels)

Tests

There is a built-in check (can be optionally turned off) within the main to_iob_tensor() function, which attempts to confirm the iob2 conversion is correct. Additionally, there are a series of additional unit and end-to-end tests in the tests directory. Finally, the tokenizers.py file contains the specific tokenizer checkpoints which I have tested.

Thoughts on an Interface

The next step for the above is to work on a more polished interface; perhaps something like sklearn Preprocessors.