Hot Posts

6/recent/ticker-posts

Ad Code

NER in Czech Documents with XLM-RoBERTa using 🤗 Accelerate | by Bohumir Buso | Nov, 2024


🤗 Accelerate
Having started in a time when wrappers were less common, I became accustomed to writing my own training loops, which I find easier to debug – an approach that 🤗 Accelerate supports effectively. It proved beneficial in this project – I wasn’t entirely certain of the required data and label formats or shapes and my data didn’t match the well-organized examples often shown in tutorials, but having full access to intermediate computations during the training loop allowed me to iterate quickly.

Context Length
Most tutorials suggest using each sentence as a single training example. However, in this case, I decided a longer context would be more suitable as documents typically contain references to multiple entities, many of which are irrelevant (e.g. lawyers, other creditors, case numbers). This broader context enables the model to better identify the relevant client. I used 512 tokens from each document as one training example. This is a common maximum limit for models but comfortably accommodates all entities in most of my documents.

Labelling of Subtokens
In the 🤗 token classification tutorial [1], recommended approach is:

Only labeling the first token of a given word. Assign -100 to other subtokens from the same word.

However, I found that the following method suggested in the token classification tutorial in their NLP course [2] works much better:

Each token gets the same label as the token that started the word it’s inside, since they are part of the same entity. For tokens inside a word but not at the beginning, we replace the B- with I-

Label “-100” is special label that is ignored by loss function. Hence, I used their functions with minor changes:

def align_labels_with_tokens(labels, word_ids):
new_labels = []
current_word = None
for word_id in word_ids:
if word_id != current_word:
# Start of a new word!
current_word = word_id
label = -100 if word_id is None else labels[word_id]
new_labels.append(label)
elif word_id is None:
# Special token
new_labels.append(-100)
else:
# Same word as previous token
label = labels[word_id]
# If the label is B-XXX we change it to I-XXX
if label % 2 == 1:
label += 1
new_labels.append(label)
return new_labels


def tokenize_and_align_labels(examples):
tokenizer = AutoTokenizer.from_pretrained("../model/xlm-roberta-large")
tokenized_inputs = tokenizer(
examples["tokens"], truncation=True, is_split_into_words=True,
padding="max_length", max_length=512)
all_labels = examples["ner_tags"]
new_labels = []
for i, labels in enumerate(all_labels):
word_ids = tokenized_inputs.word_ids(i)
new_labels.append(align_labels_with_tokens(labels, word_ids))


tokenized_inputs["labels"] = new_labels
return tokenized_inputs

I also used their postprocess()function:

To simplify its evaluation part, we define this postprocess() function that takes predictions and labels and converts them to lists of strings.

def postprocess(predictions, labels):
predictions = predictions.detach().cpu().clone().numpy()
labels = labels.detach().cpu().clone().numpy()
true_labels = [[id2label[l] for l in label if l != -100] for label in labels]
true_predictions = [
[id2label[p] for (p, l) in zip(prediction, label) if l != -100]
for prediction, label in zip(predictions, labels)
]
return true_predictions, true_labels

Class Weights
Incorporating class weights into the loss function significantly improved model performance.
While this adjustment may seem straightforward — without it, the model overemphasized the majority “O” class — it’s surprisingly absent from most tutorials. I implemented a custom compute_weights() function to address this imbalance:

def compute_weights(trainset, num_labels):
c = Counter()
for t in trainset:
c += Counter(t['labels'].tolist())
weights = [sum(c.values())/(c[i]+1) for i in range(num_labels)]
return weights

Training Loop
I defined two additional functions: PyTorch DataLoader() to manage batch processing, and a main() function to set up distributed training objects and execute the training loop.

from accelerate import Accelerator, notebook_launcher
from collections import Counter
from datasets import Dataset
from datetime import datetime
import torch
from torch.optim.lr_scheduler import ReduceLROnPlateau
from torch.nn import CrossEntropyLoss
from torch.utils.data import DataLoader
from transformers import AutoTokenizer
from transformers import AutoModelForTokenClassification
from transformers import XLMRobertaConfig, XLMRobertaForTokenClassification
from seqeval.metrics import classification_report, f1_score
def create_dataloaders(trainset, evalset, batch_size, num_workers):
train_dataloader = DataLoader(trainset, shuffle=True,
batch_size=batch_size, num_workers=num_workers)
eval_dataloader = DataLoader(evalset, shuffle=False,
batch_size=batch_size, num_workers=num_workers)
return train_dataloader, eval_dataloader


def main(batch_size, num_workers, epochs, model_path, dataset_tr, dataset_ev, training_type, model_params, dt):
accelerator = Accelerator(split_batches=True)
num_labels = model_params['num_labels']


# Prepare data #
train_ds = Dataset.from_dict(
{"tokens": [d[2][:512] for d in dataset_tr],
"ner_tags": [d[1][:512] for d in dataset_tr]})
eval_ds = Dataset.from_dict(
{"tokens": [d[2][:512] for d in dataset_ev],
"ner_tags": [d[1][:512] for d in dataset_ev]})
trainset = train_ds.map(tokenize_and_align_labels, batched=True,
remove_columns=["tokens", "ner_tags"])
evalset = eval_ds.map(tokenize_and_align_labels, batched=True,
remove_columns=["tokens", "ner_tags"])
trainset.set_format("torch")
evalset.set_format("torch")
train_dataloader, eval_dataloader = create_dataloaders(trainset, evalset,
batch_size, num_workers)


# Type of training #
if training_type=='from_scratch':
config = XLMRobertaConfig.from_pretrained(model_path, **model_params)
model = XLMRobertaForTokenClassification(config)
elif training_type=='transfer_learning':
model = AutoModelForTokenClassification.from_pretrained(model_path,
ignore_mismatched_sizes=True, **model_params)
for param in model.parameters():
param.requires_grad=False
for param in model.classifier.parameters():
param.requires_grad=True
elif training_type=='fine_tuning':
model = AutoModelForTokenClassification.from_pretrained(model_path,
**model_params)
for param in model.parameters():
param.requires_grad=True
for param in model.classifier.parameters():
param.requires_grad=True


# Intantiate the optimizer #
optimizer = torch.optim.AdamW(params=model.parameters(), lr=2e-5)


# Instantiate the learning rate scheduler #
lr_scheduler = ReduceLROnPlateau(optimizer, patience=5)


# Define loss function #
weights = compute_weights(trainset, num_labels)
loss_fct = CrossEntropyLoss(weight=torch.tensor(weights))


# Prepare objects for distributed training #
loss_fct, train_dataloader, model, optimizer, eval_dataloader, lr_scheduler = accelerator.prepare(
loss_fct, train_dataloader, model, optimizer, eval_dataloader, lr_scheduler)


# Training loop #
max_f1 = 0 # for early stopping
for t in range(epochs):
# training
accelerator.print(f"\n\nEpoch {t+1}\n-------------------------------")
model.train()
tr_loss = 0
preds = list()
labs = list()
for batch in train_dataloader:
outputs = model(input_ids=batch['input_ids'],
attention_mask=batch['attention_mask'])
labels = batch["labels"]
loss = loss_fct(outputs.logits.view(-1, num_labels), labels.view(-1))
accelerator.backward(loss)
optimizer.step()
optimizer.zero_grad()
tr_loss += loss
predictions = outputs.logits.argmax(dim=-1)
predictions_gathered = accelerator.gather(predictions)
labels_gathered = accelerator.gather(labels)
true_predictions, true_labels = postprocess(predictions_gathered, labels_gathered)
preds.extend(true_predictions)
labs.extend(true_labels)


lr_scheduler.step(tr_loss)


accelerator.print(f"Train loss: {tr_loss/len(train_dataloader):>8f} \n")
accelerator.print(classification_report(labs, preds))


# evaluation
model.eval()
ev_loss = 0
preds = list()
labs = list()
for batch in eval_dataloader:
with torch.no_grad():
outputs = model(input_ids=batch['input_ids'],
attention_mask=batch['attention_mask'])
labels = batch["labels"]
loss = loss_fct(outputs.logits.view(-1, num_labels), labels.view(-1))


ev_loss += loss
predictions = outputs.logits.argmax(dim=-1)
predictions_gathered = accelerator.gather(predictions)
labels_gathered = accelerator.gather(labels)
true_predictions, true_labels = postprocess(predictions_gathered, labels_gathered)
preds.extend(true_predictions)
labs.extend(true_labels)


accelerator.print(f"Eval loss: {ev_loss/len(eval_dataloader):>8f} \n")
accelerator.print(classification_report(labs, preds))


accelerator.print(f"Current Learning Rate: {optimizer.param_groups[0]['lr']}")


# checkpoint best model
if f1_score(labs, preds) > max_f1:
accelerator.wait_for_everyone()
unwrapped_model = accelerator.unwrap_model(model)
unwrapped_model.save_pretrained(f"../model/xlml_ner/{dt}/",
is_main_process=accelerator.is_main_process,
save_function=accelerator.save)
accelerator.print(f"Model saved during {t+1}. epoch.")
max_f1 = f1_score(labs, preds)
best_epoch = t


# early stopping
if (t - best_epoch) > 10:
accelerator.print(f"Early stopping after {t+1}. epoch.")
break


accelerator.print("Done!")


With everything prepared, the model is ready for training. I just need to initiate the process:

label_list = [
"O",
"B-evcu", "I-evcu", # variable symbol of creditor
"B-rc", "I-rc", # birth ID
"B-prijmeni", "I-prijmeni", # surname
"B-jmeno", "I-jmeno", # given name
"B-datum", "I-datum", # birth date
]
id2label = {a: b for a,b in enumerate(label_list)}
label2id = {b: a for a,b in enumerate(label_list)}
num_workers = 6 # number of GPUs
batch_size = num_workers*2
epochs = 100
model_path = "../model/xlm-roberta-large"
training_type = "fine_tuning" # from_scratch / transfer_learning / fine_tuning
model_params = {"id2label": id2label, "label2id": label2id, "num_labels": 11}
dt = datetime.now().strftime("%Y%m%d_%H%M%S")
os.mkdir(f"../model/xlml_ner/{dt}")


notebook_launcher(main, args=(batch_size, num_workers, epochs, model_path,
dataset_tr, dataset_ev, training_type, model_params, dt),
num_processes=num_workers, mixed_precision="fp16", use_port="29502")

I find using notebook_launcher() convenient, as it allows me to run training in the console and easily work with results afterward.

XLM-RoBERTa base vs large vs Small-E-Czech
I experimented with fine-tuning three models. The XLM-RoBERTa base model [3] delivered satisfactory performance, but the server capacity also allowed me to try the XLM-RoBERTa large model [3], which has twice the parameters.

XLM-RoBERTa is a multilingual version of RoBERTa. It is pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages.

The large model showed a slight improvement in results, so I ultimately deployed it. I also tested Small-E-Czech [4], an Electra-small model pre-trained on Czech web data, but its performance was poor.

Fine-tuning vs Transfer learning vs Training from scratch
In addition to fine-tuning (updating all model weights), I tested transfer learning, as it is sometimes suggested that training only the final (classification) layer may suffice.. However, the performance difference was significant, favoring full fine-tuning. I also attempted training from scratch by importing only architecture of the model, initializing the weights randomly, and then training, but as expected, this approach was ineffective.

RoBERTa vs LLM (Claude 3.5 Sonnet)
I briefly explored zero-shot LLMs, though with minimal prompt engineering (so 🥱). The model struggled even with basic requests, such as (I used Czech in the actual prompt):

Find variable symbol of creditor. This number has exactly 9 consecutive digits 0–9 without letters or other special characters. It is usually preceded by one of the following abbreviations: ‘ev.č.’, ‘zn. opr’, ‘VS. O’, ‘evid. č. opr.’. On the contrary, I’m not interested in a transaction number with the abbreviation ‘č.j.’. This number does not appear often in documents, it may happen that you will not be able to find it, then write ‘cannot find’. If you’re not sure, write ‘not sure’.

The model sometimes failed to output the 9-digit format accurately. Post-processing would filter out shorter numbers, but there were many false positives 9-digit numbers.

Occasionally the model inferred incorrect birth IDs based solely on birth dates (even with temperature set to 0). On the other hand, it excelled at extracting names, surnames, and birth dates.

Overall, even in my previous experiments, I found that LLMs (at the time of writing) perform better with general tasks but lack accuracy and reliability for specific or unconventional tasks. The performance in identifying the client was fairly similar for both approaches. For internal reasons, the RoBERTa model was deployed.



from machine learning – Techyrack Hub https://ift.tt/EqTQk7L
via IFTTT

Post a Comment

0 Comments

Ad Code