A transformer reading a book, generated by pixray-vqgan

TLDR; If you want to skip the details of how this works, the end-project is available here: Frame-Semantic-Transformer. A live demo is available as well.

If you want to get meaningful semantic information from a sentence that can be used by algorithms, you first need parse it. One powerful framework for this is the idea of Frame Semantics. Frame semantics break apart a sentence into concepts called “frames”, where each frame contains attributes called “frame elements” which describe what’s going on in the frame, and has a “trigger” in the sentence which evokes the frame.

For instance, consider the sentence below:

Sergey dodged the flower pot that Larry threw in disgust.

A frame that might be present is the idea of “dodging”:

frame: Dodging
trigger: "dodged"
elements:
    Dodger="Sergey"
    Bad_entity="the flower pot that Larry threw in disgust"

There can be multiple frames present in a sentence at a time, and frames can relate to and inherit from other frames as well.

The gold standard for Frame Semantics is a project called FrameNet, which contains an open database of thousands of frames and annotated example texts.

What about existing frame semantic parsers?

I’m certainly not the first person to attempt to build a frame semantic parser (sometimes also called automatic semantic role labeling). The 2 state-of-the-art projects I found are Open-Sesame, and the paper Open-Domain Frame Semantic Parsing Using Transformers.

Open-Sesame is the best performing open-source frame semantic parser, but has a number of problems that make it difficult to work with as an end-user:

  • Models trained for Open-Sesame can only run on the computer they were trained on, otherwise they perform poorly. This includes the provided pre-trained models.
  • Installation is difficult, and requires manually installing dependencies and uses rare ML libraries which are hard to get working properly.
  • The pretrained models don’t work correctly on current systems

The paper Open-Domain Frame Semantic Parsing Using Transformers looks great - it uses Google’s T5 Transformer and claims to achieve even better performance than Open-Sesame. However, it’s not open-source, so there’s no actual code to run or a library to work with.

Building an open-source frame semantic parser

I decided to combine the best of Open-Sesame and “Open-Domain Frame Semantic Parsing Using Transformers” to build an easy-to-use open-source frame semantic parser on modern technology. I used the data splitting, task definitions, and evaluation criteria from Open-Sesame, while using a T5 transformer as a base model as in the open-domain parsing paper.

My goal is to create a frame-semantic parser which meets the following criteria:

  1. Match or exceed Open-Sesame’s performance
  2. Installable and usable with a single pip install.
  3. Built on modern tech like PyTorch and HuggingFace.

The Tasks

Semantic parsing of a sentence as performed by Open-Sesame requires 3 steps:

  1. Trigger Identification: Find the frame trigger locations in the text.
  2. Frame Classification: For each trigger, determine which frame it corresponds to.
  3. Argument Extraction: After determining the frame, find the corresponding frame elements in the text.

For example, consider the following the sentence:

It was no use trying the lift.

For the first step, trigger identification, we would identify the 2 following locations, indicated by *’s in the sentence below:

It was no use *trying the *lift.

Next, we need to identify which frame corresponds with each trigger location:

It was no use *trying the lift.
    -> Attempt_means

It was no use trying the *lift.
    -> Connecting_architecture

Finally, for each trigger and frame, we need to find the frame elements in the frame:

It was no use *trying the lift. :: Attempt_means
    -> Means="the lift"

It was no use trying the *lift. :: Connecting_architecture
    -> Part="lift"

In FrameNet, there are tens of thousands of annotated sentences like this indicating the triggers, frames, and frame elements in the sentence which we can use to train our model.

T5 Transformer

Transformer architectures have revolutionize the field of language processing (NLP) since their introduction in 2017. The typical idea is to start with a transformer model that’s already pre-trained on a massive quantity of text from the internet, and just “fine-tune” it on the actual task you care about.

In this case, we use the T5 transformer provided by HuggingFace. T5 uses the idea of having a single model perform multiple tasks, with each task simply being indicated by adding a keyword to the input text.

For example, for the sentence we discussed above, we could break apart the tasks as follows:

First, trigger identification

input:  "TRIGGER: It was no use trying the lift."
output: "It was no use *trying the *lift."

Next, frame classification

input:  "FRAME: It was no use *trying the lift."
output: "Attempt_means"

input:  "FRAME: It was no use trying the *lift."
output: "Connecting_architecture"

Finally, argument extraction:

input:  "ARGS Attempt_means: It was no use *trying the lift."
output: "Means=the lift"

input:  "ARGS Connecting_architecture: It was no use trying the *lift."
output: "Parts=lift"

Notice above how all the tasks follow the same input/output format, where each task takes a string as input and returns a string as output. Furthermore, each task is specified by putting a keyword at the start of the input followed by a :, for example Frame: for frame classification, and Trigger: for trigger identification. For argument extraction, we also put the name of the frame as part of the task definition ARGS <frame_name>:.

I based the T5 training on SimpleT5, which uses Pytorch Lighting and HuggingFace under the hood.

…and that’s all it really takes to get a working frame semantic parser using T5!

Providing hints to T5

That’s not the end of the story, unfortunately. This approach performs well already - it actually beats Open-Sesame at argument extraction even without any extra tweaks! However, it doesn’t perform as well at frame classification, and we can do even better at argument extraction.

The key insight is that for the frame identification and argument extraction tasks, we can give T5 some extra hints to help it choose the best results. For frame classification, FrameNet includes a list of “lexical units” which are likely triggers of each frame. We can use this list to find some candidate frames for each trigger word.

For instance, for the word try, the following lexical units appear in FrameNet:

  • try.v : Attempt
  • try.v : Try_defendant
  • try.v : Attempt_means
  • try.v : Tasting

With the sentence It was no use *trying the lift., we can extract the labeled trigger word trying, stem it with NLTK to get try, and then check the lexical unit list in FrameNet to see a list of reasonable frames to guess. Then, we can pass these into T5 in the task definition, like below:

input: "FRAME Attempt Try_defendant Attempt_means Tasting: It was no use *trying the lift."
output: "Attempt_means"

By checking the lexical units list in FrameNet, we’re able to provide 4 possible frames to T5 in the task definition which makes it must easier for it to simply pick one of those 4 frames rather than needing to guess the frame out of thin air! In the Frame-Semantic-Transformer project, we take this even further and check bigrams of words involving the trigger as well to search for matching lexical units.

For the argument extraction task, we can similarly help T5 by pre-emptively pulling out a list of all the possible frame elements for the frame in question. For instance, the Attempt_means frame has the following possible elements, abbreviated for clarity:

  • Agent
  • Means
  • Goal
  • Circumstances

We can similarly provide this list to T5 as part of the task header:

input: "ARGS Attempt_means | Agent Means Goal Circumstances: It was no use *trying the lift."
output: "Attempt_means"

Augmenting the data

After first trying this out, it became immediately apparent that the model was overfit to the data as it appears on FrameNet. Specifically, on FrameNet, all sentences end in proper punctuation. If you try asking for the frames of a sentence that doesn’t have a period at the end, the model often freaks out and starts repeating itself over-and-over and outputting nonsense. During training it never encountered an input sentence without a period at the end, it didn’t know what to do!

To alleviate this, Frame-Semantic-Transformer adds some extra data augmentations to the training samples, like occasionally dropping the period at the end of the sentence, or changing “can’t” into “cannot”, or making everything lowercase. These tweaks won’t help the model improve its score on the test data, but it should help it work better on unseen data.

Evaluating performance

So of course the question is: how does this T5-based approach compare to Open-Sesame? I trained the model on the same breakdown of train/dev/test documents from FrameNet as Open-Sesame, and I tried to use the same metrics as Open-Sesame so the results would be a fair apples-to-apples comparison. I also trained 2 variants of the T5 model - one variant using t5-base which is about 850MB, and another using t5-small, which is about 230MB.

The results are as follows on the Open-Sesame test set:

Task Sesame F1 Small Model F1 Base Model F1
Trigger identification 0.73 0.70 0.72
Frame classification 0.87 0.81 0.87
Argument extraction 0.61 0.70 0.72

The base model performs pretty similarly to Open-Sesame at task identification and frame classification, but performs significantly better at argument extraction. The small model performs a bit worse than the base model, and under-performs Open-Sesame on trigger identification and frame classification, but is still significantly better than Sesame at argument extraction.

Next steps

I expect there’s still more improvements that can be made to help Frame-Semantic-Transformer perform even better than it does now:

  • I didn’t tune hyperparams much, so I’m sure there’s a few more f1 points to be squeezed out from that.
  • I worry that the model is overfit on FrameNet training data and won’t perform as well on real-world tasks; this will need more investigation and testing.
  • The Open-Domain Frame Parsing paper talks about using numbers to mark spans in argument extraction which might improve performance on argument extraction.
  • The paper also tries using custom decoders per task instead of the standard T5 text decoder.
  • There are even larger T5 models to try using, like t5-large and t5-3b which could perform even better.
  • FrameNet lexical units don’t cover everything, so it’s probably possible to use some sort of clustering of lexical units to give better hints to T5 for frame classification.

Longer term, it would be great to expand this to bigger / better datasets than FrameNet (ex. multi-lingual framenet) that can be used to train the model. It would be awesome as well to try to generate more frames / lexical units for FrameNet automatically using a technique like what was done to generate Atomic10x in the paper Symbolic Knowledge Distillation - from General Language Models to Commonsense Models

Any contributions to the project or thoughts/feedback is welcome!