Microsoft Research Unveils New BERT&#x2d;Based Biomedical NLP AI Model

Nefi Alarcon

To help accelerate natural language processing in biomedicine, Microsoft Research developed a BERT-based AI model that outperforms previous biomedicine NLP…

NVIDIA

•

Nefi Alarcon

•2 min read•intermediate•

--

•View Original

BERTNatural Language ProcessingTensorFlow

Overview

Microsoft Research has developed a BERT-based AI model named PubMedBERT that significantly enhances natural language processing in biomedicine, outperforming previous methods. This model is designed to assist researchers in advancing their work more rapidly by classifying documents and extracting medical information.

What You'll Learn

1

How to utilize PubMedBERT for biomedical NLP tasks

2

Why domain-specific pretraining is essential for biomedical NLP

3

When to apply the PubMed dataset for training NLP models

Prerequisites & Requirements

Understanding of natural language processing concepts
Familiarity with NVIDIA DGX-2 and TensorFlow(optional)

Key Questions Answered

What is PubMedBERT and how does it improve biomedical NLP?

PubMedBERT is a BERT-based AI model developed by Microsoft Research that enhances biomedical natural language processing by outperforming previous methods. It is trained on a vast dataset from PubMed, which includes over 14 million abstracts, enabling it to classify documents and extract medical information effectively.

What datasets were used to train PubMedBERT?

The model was trained using a comprehensive biomedical NLP benchmark sourced from publicly-available datasets, specifically utilizing vocabulary from the PubMed dataset, which contains over 14 million abstracts and 3.2 billion words.

What hardware was used for training the PubMedBERT model?

The training of the PubMedBERT model was conducted on an NVIDIA DGX-2 system equipped with 16 NVIDIA V100 GPUs, utilizing NVIDIA’s NGC TensorFlow implementation to optimize performance.

What are the results of using domain-specific pretraining in PubMedBERT?

The researchers found that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks, leading to new state-of-the-art results across various benchmarks, as stated in their paper.

Key Statistics & Figures

Number of abstracts in PubMed dataset

over 14 million

This extensive dataset is crucial for training the PubMedBERT model.

Total words in PubMed dataset

3.2 billion

The large vocabulary from this dataset enhances the model's performance in biomedical NLP tasks.

Technologies & Tools

Hardware

Nvidia Dgx-2

Used for training the PubMedBERT model with high computational power.

Hardware

Nvidia V100 Gpus

16 GPUs were utilized to accelerate the training process of the model.

Software

Nvidia’s Ngc Tensorflow

Implementation used for training the PubMedBERT model.

Key Actionable Insights

1
Leverage PubMedBERT for document classification tasks in biomedical research to enhance efficiency.
Using PubMedBERT can significantly reduce the time spent on manual document classification, allowing researchers to focus on more critical analysis and interpretation of data.

2
Incorporate domain-specific pretraining in your NLP projects to achieve better performance.
Domain-specific pretraining can lead to improved accuracy and relevance in NLP tasks, especially in specialized fields like biomedicine.

3
Utilize the BLURB benchmark for evaluating your own biomedical NLP models.
The BLURB benchmark provides a comprehensive framework for assessing model performance, which can help in identifying areas for improvement in your NLP solutions.