Discovering Themes in News Articles: A Step-by-Step Guide to Supervised BERTopic

10 min readMar 6, 2023

In the field of natural language processing, topic modeling is a technique that involves assigning topics to a given corpus of text based on the words that are present within it. It is a powerful technique that helps in uncovering the hidden themes or topics present in a large corpus of unstructured text data.

Let me introduce you to BERTopic:

BERTopic is a recently introduced topic modeling technique that has gained popularity because of its ability to work with pre-trained language models. It uses the power of the transformer-based architecture to represent the textual data, which has shown promising results in comparison to other techniques.

BERTopic is superior to other statistical techniques like LDA, NMF, and LSI because it uses contextual embeddings to represent the text, which allows capturing of the complex relationships between words and phrases in the data. It can also be trained using pre-existing language models, making it much more efficient than other techniques requiring a large number of iterations to achieve convergence. The typical unsupervised algorithm involves the following 6 steps: (Refer to the BERTopic paper for an in-depth explanation)

Topic modeling methods:

In unsupervised topic modeling, the algorithm is provided with the unlabelled dataset and tries to discover the hidden topics by itself. In contrast, in supervised topic modeling, the algorithm is trained on labeled data, making it more efficient in identifying the most relevant topics within the dataset. BERTopic allows us to perform topic modeling in a supervised, unsupervised as well as semi-supervised way.

In this article, we will perform supervised topic modeling using the ‘News Category’ dataset. By utilizing the pre-defined labels to group similar topics together, it becomes easier to interpret the results and draw actionable insights from the data.

The News Categories dataset is a collection of news articles, along with their corresponding categories, that has been curated for use in machine learning applications. It consists of over 200,000 articles from various news sources, including Reuters, BBC, and CNN, and covers a wide range of categories such as sports, politics, technology, and entertainment.

Let's dive into the code!

Install the BERTopic library using the following command:

!pip install bertopic

Import the BERTopic and other required libraries into your code

from bertopic import BERTopic
import pandas as pd

1. Loading and pre-processing the dataset:

In this step, we load the dataset as a data frame in python. The data frame has 2,00,000+ rows and 6 columns. We convert the category labels into integer values. We will only keep the ‘category’, ‘short_description’ and the new column named ‘category_idx’, and remove the rest of the columns from the data frame.

data = pd.read_json('/content/drive/MyDrive/Topic Identification/News_Category_Dataset_v3.json', lines=True)
# remove extra columns
data.drop(['link','authors','date','headline'], inplace=True, axis=1)

categories=data.category.unique().tolist()

def map_category_to_idx(cat_name):
  return categories.index(cat_name)

data['category_idx']=[map_category_to_idx(i) for i in data.category]

docs = pd.DataFrame(data['short_description'])
y = pd.DataFrame(data['category_idx'])

Now, the data frame looks as:

2. Creating & configuring the model:

Unlike many other topic modeling techniques, BERTopic allows us to choose any embedding model like word2vec, BERT or sentence-BERT. This means that we can easily customize the topic modeling approach to fit our specific needs and data characteristics. Similarly, any clustering or classification algorithm can be easily plugged into the BERTopic framework.

For this article, we will use the default parameters. The embedding model used by default is the ’all-MiniLM-L6-v2'.

In the supervised approach, we skip over dimensionality reduction and replace the cluster model with a classifier. By default, the Logistic Regression classifier is used.

from bertopic.vectorizers import ClassTfidfTransformer
from bertopic.dimensionality import BaseDimensionalityReduction
from sklearn.linear_model import LogisticRegression

empty_dimensionality_model = BaseDimensionalityReduction()
clf = LogisticRegression()
ctfidf_model = ClassTfidfTransformer(reduce_frequent_words=True)

# Create a fully supervised BERTopic instance
topic_model= BERTopic(
        umap_model=empty_dimensionality_model,
        hdbscan_model=clf,
        ctfidf_model=ctfidf_model,
        language="english", 
        top_n_words=5, 
        calculate_probabilities=True
)

# Run BERTopic model
topics, probabilities = topic_model.fit_transform(docs['short_description'], y=y)

42 different topics are learned by the model. Let's view them using the following command:

topic_model.get_topic_info()

We can see the list of the top 5 words from each topic:

# Get the list of words in each topic
for i in range(42):
  print(f"Topic {i}: {[t[0] for t in topic_model.get_topic(i)]}")

Topics (1, 6) and (5, 14) are quite similar. In such cases, similar topics can be merged together using the following code:

topic_model.merge_topics(docs['short_description'], [[1,6],[5,14]])

Note: The topic ‘-1’ denotes the outliers. If a document is assigned topic -1, it indicates that the topic could not be assigned any of the known topics.

We can also provide custom label names for some or all of the topics:

# define the custom labels of some/ all topics
my_custom_labels={-1:"Other", 0: "Politics", 1: "Health", 2: "Movie", 
                   3:"Family", 4:"Travel", 5:"Trend", 6:"Terrorism", 
                   7:"LGBTQ community", 8:"Food", 9:"Business", 
                  10:"Entertainment", 11:"Sports", 12:"Racism", 13:"Home", 
                  14:"Art", 15:"Wedding", 16:"Women Empowerment" }

topic_model.set_topic_labels(my_custom_labels)

3. Evaluation of the model:

In topic modeling, a topic is typically represented by the N words with the highest probability of belonging to that topic. To evaluate the quality of the resulting topics, coherence scores are commonly used. Topic coherence is calculated by examining the top words in each topic and measuring the degree of semantic similarity among them.

To achieve broader coverage of the analyzed corpus, it is essential to obtain a greater variety of topics, rather than solely prioritizing coherence among them. This is because greater diversity among the resulting topics leads to a more comprehensive exploration of the various aspects within the corpus. Hence, topic diversity becomes another important metric to evaluate the topic models. The diversity score quantifies how well the topics capture the breadth and variety of information contained in the corpus.

In this article, we will focus on finding the coherence score of the model.

from gensim.models import CoherenceModel
def calc_coherence(model, texts):
  # Extract vectorizer and tokenizer from BERTopic
  vectorizer = model.vectorizer_model
  tokenizer = vectorizer.build_tokenizer()

  # Extract features for Topic Coherence evaluation
  words = vectorizer.get_feature_names()
  tokens = [tokenizer(doc) for doc in texts]
  dictionary = corpora.Dictionary(tokens)
  corpus = [dictionary.doc2bow(token) for token in tokens]
  topic_words = [[words for words, _ in model.get_topic(topic)] 
                for topic in range(len(set(topics))-1)]

  # Compute Coherence Score

  coherence_model_bertopic = CoherenceModel(texts=tokens, topics=topic_words, corpus=corpus, dictionary=dictionary, coherence='c_v')
  coherence_bertopic = coherence_model_bertopic.get_coherence()
  return coherence_bertopic

topics= topic_model.get_topics()
print("Coherence Score= ", calc_coherence(topic_model, docs['short_description']))

This model gives a coherence score of 0.65 (Output image attached below)

It’s important to note that there can be a tradeoff between coherence and diversity. For example, increasing the number of topics can increase diversity but decrease coherence, while decreasing the number of topics can increase coherence but decrease diversity.

Therefore, the optimal balance between coherence and diversity will depend on the specific goals and context of the topic modeling project. In general, a good topic model should strike a balance between coherence and diversity that is appropriate for the intended application.

4. Testing the model:

It's time for the most exciting part! Let us take examples of news text passages from various domains to test our model:

a) Science & technology:

After journeying this summer through a narrow, sand-lined pass, NASA’s Curiosity Mars rover recently arrived in the ‘sulfate-bearing unit’. a long-sought region of Mount Sharp enriched with salty minerals. Scientists hypothesize that billions of years ago, streams, and ponds left behind the minerals as the water dried up. Assuming the hypothesis is correct, these minerals offer tantalizing clues as to how — and why — the Red Planet’s climate changed from being more Earth-like to the frozen desert it is today. The minerals were spotted by NASA’s Mars Reconnaissance Orbiter years before Curiosity landed in 2012, so scientists have been waiting a long time to see this terrain up close. Soon after arriving, the rover discovered a diverse array of rock types and signs of past water.

newdoc="After journeying this summer through a narrow, sand-lined pass, NASA’s Curiosity Mars rover recently arrived in the 'sulfate-bearing unit'. a long-sought region of Mount Sharp enriched with salty minerals. Scientists hypothesize that billions of years ago, streams, and ponds left behind the minerals as the water dried up. Assuming the hypothesis is correct, these minerals offer tantalizing clues as to how – and why – the Red Planet’s climate changed from being more Earth-like to the frozen desert it is today. The minerals were spotted by NASA’s Mars Reconnaissance Orbiter years before Curiosity landed in 2012, so scientists have been waiting a long time to see this terrain up close. Soon after arriving, the rover discovered a diverse array of rock types and signs of past water."

# Find the top 3 most relevant topics
similar_topics, similarity = topic_model.find_topics(newdoc, top_n=3)
print(f'Top 3 topics: {[(topic_model.custom_labels_)[t] for t in similar_topics]}')

Output: Top 3 topics: [‘Science’, ‘Environment’, ‘Nature’]

b) Education

Indians prefer to study abroad. The number of Indians enrolled in foreign universities has gone up from 4.44 lakh in 2021 to 7.5 lakh in 2022, as per the Centre. However, some countries either provide free or discounted education to international students. While European countries do not demand tuition fees from international students, they might charge a relatively smaller amount under the administrative fee. Among European nations, Germany remains the most popular nation as according to the data released by the Government of India, a total of 34,864 Indian students were present in Germany in 2022. Interestingly, Germany abolished the concept of tuition fees in 2014, so higher education degrees remain free of cost for domestic and international students.

newdoc="Indians prefer to study abroad. The number of Indians enrolled in foreign universities has gone up from 4.44 lakh in 2021 to 7.5 lakh in 2022, as per the Centre. However, some countries either provide free or discounted education to international students. While European countries do not demand tuition fees from international students, they might charge a relatively smaller amount under the administrative fee. Among European nations, Germany remains the most popular nation as according to the data released by the Government of India, a total of 34,864 Indian students were present in Germany in 2022. Interestingly, Germany abolished the concept of tuition fees in 2014, so higher education degrees remain free of cost for domestic and international students."

# Find the top 3 most relevant topics
similar_topics, similarity = topic_model.find_topics(newdoc, top_n=3)
print(f'Top 3 topics: {[(topic_model.custom_labels_)[t] for t in similar_topics]}')

Output: Top 3 topics: [‘Education’, ‘Countries’, ‘Ethnicity’]

c) Crime/ Terrorism

Abe was shot at twice while he was giving a speech on a street in the city of Nara on Friday morning. Security officials at the scene tackled the gunman and a 41-year-old suspect is now in police custody. Several other handmade weapons, similar to those used in the attack, had been confiscated after a search of the suspect’s house, police officers told a news conference. Explosives were also found at the home and police said they had advised residents to evacuate the area. The suspected shooter told officers he had a grudge against a specific group he believed Abe was connected to, police said, adding that they were investigating why the former PM was targeted out of other people related to the group. Prime Minister Fumio Kishida condemned the attack, saying: ‘It is barbaric and malicious and it cannot be tolerated.’

newdoc="Abe was shot at twice while he was giving a speech on a street in the city of Nara on Friday morning. Security officials at the scene tackled the gunman and a 41-year-old suspect is now in police custody. Several other handmade weapons, similar to those used in the attack, had been confiscated after a search of the suspect's house, police officers told a news conference. Explosives were also found at the home and police said they had advised residents to evacuate the area. The suspected shooter told officers he had a grudge against a specific group he believed Abe was connected to, police said, adding that they were investigating why the former PM was targeted out of other people related to the group. Prime Minister Fumio Kishida condemned the attack, saying: 'It is barbaric and malicious and it cannot be tolerated."

# Find the top 3 most relevant topics
similar_topics, similarity = topic_model.find_topics(newdoc, top_n=3)
print(f'Top 3 topics: {[(topic_model.custom_labels_)[t] for t in similar_topics]}')

Output: Top 3 topics: [‘Crime’, ‘Police’, ‘Terrorism’]

Thus, we find that the model works pretty well across various domains! 👏

5. Saving the model:

Finally, we can save the model in a cloud drive or local file system in the following way:

# Save the topic model
# syntax: model.save("path_of_folder/model_name")
topic_model.save("/drive/MyDrive/topicmodel_36NewsCategories")

We can load a model which was saved previously, using the following line of code:

# Load the topic model
# syntax: BERTopic.load("path_of_folder/model_name")
my_topic_model = BERTopic.load("/drive/MyDrive/topicmodel_36NewsCategories")

In conclusion, supervised topic modeling with BERTopic is a powerful tool for exploring complex and varied text data. By following the steps outlined in this article, you can harness the full power of BERT and create high-quality, interpretable topics that accurately reflect the themes and ideas present in your data. With the flexibility and versatility of BERTopic, you can easily adapt the algorithm to suit your specific needs. Whether you’re a data scientist, researcher, or analyst, supervised topic modeling with BERTopic will surely unlock new insights and perspectives into the data you work with. So why not give it a try today and see what exciting discoveries await? 😉

If you found this article insightful, hit the clap button and follow me for more exciting articles and tutorials! 😄