Unlocking the Power of Sentence Similarity in Marathi

7 min readFeb 12, 2023

A beginner’s guide to the mahaNLP python library

Written by Ananya Joshi, Janhavi Gadre & Raviraj Joshi

Have you ever wondered how search engines are able to understand the meaning behind your queries and provide relevant results? One of the key technologies behind this is Sentence Similarity.

Sentence similarity is a measure of the degree to which two sentences convey the same meaning. It is a crucial component in natural language processing and is used in a wide range of applications, from information retrieval and summarization to machine translation and dialogue systems.

Image src: https://peltarion.com/blog/applied-ai/a-business-view-on-semantic-similarity

Computing sentence similarity is a complex task, as it requires understanding the meaning of words, phrases, and entire sentences. It is usually achieved through a combination of techniques such as semantic analysis, syntactic analysis, and machine learning.

One popular approach is to use word embeddings, which are numerical representations of words that capture their meaning. By comparing the word embeddings of the words in two sentences, one can determine the semantic similarity between the sentences.

Image src: https://www.tensorflow.org/hub/tutorials/text_cookbook

Another approach is to use a pre-trained language model, such as BERT, to extract the meaning of the sentences and compare them using cosine similarity or other similarity measures.

Sentence-BERT (SBERT) is a variation of the popular BERT model that is pre-trained on a dataset of sentence pairs and is optimized for tasks that involve comparing the meaning of two sentences. SBERT is better than BERT in tasks that involve sentence-level understanding and is particularly useful for tasks that require understanding the fine-grained meaning of sentences and the relationships between them.

Thus, it is more effective than BERT in tasks like semantic textual similarity, paraphrase detection, and natural language inference.

In this article, we take a deep dive into the world of Sentence Similarity and explore how the mahaNLP library can be used to compute the similarity of Marathi sentences. 🚀

⭐️ MahaNLP is a python-based natural language processing library that focuses on the Indian language of Marathi. Developed by L3Cube-Pune, this library aims to bring Marathi to the forefront of IndicNLP and promote its use in AI for Maharashtra. ⭐️

Sentence similarity is a crucial task in natural language processing. Apart from several other submodules for a variety of text-processing tasks, mahaNLP provides a Similarity submodule, which we shall explore in this article.

MahaNLP enables the computation of sentence similarity using two powerful SBERT models: the marathi-sentence-bert-nli and the marathi-sentence-similarity-sbert models developed recently by L3Cube.

The marathi-sentence-bert-nli model is a Sentence-BERT trained on a large dataset of sentence pairs. This model can understand the relationship between two sentences, making it an ideal choice for sentence similarity analysis.
Its advanced version, the marathi-sentence-similarity-sbert, is trained on a large corpus of Marathi text using a two-step process. This model is fine-tuned on the Marathi Semantic Textual Similarity dataset, making it capable of understanding the complexities of this language.

Ready to discover the similarities between two Marathi sentences? Follow along with this easy-to-understand, step-by-step guide and unlock the secrets of sentence comparison in the Marathi language! 😄

SETUP:

To use the sentence similarity model in mahaNLP, you will first need to install the library. You can install mahaNLP using pip by running the following command:

pip install mahaNLP==0.6

Whether you’re a seasoned ML practitioner or a beginner programmer, the mahaNLP library has got you covered! The library’s basic mode is designed with simplicity in mind, offering an easy-to-use approach for novice programmers. On the other hand, its advanced mode is tailored for experienced ML professionals, offering more flexibility and options to choose from when it comes to selecting the best model for the job.

➡️ The basic mode:

Import the SimilarityAnalyzer from the similarity module into your python script and create an object of the SimilarityAnalyzer class. It uses the default model: ‘marathi-sentence-similarity-sbert’. This versatile model delivers reliable results for a wide range of use cases, allowing programmers to utilize it without the need for any complicated configurations or setup.

from mahaNLP.similarity import SimilarityAnalyzer

# Loads the default model- marathi-sentence-similarity-sbert
similarity_model = SimilarityAnalyzer()

➡️ The advanced mode:

Import SimilarityModel from the modelRepo module into your python script and create an object of the SimilarityModel class.

from mahaNLP.model_repo import SimilarityModel

# create an object
similarity_model2 = SimilarityModel() 
# this loads with default model: marathi-sentence-similarity-sbert

Here, mahaNLP provides added functionality to see a list of available similarity models and configure the most appropriate model for your specific use case. The ‘list_models’ method allows the user to see a list of available models in the mahaNLP library. The user can then select and use a specific model by passing its name as an argument.

similarity_model2.list_models()
#Output:
#similarity models: 
#marathi-sentence-similarity-sbert:l3cube-pune/marathi-sentence-similarity-sbert
#marathi-sentence-bert-nli :  l3cube-pune/marathi-sentence-bert-nli

# Specify the new similarity model
similarity_model2 = SimilarityModel('marathi-sentence-bert-nli')

It is important to note that depending on the model you choose, the performance and accuracy may vary. It is recommended to test and compare the results using different models to find the one that best suits your needs.

The Similarity model provides us with two functionalities:

get_similarity_score: Returns the similarity score of a string with respect to another string
embed_sentences: Returns the sentence embedding values in an array

Let us see examples of each of these functionalities:

With mahaNLP, you can compare two sentences and get a similarity score, which ranges from 0 to 1, with 1 indicating that the sentences are identical and 0 indicating that they are completely dissimilar.

sentence1 = "तो जीवनात यशस्वी झाला की नाही?"
sentence2 = "तो आयुष्यात यशस्वी झाला की नाही?"

similarity_score = similarity_model.get_similarity_score(sentence1, sentence2)
print(similarity_score)

You can also compare a target sentence with a collection of sentences to find the most similar sentence from the collection:

target =  'भारतात नवीन तंत्रज्ञान विकसित करणं हे आमचे उद्दिष्ट आहे' 
 
textsentences = ['भारतात तंत्रज्ञानाच्या क्षेत्रात विकास करणं हे आमचे उद्दिष्ट आहे.',
                'एक माणूस बासरी वाजवत आहे.',
                'तंत्रज्ञान हा उद्योग आणि उपजीविकेचा अविभाज्य भाग बनला आहे.',
                'नाशिक शहरात रविवारी रात्रीपासूनच पाऊस होत आहे.',
                'भारतातील उत्पादन वाढवणे हे आमचे उद्दिष्ट आहे.',]
 
print(similarity_model.get_similarity_score(target,textsentences))

The similarity score of 0.92 shows that the first sentence is semantically closest to our target sentence.

There’s an optional boolean parameter: as_dict (Default: False) which is used to define the print type. Setting it to true will return a dictionary with sentences as keys and their corresponding similarity scores as values.

2. You can use the ‘embed_sentences’ function to get the embedding of a sentence. Sentence embeddings are a way of representing sentences as vectors, which can be used for various NLP tasks such as text classification, clustering or similarity.

sent= "नेहमी आनंदी राहण्याचा प्रयत्न करा"
sentence_embedding = similarity_model.embed_sentences(sent)
print("Sentence Embedding: \n", sentence_embedding)

MahaNLP is a game-changer for working with text data in the Marathi language. Apart from similarity, it also offers a variety of other functions such as tokenization, preprocessing, autocompletion, tagging, etc. With this powerful library, preprocessing Marathi text has never been simpler!

What sets mahaNLP apart is its ability to understand the complexities of the Marathi language and idiomatic expressions. It is fine-tuned specifically for Marathi, which means it can accurately understand the nuances of this language, making it more accurate than a general-purpose NLP library. However, it is important to note that it is trained on Marathi text, so it can only determine the similarity of Marathi Sentences.

If you need to compare similarities of any other language sentences, you might need to use other models or libraries trained in that language.

Handling large datasets is no problem with mahaNLP. It is built to scale and can handle the demands of large-scale text analysis projects. This makes it a great choice for applications such as sentiment analysis, named entity recognition, and hate speech detection.

In conclusion, mahaNLP is a valuable addition to the field of Indian Language Processing and brings Marathi language to the forefront. It’s a great step forward in the field of AI for Maharashtra and the developer L3Cube’s vision of making Marathi a resource-rich language is getting closer to reality. ✨

We hope this article has given you a glimpse into one of the capabilities of mahaNLP and the Marathi language. We look forward to seeing the innovative ways that researchers and developers will use this library to further our understanding of natural language!

Unlock the full potential of Marathi language with mahaNLP as your toolbox!! 🧰

We can’t wait to hear about the unique and creative ways you’re using mahaNLP. Leave a comment and share your story with us, and don’t forget to give the article a clap to show your support! 👏

Check out the mahaNLP GitHub page to learn more about the latest features and updates!

Unlocking the Power of Sentence Similarity in Marathi

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Ananya Joshi

No responses yet

More from Ananya Joshi

Discovering Themes in News Articles: A Step-by-Step Guide to Supervised BERTopic

A step-by-step tutorial of supervised topic modeling using the BERTopic algorithm with python code.

Ways to enhance your MLH Prep fellowship application

Here’s how you can create a strong and effective application for MLH Prep that stands out uniquely!

How I secured AIR 1 in IPTSE

Success mantra from an IPTSE winner

Ways to enhance your MLH fellowship prep application

Creating a strong and unique application for MLH fellowship

Recommended from Medium

The 5 paid subscriptions I actually use in 2025 as a Staff Software Engineer

Tools I use that are cheaper than Netflix

Jeff Bezos Says the 1-Hour Rule Makes Him Smarter. New Neuroscience Says He’s Right

Jeff Bezos’s morning routine has long included the one-hour rule. New neuroscience says yours probably should too.

Lists

Natural Language Processing

The New Chatbots: ChatGPT, Bard, and Beyond

data science and AI

Staff picks

Just Stop Writing Python Functions Like This!!!

I just reviewed someone else’s code and I was just shocked.

How Does Our Sense of Humor Change With Age? A Statistical Analysis

How do our comedic sensibilities form and transform over time?

Google just confirmed the AI reality many programmers are desperately trying to deny

AI is slowly taking over coding but many programmers are still sticking their head in the sand about what’s coming…

LLM Architectures Explained: NLP Fundamentals (Part 1)

Deep Dive into the architecture & building of real-world applications leveraging NLP Models starting from RNN to the Transformers.