How I Built an NLP-Based Recommendation System

Manikanta P
2 min readMay 1, 2023

--

TLDR;
I built an NLP-based recommendation system that recommends ML papers and explains the reasoning behind that recommendation using OpenAI’s GPT API.

In this article, I will share my experience of building an NLP-based recommendation system from scratch in just 24 hours using a variety of tools and techniques, including Next Js, OpenAI ChatGPT, BERT, SentenceTransformers, SPECTER, FastAPI, and scraped data from ArXiv of last 6 months ML papers.

Tech stack of the System

Whenever I read a paper, I found myself constantly seeking recommendations(cause of Tiktok, youtube 🤔). This inspired me to create a recommendation system for academic papers.

Initially, I searched for a dataset specifically related to machine learning papers. While I found some datasets, they didn’t suit my needs, so I began scraping metadata from ArXiv instead. I thought this might be helpful to others so I deployed and scraper here and code is here.

After obtaining the dataset containing metadata, I explored various methods for computing embeddings for each paper:

1.TF-IDF

TF-IDF is a numerical statistic that measures how important a word is to a document in a corpus.

2. Sentence-Transformers

SentenceTransformers are a set of pre-trained models that can be used for various NLP tasks such as semantic similarity, sentence classification, and clustering.

3. Specter from Allen AI (BERT)

SPECTER is a pre-trained neural network developed by Allen Institute for AI that generates high-quality document embeddings for natural language processing tasks such as information retrieval and text classification.

In this Specter gave good accuracy as it was trained on a similar academic data. For similarity measure a simple cosine similarity has been used. So there I got a question, whenever user sees a recommendation, they’ll be curious to know the reason behind that suggestion. I’ve integrated OpenAI’s GPT API to explain the reasoning behind that suggestion.

I deployed the backend app using FastAPI and Deta space and developed a simple user interface with Next.js and Vercel to retrieve data from the backend server. The app is currently deployed on Vercel.

App deployed on Vercel

In the future, I plan to integrate a monitoring dashboard to track performance and implement a continuous training pipeline for daily data scraping. Additionally, I’m considering adding an interactive chat feature for discussing papers.

Building this app was both challenging and rewarding, and I welcome any ideas or feedback here.

References:

https://fastapi.tiangolo.com/

--

--