How I Built an NLP-Based Recommendation System
TLDR;
I built an NLP-based recommendation system that recommends ML papers and explains the reasoning behind that recommendation using OpenAI’s GPT API.
In this article, I will share my experience of building an NLP-based recommendation system from scratch in just 24 hours using a variety of tools and techniques, including Next Js, OpenAI ChatGPT, BERT, SentenceTransformers, SPECTER, FastAPI, and scraped data from ArXiv of last 6 months ML papers.
Whenever I read a paper, I found myself constantly seeking recommendations(cause of Tiktok, youtube 🤔). This inspired me to create a recommendation system for academic papers.
Initially, I searched for a dataset specifically related to machine learning papers. While I found some datasets, they didn’t suit my needs, so I began scraping metadata from ArXiv instead. I thought this might be helpful to others so I deployed and scraper here and code is here.
After obtaining the dataset containing metadata, I explored various methods for computing embeddings for each paper:
TF-IDF is a numerical statistic that measures how important a word is to a document in a corpus.
SentenceTransformers are a set of pre-trained models that can be used for various NLP tasks such as semantic similarity, sentence classification, and clustering.
3. Specter from Allen AI (BERT)
SPECTER is a pre-trained neural network developed by Allen Institute for AI that generates high-quality document embeddings for natural language processing tasks such as information retrieval and text classification.
In this Specter gave good accuracy as it was trained on a similar academic data. For similarity measure a simple cosine similarity has been used. So there I got a question, whenever user sees a recommendation, they’ll be curious to know the reason behind that suggestion. I’ve integrated OpenAI’s GPT API to explain the reasoning behind that suggestion.
I deployed the backend app using FastAPI and Deta space and developed a simple user interface with Next.js and Vercel to retrieve data from the backend server. The app is currently deployed on Vercel.
In the future, I plan to integrate a monitoring dashboard to track performance and implement a continuous training pipeline for daily data scraping. Additionally, I’m considering adding an interactive chat feature for discussing papers.
Building this app was both challenging and rewarding, and I welcome any ideas or feedback here.
References: