Harshwardhan Joshi

About Me

I am a Data Scientist and Machine Learning Engineer with expertise in AI, NLP, Deep Learning, and Generative AI. My work focuses on developing scalable AI solutions, automating workflows, and deriving insights from large datasets. With hands-on experience in machine learning, deep learning, and cloud-based AI, I have built retrieval-augmented generation (RAG) pipelines, image processing systems, and predictive models for real-world applications.

Currently, I am working on intelligent automation systems and AI-powered applications, leveraging technologies like PyTorch, TensorFlow, Scikit-learn, and Hugging Face for model development, and using Docker, AWS, and GCP for scalable deployments. My recent projects and work experiences include designing an image classification system using Vision LLMs (GPT-4o), optimizing NLP models for syllabus analysis, and creating an AI-driven personalized email outreach system.

With a strong foundation in Python, Go, SQL, and R, I combine data science and software engineering to develop impactful AI-driven solutions. Passionate about problem-solving, I continuously explore new frontiers in AI, cloud computing, and MLOps to build innovative and efficient systems.

View My Projects

Education

NCSU Logo

Master of Science: Computer Science

North Carolina State University
Raleigh, North Carolina, USA | 2022 - 2024

Focused on AI/ML, NLP, cloud computing, and big data analytics.

KIIT Logo

Bachelor of Technology: Computer Science

Kalinga Institute of Industrial Technology
Bhubaneswar, Odisha, India | 2018 - 2022

Specialized in software development, machine learning, and data structures.

Work Experience

Taral AI Logo

Software Engineer Intern - Backend

Taral AI
Remote, USA | September 2024 - Present

Worked on designing and implementing a Docker-based image processing system using Go, MongoDB, NATS, and Vision LLMs to classify images as ‘dirty’ or ‘clean’ for automating cleaning task scheduling in schools. Improved database query performance by 40% and built an asynchronous messaging system handling 500+ concurrent connections.

Tech Stack: Docker, Go, MongoDB, NATS, Vision LLMs, Asynchronous Messaging


NLP AI Researcher Logo

NLP and AI Researcher

North Carolina State University
Raleigh, North Carolina, USA / Remote | April 2024 - Present

Leading a study to improve curriculum development by analyzing past syllabus data in the field of biochemistry. Worked on applying transformer models like bioBERT and RoBERTa to analyze 50,000+ university syllabi, designing a Retrieval-Augmented Generation (RAG) pipeline, and uncovering statistical insights into course content distribution.

Tech Stack: bioBERT, RoBERTa, Transformer Models, RAG Pipeline, Python, Pandas


ML Intern Logo

Machine Learning Intern

Verzeo
Bhubaneswar, Odisha, India | June 2020 - August 2020

Led a 4-member team in developing an XGBoost regression model for car price prediction, achieving 92% accuracy and a 0.85 R² score. Improved model performance by 15% through feature engineering and selection using LASSO regression. Developed interactive Tableau dashboards for data-driven decision-making.

Tech Stack: XGBoost, LASSO Regression, Tableau, Python, Scikit-learn

Projects

Here are some of the projects I have worked on. These projects showcase my skills in data science, machine learning, image processing, and cloud-based solutions.

AI-Powered Personalized Email Outreach System

AI-Powered Personalized Email Outreach System

Built a cold email generator using Python, LangChain, and Llama 3.1 to connect with hiring managers, achieving 95% relevance accuracy by using an embedding-based retrieval system with ChromaDB.

Tech Stack:

  • Python
  • LangChain
  • Llama 3.1
  • ChromaDB
  • Streamlit
  • JSON
  • Web Scraping

Key Achievements:

  • Achieved 95% relevance accuracy by building an embedding-based retrieval system.
  • Applied prompt engineering to reduce hallucination rates and ensure accurate email generation.
  • Developed a web scraper to convert job descriptions into structured JSON format with 90% accuracy.
  • Reduced manual review time by 75% through a Streamlit-based user interface for output verification.
View on GitHub
Deep Learning-based Crop Health Analyzer

Deep Learning-based Crop Health Analyzer

Developed a CNN-based solution for classifying crop diseases with 95% detection accuracy. The solution leverages Python, TensorFlow, and Keras to analyze crop leaf images and classify diseases across multiple categories.

Tech Stack:

  • Python
  • TensorFlow
  • Keras
  • FastAPI
  • ReactJS
  • Google Cloud Platform (GCP)

Key Achievements:

  • Achieved 95% detection accuracy in classifying crop diseases.
  • Reduced inference time by 45% using model quantization (32-bit to 8-bit precision).
  • Engineered a FastAPI backend for image uploading and ReactJS frontend for real-time results.
  • Deployed the scalable solution on GCP with automatic load balancing, handling 100+ dummy requests/minute.
View on GitHub
Climate Forecasting System

Climate Forecasting System - Multivariate Time Series Prediction

Built a deep learning-based system for multivariate time series prediction to forecast climate variables like temperature, pressure, and humidity. Utilized LSTM, GRU, and CNN models to accurately predict climate patterns based on historical data from multiple locations.

Tech Stack:

  • Python
  • TensorFlow
  • Keras
  • Scikit-learn
  • Pandas
  • Matplotlib
  • Jupyter Notebooks

Key Achievements:

  • Implemented deep learning models (LSTM, GRU, CNN) to predict temperature, pressure, and humidity from large-scale time series climate data.
  • Engineered temporal features (sine and cosine transformations) for daily and yearly cycles, enhancing model performance and accuracy.
  • Achieved a Root Mean Squared Error (RMSE) of 1.25°C for temperature predictions and 1.8 mbar for pressure predictions on the test set, significantly outperforming baseline models.
  • Optimized the models for both accuracy and computational efficiency, reducing training time while maintaining high prediction quality.
View on GitHub
Book Recommendation System

Book Recommendation System

Engineered a highly efficient, large-scale book recommendation system, designed to provide personalized reading suggestions to over 1 million users. Leveraging the power of Apache Spark, MLlib, and PySpark, I processed vast datasets of user ratings and book metadata, delivering lightning-fast, tailored recommendations.

Tech Stack:

  • Apache Spark
  • MLlib
  • PySpark
  • Python
  • SQL
  • Jupyter Notebooks
  • Matplotlib

Key Achievements:

  • Developed an advanced collaborative filtering algorithm using the powerful Alternating Least Squares (ALS) method, enabling highly accurate recommendations for a massive user base.
  • Achieved a remarkable Root Mean Squared Error (RMSE) of 0.82, showcasing the effectiveness of the model after implementing meticulous hyperparameter tuning and rigorous cross-validation.
  • Optimized data processing workflows with Spark SQL for efficient and scalable data manipulation, significantly accelerating model training times and overall system performance.
  • Enhanced recommendation quality through a carefully conducted parametric grid search, fine-tuning key model parameters for peak accuracy and improved user experience.
View on GitHub
Bookipedia

Bookipedia - Online Book Store

Bookipedia is a full-fledged online book store where customers can explore and purchase their favorite books with ease. Built with a robust frontend and backend, this platform ensures seamless browsing, secure transactions, and an intuitive user experience.

Tech Stack:

  • Angular
  • Node.js
  • JavaScript
  • Jasmine
  • Cucumber
  • Protractor
  • Karma

Key Achievements:

  • Designed an intuitive and responsive user interface using Angular, ensuring a smooth book-browsing experience.
  • Developed a robust backend with Node.js, implementing secure authentication, dynamic book catalog management, and order processing.
  • Implemented a comprehensive testing framework using Jasmine, Cucumber, Protractor, and Karma, ensuring high code reliability and reducing potential bugs.
  • Collaborated with a team of six members using an Agile workflow, effectively managing tasks through a Kanban board.
View on GitHub
PlagDetector

PlagDetector - AI vs Human Text Classifier

PlagDetector is an advanced NLP-based project that identifies whether a given text is generated by AI or written by a human. The system leverages transformer models, text summarization techniques, and semantic similarity analysis to detect AI-generated content with high accuracy.

Tech Stack:

  • NLP Frameworks: TensorFlow, PyTorch, Hugging Face Transformers
  • Text Analysis: BERT, RoBERTa, TF-IDF, Cosine Similarity, Sentence-BERT
  • Summarization Models: Sequence-to-sequence architectures
  • Classification: Logistic Regression, Random Forest, Deep Neural Networks
  • Data Processing: Pandas, NumPy, NLTK, SpaCy
  • Evaluation Metrics: ROC-AUC, Precision-Recall, Confusion Matrix

Key Achievements:

  • Implemented a hybrid approach combining linguistic features and deep learning models to classify AI vs. human text.
  • Developed a text summarization pipeline to extract key information and compare semantic meaning.
  • Achieved high classification accuracy by fine-tuning transformer-based models on AI-generated datasets.
  • Designed a scalable and efficient framework, optimizing inference time for real-world applications.
View on GitHub