Automated PDF Content Extraction and Query Resolution via Social Media Integration

Automated PDF Content Extraction and Query Resolution via Social Media Integration

Concept

Originally prototyped as RICO (RobotIzed Campaign Organiser) and later refined into Document Sage, this project is an AI-powered assistant that transforms static documents into interactive, queryable tools.

Using Retrieval-Augmented Generation (RAG) and Large Language Models (LLMs), the system integrates with Discord to let users query PDFs and FAQs in real time. It reduces staff strain, enhances accessibility, and improves engagement by delivering clear, context- aware answers instantly

Description

Developed over a 12-week capstone, the project tackled a common challenge faced by SMEs and digital communities: low engagement with long FAQs and PDF documents, and the resulting overload on support staff. Our solution integrated OCR, embeddings, and RAG pipelines into a Discord bot with escalation protocols.

Key features included document-to-markdown conversion, synthetic FAQ dataset creation, profanity filtering, and human fallback mechanisms

Type

Group project — AI/Analytics Capstone (Final-year university project).

Role

  • Conducted research into AI/LLM applications for SMEs and their business impact.
  • Contributed to report content formation, including data discussion, risk analysis, and recommendations.
  • Assisted in structuring CRISP-DM phases across business understanding, data preparation, modelling, and evaluation.
  • Collaborated on the creation of the project Gantt chart, supporting effective planning and progress tracking.
  • Performed testing of the Discord-integrated bot, including side-by-side evaluation and creating tickets to simulate real escalation flows.

Timeline

Autumn Session 2025 — UTS (12 weeks)

Impact

  • Delivered a working Discord-integrated demo for document-aware QsA.
  • Achieved 87.2% accuracy using GeminiFlash2 + MiniLM embeddings
  • Improved OCR performance by switching from PyTesseract to EasyOCR, boosting BLEU/ROUGE evaluation scores
  • Demonstrated practical value for SMEs by automating repetitive queries and lowering operational bottlenecks.

Learnings

  • Built collaboration skills in research, structured documentation, project planning (including Gantt chart creation), and hands-on system testing.
  • Gained deep experience with RAG, LangChain, FAISS, OCR, embeddings, and LLM evaluation.
  • Learned to scrape and generate synthetic FAQ datasets for realistic testing.
  • Strengthened ability to connect technical design with business impact, applying CRISP-DM alongside Agile iterations.

Project Details

Using Retrieval-Augmented Generation (RAG) and Large Language Models (LLMs), the system integrates with Discord to let users query PDFs and FAQs in real time.