Aurora
AI Content Creation Tool with RAG
An end-to-end content creation tool combining AI, document processing, and RAG for intelligent document analysis. Processes multiple formats (PDF, images with OCR, URLs, DOCX) into semantic embeddings, enabling AI-powered content generation grounded in uploaded source material.
Multi-format extraction (PDF, OCR, URLs, DOCX)
384-dim semantic embeddings via Sentence-Transformers
Source tracking with UUID for traceability
Key Features
Multi-format document extraction: PDF (pdfplumber), images with OCR (Tesseract + PaddleOCR), URLs, DOCX, plain text
Semantic embedding pipeline using Sentence-Transformers (all-MiniLM-L6-v2, 384-dim vectors)
Intelligent 500-token text chunking with UUID-based source tracking for traceability
URL content extraction with readability filtering (BeautifulSoup4 + readability-lxml)
Google Gemini API integration for AI-powered content generation
Semantic search across all uploaded documents for RAG retrieval