Spaces:
Sleeping
Sleeping
| title: RAG-Ready Content Scraper | |
| emoji: π | |
| colorFrom: blue | |
| colorTo: green | |
| sdk: docker | |
| app_file: app.py | |
| pinned: false | |
| license: mit | |
| short_description: Scrape web/GitHub for RAG-ready datasets. | |
| # RAG-Ready Content Scraper | |
| RAG-Ready Content Scraper is a Python tool, enhanced with a Gradio interface and Docker support, designed for efficiently scraping web content and GitHub repositories. It's tailored for Retrieval-Augmented Generation (RAG) systems, extracting and preprocessing text into structured, RAG-ready formats like Markdown, JSON, and CSV. | |
| This version is designed to be deployed as a HuggingFace Space using Docker, enabling full functionality including GitHub repository processing via RepoMix. | |
| ## Features | |
| - **Dual Scraping Modes**: | |
| - **Webpage Scraping**: Scrapes web content and converts it to Markdown. Supports recursive depth control to follow internal links. | |
| - **GitHub Repository Processing**: Processes GitHub repositories using **RepoMix** to create AI-friendly outputs. | |
| - **Multiple Output Formats**: Generate datasets in Markdown, JSON, or CSV. | |
| - **Interactive Gradio Interface**: Easy-to-use web UI with clear input sections, configuration options, progress display, content preview, and downloadable results. | |
| - **HuggingFace Spaces Ready (Docker)**: Deployable as a Dockerized HuggingFace Space, ensuring all features are available. | |
| - **Pre-configured Examples**: Includes example inputs for quick testing. | |
| - **In-UI Documentation**: "How it Works" section provides guidance. | |
| ## Requirements for Local Development (Optional) | |
| - Python 3.10+ | |
| - Node.js and npm (for Repomix GitHub repository processing) | |
| - Repomix (can be installed globally: `npm install -g repomix`, or used via `npx repomix` if `app.py` is adjusted) | |
| - Project dependencies: `pip install -r requirements.txt` | |
| ## HuggingFace Space Deployment | |
| This application is intended to be deployed as a HuggingFace Space using the **Docker SDK**. | |
| 1. **Create a new HuggingFace Space.** | |
| 2. Choose **"Docker"** as the Space SDK. | |
| 3. Select **"Use an existing Dockerfile"**. | |
| 4. Push this repository (including the `Dockerfile`, `app.py`, `requirements.txt`, and the `rag_scraper` directory) to the HuggingFace Space repository. | |
| 5. The Space will build the Docker image and launch the application. All features, including GitHub repository processing with RepoMix, will be available. | |
| ## Using the Interface | |
| 1. **Enter URL or GitHub Repository ID**: | |
| * For websites: Enter a complete URL (e.g., `https://example.com`). | |
| * For GitHub repositories: Enter a URL (e.g., `https://github.com/username/repo`) or shorthand ID (e.g., `username/repo`). | |
| 2. **Select Source Type**: | |
| * Choose "Webpage" or "GitHub Repository". | |
| 3. **Set Scraping Depth** (for Webpages only): | |
| * 0: Only scrape the main page. | |
| * 1-3: Follow internal links recursively to the specified depth. (Ignored for GitHub repos). | |
| 4. **Select Output Format**: | |
| * Choose "Markdown", "JSON", or "CSV". | |
| 5. **Click "Process Content"**. | |
| 6. **View Status and Preview**: Monitor progress and see a preview of the extracted content. | |
| 7. **Download File**: Download the generated dataset in your chosen format. | |
| ## How It Works | |
| ### Webpage Scraping | |
| 1. Fetches HTML content from the provided URL. | |
| 2. Converts HTML to clean Markdown. | |
| 3. If scraping depth > 0, extracts internal links and recursively repeats the process for each valid link. | |
| 4. Converts the aggregated Markdown to the selected output format (JSON, CSV, or keeps as Markdown). | |
| ### GitHub Repository Processing | |
| 1. Uses **RepoMix** (a Node.js tool) to fetch and process the specified GitHub repository. | |
| 2. RepoMix analyzes the repository structure and content, generating a consolidated Markdown output. | |
| 3. This Markdown is then converted to the selected output format (JSON, CSV, or kept as Markdown). | |
| ## Source Code | |
| The source code for this project is available on HuggingFace Spaces: | |
| [https://huggingface.co/spaces/CultriX/RAG-Scraper](https://huggingface.co/spaces/CultriX/RAG-Scraper) | |
| ## License | |
| This project is licensed under the MIT License. | |