Spaces:

CultriX
/

RAG-Scraper

Sleeping

App Files Files Community

RAG-Scraper / README.md

CultriX

Fix metadata in README.md for HuggingFace Hub

85fd625 7 months ago

preview code

raw

history blame contribute delete

4.12 kB

	---
	title: RAG-Ready Content Scraper
	emoji: 🚀
	colorFrom: blue
	colorTo: green
	sdk: docker
	app_file: app.py
	pinned: false
	license: mit
	short_description: Scrape web/GitHub for RAG-ready datasets.
	---

	# RAG-Ready Content Scraper

	RAG-Ready Content Scraper is a Python tool, enhanced with a Gradio interface and Docker support, designed for efficiently scraping web content and GitHub repositories. It's tailored for Retrieval-Augmented Generation (RAG) systems, extracting and preprocessing text into structured, RAG-ready formats like Markdown, JSON, and CSV.

	This version is designed to be deployed as a HuggingFace Space using Docker, enabling full functionality including GitHub repository processing via RepoMix.

	## Features

	- Dual Scraping Modes:
	- Webpage Scraping: Scrapes web content and converts it to Markdown. Supports recursive depth control to follow internal links.
	- GitHub Repository Processing: Processes GitHub repositories using RepoMix to create AI-friendly outputs.
	- Multiple Output Formats: Generate datasets in Markdown, JSON, or CSV.
	- Interactive Gradio Interface: Easy-to-use web UI with clear input sections, configuration options, progress display, content preview, and downloadable results.
	- HuggingFace Spaces Ready (Docker): Deployable as a Dockerized HuggingFace Space, ensuring all features are available.
	- Pre-configured Examples: Includes example inputs for quick testing.
	- In-UI Documentation: "How it Works" section provides guidance.

	## Requirements for Local Development (Optional)

	- Python 3.10+
	- Node.js and npm (for Repomix GitHub repository processing)
	- Repomix (can be installed globally: `npm install -g repomix`, or used via `npx repomix` if `app.py` is adjusted)
	- Project dependencies: `pip install -r requirements.txt`

	## HuggingFace Space Deployment

	This application is intended to be deployed as a HuggingFace Space using the Docker SDK.

	1. Create a new HuggingFace Space.
	2. Choose "Docker" as the Space SDK.
	3. Select "Use an existing Dockerfile".
	4. Push this repository (including the `Dockerfile`, `app.py`, `requirements.txt`, and the `rag_scraper` directory) to the HuggingFace Space repository.
	5. The Space will build the Docker image and launch the application. All features, including GitHub repository processing with RepoMix, will be available.

	## Using the Interface

	1. Enter URL or GitHub Repository ID:
	* For websites: Enter a complete URL (e.g., `https://example.com`).
	* For GitHub repositories: Enter a URL (e.g., `https://github.com/username/repo`) or shorthand ID (e.g., `username/repo`).
	2. Select Source Type:
	* Choose "Webpage" or "GitHub Repository".
	3. Set Scraping Depth (for Webpages only):
	* 0: Only scrape the main page.
	* 1-3: Follow internal links recursively to the specified depth. (Ignored for GitHub repos).
	4. Select Output Format:
	* Choose "Markdown", "JSON", or "CSV".
	5. Click "Process Content".
	6. View Status and Preview: Monitor progress and see a preview of the extracted content.
	7. Download File: Download the generated dataset in your chosen format.

	## How It Works

	### Webpage Scraping

	1. Fetches HTML content from the provided URL.
	2. Converts HTML to clean Markdown.
	3. If scraping depth > 0, extracts internal links and recursively repeats the process for each valid link.
	4. Converts the aggregated Markdown to the selected output format (JSON, CSV, or keeps as Markdown).

	### GitHub Repository Processing

	1. Uses RepoMix (a Node.js tool) to fetch and process the specified GitHub repository.
	2. RepoMix analyzes the repository structure and content, generating a consolidated Markdown output.
	3. This Markdown is then converted to the selected output format (JSON, CSV, or kept as Markdown).

	## Source Code

	The source code for this project is available on HuggingFace Spaces:
	[https://huggingface.co/spaces/CultriX/RAG-Scraper](https://huggingface.co/spaces/CultriX/RAG-Scraper)

	## License

	This project is licensed under the MIT License.