Spaces:

psyrishi
/

narrative-summarizer

Sleeping

psyrishi commited on Sep 13

Commit

16ea1b0

1 Parent(s): 22ddfaa

Add robust UTF-8 encoding detection, token-based chunking, and second-pass summarization

- Implemented file encoding detection to support various UTF encodings and prevent decode errors
- Changed chunking to be token-based instead of character-based for better handling of large texts
- Added second-pass summarization for improved global compression of summaries
- Updated UI to use dropdown for prompt selection (Bread, Butter, or Both)
- Improved error handling and user feedback for file uploads
- Updated README and requirements for new dependencies and usage instructions

Files changed (4) hide show

README.md +46 -107
app.py +33 -33
requirements.txt +1 -0
summarizer.py +70 -51

README.md CHANGED Viewed

@@ -17,147 +17,86 @@ tags:
 - gradio
 ---
-# 📚 Narrative Summarizer
-Summarize long `.txt` narrative files into compressed, LLM-optimized summaries using BART. Choose between `Bread`, `Butter`, or both prompt styles for custom compression behavior. Upload a `.txt` file, select your preferences, and receive a clean, compressed summary in seconds.
 ---
-## 📚 Narrative Summarizer — Hugging Face Space
-**`psyrishi/narrative-summarizer`**
-A user-friendly summarization tool for `.txt` files, powered by Hugging Face Transformers and built with Gradio.
-This app transforms long-form narratives into compressed, LLM-friendly summaries using either the **"Bread"**, **"Butter"**, or a **combination of both** prompt styles. It supports checkpointing to avoid data loss on interruptions and ensures large text files are processed reliably.
----
-### ✨ Features
-* ✅ Supports `.txt` file uploads up to 3 MB (or more)
-* 📌 Prompt options: `Bread`, `Butter`, or `Bread and Butter`
-* 🔁 Multi-iteration summarization support
-* 🧠 Model: `facebook/bart-large-cnn`
-* 💾 Auto checkpointing: progress won't be lost on timeout
-* 🧰 Output is saved for download post-processing
-* 🌐 Clean Gradio UI – easy to run in browser
 ---
-### 📥 How to Use
-1. **Upload** a `.txt` file (max \~3MB recommended)
-2. **Select** a summarization style from dropdown:
-   * `Bread only`
-   * `Butter only`
-   * `Bread and Butter`
-3. Choose:
-   * `Iterations`: how many times the prompts apply
-   * `Max Length`: max summary tokens per chunk
-   * `Min Length`: min summary tokens per chunk
-4. Click **Summarize**
-5. Get your **condensed output** in the results box
----
-### ⚙️ Tech Stack
-| Component         | Details                          |
-| ----------------- | -------------------------------- |
-| **Frontend**      | [Gradio](https://www.gradio.app) |
-| **Backend**       | Hugging Face `transformers`      |
-| **Model**         | `facebook/bart-large-cnn`        |
-| **Checkpointing** | JSON-based resume system         |
-| **Language**      | Python 3.10+                     |
----
-### 📂 Folder Structure
-```
-.
-├── app.py              # Gradio frontend app
-├── summarizer.py       # Backend summarization logic
-├── requirements.txt    # Dependencies
-├── inputs/             # Uploaded input files
-├── outputs/            # Final summarized outputs
-└── checkpoints/        # Intermediate checkpointing
-```
 ---
-### 🛠️ Setup (Local)
-Clone this repo and run it locally:
-```bash
-git clone https://huggingface.co/spaces/psyrishi/narrative-summarizer
-cd narrative-summarizer
-pip install -r requirements.txt
-python app.py
-```
 ---
-## 🚀 Space Configuration
-Here’s how to fill out the **Hugging Face Space creation form**:
-| Field                 | Value                                       |
-| --------------------- | ------------------------------------------- |
-| **Owner**             | `psyrishi`                                  |
-| **Space Name**        | `narrative-summarizer`                      |
-| **Short Description** | Summarizer for the txt files                |
-| **License**           | Choose: `MIT`, `Apache 2.0`, or `Other`     |
-| **Space SDK**         | ✅ Gradio                                    |
-| **Gradio Template**   | Start from Scratch or Blank                 |
-| **Hardware**          | ✅ Free (sufficient for your use case)       |
-| **Visibility**        | Choose: `Public` (recommended) or `Private` |
-| **Dev Mode**          | (Optional) Available to PRO subscribers     |
 ---
-### 🧪 Prompt Styles Explained
-* 🥖 **Bread**: Focuses on compression for efficient LLM parsing
-* 🧈 **Butter**: Enhances nuance and detail while summarizing
-* 🥪 **Bread + Butter**: Applies both sequentially for balance
 ---
-### 📌 Example Input
-```txt
-Once upon a time, in a quiet village nestled between two mountains...
-```
-### 📤 Example Output (Bread only)
-```txt
-A peaceful mountain village faces hidden turmoil, gradually unveiling conflicts beneath its quiet surface.
-```
 ---
-### 🔐 License
-Recommend using:
-```
-MIT License
-Copyright (c) 2025 psyrishi
-Permission is hereby granted, free of charge, to any person obtaining a copy...
-```
-Or [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0).
 ---
-### 👋 Feedback & Contributions
-Feel free to fork the repo, create pull requests, or open issues if you'd like to contribute or improve the tool.

 - gradio
 ---
+# Narrative Summarizer
+A Gradio-based app to summarize large narrative `.txt` files using transformer models with advanced chunking and multi-pass summarization.
 ---
+## Features
+- **Robust UTF-8 file handling** with encoding detection for smooth uploads.
+- **Token-based chunking** to handle large files efficiently.
+- **Multiple prompt styles** via dropdown:
+  - Bread Only
+  - Butter Only
+  - Bread and Butter
+- **Iterative summarization passes** for better global compression.
+- **Second-pass summarization** to refine and compress summaries further.
+- Built with **Hugging Face Transformers** and **Gradio**.
 ---
+## Setup
+1. Clone this repo:
+   ```bash
+   git clone https://huggingface.co/spaces/psyrishi/narrative-summarizer
+   cd narrative-summarizer
+````
+2. Install dependencies:
+   ```bash
+   pip install -r requirements.txt
+   ```
+3. Run the app locally:
+   ```bash
+   python app.py
+   ```
 ---
+## Usage
+* Upload a `.txt` file (UTF-8 or similar encodings supported).
+* Choose a prompt style from the dropdown.
+* Select the number of summarization iterations (≥1).
+* Click **Summarize** to get the output.
 ---
+## How It Works
+* Reads the input file with encoding detection to avoid decode errors.
+* Splits text into token-based chunks (\~200 tokens each).
+* Applies custom prompts and summarizes each chunk.
+* Optionally, performs multiple iterative passes to refine the summary.
+* Combines chunk summaries and performs a second-pass summarization for global compression.
 ---
+## Notes
+* Model used: `facebook/bart-large-cnn` (can be customized in `summarizer.py`).
+* GPU acceleration can speed up summarization if available.
+* For very large files, increase iterations cautiously to avoid long runtimes.
 ---
+## License
+This project is licensed under the MIT License.
 ---
+## Author
+Created by [psyrishi](https://huggingface.co/psyrishi)
 ---
+Feel free to contribute or raise issues!

app.py CHANGED Viewed

@@ -1,44 +1,44 @@
 import gradio as gr
-from summarizer import load_model, summarize_chunks
-model_options = [
-    "facebook/bart-large-cnn",
-    "google/pegasus-xsum",
-    "allenai/led-base-16384",
-    "psyrishi/llama2-7b-summary"
-]
-def summarize_file(file_obj, compression_level, model_name):
     try:
-        text = file_obj.read().decode("utf-8")
-    except:
-        return "❌ Error: Unable to read the file. Please upload a valid UTF-8 text file."
-    summarizer, tokenizer = load_model(model_name)
-    result = summarize_chunks(text, summarizer, tokenizer, compression_level=compression_level, second_pass=True)
-    return result
 with gr.Blocks() as demo:
-    gr.Markdown("## 📚 Advanced Narrative Summarizer")
-    gr.Markdown("Summarize large `.txt` files using advanced transformers like Longformer, LLaMA2, and Pegasus.")
     with gr.Row():
-        file_input = gr.File(label="Upload .txt File", file_types=[".txt"])
-        compression_dropdown = gr.Dropdown(
-            choices=[
-                "High (90% compression)",
-                "Medium (70% compression)",
-                "Low (50% compression)"
-            ],
-            value="Medium (70% compression)",
-            label="Compression Level"
         )
-        model_dropdown = gr.Dropdown(choices=model_options, value=model_options[0], label="Model")
-    summarize_btn = gr.Button("Summarize")
-    output_text = gr.Textbox(label="📄 Summarized Output", lines=20, interactive=False)
-    summarize_btn.click(fn=summarize_file, inputs=[file_input, compression_dropdown, model_dropdown], outputs=output_text)
-demo.launch(share=True)

 import gradio as gr
+from summarizer import NarrativeSummarizer
+# Initialize summarizer instance (can specify model etc here)
+summarizer = NarrativeSummarizer()
+def run_summarization(file, prompt_type, iterations):
+    if not file:
+        return "❌ Error: No file uploaded."
     try:
+        iterations = int(iterations)
+        if iterations < 1:
+            return "❌ Error: Iterations must be >= 1."
+    except ValueError:
+        return "❌ Error: Iterations must be an integer."
+    try:
+        # Run summarization
+        summary = summarizer.process_file(file.name, prompt_type, iterations)
+        return summary
+    except Exception as e:
+        return f"❌ Error: {str(e)}"
 with gr.Blocks() as demo:
+    gr.Markdown("# Narrative Summarizer")
     with gr.Row():
+        file_input = gr.File(label="Upload your .txt file")
+        prompt_dropdown = gr.Dropdown(
+            choices=["Bread Only", "Butter Only", "Bread and Butter"],
+            value="Bread Only",
+            label="Select Prompt Type"
         )
+        iterations_input = gr.Number(value=1, label="Iterations", precision=0, minimum=1)
+    output_text = gr.Textbox(label="Summary Output", lines=15)
+    run_button = gr.Button("Summarize")
+    run_button.click(
+        fn=run_summarization,
+        inputs=[file_input, prompt_dropdown, iterations_input],
+        outputs=output_text
+    )
+demo.launch()

requirements.txt CHANGED Viewed

@@ -1,4 +1,5 @@
 transformers>=4.40.0
 gradio>=4.44.1
 torch
 hf-xet

 transformers>=4.40.0
 gradio>=4.44.1
 torch
+chardet>=5.0
 hf-xet

summarizer.py CHANGED Viewed

@@ -1,55 +1,74 @@
-from transformers import pipeline, AutoTokenizer
-import math
-def load_model(model_name="facebook/bart-large-cnn"):
-    summarizer = pipeline("summarization", model=model_name)
-    tokenizer = AutoTokenizer.from_pretrained(model_name)
-    return summarizer, tokenizer
-def chunk_text_by_tokens(text, tokenizer, max_tokens=1024):
-    tokens = tokenizer.encode(text)
-    chunks = []
-    for i in range(0, len(tokens), max_tokens):
-        chunk_tokens = tokens[i:i + max_tokens]
-        chunk_text = tokenizer.decode(chunk_tokens, skip_special_tokens=True)
-        chunks.append(chunk_text)
-    return chunks
-def get_summary_lengths(token_count, compression_level):
-    if compression_level == "High (90% compression)":
-        factor = 0.1
-    elif compression_level == "Medium (70% compression)":
-        factor = 0.3
-    else:
-        factor = 0.5
-    max_len = max(30, math.ceil(token_count * factor))
-    min_len = max(10, math.ceil(token_count * (factor / 2)))
-    return min_len, max_len
-def summarize_chunks(text, summarizer, tokenizer, compression_level="Medium (70% compression)", second_pass=True):
-    chunks = chunk_text_by_tokens(text, tokenizer, max_tokens=1024)
-    summaries = []
-    for chunk in chunks:
-        token_count = len(tokenizer.encode(chunk))
-        min_len, max_len = get_summary_lengths(token_count, compression_level)
-        try:
-            summary = summarizer(chunk, max_length=max_len, min_length=min_len, do_sample=False)[0]['summary_text']
-        except Exception as e:
-            summary = f"[Error summarizing chunk: {e}]"
-        summaries.append(summary.strip())
-    combined_summary = "\n\n".join(summaries)
-    # Second-pass summarization (global)
-    if second_pass and len(summaries) > 1:
-        token_count = len(tokenizer.encode(combined_summary))
-        min_len, max_len = get_summary_lengths(token_count, compression_level)
         try:
-            final_summary = summarizer(combined_summary, max_length=max_len, min_length=min_len, do_sample=False)[0]['summary_text']
-            return final_summary
         except Exception as e:
-            return combined_summary + f"\n\n[Second pass error: {e}]"
-    else:
-        return combined_summary

+import os
+from transformers import pipeline
+import chardet
+class NarrativeSummarizer:
+    def __init__(self, model_name="facebook/bart-large-cnn", chunk_size=1000):
+        self.model_name = model_name
+        self.chunk_size = chunk_size
+        self.summarizer = pipeline("summarization", model=self.model_name)
+    def chunk_text_token_based(self, text):
+        # Token-based chunking approximation based on whitespace split (could be improved with tokenizers)
+        words = text.split()
+        chunks = []
+        current_chunk = []
+        current_len = 0
+        max_tokens = 200  # approximate token limit per chunk (adjust as needed)
+        for word in words:
+            current_chunk.append(word)
+            current_len += 1
+            if current_len >= max_tokens:
+                chunks.append(" ".join(current_chunk))
+                current_chunk = []
+                current_len = 0
+        if current_chunk:
+            chunks.append(" ".join(current_chunk))
+        return chunks
+    def apply_custom_prompt(self, chunk, prompt_type):
+        if prompt_type == "Bread Only":
+            prompt = f"Transform the provided fictional narrative into a maximally compressed yet losslessly decompressible format optimized for LLM reconstruction. {chunk}"
+        elif prompt_type == "Butter Only":
+            prompt = f"Solid foundation, but let's refine the granularity. Your 4-subpoint structure creates artificial symmetry where organic complexity should flourish. {chunk}"
+        elif prompt_type == "Bread and Butter":
+            prompt = f"Transform the provided fictional narrative into a maximally compressed format. Then refine granularity for organic complexity. {chunk}"
+        else:
+            prompt = chunk
+        return prompt
+    def summarize_chunk(self, chunk, prompt_type):
+        prompt = self.apply_custom_prompt(chunk, prompt_type)
+        summary = self.summarizer(prompt, max_length=150, min_length=50, do_sample=False)
+        return summary[0]['summary_text']
+    def process_file(self, file_path, prompt_type, iterations=1):
+        # Read file robustly with encoding detection
         try:
+            with open(file_path, 'rb') as f:
+                raw_data = f.read()
+            detected = chardet.detect(raw_data)
+            encoding = detected['encoding'] or 'utf-8'
+            text = raw_data.decode(encoding, errors='replace')
         except Exception as e:
+            raise RuntimeError(f"Unable to read the file: {str(e)}")
+        # Chunk the text token-wise
+        chunks = self.chunk_text_token_based(text)
+        condensed_chunks = []
+        for chunk in chunks:
+            temp_chunk = chunk
+            for _ in range(iterations):
+                temp_chunk = self.apply_custom_prompt(temp_chunk, prompt_type)
+                temp_chunk = self.summarize_chunk(temp_chunk, prompt_type)
+            condensed_chunks.append(temp_chunk)
+        # Second pass summarization for global compression
+        combined = " ".join(condensed_chunks)
+        if iterations > 1:
+            final_summary = self.summarize_chunk(combined, prompt_type)
+        else:
+            final_summary = combined
+        return final_summary