Spaces:
Sleeping
Sleeping
Add robust UTF-8 encoding detection, token-based chunking, and second-pass summarization
Browse files- Implemented file encoding detection to support various UTF encodings and prevent decode errors
- Changed chunking to be token-based instead of character-based for better handling of large texts
- Added second-pass summarization for improved global compression of summaries
- Updated UI to use dropdown for prompt selection (Bread, Butter, or Both)
- Improved error handling and user feedback for file uploads
- Updated README and requirements for new dependencies and usage instructions
- README.md +46 -107
- app.py +33 -33
- requirements.txt +1 -0
- summarizer.py +70 -51
README.md
CHANGED
|
@@ -17,147 +17,86 @@ tags:
|
|
| 17 |
- gradio
|
| 18 |
---
|
| 19 |
|
| 20 |
-
#
|
| 21 |
|
| 22 |
-
|
| 23 |
|
| 24 |
---
|
| 25 |
|
| 26 |
-
##
|
| 27 |
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
* ✅ Supports `.txt` file uploads up to 3 MB (or more)
|
| 39 |
-
* 📌 Prompt options: `Bread`, `Butter`, or `Bread and Butter`
|
| 40 |
-
* 🔁 Multi-iteration summarization support
|
| 41 |
-
* 🧠 Model: `facebook/bart-large-cnn`
|
| 42 |
-
* 💾 Auto checkpointing: progress won't be lost on timeout
|
| 43 |
-
* 🧰 Output is saved for download post-processing
|
| 44 |
-
* 🌐 Clean Gradio UI – easy to run in browser
|
| 45 |
|
| 46 |
---
|
| 47 |
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
1. **Upload** a `.txt` file (max \~3MB recommended)
|
| 51 |
-
2. **Select** a summarization style from dropdown:
|
| 52 |
-
|
| 53 |
-
* `Bread only`
|
| 54 |
-
* `Butter only`
|
| 55 |
-
* `Bread and Butter`
|
| 56 |
-
3. Choose:
|
| 57 |
-
|
| 58 |
-
* `Iterations`: how many times the prompts apply
|
| 59 |
-
* `Max Length`: max summary tokens per chunk
|
| 60 |
-
* `Min Length`: min summary tokens per chunk
|
| 61 |
-
4. Click **Summarize**
|
| 62 |
-
5. Get your **condensed output** in the results box
|
| 63 |
-
|
| 64 |
-
---
|
| 65 |
|
| 66 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 67 |
|
| 68 |
-
|
| 69 |
-
| ----------------- | -------------------------------- |
|
| 70 |
-
| **Frontend** | [Gradio](https://www.gradio.app) |
|
| 71 |
-
| **Backend** | Hugging Face `transformers` |
|
| 72 |
-
| **Model** | `facebook/bart-large-cnn` |
|
| 73 |
-
| **Checkpointing** | JSON-based resume system |
|
| 74 |
-
| **Language** | Python 3.10+ |
|
| 75 |
|
| 76 |
-
|
|
|
|
|
|
|
| 77 |
|
| 78 |
-
|
| 79 |
|
| 80 |
-
```
|
| 81 |
-
.
|
| 82 |
-
|
| 83 |
-
├── summarizer.py # Backend summarization logic
|
| 84 |
-
├── requirements.txt # Dependencies
|
| 85 |
-
├── inputs/ # Uploaded input files
|
| 86 |
-
├── outputs/ # Final summarized outputs
|
| 87 |
-
└── checkpoints/ # Intermediate checkpointing
|
| 88 |
-
```
|
| 89 |
|
| 90 |
---
|
| 91 |
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
Clone this repo and run it locally:
|
| 95 |
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
|
| 100 |
-
pip install -r requirements.txt
|
| 101 |
-
python app.py
|
| 102 |
-
```
|
| 103 |
|
| 104 |
---
|
| 105 |
|
| 106 |
-
##
|
| 107 |
-
|
| 108 |
-
Here’s how to fill out the **Hugging Face Space creation form**:
|
| 109 |
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
| **License** | Choose: `MIT`, `Apache 2.0`, or `Other` |
|
| 116 |
-
| **Space SDK** | ✅ Gradio |
|
| 117 |
-
| **Gradio Template** | Start from Scratch or Blank |
|
| 118 |
-
| **Hardware** | ✅ Free (sufficient for your use case) |
|
| 119 |
-
| **Visibility** | Choose: `Public` (recommended) or `Private` |
|
| 120 |
-
| **Dev Mode** | (Optional) Available to PRO subscribers |
|
| 121 |
|
| 122 |
---
|
| 123 |
|
| 124 |
-
|
| 125 |
|
| 126 |
-
*
|
| 127 |
-
*
|
| 128 |
-
*
|
| 129 |
|
| 130 |
---
|
| 131 |
|
| 132 |
-
|
| 133 |
|
| 134 |
-
|
| 135 |
-
Once upon a time, in a quiet village nestled between two mountains...
|
| 136 |
-
```
|
| 137 |
-
|
| 138 |
-
### 📤 Example Output (Bread only)
|
| 139 |
-
|
| 140 |
-
```txt
|
| 141 |
-
A peaceful mountain village faces hidden turmoil, gradually unveiling conflicts beneath its quiet surface.
|
| 142 |
-
```
|
| 143 |
|
| 144 |
---
|
| 145 |
|
| 146 |
-
|
| 147 |
-
|
| 148 |
-
Recommend using:
|
| 149 |
-
|
| 150 |
-
```
|
| 151 |
-
MIT License
|
| 152 |
-
|
| 153 |
-
Copyright (c) 2025 psyrishi
|
| 154 |
-
Permission is hereby granted, free of charge, to any person obtaining a copy...
|
| 155 |
-
```
|
| 156 |
|
| 157 |
-
|
| 158 |
|
| 159 |
---
|
| 160 |
|
| 161 |
-
|
| 162 |
|
| 163 |
-
Feel free to fork the repo, create pull requests, or open issues if you'd like to contribute or improve the tool.
|
|
|
|
| 17 |
- gradio
|
| 18 |
---
|
| 19 |
|
| 20 |
+
# Narrative Summarizer
|
| 21 |
|
| 22 |
+
A Gradio-based app to summarize large narrative `.txt` files using transformer models with advanced chunking and multi-pass summarization.
|
| 23 |
|
| 24 |
---
|
| 25 |
|
| 26 |
+
## Features
|
| 27 |
|
| 28 |
+
- **Robust UTF-8 file handling** with encoding detection for smooth uploads.
|
| 29 |
+
- **Token-based chunking** to handle large files efficiently.
|
| 30 |
+
- **Multiple prompt styles** via dropdown:
|
| 31 |
+
- Bread Only
|
| 32 |
+
- Butter Only
|
| 33 |
+
- Bread and Butter
|
| 34 |
+
- **Iterative summarization passes** for better global compression.
|
| 35 |
+
- **Second-pass summarization** to refine and compress summaries further.
|
| 36 |
+
- Built with **Hugging Face Transformers** and **Gradio**.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 37 |
|
| 38 |
---
|
| 39 |
|
| 40 |
+
## Setup
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 41 |
|
| 42 |
+
1. Clone this repo:
|
| 43 |
+
```bash
|
| 44 |
+
git clone https://huggingface.co/spaces/psyrishi/narrative-summarizer
|
| 45 |
+
cd narrative-summarizer
|
| 46 |
+
````
|
| 47 |
|
| 48 |
+
2. Install dependencies:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
|
| 50 |
+
```bash
|
| 51 |
+
pip install -r requirements.txt
|
| 52 |
+
```
|
| 53 |
|
| 54 |
+
3. Run the app locally:
|
| 55 |
|
| 56 |
+
```bash
|
| 57 |
+
python app.py
|
| 58 |
+
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 59 |
|
| 60 |
---
|
| 61 |
|
| 62 |
+
## Usage
|
|
|
|
|
|
|
| 63 |
|
| 64 |
+
* Upload a `.txt` file (UTF-8 or similar encodings supported).
|
| 65 |
+
* Choose a prompt style from the dropdown.
|
| 66 |
+
* Select the number of summarization iterations (≥1).
|
| 67 |
+
* Click **Summarize** to get the output.
|
|
|
|
|
|
|
|
|
|
| 68 |
|
| 69 |
---
|
| 70 |
|
| 71 |
+
## How It Works
|
|
|
|
|
|
|
| 72 |
|
| 73 |
+
* Reads the input file with encoding detection to avoid decode errors.
|
| 74 |
+
* Splits text into token-based chunks (\~200 tokens each).
|
| 75 |
+
* Applies custom prompts and summarizes each chunk.
|
| 76 |
+
* Optionally, performs multiple iterative passes to refine the summary.
|
| 77 |
+
* Combines chunk summaries and performs a second-pass summarization for global compression.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 78 |
|
| 79 |
---
|
| 80 |
|
| 81 |
+
## Notes
|
| 82 |
|
| 83 |
+
* Model used: `facebook/bart-large-cnn` (can be customized in `summarizer.py`).
|
| 84 |
+
* GPU acceleration can speed up summarization if available.
|
| 85 |
+
* For very large files, increase iterations cautiously to avoid long runtimes.
|
| 86 |
|
| 87 |
---
|
| 88 |
|
| 89 |
+
## License
|
| 90 |
|
| 91 |
+
This project is licensed under the MIT License.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 92 |
|
| 93 |
---
|
| 94 |
|
| 95 |
+
## Author
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 96 |
|
| 97 |
+
Created by [psyrishi](https://huggingface.co/psyrishi)
|
| 98 |
|
| 99 |
---
|
| 100 |
|
| 101 |
+
Feel free to contribute or raise issues!
|
| 102 |
|
|
|
app.py
CHANGED
|
@@ -1,44 +1,44 @@
|
|
| 1 |
import gradio as gr
|
| 2 |
-
from summarizer import
|
| 3 |
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
"google/pegasus-xsum",
|
| 7 |
-
"allenai/led-base-16384",
|
| 8 |
-
"psyrishi/llama2-7b-summary"
|
| 9 |
-
]
|
| 10 |
|
| 11 |
-
def
|
|
|
|
|
|
|
| 12 |
try:
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
|
| 21 |
with gr.Blocks() as demo:
|
| 22 |
-
gr.Markdown("
|
| 23 |
-
gr.Markdown("Summarize large `.txt` files using advanced transformers like Longformer, LLaMA2, and Pegasus.")
|
| 24 |
-
|
| 25 |
with gr.Row():
|
| 26 |
-
file_input = gr.File(label="Upload .txt
|
| 27 |
-
|
| 28 |
-
choices=[
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
"Low (50% compression)"
|
| 32 |
-
],
|
| 33 |
-
value="Medium (70% compression)",
|
| 34 |
-
label="Compression Level"
|
| 35 |
)
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
summarize_btn = gr.Button("Summarize")
|
| 39 |
|
| 40 |
-
output_text = gr.Textbox(label="
|
| 41 |
|
| 42 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
|
| 44 |
-
demo.launch(
|
|
|
|
| 1 |
import gradio as gr
|
| 2 |
+
from summarizer import NarrativeSummarizer
|
| 3 |
|
| 4 |
+
# Initialize summarizer instance (can specify model etc here)
|
| 5 |
+
summarizer = NarrativeSummarizer()
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
|
| 7 |
+
def run_summarization(file, prompt_type, iterations):
|
| 8 |
+
if not file:
|
| 9 |
+
return "❌ Error: No file uploaded."
|
| 10 |
try:
|
| 11 |
+
iterations = int(iterations)
|
| 12 |
+
if iterations < 1:
|
| 13 |
+
return "❌ Error: Iterations must be >= 1."
|
| 14 |
+
except ValueError:
|
| 15 |
+
return "❌ Error: Iterations must be an integer."
|
| 16 |
+
|
| 17 |
+
try:
|
| 18 |
+
# Run summarization
|
| 19 |
+
summary = summarizer.process_file(file.name, prompt_type, iterations)
|
| 20 |
+
return summary
|
| 21 |
+
except Exception as e:
|
| 22 |
+
return f"❌ Error: {str(e)}"
|
| 23 |
|
| 24 |
with gr.Blocks() as demo:
|
| 25 |
+
gr.Markdown("# Narrative Summarizer")
|
|
|
|
|
|
|
| 26 |
with gr.Row():
|
| 27 |
+
file_input = gr.File(label="Upload your .txt file")
|
| 28 |
+
prompt_dropdown = gr.Dropdown(
|
| 29 |
+
choices=["Bread Only", "Butter Only", "Bread and Butter"],
|
| 30 |
+
value="Bread Only",
|
| 31 |
+
label="Select Prompt Type"
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
)
|
| 33 |
+
iterations_input = gr.Number(value=1, label="Iterations", precision=0, minimum=1)
|
|
|
|
|
|
|
| 34 |
|
| 35 |
+
output_text = gr.Textbox(label="Summary Output", lines=15)
|
| 36 |
|
| 37 |
+
run_button = gr.Button("Summarize")
|
| 38 |
+
run_button.click(
|
| 39 |
+
fn=run_summarization,
|
| 40 |
+
inputs=[file_input, prompt_dropdown, iterations_input],
|
| 41 |
+
outputs=output_text
|
| 42 |
+
)
|
| 43 |
|
| 44 |
+
demo.launch()
|
requirements.txt
CHANGED
|
@@ -1,4 +1,5 @@
|
|
| 1 |
transformers>=4.40.0
|
| 2 |
gradio>=4.44.1
|
| 3 |
torch
|
|
|
|
| 4 |
hf-xet
|
|
|
|
| 1 |
transformers>=4.40.0
|
| 2 |
gradio>=4.44.1
|
| 3 |
torch
|
| 4 |
+
chardet>=5.0
|
| 5 |
hf-xet
|
summarizer.py
CHANGED
|
@@ -1,55 +1,74 @@
|
|
| 1 |
-
|
| 2 |
-
import
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
|
| 16 |
-
|
| 17 |
-
|
| 18 |
-
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
except Exception as e:
|
| 40 |
-
summary = f"[Error summarizing chunk: {e}]"
|
| 41 |
-
summaries.append(summary.strip())
|
| 42 |
|
| 43 |
-
|
|
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
token_count = len(tokenizer.encode(combined_summary))
|
| 48 |
-
min_len, max_len = get_summary_lengths(token_count, compression_level)
|
| 49 |
try:
|
| 50 |
-
|
| 51 |
-
|
|
|
|
|
|
|
|
|
|
| 52 |
except Exception as e:
|
| 53 |
-
|
| 54 |
-
|
| 55 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import os
|
| 2 |
+
from transformers import pipeline
|
| 3 |
+
import chardet
|
| 4 |
+
|
| 5 |
+
class NarrativeSummarizer:
|
| 6 |
+
def __init__(self, model_name="facebook/bart-large-cnn", chunk_size=1000):
|
| 7 |
+
self.model_name = model_name
|
| 8 |
+
self.chunk_size = chunk_size
|
| 9 |
+
self.summarizer = pipeline("summarization", model=self.model_name)
|
| 10 |
+
|
| 11 |
+
def chunk_text_token_based(self, text):
|
| 12 |
+
# Token-based chunking approximation based on whitespace split (could be improved with tokenizers)
|
| 13 |
+
words = text.split()
|
| 14 |
+
chunks = []
|
| 15 |
+
current_chunk = []
|
| 16 |
+
current_len = 0
|
| 17 |
+
max_tokens = 200 # approximate token limit per chunk (adjust as needed)
|
| 18 |
+
for word in words:
|
| 19 |
+
current_chunk.append(word)
|
| 20 |
+
current_len += 1
|
| 21 |
+
if current_len >= max_tokens:
|
| 22 |
+
chunks.append(" ".join(current_chunk))
|
| 23 |
+
current_chunk = []
|
| 24 |
+
current_len = 0
|
| 25 |
+
if current_chunk:
|
| 26 |
+
chunks.append(" ".join(current_chunk))
|
| 27 |
+
return chunks
|
| 28 |
+
|
| 29 |
+
def apply_custom_prompt(self, chunk, prompt_type):
|
| 30 |
+
if prompt_type == "Bread Only":
|
| 31 |
+
prompt = f"Transform the provided fictional narrative into a maximally compressed yet losslessly decompressible format optimized for LLM reconstruction. {chunk}"
|
| 32 |
+
elif prompt_type == "Butter Only":
|
| 33 |
+
prompt = f"Solid foundation, but let's refine the granularity. Your 4-subpoint structure creates artificial symmetry where organic complexity should flourish. {chunk}"
|
| 34 |
+
elif prompt_type == "Bread and Butter":
|
| 35 |
+
prompt = f"Transform the provided fictional narrative into a maximally compressed format. Then refine granularity for organic complexity. {chunk}"
|
| 36 |
+
else:
|
| 37 |
+
prompt = chunk
|
| 38 |
+
return prompt
|
|
|
|
|
|
|
|
|
|
| 39 |
|
| 40 |
+
def summarize_chunk(self, chunk, prompt_type):
|
| 41 |
+
prompt = self.apply_custom_prompt(chunk, prompt_type)
|
| 42 |
+
summary = self.summarizer(prompt, max_length=150, min_length=50, do_sample=False)
|
| 43 |
+
return summary[0]['summary_text']
|
| 44 |
|
| 45 |
+
def process_file(self, file_path, prompt_type, iterations=1):
|
| 46 |
+
# Read file robustly with encoding detection
|
|
|
|
|
|
|
| 47 |
try:
|
| 48 |
+
with open(file_path, 'rb') as f:
|
| 49 |
+
raw_data = f.read()
|
| 50 |
+
detected = chardet.detect(raw_data)
|
| 51 |
+
encoding = detected['encoding'] or 'utf-8'
|
| 52 |
+
text = raw_data.decode(encoding, errors='replace')
|
| 53 |
except Exception as e:
|
| 54 |
+
raise RuntimeError(f"Unable to read the file: {str(e)}")
|
| 55 |
+
|
| 56 |
+
# Chunk the text token-wise
|
| 57 |
+
chunks = self.chunk_text_token_based(text)
|
| 58 |
+
condensed_chunks = []
|
| 59 |
+
|
| 60 |
+
for chunk in chunks:
|
| 61 |
+
temp_chunk = chunk
|
| 62 |
+
for _ in range(iterations):
|
| 63 |
+
temp_chunk = self.apply_custom_prompt(temp_chunk, prompt_type)
|
| 64 |
+
temp_chunk = self.summarize_chunk(temp_chunk, prompt_type)
|
| 65 |
+
condensed_chunks.append(temp_chunk)
|
| 66 |
+
|
| 67 |
+
# Second pass summarization for global compression
|
| 68 |
+
combined = " ".join(condensed_chunks)
|
| 69 |
+
if iterations > 1:
|
| 70 |
+
final_summary = self.summarize_chunk(combined, prompt_type)
|
| 71 |
+
else:
|
| 72 |
+
final_summary = combined
|
| 73 |
+
|
| 74 |
+
return final_summary
|