psyrishi commited on
Commit
16ea1b0
·
1 Parent(s): 22ddfaa

Add robust UTF-8 encoding detection, token-based chunking, and second-pass summarization

Browse files

- Implemented file encoding detection to support various UTF encodings and prevent decode errors
- Changed chunking to be token-based instead of character-based for better handling of large texts
- Added second-pass summarization for improved global compression of summaries
- Updated UI to use dropdown for prompt selection (Bread, Butter, or Both)
- Improved error handling and user feedback for file uploads
- Updated README and requirements for new dependencies and usage instructions

Files changed (4) hide show
  1. README.md +46 -107
  2. app.py +33 -33
  3. requirements.txt +1 -0
  4. summarizer.py +70 -51
README.md CHANGED
@@ -17,147 +17,86 @@ tags:
17
  - gradio
18
  ---
19
 
20
- # 📚 Narrative Summarizer
21
 
22
- Summarize long `.txt` narrative files into compressed, LLM-optimized summaries using BART. Choose between `Bread`, `Butter`, or both prompt styles for custom compression behavior. Upload a `.txt` file, select your preferences, and receive a clean, compressed summary in seconds.
23
 
24
  ---
25
 
26
- ## 📚 Narrative Summarizer — Hugging Face Space
27
 
28
- **`psyrishi/narrative-summarizer`**
29
-
30
- A user-friendly summarization tool for `.txt` files, powered by Hugging Face Transformers and built with Gradio.
31
-
32
- This app transforms long-form narratives into compressed, LLM-friendly summaries using either the **"Bread"**, **"Butter"**, or a **combination of both** prompt styles. It supports checkpointing to avoid data loss on interruptions and ensures large text files are processed reliably.
33
-
34
- ---
35
-
36
- ### Features
37
-
38
- * ✅ Supports `.txt` file uploads up to 3 MB (or more)
39
- * 📌 Prompt options: `Bread`, `Butter`, or `Bread and Butter`
40
- * 🔁 Multi-iteration summarization support
41
- * 🧠 Model: `facebook/bart-large-cnn`
42
- * 💾 Auto checkpointing: progress won't be lost on timeout
43
- * 🧰 Output is saved for download post-processing
44
- * 🌐 Clean Gradio UI – easy to run in browser
45
 
46
  ---
47
 
48
- ### 📥 How to Use
49
-
50
- 1. **Upload** a `.txt` file (max \~3MB recommended)
51
- 2. **Select** a summarization style from dropdown:
52
-
53
- * `Bread only`
54
- * `Butter only`
55
- * `Bread and Butter`
56
- 3. Choose:
57
-
58
- * `Iterations`: how many times the prompts apply
59
- * `Max Length`: max summary tokens per chunk
60
- * `Min Length`: min summary tokens per chunk
61
- 4. Click **Summarize**
62
- 5. Get your **condensed output** in the results box
63
-
64
- ---
65
 
66
- ### ⚙️ Tech Stack
 
 
 
 
67
 
68
- | Component | Details |
69
- | ----------------- | -------------------------------- |
70
- | **Frontend** | [Gradio](https://www.gradio.app) |
71
- | **Backend** | Hugging Face `transformers` |
72
- | **Model** | `facebook/bart-large-cnn` |
73
- | **Checkpointing** | JSON-based resume system |
74
- | **Language** | Python 3.10+ |
75
 
76
- ---
 
 
77
 
78
- ### 📂 Folder Structure
79
 
80
- ```
81
- .
82
- ├── app.py # Gradio frontend app
83
- ├── summarizer.py # Backend summarization logic
84
- ├── requirements.txt # Dependencies
85
- ├── inputs/ # Uploaded input files
86
- ├── outputs/ # Final summarized outputs
87
- └── checkpoints/ # Intermediate checkpointing
88
- ```
89
 
90
  ---
91
 
92
- ### 🛠️ Setup (Local)
93
-
94
- Clone this repo and run it locally:
95
 
96
- ```bash
97
- git clone https://huggingface.co/spaces/psyrishi/narrative-summarizer
98
- cd narrative-summarizer
99
-
100
- pip install -r requirements.txt
101
- python app.py
102
- ```
103
 
104
  ---
105
 
106
- ## 🚀 Space Configuration
107
-
108
- Here’s how to fill out the **Hugging Face Space creation form**:
109
 
110
- | Field | Value |
111
- | --------------------- | ------------------------------------------- |
112
- | **Owner** | `psyrishi` |
113
- | **Space Name** | `narrative-summarizer` |
114
- | **Short Description** | Summarizer for the txt files |
115
- | **License** | Choose: `MIT`, `Apache 2.0`, or `Other` |
116
- | **Space SDK** | ✅ Gradio |
117
- | **Gradio Template** | Start from Scratch or Blank |
118
- | **Hardware** | ✅ Free (sufficient for your use case) |
119
- | **Visibility** | Choose: `Public` (recommended) or `Private` |
120
- | **Dev Mode** | (Optional) Available to PRO subscribers |
121
 
122
  ---
123
 
124
- ### 🧪 Prompt Styles Explained
125
 
126
- * 🥖 **Bread**: Focuses on compression for efficient LLM parsing
127
- * 🧈 **Butter**: Enhances nuance and detail while summarizing
128
- * 🥪 **Bread + Butter**: Applies both sequentially for balance
129
 
130
  ---
131
 
132
- ### 📌 Example Input
133
 
134
- ```txt
135
- Once upon a time, in a quiet village nestled between two mountains...
136
- ```
137
-
138
- ### 📤 Example Output (Bread only)
139
-
140
- ```txt
141
- A peaceful mountain village faces hidden turmoil, gradually unveiling conflicts beneath its quiet surface.
142
- ```
143
 
144
  ---
145
 
146
- ### 🔐 License
147
-
148
- Recommend using:
149
-
150
- ```
151
- MIT License
152
-
153
- Copyright (c) 2025 psyrishi
154
- Permission is hereby granted, free of charge, to any person obtaining a copy...
155
- ```
156
 
157
- Or [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0).
158
 
159
  ---
160
 
161
- ### 👋 Feedback & Contributions
162
 
163
- Feel free to fork the repo, create pull requests, or open issues if you'd like to contribute or improve the tool.
 
17
  - gradio
18
  ---
19
 
20
+ # Narrative Summarizer
21
 
22
+ A Gradio-based app to summarize large narrative `.txt` files using transformer models with advanced chunking and multi-pass summarization.
23
 
24
  ---
25
 
26
+ ## Features
27
 
28
+ - **Robust UTF-8 file handling** with encoding detection for smooth uploads.
29
+ - **Token-based chunking** to handle large files efficiently.
30
+ - **Multiple prompt styles** via dropdown:
31
+ - Bread Only
32
+ - Butter Only
33
+ - Bread and Butter
34
+ - **Iterative summarization passes** for better global compression.
35
+ - **Second-pass summarization** to refine and compress summaries further.
36
+ - Built with **Hugging Face Transformers** and **Gradio**.
 
 
 
 
 
 
 
 
37
 
38
  ---
39
 
40
+ ## Setup
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
 
42
+ 1. Clone this repo:
43
+ ```bash
44
+ git clone https://huggingface.co/spaces/psyrishi/narrative-summarizer
45
+ cd narrative-summarizer
46
+ ````
47
 
48
+ 2. Install dependencies:
 
 
 
 
 
 
49
 
50
+ ```bash
51
+ pip install -r requirements.txt
52
+ ```
53
 
54
+ 3. Run the app locally:
55
 
56
+ ```bash
57
+ python app.py
58
+ ```
 
 
 
 
 
 
59
 
60
  ---
61
 
62
+ ## Usage
 
 
63
 
64
+ * Upload a `.txt` file (UTF-8 or similar encodings supported).
65
+ * Choose a prompt style from the dropdown.
66
+ * Select the number of summarization iterations (≥1).
67
+ * Click **Summarize** to get the output.
 
 
 
68
 
69
  ---
70
 
71
+ ## How It Works
 
 
72
 
73
+ * Reads the input file with encoding detection to avoid decode errors.
74
+ * Splits text into token-based chunks (\~200 tokens each).
75
+ * Applies custom prompts and summarizes each chunk.
76
+ * Optionally, performs multiple iterative passes to refine the summary.
77
+ * Combines chunk summaries and performs a second-pass summarization for global compression.
 
 
 
 
 
 
78
 
79
  ---
80
 
81
+ ## Notes
82
 
83
+ * Model used: `facebook/bart-large-cnn` (can be customized in `summarizer.py`).
84
+ * GPU acceleration can speed up summarization if available.
85
+ * For very large files, increase iterations cautiously to avoid long runtimes.
86
 
87
  ---
88
 
89
+ ## License
90
 
91
+ This project is licensed under the MIT License.
 
 
 
 
 
 
 
 
92
 
93
  ---
94
 
95
+ ## Author
 
 
 
 
 
 
 
 
 
96
 
97
+ Created by [psyrishi](https://huggingface.co/psyrishi)
98
 
99
  ---
100
 
101
+ Feel free to contribute or raise issues!
102
 
 
app.py CHANGED
@@ -1,44 +1,44 @@
1
  import gradio as gr
2
- from summarizer import load_model, summarize_chunks
3
 
4
- model_options = [
5
- "facebook/bart-large-cnn",
6
- "google/pegasus-xsum",
7
- "allenai/led-base-16384",
8
- "psyrishi/llama2-7b-summary"
9
- ]
10
 
11
- def summarize_file(file_obj, compression_level, model_name):
 
 
12
  try:
13
- text = file_obj.read().decode("utf-8")
14
- except:
15
- return "❌ Error: Unable to read the file. Please upload a valid UTF-8 text file."
16
-
17
- summarizer, tokenizer = load_model(model_name)
18
- result = summarize_chunks(text, summarizer, tokenizer, compression_level=compression_level, second_pass=True)
19
- return result
 
 
 
 
 
20
 
21
  with gr.Blocks() as demo:
22
- gr.Markdown("## 📚 Advanced Narrative Summarizer")
23
- gr.Markdown("Summarize large `.txt` files using advanced transformers like Longformer, LLaMA2, and Pegasus.")
24
-
25
  with gr.Row():
26
- file_input = gr.File(label="Upload .txt File", file_types=[".txt"])
27
- compression_dropdown = gr.Dropdown(
28
- choices=[
29
- "High (90% compression)",
30
- "Medium (70% compression)",
31
- "Low (50% compression)"
32
- ],
33
- value="Medium (70% compression)",
34
- label="Compression Level"
35
  )
36
- model_dropdown = gr.Dropdown(choices=model_options, value=model_options[0], label="Model")
37
-
38
- summarize_btn = gr.Button("Summarize")
39
 
40
- output_text = gr.Textbox(label="📄 Summarized Output", lines=20, interactive=False)
41
 
42
- summarize_btn.click(fn=summarize_file, inputs=[file_input, compression_dropdown, model_dropdown], outputs=output_text)
 
 
 
 
 
43
 
44
- demo.launch(share=True)
 
1
  import gradio as gr
2
+ from summarizer import NarrativeSummarizer
3
 
4
+ # Initialize summarizer instance (can specify model etc here)
5
+ summarizer = NarrativeSummarizer()
 
 
 
 
6
 
7
+ def run_summarization(file, prompt_type, iterations):
8
+ if not file:
9
+ return "❌ Error: No file uploaded."
10
  try:
11
+ iterations = int(iterations)
12
+ if iterations < 1:
13
+ return "❌ Error: Iterations must be >= 1."
14
+ except ValueError:
15
+ return "❌ Error: Iterations must be an integer."
16
+
17
+ try:
18
+ # Run summarization
19
+ summary = summarizer.process_file(file.name, prompt_type, iterations)
20
+ return summary
21
+ except Exception as e:
22
+ return f"❌ Error: {str(e)}"
23
 
24
  with gr.Blocks() as demo:
25
+ gr.Markdown("# Narrative Summarizer")
 
 
26
  with gr.Row():
27
+ file_input = gr.File(label="Upload your .txt file")
28
+ prompt_dropdown = gr.Dropdown(
29
+ choices=["Bread Only", "Butter Only", "Bread and Butter"],
30
+ value="Bread Only",
31
+ label="Select Prompt Type"
 
 
 
 
32
  )
33
+ iterations_input = gr.Number(value=1, label="Iterations", precision=0, minimum=1)
 
 
34
 
35
+ output_text = gr.Textbox(label="Summary Output", lines=15)
36
 
37
+ run_button = gr.Button("Summarize")
38
+ run_button.click(
39
+ fn=run_summarization,
40
+ inputs=[file_input, prompt_dropdown, iterations_input],
41
+ outputs=output_text
42
+ )
43
 
44
+ demo.launch()
requirements.txt CHANGED
@@ -1,4 +1,5 @@
1
  transformers>=4.40.0
2
  gradio>=4.44.1
3
  torch
 
4
  hf-xet
 
1
  transformers>=4.40.0
2
  gradio>=4.44.1
3
  torch
4
+ chardet>=5.0
5
  hf-xet
summarizer.py CHANGED
@@ -1,55 +1,74 @@
1
- from transformers import pipeline, AutoTokenizer
2
- import math
3
-
4
- def load_model(model_name="facebook/bart-large-cnn"):
5
- summarizer = pipeline("summarization", model=model_name)
6
- tokenizer = AutoTokenizer.from_pretrained(model_name)
7
- return summarizer, tokenizer
8
-
9
- def chunk_text_by_tokens(text, tokenizer, max_tokens=1024):
10
- tokens = tokenizer.encode(text)
11
- chunks = []
12
- for i in range(0, len(tokens), max_tokens):
13
- chunk_tokens = tokens[i:i + max_tokens]
14
- chunk_text = tokenizer.decode(chunk_tokens, skip_special_tokens=True)
15
- chunks.append(chunk_text)
16
- return chunks
17
-
18
- def get_summary_lengths(token_count, compression_level):
19
- if compression_level == "High (90% compression)":
20
- factor = 0.1
21
- elif compression_level == "Medium (70% compression)":
22
- factor = 0.3
23
- else:
24
- factor = 0.5
25
-
26
- max_len = max(30, math.ceil(token_count * factor))
27
- min_len = max(10, math.ceil(token_count * (factor / 2)))
28
- return min_len, max_len
29
-
30
- def summarize_chunks(text, summarizer, tokenizer, compression_level="Medium (70% compression)", second_pass=True):
31
- chunks = chunk_text_by_tokens(text, tokenizer, max_tokens=1024)
32
- summaries = []
33
-
34
- for chunk in chunks:
35
- token_count = len(tokenizer.encode(chunk))
36
- min_len, max_len = get_summary_lengths(token_count, compression_level)
37
- try:
38
- summary = summarizer(chunk, max_length=max_len, min_length=min_len, do_sample=False)[0]['summary_text']
39
- except Exception as e:
40
- summary = f"[Error summarizing chunk: {e}]"
41
- summaries.append(summary.strip())
42
 
43
- combined_summary = "\n\n".join(summaries)
 
 
 
44
 
45
- # Second-pass summarization (global)
46
- if second_pass and len(summaries) > 1:
47
- token_count = len(tokenizer.encode(combined_summary))
48
- min_len, max_len = get_summary_lengths(token_count, compression_level)
49
  try:
50
- final_summary = summarizer(combined_summary, max_length=max_len, min_length=min_len, do_sample=False)[0]['summary_text']
51
- return final_summary
 
 
 
52
  except Exception as e:
53
- return combined_summary + f"\n\n[Second pass error: {e}]"
54
- else:
55
- return combined_summary
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ from transformers import pipeline
3
+ import chardet
4
+
5
+ class NarrativeSummarizer:
6
+ def __init__(self, model_name="facebook/bart-large-cnn", chunk_size=1000):
7
+ self.model_name = model_name
8
+ self.chunk_size = chunk_size
9
+ self.summarizer = pipeline("summarization", model=self.model_name)
10
+
11
+ def chunk_text_token_based(self, text):
12
+ # Token-based chunking approximation based on whitespace split (could be improved with tokenizers)
13
+ words = text.split()
14
+ chunks = []
15
+ current_chunk = []
16
+ current_len = 0
17
+ max_tokens = 200 # approximate token limit per chunk (adjust as needed)
18
+ for word in words:
19
+ current_chunk.append(word)
20
+ current_len += 1
21
+ if current_len >= max_tokens:
22
+ chunks.append(" ".join(current_chunk))
23
+ current_chunk = []
24
+ current_len = 0
25
+ if current_chunk:
26
+ chunks.append(" ".join(current_chunk))
27
+ return chunks
28
+
29
+ def apply_custom_prompt(self, chunk, prompt_type):
30
+ if prompt_type == "Bread Only":
31
+ prompt = f"Transform the provided fictional narrative into a maximally compressed yet losslessly decompressible format optimized for LLM reconstruction. {chunk}"
32
+ elif prompt_type == "Butter Only":
33
+ prompt = f"Solid foundation, but let's refine the granularity. Your 4-subpoint structure creates artificial symmetry where organic complexity should flourish. {chunk}"
34
+ elif prompt_type == "Bread and Butter":
35
+ prompt = f"Transform the provided fictional narrative into a maximally compressed format. Then refine granularity for organic complexity. {chunk}"
36
+ else:
37
+ prompt = chunk
38
+ return prompt
 
 
 
39
 
40
+ def summarize_chunk(self, chunk, prompt_type):
41
+ prompt = self.apply_custom_prompt(chunk, prompt_type)
42
+ summary = self.summarizer(prompt, max_length=150, min_length=50, do_sample=False)
43
+ return summary[0]['summary_text']
44
 
45
+ def process_file(self, file_path, prompt_type, iterations=1):
46
+ # Read file robustly with encoding detection
 
 
47
  try:
48
+ with open(file_path, 'rb') as f:
49
+ raw_data = f.read()
50
+ detected = chardet.detect(raw_data)
51
+ encoding = detected['encoding'] or 'utf-8'
52
+ text = raw_data.decode(encoding, errors='replace')
53
  except Exception as e:
54
+ raise RuntimeError(f"Unable to read the file: {str(e)}")
55
+
56
+ # Chunk the text token-wise
57
+ chunks = self.chunk_text_token_based(text)
58
+ condensed_chunks = []
59
+
60
+ for chunk in chunks:
61
+ temp_chunk = chunk
62
+ for _ in range(iterations):
63
+ temp_chunk = self.apply_custom_prompt(temp_chunk, prompt_type)
64
+ temp_chunk = self.summarize_chunk(temp_chunk, prompt_type)
65
+ condensed_chunks.append(temp_chunk)
66
+
67
+ # Second pass summarization for global compression
68
+ combined = " ".join(condensed_chunks)
69
+ if iterations > 1:
70
+ final_summary = self.summarize_chunk(combined, prompt_type)
71
+ else:
72
+ final_summary = combined
73
+
74
+ return final_summary