ulmentflam commited on
Commit
2ebd05e
·
verified ·
1 Parent(s): 1e2779e

Upload folder using huggingface_hub

Browse files
Files changed (2) hide show
  1. LICENSE +72 -0
  2. README.md +176 -0
LICENSE ADDED
@@ -0,0 +1,72 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ QWERKY AI DISTILLED MODEL LICENSE AGREEMENT
2
+
3
+ This model is a distilled version created by QWERKY AI, Inc. and is subject to dual attribution requirements.
4
+
5
+ ================================================================================
6
+ ATTRIBUTION REQUIREMENTS
7
+ ================================================================================
8
+
9
+ This model is:
10
+ 1. Derived from Meta's Llama 3.1 model and subject to the Llama 3.1 Community License Agreement
11
+ 2. Distilled and optimized by QWERKY AI, Inc.
12
+
13
+ When using or redistributing this model, you must provide attribution to BOTH:
14
+ - Meta Platforms, Inc. for the original Llama 3.1 model
15
+ - QWERKY AI, Inc. for the distillation and optimization
16
+
17
+ Suggested attribution format:
18
+ "This model is based on Meta's Llama 3.1, distilled and optimized by QWERKY AI, Inc."
19
+
20
+ ================================================================================
21
+ ORIGINAL LLAMA 3.1 LICENSE TERMS
22
+ ================================================================================
23
+
24
+ This model inherits all terms and conditions from the Llama 3.1 Community License Agreement dated July 23, 2024, including but not limited to:
25
+
26
+ 1. USAGE RESTRICTIONS: If you have more than 700 million monthly active users, you must request a license from Meta.
27
+
28
+ 2. PROHIBITED USES: You may not use this model to:
29
+ - Violate laws or regulations
30
+ - Engage in harmful, abusive, or discriminatory activities
31
+ - Generate misinformation or harmful content
32
+
33
+ 3. DISTRIBUTION: Any redistribution must include:
34
+ - This complete license
35
+ - Attribution to both Meta and QWERKY AI
36
+ - The same use restrictions
37
+
38
+ The full Llama 3.1 Community License Agreement is incorporated by reference and available at: https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE
39
+
40
+ ================================================================================
41
+ QWERKY AI ADDITIONAL TERMS
42
+ ================================================================================
43
+
44
+ In addition to the Llama 3.1 license terms, users must:
45
+
46
+ 1. ATTRIBUTION: Include clear attribution to QWERKY AI, Inc. in any:
47
+ - Academic papers or research
48
+ - Commercial products or services
49
+ - Public demonstrations or benchmarks
50
+ - Derivative works or fine-tuned versions
51
+
52
+ 2. QWERKY BRANDING: Do not imply endorsement by QWERKY AI without written permission
53
+
54
+ 3. PERFORMANCE CLAIMS: When citing performance metrics, clearly indicate:
55
+ - That this is a distilled version
56
+ - Any benchmarks are specific to this distilled model
57
+ - QWERKY AI's optimization techniques were applied
58
+
59
+ ================================================================================
60
+ WARRANTY DISCLAIMER
61
+ ================================================================================
62
+
63
+ THIS MODEL IS PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED.
64
+ NEITHER META PLATFORMS, INC. NOR QWERKY AI, INC. MAKE ANY WARRANTIES REGARDING
65
+ THE MODEL'S PERFORMANCE, MERCHANTABILITY, OR FITNESS FOR ANY PARTICULAR PURPOSE.
66
+
67
+ ================================================================================
68
+
69
+ By using this model, you agree to all terms above.
70
+
71
+ Copyright (c) Meta Platforms, Inc. (Original Llama 3.1 Model)
72
+ Copyright (c) QWERKY AI, Inc. (Distillation and Optimization)
README.md ADDED
@@ -0,0 +1,176 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ tags:
4
+ - qwerky
5
+ - mamba
6
+ - mamba
7
+ - llama
8
+ - hybrid
9
+ - causal-lm
10
+ - text-generation
11
+ language:
12
+ - en
13
+ library_name: transformers
14
+ pipeline_tag: text-generation
15
+ ---
16
+
17
+ # QwerkyLlamaMambaHybrid
18
+
19
+ This is a hybrid Mamba-Transformer model based on the Llama 3.2 architecture, distilled from Llama 3.1 8B into a 3B parameter model using Qwerky's proprietary distillation method. The model uses MAMBA layers interleaved with attention layers for efficient sequence modeling. The results are a 3B parameter model comparable in quality to Llama's 3.2 3B but running at speeds as fast or faster than Llama's 3.2 1B model.
20
+
21
+ **Model Developer**: Qwerky AI
22
+
23
+ ## ⚠️ Important Requirements
24
+
25
+ **CUDA is required to run this model.** This model requires a CUDA-compatible GPU and cannot be run on CPU-only systems. Make sure you have:
26
+ - A CUDA-compatible GPU (NVIDIA GPU with CUDA support)
27
+ - CUDA toolkit installed
28
+ - PyTorch with CUDA support
29
+
30
+ ## Model Details
31
+
32
+ - **Model Type:** QwerkyLlamaMambaHybrid (Hybrid Mamba-Transformer)
33
+ - **Architecture:** QwerkyLlamaMambaHybridForCausalLM
34
+ - **Base Model:** Llama-3.1-8B
35
+ - **Mamba Type:** MAMBA
36
+
37
+ ### Model Configuration
38
+
39
+ - **Vocabulary Size:** 128256
40
+ - **Hidden Size:** 4096
41
+ - **Number of Layers:** 32
42
+ - **Number of Attention Heads:** 32
43
+ - **Intermediate Size:** 14336
44
+
45
+ ## How to Use
46
+
47
+ This model can be loaded using HuggingFace Transformers with `AutoTokenizer` and `AutoModelForCausalLM`. The model uses custom configuration and modeling files that are automatically loaded via the `auto_map` in `config.json`.
48
+
49
+ ### Installation
50
+
51
+ First, install the required dependencies:
52
+
53
+ ```bash
54
+ pip install transformers torch safetensors
55
+ pip install flash-attn --no-build-isolation
56
+ pip install mamba-ssm --no-build-isolation
57
+ pip install causal-conv1d>=1.2.0 --no-build-isolation
58
+ ```
59
+
60
+ **Note:** `flash-attn` compilation can take 10-30 minutes and may use significant system resources. To avoid overwhelming your system, you can limit parallel compilation jobs:
61
+
62
+ ```bash
63
+ MAX_JOBS=1 pip install flash-attn --no-build-isolation
64
+ ```
65
+
66
+ Or set it as an environment variable:
67
+ ```bash
68
+ export MAX_JOBS=1
69
+ pip install flash-attn --no-build-isolation
70
+ ```
71
+
72
+ ### Loading the Model
73
+
74
+ #### From HuggingFace Hub
75
+
76
+ ```python
77
+ import torch
78
+ from transformers import AutoTokenizer, AutoModelForCausalLM
79
+
80
+ # Load tokenizer and model
81
+ tokenizer = AutoTokenizer.from_pretrained("QwerkyAI/Qwerky-Optimized-Llama3.2-Mamba-0.2-8B-Instruct")
82
+ model = AutoModelForCausalLM.from_pretrained(
83
+ "QwerkyAI/Qwerky-Optimized-Llama3.2-Mamba-0.2-8B-Instruct",
84
+ torch_dtype=torch.bfloat16, # or torch.float16
85
+ device_map="auto",
86
+ trust_remote_code=True
87
+ ).to("cuda")
88
+ ```
89
+
90
+ #### From Local Directory
91
+
92
+ ```python
93
+ import torch
94
+ from transformers import AutoTokenizer, AutoModelForCausalLM
95
+
96
+ # Load tokenizer and model from local directory
97
+ tokenizer = AutoTokenizer.from_pretrained("./path/to/model")
98
+ model = AutoModelForCausalLM.from_pretrained(
99
+ "./path/to/model",
100
+ torch_dtype=torch.bfloat16, # or torch.float16
101
+ device_map="auto",
102
+ trust_remote_code=True
103
+ ).to("cuda")
104
+ ```
105
+
106
+ ### Generating Text
107
+
108
+ ```python
109
+ messages = [
110
+ {"role": "user", "content": "Hello, how are you?"}
111
+ ]
112
+
113
+ # Apply chat template
114
+ prompt = tokenizer.apply_chat_template(
115
+ messages,
116
+ tokenize=False,
117
+ add_generation_prompt=True
118
+ )
119
+
120
+ # Tokenize and move to CUDA
121
+ inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
122
+
123
+ # Ensure model is in bfloat16 or float16 for FlashAttention compatibility
124
+ model = model.to(torch.bfloat16)
125
+
126
+ # Generate response
127
+ outputs = model.generate(
128
+ inputs.input_ids,
129
+ max_length=100,
130
+ temperature=0.7,
131
+ )
132
+
133
+ # Decode output
134
+ response = tokenizer.decode(outputs[0])
135
+ print(response)
136
+ ```
137
+
138
+ ## Model Files
139
+
140
+ This model repository contains:
141
+
142
+ - `config.json` - Model configuration with `auto_map` for custom classes
143
+ - `modeling_qwerky_llama_mamba_hybrid.py` - Custom modeling class
144
+ - `configuration_qwerky_llama_mamba_hybrid.py` - Custom configuration class
145
+ - `model.safetensors` or `model-*.safetensors` - Model weights (sharded if >5GB)
146
+ - `model.safetensors.index.json` - Index file for sharded weights (if applicable)
147
+ - `tokenizer.json`, `tokenizer_config.json` - Tokenizer files
148
+ - `README.md` - This file
149
+
150
+ ## Requirements
151
+
152
+ - Python 3.8+
153
+ - PyTorch 2.0+
154
+ - Transformers 4.30+
155
+ - safetensors
156
+ - mamba-ssm (for MAMBA models)
157
+ - causal-conv1d>=1.2.0 (for MAMBA models)
158
+ - flash-attn (for optimized attention)
159
+
160
+
161
+ ## Citation
162
+
163
+ If you use this model, please cite:
164
+
165
+ ```bibtex
166
+ @misc{qwerky_llama_mamba_hybrid,
167
+ title={QwerkyLlamaMambaHybrid},
168
+ author={Qwerky AI, Inc.},
169
+ year={2025},
170
+ publisher={HuggingFace}
171
+ }
172
+ ```
173
+
174
+ ## License
175
+
176
+ This model is licensed under the Qwerky Distilled Model License Agreement. See the [LICENSE](LICENSE) file for more details.