Upload folder using huggingface_hub
Browse files
LICENSE
ADDED
|
@@ -0,0 +1,72 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
QWERKY AI DISTILLED MODEL LICENSE AGREEMENT
|
| 2 |
+
|
| 3 |
+
This model is a distilled version created by QWERKY AI, Inc. and is subject to dual attribution requirements.
|
| 4 |
+
|
| 5 |
+
================================================================================
|
| 6 |
+
ATTRIBUTION REQUIREMENTS
|
| 7 |
+
================================================================================
|
| 8 |
+
|
| 9 |
+
This model is:
|
| 10 |
+
1. Derived from Meta's Llama 3.1 model and subject to the Llama 3.1 Community License Agreement
|
| 11 |
+
2. Distilled and optimized by QWERKY AI, Inc.
|
| 12 |
+
|
| 13 |
+
When using or redistributing this model, you must provide attribution to BOTH:
|
| 14 |
+
- Meta Platforms, Inc. for the original Llama 3.1 model
|
| 15 |
+
- QWERKY AI, Inc. for the distillation and optimization
|
| 16 |
+
|
| 17 |
+
Suggested attribution format:
|
| 18 |
+
"This model is based on Meta's Llama 3.1, distilled and optimized by QWERKY AI, Inc."
|
| 19 |
+
|
| 20 |
+
================================================================================
|
| 21 |
+
ORIGINAL LLAMA 3.1 LICENSE TERMS
|
| 22 |
+
================================================================================
|
| 23 |
+
|
| 24 |
+
This model inherits all terms and conditions from the Llama 3.1 Community License Agreement dated July 23, 2024, including but not limited to:
|
| 25 |
+
|
| 26 |
+
1. USAGE RESTRICTIONS: If you have more than 700 million monthly active users, you must request a license from Meta.
|
| 27 |
+
|
| 28 |
+
2. PROHIBITED USES: You may not use this model to:
|
| 29 |
+
- Violate laws or regulations
|
| 30 |
+
- Engage in harmful, abusive, or discriminatory activities
|
| 31 |
+
- Generate misinformation or harmful content
|
| 32 |
+
|
| 33 |
+
3. DISTRIBUTION: Any redistribution must include:
|
| 34 |
+
- This complete license
|
| 35 |
+
- Attribution to both Meta and QWERKY AI
|
| 36 |
+
- The same use restrictions
|
| 37 |
+
|
| 38 |
+
The full Llama 3.1 Community License Agreement is incorporated by reference and available at: https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE
|
| 39 |
+
|
| 40 |
+
================================================================================
|
| 41 |
+
QWERKY AI ADDITIONAL TERMS
|
| 42 |
+
================================================================================
|
| 43 |
+
|
| 44 |
+
In addition to the Llama 3.1 license terms, users must:
|
| 45 |
+
|
| 46 |
+
1. ATTRIBUTION: Include clear attribution to QWERKY AI, Inc. in any:
|
| 47 |
+
- Academic papers or research
|
| 48 |
+
- Commercial products or services
|
| 49 |
+
- Public demonstrations or benchmarks
|
| 50 |
+
- Derivative works or fine-tuned versions
|
| 51 |
+
|
| 52 |
+
2. QWERKY BRANDING: Do not imply endorsement by QWERKY AI without written permission
|
| 53 |
+
|
| 54 |
+
3. PERFORMANCE CLAIMS: When citing performance metrics, clearly indicate:
|
| 55 |
+
- That this is a distilled version
|
| 56 |
+
- Any benchmarks are specific to this distilled model
|
| 57 |
+
- QWERKY AI's optimization techniques were applied
|
| 58 |
+
|
| 59 |
+
================================================================================
|
| 60 |
+
WARRANTY DISCLAIMER
|
| 61 |
+
================================================================================
|
| 62 |
+
|
| 63 |
+
THIS MODEL IS PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED.
|
| 64 |
+
NEITHER META PLATFORMS, INC. NOR QWERKY AI, INC. MAKE ANY WARRANTIES REGARDING
|
| 65 |
+
THE MODEL'S PERFORMANCE, MERCHANTABILITY, OR FITNESS FOR ANY PARTICULAR PURPOSE.
|
| 66 |
+
|
| 67 |
+
================================================================================
|
| 68 |
+
|
| 69 |
+
By using this model, you agree to all terms above.
|
| 70 |
+
|
| 71 |
+
Copyright (c) Meta Platforms, Inc. (Original Llama 3.1 Model)
|
| 72 |
+
Copyright (c) QWERKY AI, Inc. (Distillation and Optimization)
|
README.md
ADDED
|
@@ -0,0 +1,176 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: other
|
| 3 |
+
tags:
|
| 4 |
+
- qwerky
|
| 5 |
+
- mamba
|
| 6 |
+
- mamba
|
| 7 |
+
- llama
|
| 8 |
+
- hybrid
|
| 9 |
+
- causal-lm
|
| 10 |
+
- text-generation
|
| 11 |
+
language:
|
| 12 |
+
- en
|
| 13 |
+
library_name: transformers
|
| 14 |
+
pipeline_tag: text-generation
|
| 15 |
+
---
|
| 16 |
+
|
| 17 |
+
# QwerkyLlamaMambaHybrid
|
| 18 |
+
|
| 19 |
+
This is a hybrid Mamba-Transformer model based on the Llama 3.2 architecture, distilled from Llama 3.1 8B into a 3B parameter model using Qwerky's proprietary distillation method. The model uses MAMBA layers interleaved with attention layers for efficient sequence modeling. The results are a 3B parameter model comparable in quality to Llama's 3.2 3B but running at speeds as fast or faster than Llama's 3.2 1B model.
|
| 20 |
+
|
| 21 |
+
**Model Developer**: Qwerky AI
|
| 22 |
+
|
| 23 |
+
## ⚠️ Important Requirements
|
| 24 |
+
|
| 25 |
+
**CUDA is required to run this model.** This model requires a CUDA-compatible GPU and cannot be run on CPU-only systems. Make sure you have:
|
| 26 |
+
- A CUDA-compatible GPU (NVIDIA GPU with CUDA support)
|
| 27 |
+
- CUDA toolkit installed
|
| 28 |
+
- PyTorch with CUDA support
|
| 29 |
+
|
| 30 |
+
## Model Details
|
| 31 |
+
|
| 32 |
+
- **Model Type:** QwerkyLlamaMambaHybrid (Hybrid Mamba-Transformer)
|
| 33 |
+
- **Architecture:** QwerkyLlamaMambaHybridForCausalLM
|
| 34 |
+
- **Base Model:** Llama-3.1-8B
|
| 35 |
+
- **Mamba Type:** MAMBA
|
| 36 |
+
|
| 37 |
+
### Model Configuration
|
| 38 |
+
|
| 39 |
+
- **Vocabulary Size:** 128256
|
| 40 |
+
- **Hidden Size:** 4096
|
| 41 |
+
- **Number of Layers:** 32
|
| 42 |
+
- **Number of Attention Heads:** 32
|
| 43 |
+
- **Intermediate Size:** 14336
|
| 44 |
+
|
| 45 |
+
## How to Use
|
| 46 |
+
|
| 47 |
+
This model can be loaded using HuggingFace Transformers with `AutoTokenizer` and `AutoModelForCausalLM`. The model uses custom configuration and modeling files that are automatically loaded via the `auto_map` in `config.json`.
|
| 48 |
+
|
| 49 |
+
### Installation
|
| 50 |
+
|
| 51 |
+
First, install the required dependencies:
|
| 52 |
+
|
| 53 |
+
```bash
|
| 54 |
+
pip install transformers torch safetensors
|
| 55 |
+
pip install flash-attn --no-build-isolation
|
| 56 |
+
pip install mamba-ssm --no-build-isolation
|
| 57 |
+
pip install causal-conv1d>=1.2.0 --no-build-isolation
|
| 58 |
+
```
|
| 59 |
+
|
| 60 |
+
**Note:** `flash-attn` compilation can take 10-30 minutes and may use significant system resources. To avoid overwhelming your system, you can limit parallel compilation jobs:
|
| 61 |
+
|
| 62 |
+
```bash
|
| 63 |
+
MAX_JOBS=1 pip install flash-attn --no-build-isolation
|
| 64 |
+
```
|
| 65 |
+
|
| 66 |
+
Or set it as an environment variable:
|
| 67 |
+
```bash
|
| 68 |
+
export MAX_JOBS=1
|
| 69 |
+
pip install flash-attn --no-build-isolation
|
| 70 |
+
```
|
| 71 |
+
|
| 72 |
+
### Loading the Model
|
| 73 |
+
|
| 74 |
+
#### From HuggingFace Hub
|
| 75 |
+
|
| 76 |
+
```python
|
| 77 |
+
import torch
|
| 78 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 79 |
+
|
| 80 |
+
# Load tokenizer and model
|
| 81 |
+
tokenizer = AutoTokenizer.from_pretrained("QwerkyAI/Qwerky-Optimized-Llama3.2-Mamba-0.2-8B-Instruct")
|
| 82 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 83 |
+
"QwerkyAI/Qwerky-Optimized-Llama3.2-Mamba-0.2-8B-Instruct",
|
| 84 |
+
torch_dtype=torch.bfloat16, # or torch.float16
|
| 85 |
+
device_map="auto",
|
| 86 |
+
trust_remote_code=True
|
| 87 |
+
).to("cuda")
|
| 88 |
+
```
|
| 89 |
+
|
| 90 |
+
#### From Local Directory
|
| 91 |
+
|
| 92 |
+
```python
|
| 93 |
+
import torch
|
| 94 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 95 |
+
|
| 96 |
+
# Load tokenizer and model from local directory
|
| 97 |
+
tokenizer = AutoTokenizer.from_pretrained("./path/to/model")
|
| 98 |
+
model = AutoModelForCausalLM.from_pretrained(
|
| 99 |
+
"./path/to/model",
|
| 100 |
+
torch_dtype=torch.bfloat16, # or torch.float16
|
| 101 |
+
device_map="auto",
|
| 102 |
+
trust_remote_code=True
|
| 103 |
+
).to("cuda")
|
| 104 |
+
```
|
| 105 |
+
|
| 106 |
+
### Generating Text
|
| 107 |
+
|
| 108 |
+
```python
|
| 109 |
+
messages = [
|
| 110 |
+
{"role": "user", "content": "Hello, how are you?"}
|
| 111 |
+
]
|
| 112 |
+
|
| 113 |
+
# Apply chat template
|
| 114 |
+
prompt = tokenizer.apply_chat_template(
|
| 115 |
+
messages,
|
| 116 |
+
tokenize=False,
|
| 117 |
+
add_generation_prompt=True
|
| 118 |
+
)
|
| 119 |
+
|
| 120 |
+
# Tokenize and move to CUDA
|
| 121 |
+
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
|
| 122 |
+
|
| 123 |
+
# Ensure model is in bfloat16 or float16 for FlashAttention compatibility
|
| 124 |
+
model = model.to(torch.bfloat16)
|
| 125 |
+
|
| 126 |
+
# Generate response
|
| 127 |
+
outputs = model.generate(
|
| 128 |
+
inputs.input_ids,
|
| 129 |
+
max_length=100,
|
| 130 |
+
temperature=0.7,
|
| 131 |
+
)
|
| 132 |
+
|
| 133 |
+
# Decode output
|
| 134 |
+
response = tokenizer.decode(outputs[0])
|
| 135 |
+
print(response)
|
| 136 |
+
```
|
| 137 |
+
|
| 138 |
+
## Model Files
|
| 139 |
+
|
| 140 |
+
This model repository contains:
|
| 141 |
+
|
| 142 |
+
- `config.json` - Model configuration with `auto_map` for custom classes
|
| 143 |
+
- `modeling_qwerky_llama_mamba_hybrid.py` - Custom modeling class
|
| 144 |
+
- `configuration_qwerky_llama_mamba_hybrid.py` - Custom configuration class
|
| 145 |
+
- `model.safetensors` or `model-*.safetensors` - Model weights (sharded if >5GB)
|
| 146 |
+
- `model.safetensors.index.json` - Index file for sharded weights (if applicable)
|
| 147 |
+
- `tokenizer.json`, `tokenizer_config.json` - Tokenizer files
|
| 148 |
+
- `README.md` - This file
|
| 149 |
+
|
| 150 |
+
## Requirements
|
| 151 |
+
|
| 152 |
+
- Python 3.8+
|
| 153 |
+
- PyTorch 2.0+
|
| 154 |
+
- Transformers 4.30+
|
| 155 |
+
- safetensors
|
| 156 |
+
- mamba-ssm (for MAMBA models)
|
| 157 |
+
- causal-conv1d>=1.2.0 (for MAMBA models)
|
| 158 |
+
- flash-attn (for optimized attention)
|
| 159 |
+
|
| 160 |
+
|
| 161 |
+
## Citation
|
| 162 |
+
|
| 163 |
+
If you use this model, please cite:
|
| 164 |
+
|
| 165 |
+
```bibtex
|
| 166 |
+
@misc{qwerky_llama_mamba_hybrid,
|
| 167 |
+
title={QwerkyLlamaMambaHybrid},
|
| 168 |
+
author={Qwerky AI, Inc.},
|
| 169 |
+
year={2025},
|
| 170 |
+
publisher={HuggingFace}
|
| 171 |
+
}
|
| 172 |
+
```
|
| 173 |
+
|
| 174 |
+
## License
|
| 175 |
+
|
| 176 |
+
This model is licensed under the Qwerky Distilled Model License Agreement. See the [LICENSE](LICENSE) file for more details.
|