QwerkyAI
/

Qwerky-Optimized-Llama3.1-Mamba-0.2-8B-Instruct

+QWERKY AI DISTILLED MODEL LICENSE AGREEMENT
+This model is a distilled version created by QWERKY AI, Inc. and is subject to dual attribution requirements.
+================================================================================
+ATTRIBUTION REQUIREMENTS
+================================================================================
+This model is:
+1. Derived from Meta's Llama 3.1 model and subject to the Llama 3.1 Community License Agreement
+2. Distilled and optimized by QWERKY AI, Inc.
+When using or redistributing this model, you must provide attribution to BOTH:
+- Meta Platforms, Inc. for the original Llama 3.1 model
+- QWERKY AI, Inc. for the distillation and optimization
+Suggested attribution format:
+"This model is based on Meta's Llama 3.1, distilled and optimized by QWERKY AI, Inc."
+================================================================================
+ORIGINAL LLAMA 3.1 LICENSE TERMS
+================================================================================
+This model inherits all terms and conditions from the Llama 3.1 Community License Agreement dated July 23, 2024, including but not limited to:
+1. USAGE RESTRICTIONS: If you have more than 700 million monthly active users, you must request a license from Meta.
+2. PROHIBITED USES: You may not use this model to:
+   - Violate laws or regulations
+   - Engage in harmful, abusive, or discriminatory activities
+   - Generate misinformation or harmful content
+3. DISTRIBUTION: Any redistribution must include:
+   - This complete license
+   - Attribution to both Meta and QWERKY AI
+   - The same use restrictions
+The full Llama 3.1 Community License Agreement is incorporated by reference and available at: https://github.com/meta-llama/llama-models/blob/main/models/llama3_1/LICENSE
+================================================================================
+QWERKY AI ADDITIONAL TERMS
+================================================================================
+In addition to the Llama 3.1 license terms, users must:
+1. ATTRIBUTION: Include clear attribution to QWERKY AI, Inc. in any:
+   - Academic papers or research
+   - Commercial products or services
+   - Public demonstrations or benchmarks
+   - Derivative works or fine-tuned versions
+2. QWERKY BRANDING: Do not imply endorsement by QWERKY AI without written permission
+3. PERFORMANCE CLAIMS: When citing performance metrics, clearly indicate:
+   - That this is a distilled version
+   - Any benchmarks are specific to this distilled model
+   - QWERKY AI's optimization techniques were applied
+================================================================================
+WARRANTY DISCLAIMER
+================================================================================
+THIS MODEL IS PROVIDED "AS IS" WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED.
+NEITHER META PLATFORMS, INC. NOR QWERKY AI, INC. MAKE ANY WARRANTIES REGARDING
+THE MODEL'S PERFORMANCE, MERCHANTABILITY, OR FITNESS FOR ANY PARTICULAR PURPOSE.
+================================================================================
+By using this model, you agree to all terms above.
+Copyright (c) Meta Platforms, Inc. (Original Llama 3.1 Model)
+Copyright (c) QWERKY AI, Inc. (Distillation and Optimization)

README.md ADDED Viewed

	@@ -0,0 +1,176 @@

+---
+license: other
+tags:
+- qwerky
+- mamba
+- mamba
+- llama
+- hybrid
+- causal-lm
+- text-generation
+language:
+- en
+library_name: transformers
+pipeline_tag: text-generation
+---
+# QwerkyLlamaMambaHybrid
+This is a hybrid Mamba-Transformer model based on the Llama 3.2 architecture, distilled from Llama 3.1 8B into a 3B parameter model using Qwerky's proprietary distillation method. The model uses MAMBA layers interleaved with attention layers for efficient sequence modeling. The results are a 3B parameter model comparable in quality to Llama's 3.2 3B but running at speeds as fast or faster than Llama's 3.2 1B model.
+**Model Developer**: Qwerky AI
+## ⚠️ Important Requirements
+**CUDA is required to run this model.** This model requires a CUDA-compatible GPU and cannot be run on CPU-only systems. Make sure you have:
+- A CUDA-compatible GPU (NVIDIA GPU with CUDA support)
+- CUDA toolkit installed
+- PyTorch with CUDA support
+## Model Details
+- **Model Type:** QwerkyLlamaMambaHybrid (Hybrid Mamba-Transformer)
+- **Architecture:** QwerkyLlamaMambaHybridForCausalLM
+- **Base Model:** Llama-3.1-8B
+- **Mamba Type:** MAMBA
+### Model Configuration
+- **Vocabulary Size:** 128256
+- **Hidden Size:** 4096
+- **Number of Layers:** 32
+- **Number of Attention Heads:** 32
+- **Intermediate Size:** 14336
+## How to Use
+This model can be loaded using HuggingFace Transformers with `AutoTokenizer` and `AutoModelForCausalLM`. The model uses custom configuration and modeling files that are automatically loaded via the `auto_map` in `config.json`.
+### Installation
+First, install the required dependencies:
+```bash
+pip install transformers torch safetensors
+pip install flash-attn --no-build-isolation
+pip install mamba-ssm --no-build-isolation
+pip install causal-conv1d>=1.2.0 --no-build-isolation
+```
+**Note:** `flash-attn` compilation can take 10-30 minutes and may use significant system resources. To avoid overwhelming your system, you can limit parallel compilation jobs:
+```bash
+MAX_JOBS=1 pip install flash-attn --no-build-isolation
+```
+Or set it as an environment variable:
+```bash
+export MAX_JOBS=1
+pip install flash-attn --no-build-isolation
+```
+### Loading the Model
+#### From HuggingFace Hub
+```python
+import torch
+from transformers import AutoTokenizer, AutoModelForCausalLM
+# Load tokenizer and model
+tokenizer = AutoTokenizer.from_pretrained("QwerkyAI/Qwerky-Optimized-Llama3.2-Mamba-0.2-8B-Instruct")
+model = AutoModelForCausalLM.from_pretrained(
+    "QwerkyAI/Qwerky-Optimized-Llama3.2-Mamba-0.2-8B-Instruct",
+    torch_dtype=torch.bfloat16,  # or torch.float16
+    device_map="auto",
+    trust_remote_code=True
+).to("cuda")
+```
+#### From Local Directory
+```python
+import torch
+from transformers import AutoTokenizer, AutoModelForCausalLM
+# Load tokenizer and model from local directory
+tokenizer = AutoTokenizer.from_pretrained("./path/to/model")
+model = AutoModelForCausalLM.from_pretrained(
+    "./path/to/model",
+    torch_dtype=torch.bfloat16,  # or torch.float16
+    device_map="auto",
+    trust_remote_code=True
+).to("cuda")
+```
+### Generating Text
+```python
+messages = [
+    {"role": "user", "content": "Hello, how are you?"}
+]
+# Apply chat template
+prompt = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True
+)
+# Tokenize and move to CUDA
+inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
+# Ensure model is in bfloat16 or float16 for FlashAttention compatibility
+model = model.to(torch.bfloat16)
+# Generate response
+outputs = model.generate(
+    inputs.input_ids,
+    max_length=100,
+    temperature=0.7,
+)
+# Decode output
+response = tokenizer.decode(outputs[0])
+print(response)
+```
+## Model Files
+This model repository contains:
+- `config.json` - Model configuration with `auto_map` for custom classes
+- `modeling_qwerky_llama_mamba_hybrid.py` - Custom modeling class
+- `configuration_qwerky_llama_mamba_hybrid.py` - Custom configuration class
+- `model.safetensors` or `model-*.safetensors` - Model weights (sharded if >5GB)
+- `model.safetensors.index.json` - Index file for sharded weights (if applicable)
+- `tokenizer.json`, `tokenizer_config.json` - Tokenizer files
+- `README.md` - This file
+## Requirements
+- Python 3.8+
+- PyTorch 2.0+
+- Transformers 4.30+
+- safetensors
+- mamba-ssm (for MAMBA models)
+- causal-conv1d>=1.2.0 (for MAMBA models)
+- flash-attn (for optimized attention)
+## Citation
+If you use this model, please cite:
+```bibtex
+@misc{qwerky_llama_mamba_hybrid,
+  title={QwerkyLlamaMambaHybrid},
+  author={Qwerky AI, Inc.},
+  year={2025},
+  publisher={HuggingFace}
+}
+```
+## License
+This model is licensed under the Qwerky Distilled Model License Agreement. See the [LICENSE](LICENSE) file for more details.