Spaces:

nyunai
/

edge-llm-leaderboard

Running

Arnav Chavan

updated eval email

bbd04e3 12 months ago

2.58 kB

	LOGO = '<img src="https://nyunai.com/assets/images/logo.png">'

	LOGO2 = '<img src="https://huggingface.co/front/assets/huggingface_logo-noborder.svg">'

	TITLE = """<h1 align="left" id="space-title"> Edge LLM Leaderboard </h1>"""

	ABOUT = """
	## 📝 About
	The Edge LLM Leaderboard is a leaderboard to gauge practical performance and quality of edge LLMs.
	Its aim is to benchmark the performance (throughput and memory)
	of Large Language Models (LLMs) on Edge hardware - starting with a Raspberry Pi 5 (8GB) based on the ARM Cortex A76 CPU.

	Anyone from the community can request a new base model or edge hardware/backend/optimization
	configuration for automated benchmarking:

	- Model evaluation requests will be made live soon, in the meantime feel free to email to - edge-llm-evaluation[@]nyunai[dot]com

	## ✍️ Details

	- To avoid multi-thread discrepencies, all 4 threads are used on the Pi 5.
	- LLMs are running on a singleton batch with a prompt size of 512 and generating 128 tokens.

	All of our throughput benchmarks are ran by this single tool
	[llama-bench](https://github.com/ggerganov/llama.cpp/tree/master/examples/llama-bench)
	using the power of [llama.cpp](https://github.com/ggerganov/llama.cpp) to guarantee reproducibility and consistency.

	## 🏆 Ranking Models

	We use MMLU (zero-shot) via [llama-perplexity](https://github.com/ggerganov/llama.cpp/tree/master/examples/perplexity) for performance evaluation, focusing on key metrics relevant for edge applications:

	1. Prefill Latency (Time to First Token - TTFT): Measures the time to generate the first token. Low TTFT ensures a smooth user experience, especially for real-time interactions in edge use cases.

	2. Decode Latency (Generation Speed): Indicates the speed of generating subsequent tokens, critical for real-time tasks like transcription or extended dialogue sessions.

	3. Model Size: Smaller models are better suited for edge devices with limited secondary storage compared to cloud or GPU systems, making efficient deployment possible.

	These metrics collectively address the unique challenges of deploying LLMs on edge devices, balancing performance, responsiveness, and memory constraints.

	"""


	CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results."
	CITATION_BUTTON = r"""@misc{edge-llm-leaderboard,
	author = {Arnav Chavan, Deepak Gupta, Ishan Pandey and The HuggingFace team},
	title = {Edge LLM Leaderboard},
	year = {2024},
	publisher = {Hugging Face},
	howpublished = "\url{https://huggingface.co/spaces/nyunai/edge-llm-leaderboard}",
	}
	"""