rwmasood commited on
Commit
f778973
·
verified ·
1 Parent(s): 5eb5875
Files changed (1) hide show
  1. README.md +77 -0
README.md ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-4.0
3
+ datasets:
4
+ - allenai/c4
5
+ language:
6
+ - en
7
+ metrics:
8
+ - accuracy
9
+ base_model:
10
+ - deepseek-ai/deepseek-llm-67b-chat
11
+ pipeline_tag: text-generation
12
+ tags:
13
+ - biology
14
+ - chemistry
15
+ - finance
16
+ - legal
17
+ - climate
18
+ - medical
19
+ ---
20
+
21
+
22
+
23
+ # Overview
24
+ This document presents the evaluation results of `DeepSeek-LLM-67B-Chat`, a **8-bit quantized model using GPTQ**, evaluated with the **Language Model Evaluation Harness** on the **ARC, GPQA** and **IfEval** benchmark.
25
+
26
+ ---
27
+
28
+ ## 📊 Evaluation Summary
29
+
30
+
31
+ | **Metric** | **Value** | **Description** |
32
+ |------------|-----------|-----------------|
33
+ | **ARC-Challenge** | `58.11%` | Raw (`acc,none`) |
34
+ | **GPQA Overall** | `25.44%` | Averaged across GPQA-Diamond, GPQA-Extended, GPQA-Main (n-shot, zeroshot, CoT, Generative) |
35
+ | **GPQA (n-shot acc)** | `33.04%` | Averaged over GPQA-Diamond, GPQA-Extended, GPQA-Main (`acc,none`) |
36
+ | **GPQA (zeroshot acc)** | `32.51%` | Averaged over GPQA-Diamond, GPQA-Extended, GPQA-Main (`acc,none`) |
37
+ | **GPQA (CoT n-shot)** | `17.21%` | Averaged over GPQA-Diamond, GPQA-Extended, GPQA-Main (`exact_match flexible-extract`) |
38
+ | **GPQA (CoT zeroshot)** | `17.52%` | Averaged over GPQA-Diamond, GPQA-Extended, GPQA-Main (`exact_match flexible-extract`) |
39
+ | **GPQA (Generative n-shot)** | `26.49%` | Averaged over GPQA-Diamond, GPQA-Extended, GPQA-Main (`exact_match flexible-extract`) |
40
+ | **IFEval Overall** | `43.16%` | Averaged across Prompt-level Strict, Prompt-level Loose, Inst-level Strict, Inst-level Loose |
41
+ | **IFEval (Prompt-level Strict)** | `36.23%` | Prompt-level strict accuracy |
42
+ | **IFEval (Prompt-level Loose)** | `38.45%` | Prompt-level loose accuracy |
43
+ | **IFEval (Inst-level Strict)** | `47.84%` | Inst-level strict accuracy |
44
+ | **IFEval (Inst-level Loose)** | `50.12%` | Inst-level loose accuracy |
45
+
46
+ ---
47
+
48
+ ## ⚙️ Model Configuration
49
+
50
+ - **Model:** `DeepSeek-R1-Distill-Qwen-32B`
51
+ - **Parameters:** `67 billion`
52
+ - **Quantization:** `8-bit GPTQ`
53
+ - **Source:** Hugging Face (`hf`)
54
+ - **Precision:** `torch.float16`
55
+ - **Hardware:** `NVIDIA A100 80GB PCIe`
56
+ - **CUDA Version:** `12.4`
57
+ - **PyTorch Version:** `2.6.0+cu124`
58
+ - **Batch Size:** `1`
59
+
60
+
61
+ 📌 **Interpretation:**
62
+ - The evaluation was performed on a **high-performance GPU (A100 80GB)**.
63
+ - The model is significantly smaller than the full version, with **GPTQ 8-bit quantization reducing memory footprint**.
64
+ - A **single-sample batch size** was used, which might slow evaluation speed.
65
+
66
+ ---
67
+
68
+ ## 📈 Performance Insights
69
+
70
+ - The `"higher_is_better"` flag confirms that **higher accuracy is preferred**.
71
+ - **Quantization Impact:** The **8-bit GPTQ quantization** reduces memory usage but may also impact accuracy slightly.
72
+ - **Zero-shot Limitation:** Performance could improve with **few-shot prompting** (providing examples before testing).
73
+
74
+ ---
75
+
76
+
77
+ 📌 Let us know if you need further analysis or model tuning! 🚀