nielsr HF Staff commited on
Commit
ffa699f
·
verified ·
1 Parent(s): defea43

Add comprehensive model card for Logics-Parsing

Browse files

This PR adds a comprehensive model card for the Logics-Parsing model. It includes:

- Linking the model to its paper: [Logics-Parsing: End-to-End LVLM for Document Parsing](https://huggingface.co/papers/2509.19760).
- Adding `apache-2.0` as the license (already present).
- Setting `library_name: transformers` to enable the automated "how to use" widget, justified by the model's architecture files (`Qwen2_5_VLForConditionalGeneration`, `Qwen2_5_VLProcessor`, `Qwen2Tokenizer`) and acknowledgments to `Qwen2.5-VL`.
- Specifying `pipeline_tag: image-text-to-text` for better discoverability, as it's a Vision-Language Model for document parsing.
- Adding relevant `tags`: `document-parsing`, `vlm`, `qwen`.
- Including links to the GitHub repository and ModelScope demo.
- Incorporating an introduction, key features, benchmark results, and a shell command snippet for quick inference, all sourced directly from the GitHub README.
- Correctly formatted the citation block for the paper.

Please review and merge if everything looks good.

Files changed (1) hide show
  1. README.md +505 -3
README.md CHANGED
@@ -1,3 +1,505 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ library_name: transformers
4
+ pipeline_tag: image-text-to-text
5
+ tags:
6
+ - document-parsing
7
+ - vlm
8
+ - qwen
9
+ ---
10
+
11
+ <div align="center">
12
+ <img src="https://github.com/alibaba/Logics-Parsing/raw/main/imgs/logo.jpg" width="80%" >
13
+ </div>
14
+
15
+ <p align="center">
16
+ 🤗 <a href="https://huggingface.co/Logics-MLLM/Logics-Parsing">Model</a>&nbsp&nbsp | &nbsp&nbsp🤖 <a href="https://www.modelscope.cn/studios/Alibaba-DT/Logics-Parsing/summary">Demo</a>&nbsp&nbsp | &nbsp&nbsp📑 <a href="https://huggingface.co/papers/2509.19760">Technical Report</a>
17
+ </p>
18
+
19
+ # Logics-Parsing: End-to-End LVLM for Document Parsing
20
+
21
+ Logics-Parsing is a powerful, end-to-end document parsing model built upon a general Vision-Language Model (VLM) through Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). It excels at accurately analyzing and structuring highly complex documents.
22
+
23
+ More details can be found in the paper [Logics-Parsing: End-to-End LVLM for Document Parsing](https://huggingface.co/papers/2509.19760).
24
+
25
+ Code: [https://github.com/alibaba/Logics-Parsing](https://github.com/alibaba/Logics-Parsing)
26
+
27
+ ## Introduction
28
+ <div align="center">
29
+ <img src="https://github.com/alibaba/Logics-Parsing/raw/main/imgs/overview.png" alt="LogicsDocBench 概览" style="width: 800px; height: 250px;">
30
+ </div>
31
+
32
+ <div align="center">
33
+ <table style="width: 800px;">
34
+ <tr>
35
+ <td align="center">
36
+ <img src="https://github.com/alibaba/Logics-Parsing/raw/main/imgs/report.gif" alt="研报示例">
37
+ </td>
38
+ <td align="center">
39
+ <img src="https://github.com/alibaba/Logics-Parsing/raw/main/imgs/chemistry.gif" alt="化学分子式示例">
40
+ </td>
41
+ <td align="center">
42
+ <img src="https://github.com/alibaba/Logics-Parsing/raw/main/imgs/paper.gif" alt="论文示例">
43
+ </td>
44
+ <td align="center">
45
+ <img src="https://github.com/alibaba/Logics-Parsing/raw/main/imgs/handwritten.gif" alt="手写示例">
46
+ </td>
47
+ </tr>
48
+ <tr>
49
+ <td align="center"><b>report</b></td>
50
+ <td align="center"><b>chemistry</b></td>
51
+ <td align="center"><b>paper</b></td>
52
+ <td align="center"><b>handwritten</b></td>
53
+ </tr>
54
+ </table>
55
+ </div>
56
+
57
+ Logics-Parsing is a powerful, end-to-end document parsing model built upon a general Vision-Language Model (VLM) through Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). It excels at accurately analyzing and structuring highly complex documents.
58
+
59
+ ## Key Features
60
+
61
+ * **Effortless End-to-End Processing**
62
+ * Our single-model architecture eliminates the need for complex, multi-stage pipelines. Deployment and inference are straightforward, going directly from a document image to structured output.
63
+ * It demonstrates exceptional performance on documents with challenging layouts.
64
+
65
+ * **Advanced Content Recognition**
66
+ * It accurately recognizes and structures difficult content, including intricate scientific formulas.
67
+ * Chemical structures are intelligently identified and can be represented in the standard **SMILES** format.
68
+
69
+ * **Rich, Structured HTML Output**
70
+ * The model generates a clean HTML representation of the document, preserving its logical structure.
71
+ * Each content block (e.g., paragraph, table, figure, formula) is tagged with its **category**, **bounding box coordinates**, and **OCR text**.
72
+ * It automatically identifies and filters out irrelevant elements like headers and footers, focusing only on the core content.
73
+
74
+ * **State-of-the-Art Performance**
75
+ * Logics-Parsing achieves the best performance on our in-house benchmark, which is specifically designed to comprehensively evaluate a model’s parsing capability on complex-layout documents and STEM content.
76
+
77
+ ## Benchmark
78
+
79
+ Existing document-parsing benchmarks often provide limited coverage of complex layouts and STEM content. To address this, we constructed an in-house benchmark comprising 1,078 page-level images across nine major categories and over twenty sub-categories. Our model achieves the best performance on this benchmark.
80
+ <div align="center">
81
+ <img src="https://github.com/alibaba/Logics-Parsing/raw/main/imgs/BenchCls.png">
82
+ </div>
83
+ <table>
84
+ <tr>
85
+ <td rowspan="2">Model Type</td>
86
+ <td rowspan="2">Methods</td>
87
+ <td colspan="2">Overall <sup>Edit</sup> ↓</td>
88
+ <td colspan="2">Text Edit <sup>Edit</sup> ↓</td>
89
+ <td colspan="2">Formula <sup>Edit</sup> ↓</td>
90
+ <td colspan="2">Table <sup>TEDS</sup> ↑</td>
91
+ <td colspan="2">Table <sup>Edit</sup> ↓</td>
92
+ <td colspan="2">ReadOrder<sup>Edit</sup> ↓</td>
93
+ <td rowspan="1">Chemistry<sup>Edit</sup> ↓</td>
94
+ <td rowspan="1">HandWriting<sup>Edit</sup> ↓</td>
95
+ </tr>
96
+ <tr>
97
+ <td>EN</td>
98
+ <td>ZH</td>
99
+ <td>EN</td>
100
+ <td>ZH</td>
101
+ <td>EN</td>
102
+ <td>ZH</td>
103
+ <td>EN</td>
104
+ <td>ZH</td>
105
+ <td>EN</td>
106
+ <td>ZH</td>
107
+ <td>EN</td>
108
+ <td>ZH</td>
109
+ <td>ALL</td>
110
+ <td>ALL</td>
111
+ </tr>
112
+ <tr>
113
+ <td rowspan="7">Pipeline Tools</td>
114
+ <td>doc2x</td>
115
+ <td>0.209</td>
116
+ <td>0.188</td>
117
+ <td>0.128</td>
118
+ <td>0.194</td>
119
+ <td>0.377</td>
120
+ <td>0.321</td>
121
+ <td>81.1</td>
122
+ <td>85.3</td>
123
+ <td><ins>0.148</ins></td>
124
+ <td><ins>0.115</ins></td>
125
+ <td>0.146</td>
126
+ <td>0.122</td>
127
+ <td>1.0</td>
128
+ <td>0.307</td>
129
+ </tr>
130
+ <tr>
131
+ <td>Textin</td>
132
+ <td>0.153</td>
133
+ <td>0.158</td>
134
+ <td>0.132</td>
135
+ <td>0.190</td>
136
+ <td>0.185</td>
137
+ <td>0.223</td>
138
+ <td>76.7</td>
139
+ <td><ins>86.3</ins></td>
140
+ <td>0.176</td>
141
+ <td><b>0.113</b></td>
142
+ <td><b>0.118</b></td>
143
+ <td><b>0.104</b></td>
144
+ <td>1.0</td>
145
+ <td>0.344</td>
146
+ </tr>
147
+ <tr>
148
+ <td>mathpix<sup>*</sup></td>
149
+ <td><ins>0.128</ins></td>
150
+ <td><ins>0.146</ins></td>
151
+ <td>0.128</td>
152
+ <td><ins>0.152</ins></td>
153
+ <td><b>0.06</b></td>
154
+ <td><b>0.142</b></td>
155
+ <td><b>86.2</b></td>
156
+ <td><b>86.6</b></td>
157
+ <td><b>0.120</b></td>
158
+ <td>0.127</td>
159
+ <td>0.204</td>
160
+ <td>0.164</td>
161
+ <td>0.552</td>
162
+ <td>0.263</td>
163
+ </tr>
164
+ <tr>
165
+ <td>PP_StructureV3</td>
166
+ <td>0.220</td>
167
+ <td>0.226</td>
168
+ <td>0.172</td>
169
+ <td>0.29</td>
170
+ <td>0.272</td>
171
+ <td>0.276</td>
172
+ <td>66</td>
173
+ <td>71.5</td>
174
+ <td>0.237</td>
175
+ <td>0.193</td>
176
+ <td>0.201</td>
177
+ <td>0.143</td>
178
+ <td>1.0</td>
179
+ <td>0.382</td>
180
+ </tr>
181
+ <tr>
182
+ <td>Mineru2</td>
183
+ <td>0.212</td>
184
+ <td>0.245</td>
185
+ <td>0.134</td>
186
+ <td>0.195</td>
187
+ <td>0.280</td>
188
+ <td>0.407</td>
189
+ <td>67.5</td>
190
+ <td>71.8</td>
191
+ <td>0.228</td>
192
+ <td>0.203</td>
193
+ <td>0.205</td>
194
+ <td>0.177</td>
195
+ <td>1.0</td>
196
+ <td>0.387</td>
197
+ </tr>
198
+ <tr>
199
+ <td>Marker</td>
200
+ <td>0.324</td>
201
+ <td>0.409</td>
202
+ <td>0.188</td>
203
+ <td>0.289</td>
204
+ <td>0.285</td>
205
+ <td>0.383</td>
206
+ <td>65.5</td>
207
+ <td>50.4</td>
208
+ <td>0.593</td>
209
+ <td>0.702</td>
210
+ <td>0.23</td>
211
+ <td>0.262</td>
212
+ <td>1.0</td>
213
+ <td>0.50</td>
214
+ </tr>
215
+ <tr>
216
+ <td>Pix2text</td>
217
+ <td>0.447</td>
218
+ <td>0.547</td>
219
+ <td>0.485</td>
220
+ <td>0.577</td>
221
+ <td>0.312</td>
222
+ <td>0.465</td>
223
+ <td>64.7</td>
224
+ <td>63.0</td>
225
+ <td>0.566</td>
226
+ <td>0.613</td>
227
+ <td>0.424</td>
228
+ <td>0.534</td>
229
+ <td>1.0</td>
230
+ <td>0.95</td>
231
+ </tr>
232
+ <tr>
233
+ <td rowspan="8">Expert VLMs</td>
234
+ <td>Dolphin</td>
235
+ <td>0.208</td>
236
+ <td>0.256</td>
237
+ <td>0.149</td>
238
+ <td>0.189</td>
239
+ <td>0.334</td>
240
+ <td>0.346</td>
241
+ <td>72.9</td>
242
+ <td>60.1</td>
243
+ <td>0.192</td>
244
+ <td>0.35</td>
245
+ <td>0.160</td>
246
+ <td>0.139</td>
247
+ <td>0.984</td>
248
+ <td>0.433</td>
249
+ </tr>
250
+ <tr>
251
+ <td>dots.ocr</td>
252
+ <td>0.186</td>
253
+ <td>0.198</td>
254
+ <td><ins>0.115</ins></td>
255
+ <td>0.169</td>
256
+ <td>0.291</td>
257
+ <td>0.358</td>
258
+ <td>79.5</td>
259
+ <td>82.5</td>
260
+ <td>0.172</td>
261
+ <td>0.141</td>
262
+ <td>0.165</td>
263
+ <td>0.123</td>
264
+ <td>1.0</td>
265
+ <td><ins>0.255</ins></td>
266
+ </tr>
267
+ <tr>
268
+ <td>MonkeyOcr</td>
269
+ <td>0.193</td>
270
+ <td>0.259</td>
271
+ <td>0.127</td>
272
+ <td>0.236</td>
273
+ <td>0.262</td>
274
+ <td>0.325</td>
275
+ <td>78.4</td>
276
+ <td>74.7</td>
277
+ <td>0.186</td>
278
+ <td>0.294</td>
279
+ <td>0.197</td>
280
+ <td>0.180</td>
281
+ <td>1.0</td>
282
+ <td>0.623</td>
283
+ </tr>
284
+ <tr>
285
+ <td>OCRFlux</td>
286
+ <td>0.252</td>
287
+ <td>0.254</td>
288
+ <td>0.134</td>
289
+ <td>0.195</td>
290
+ <td>0.326</td>
291
+ <td>0.405</td>
292
+ <td>58.3</td>
293
+ <td>70.2</td>
294
+ <td>0.358</td>
295
+ <td>0.260</td>
296
+ <td>0.191</td>
297
+ <td>0.156</td>
298
+ <td>1.0</td>
299
+ <td>0.284</td>
300
+ </tr>
301
+ <tr>
302
+ <td>Gotocr</td>
303
+ <td>0.247</td>
304
+ <td>0.249</td>
305
+ <td>0.181</td>
306
+ <td>0.213</td>
307
+ <td>0.231</td>
308
+ <td>0.318</td>
309
+ <td>59.5</td>
310
+ <td>74.7</td>
311
+ <td>0.38</td>
312
+ <td>0.299</td>
313
+ <td>0.195</td>
314
+ <td>0.164</td>
315
+ <td>0.969</td>
316
+ <td>0.446</td>
317
+ </tr>
318
+ <tr>
319
+ <td>Olmocr</td>
320
+ <td>0.341</td>
321
+ <td>0.382</td>
322
+ <td>0.125</td>
323
+ <td>0.205</td>
324
+ <td>0.719</td>
325
+ <td>0.766</td>
326
+ <td>57.1</td>
327
+ <td>56.6</td>
328
+ <td>0.327</td>
329
+ <td>0.389</td>
330
+ <td>0.191</td>
331
+ <td>0.169</td>
332
+ <td>1.0</td>
333
+ <td>0.294</td>
334
+ </tr>
335
+ <tr>
336
+ <td>SmolDocling</td>
337
+ <td>0.657</td>
338
+ <td>0.895</td>
339
+ <td>0.486</td>
340
+ <td>0.932</td>
341
+ <td>0.859</td>
342
+ <td>0.972</td>
343
+ <td>18.5</td>
344
+ <td>1.5</td>
345
+ <td>0.86</td>
346
+ <td>0.98</td>
347
+ <td>0.413</td>
348
+ <td>0.695</td>
349
+ <td>1.0</td>
350
+ <td>0.927</td>
351
+ </tr>
352
+ <tr>
353
+ <td><b>Logics-Parsing</b></td>
354
+ <td><b>0.124</b></td>
355
+ <td><b>0.145</b></td>
356
+ <td><b>0.089</b></td>
357
+ <td><b>0.139</b></td>
358
+ <td><ins>0.106</ins></td>
359
+ <td><ins>0.165</ins></td>
360
+ <td>76.6</td>
361
+ <td>79.5</td>
362
+ <td>0.165</td>
363
+ <td>0.166</td>
364
+ <td><ins>0.136</ins></td>
365
+ <td><ins>0.113</ins></td>
366
+ <td><b>0.519</b></td>
367
+ <td><b>0.252</b></td>
368
+ </tr>
369
+ <tr>
370
+ <td rowspan="5">General VLMs</td>
371
+ <td>Qwen2VL-72B</td>
372
+ <td>0.298</td>
373
+ <td>0.342</td>
374
+ <td>0.142</td>
375
+ <td>0.244</td>
376
+ <td>0.431</td>
377
+ <td>0.363</td>
378
+ <td>64.2</td>
379
+ <td>55.5</td>
380
+ <td>0.425</td>
381
+ <td>0.581</td>
382
+ <td>0.193</td>
383
+ <td>0.182</td>
384
+ <td>0.792</td>
385
+ <td>0.359</td>
386
+ </tr>
387
+ <tr>
388
+ <td>Qwen2.5VL-72B</td>
389
+ <td>0.233</td>
390
+ <td>0.263</td>
391
+ <td>0.162</td>
392
+ <td>0.24</td>
393
+ <td>0.251</td>
394
+ <td>0.257</td>
395
+ <td>69.6</td>
396
+ <td>67</td>
397
+ <td>0.313</td>
398
+ <td>0.353</td>
399
+ <td>0.205</td>
400
+ <td>0.204</td>
401
+ <td>0.597</td>
402
+ <td>0.349</td>
403
+ </tr>
404
+ <tr>
405
+ <td>Doubao-1.6</td>
406
+ <td>0.188</td>
407
+ <td>0.248</td>
408
+ <td>0.129</td>
409
+ <td>0.219</td>
410
+ <td>0.273</td>
411
+ <td>0.336</td>
412
+ <td>74.9</td>
413
+ <td>69.7</td>
414
+ <td>0.180</td>
415
+ <td>0.288</td>
416
+ <td>0.171</td>
417
+ <td>0.148</td>
418
+ <td>0.601</td>
419
+ <td>0.317</td>
420
+ </tr>
421
+ <tr>
422
+ <td>GPT-5</td>
423
+ <td>0.242</td>
424
+ <td>0.373</td>
425
+ <td>0.119</td>
426
+ <td>0.36</td>
427
+ <td>0.398</td>
428
+ <td>0.456</td>
429
+ <td>67.9</td>
430
+ <td>55.8</td>
431
+ <td>0.26</td>
432
+ <td>0.397</td>
433
+ <td>0.191</td>
434
+ <td>0.28</td>
435
+ <td>0.88</td>
436
+ <td>0.46</td>
437
+ </tr>
438
+ <tr>
439
+ <td>Gemini2.5 pro</td>
440
+ <td>0.185</td>
441
+ <td>0.20</td>
442
+ <td><ins>0.115</ins></td>
443
+ <td>0.155</td>
444
+ <td>0.288</td>
445
+ <td>0.326</td>
446
+ <td><ins>82.6</ins></td>
447
+ <td>80.3</td>
448
+ <td>0.154</td>
449
+ <td>0.182</td>
450
+ <td>0.181</td>
451
+ <td>0.136</td>
452
+ <td><ins>0.535</ins></td>
453
+ <td>0.26</td>
454
+ </tr>
455
+ </table>
456
+ <!-- 脚注说明 -->
457
+ <tr>
458
+ <td colspan="5">
459
+ <sup>*</sup> Tested on the v3/PDF Conversion API (August 2025 deployment).
460
+
461
+ </td>
462
+ </tr>
463
+
464
+ ## Quick Start
465
+ ### 1. Installation
466
+ ```shell
467
+ conda create -n logis-parsing python=3.10
468
+ conda activate logis-parsing
469
+
470
+ pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
471
+
472
+ ```
473
+ ### 2. Download Model Weights
474
+
475
+ ```
476
+ # Download our model from Modelscope.
477
+ pip install modelscope
478
+ python download_model.py -t modelscope
479
+
480
+ # Download our model from huggingface.
481
+ pip install huggingface_hub
482
+ python download_model.py -t huggingface
483
+ ```
484
+
485
+ ### 3. Inference
486
+ ```shell
487
+ python3 inference.py --image_path PATH_TO_INPUT_IMG --output_path PATH_TO_OUTPUT --model_path PATH_TO_MODEL
488
+ ```
489
+
490
+ ## Acknowledgments
491
+ We would like to acknowledge the following open-source projects that provided inspiration and reference for this work:
492
+ - [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL)
493
+ - [OmniDocBench](https://github.com/opendatalab/OmniDocBench)
494
+ - [Mathpix](https://mathpix.com/)
495
+
496
+ ## Citation
497
+ If you find our work helpful or inspiring, please feel free to cite it.
498
+ ```bibtex
499
+ @article{wu2024logics-parsing,
500
+ title={Logics-Parsing: End-to-End LVLM for Document Parsing},
501
+ author={Junyu Luo and Xiao Luo and Xiusi Chen and Zhiping Xiao and Wei Ju and Ming Zhang},
502
+ journal={arXiv preprint arXiv:2509.19760},
503
+ year={2025},
504
+ }
505
+ ```