Blanca commited on
Commit
2a3a057
·
verified ·
1 Parent(s): 0061f92

Update content.py

Browse files
Files changed (1) hide show
  1. content.py +18 -17
content.py CHANGED
@@ -1,34 +1,37 @@
1
- TITLE = """<h1 align="center" id="space-title">Critical Questions Leaderboard</h1>"""
2
 
3
  INTRODUCTION_TEXT = """
4
- The Critical Questions Leaderboard is a benchmark which aims at evaluating the capacity of language technology systems to generate critical questions. (See our [paper](https://arxiv.org/abs/2505.11341) for more details.)
5
 
6
- The task of Critical Questions Generation consists of generating useful critical questions when given an argumentative text. For this purpose, a dataset of real debate interventions with associated critical questions has been released.
7
 
8
- Critical Questions are the set of inquiries that should be asked in order to judge if an argument is acceptable or fallacious. Therefore, these questions are designed to unmask the assumptions held by the premises of the argument and attack its inference.
9
 
10
- In the dataset, the argumentative texts are interventions of real debates, which have been annotated with Argumentation Schemes and later associated to a set of critical questions. For every intervention, the speaker, the set of Argumentation Schemes, and the critical questions are provided. These questions have been annotated according to their usefulness for challenging the arguments in each text. The labels are either Useful, Unhelpful, or Invalid. The goal of the task is to generate 3 critical questions that are Useful.
 
11
 
12
- Each of this 3 critical questions will be evaluated separately and then the punctuation will be aggregated.
 
13
 
14
- ## Data
15
- The Critical Questions Dataset is made of 220 interventions associated to ~5k gold standard questions. These questions are in turn annotated as Useful, Unhelpful or Invalid, and serve as a reference for the evaluation model.
16
 
17
- The data can be found in [this dataset](https://huggingface.co/datasets/Blanca/CQs-Gen). The test set is contained in `test.jsonl` and contains 34 of the interventions, the validation set contains the remaining 186, and the reference questions of this set are public.
 
 
 
 
 
18
 
19
  ## Leaderboard
20
- Submission made by our team are labelled as "CQs-Gen authors".
21
 
22
- See below for submissions.
23
  """
24
 
25
  SUBMISSION_TEXT = """
26
  ## Submissions
27
- Results can be submitted for the test set only. Scores are expressed as the percentage of correct answers for a given split.
28
 
29
- Evaluation is done by comparing the newly generated question to the reference questions using Semantic Text Similarity, and inheriting the label of the most similar reference. Questions where no reference is found are considered invalid. See the evaluation function [here](https://huggingface.co/spaces/HiTZ/Critical_Questions_Leaderboard/blob/main/app.py#L141), or find more details in the [paper](https://arxiv.org/abs/2505.11341).
30
 
31
- We expect submissions to be json files with the following format.
32
  ```json
33
  {
34
  "CLINTON_1_1": {
@@ -65,7 +68,7 @@ CITATION_BUTTON_TEXT = r"""@misc{figueras2025benchmarkingcriticalquestionsgenera
65
  title={Benchmarking Critical Questions Generation: A Challenging Reasoning Task for Large Language Models},
66
  author={Calvo Figueras, Banca and Rodrigo Agerri},
67
  year={2025},
68
- booktitle={2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023)},
69
  organization={Association for Computational Linguistics (ACL)},
70
  url={https://arxiv.org/abs/2505.11341},
71
  }"""
@@ -82,5 +85,3 @@ def format_log(msg):
82
 
83
  def model_hyperlink(link, model_name):
84
  return f'<a target="_blank" href="{link}" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">{model_name}</a>'
85
-
86
-
 
1
+ TITLE = """<h1 align="center" id="space-title">Critical Questions Generation Leaderboard</h1>"""
2
 
3
  INTRODUCTION_TEXT = """
4
+ <p style='font-size:20px;'>Critical Questions Generation is the task of automatically generating questions that can unmask the assumptions held by the premises of an argumentative text.
5
 
6
+ This leaderboard, aims at benchmarking the capacity of language technology systems to create Critical Questions (CQs). That is, questions that should be asked in order to judge if an argument is acceptable or fallacious.
7
 
8
+ The task consists on generating 3 Useful Critical Questions per argumentative text.
9
 
10
+ All details on the task, the dataset, and the evaluation can be found in the paper [Benchmarking Critical Questions Generation: A Challenging Reasoning Task for Large Language Models](https://arxiv.org/abs/2505.11341)
11
+ </p>"""
12
 
13
+ DATA_TEXT = """
14
+ ## Data
15
 
16
+ <p style='font-size:20px;'> The [CQs-Gen dataset](https://huggingface.co/datasets/HiTZ/CQs-Gen) gathers 220 interventions of real debates. And contains:
 
17
 
18
+ - `validation`: which contains 186 interventions and can be used for training or validation, as it contains ~25 reference questions per intervention already evaluated accoding to their usefulness (either Useful, Unhelpful or Invalid).
19
+ - `test`: which contains 34 interventions. The reference questions of this set (~70) are kept private to avoid data contamination. The questions generated using the test set is what should be submitted to this leaderboard.
20
+ </p>
21
+
22
+ ## Evaluation
23
+ <p style='font-size:20px;'> Evaluation is done by comparing each of the 3 newly generated question to the reference questions of the test set using Semantic Text Similarity, and inheriting the label of the most similar reference given the threshold of 0.65. Questions where no reference is found are considered Invalid. See the evaluation function [here](https://huggingface.co/spaces/HiTZ/Critical_Questions_Leaderboard/blob/main/app.py#L141), or find more details in the [paper](https://arxiv.org/abs/2505.11341). </p>
24
 
25
  ## Leaderboard
 
26
 
 
27
  """
28
 
29
  SUBMISSION_TEXT = """
30
  ## Submissions
31
+ <p style='font-size:20px;'> Results can be submitted for the test set only.
32
 
33
+ We expect submissions to be json files with the following format: </p>
34
 
 
35
  ```json
36
  {
37
  "CLINTON_1_1": {
 
68
  title={Benchmarking Critical Questions Generation: A Challenging Reasoning Task for Large Language Models},
69
  author={Calvo Figueras, Banca and Rodrigo Agerri},
70
  year={2025},
71
+ booktitle={2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025)},
72
  organization={Association for Computational Linguistics (ACL)},
73
  url={https://arxiv.org/abs/2505.11341},
74
  }"""
 
85
 
86
  def model_hyperlink(link, model_name):
87
  return f'<a target="_blank" href="{link}" style="color: var(--link-text-color); text-decoration: underline;text-decoration-style: dotted;">{model_name}</a>'