Challenge Description:

Tens of thousands of research papers on the SARS-CoV-2 virus and the COVID-19 illness have flooded journals and preprint servers, with many more added every day. This publication rate is leaving researchers unable to keep up with new findings and insights, even as they rely on these to inform their own work in understanding and combating the disease. In particular, research groups investigating novel drug candidates with in-silico drug discovery efforts are in essence searching for a proverbial needle in a haystack, and so rely on scientific insights from the literature to guide their efforts toward the most fruitful directions. Thus, they are left to navigate the current publication deluge for their primary guidance. In order to keep pace with the vast amount of relevant literature that continues to grow, researchers need more capable tools to locate and filter the information they seek in these publications. To meet this pressing need, it is desirable to develop and apply the latest Natural Language Processing (NLP) techniques for efficiently locating and filtering relevant information from a growing dataset of COVID-related literature, to help accelerate drug discovery research. The proposed NLP challenge task is framed towards designing and building a Question-Answering (QA) system to find answers to COVID-related questions in the scientific literature.


A practical usage of NLP models with COVID relevant papers might be automated information
extraction from literature to facilitate drug discovery efforts. One of the crucial elements that
can inform these efforts is the knowledge about viral proteins. The goal of this data challenge to
build an NLP model to identify answers to protein-related questions from scientific papers.


Researchers at the Brookhaven Nation Laboratory (BNL), Computational Science Initiative (CSI),
and the Oak Ridge National Laboratory (ORNL), Biophysics group, have collected a question-answer dataset annotated by a biomedical domain expert to evaluate the system on specialized
information acquisition for protein-related questions. Due to the limited size of the collected QA
data for training a full-fledged NLP model, we recommend that participants leverage external
resources to pre-train and fine-tune their models. The most significant of these is the COVID-19
Open Research Dataset (CORD-19) [1], provided by AllenAI’s Semantic Scholar team. Other
potential resources include publicly-available QA datasets and Natural Language Inference (NLI)
datasets (NLI tasks determine semantic relations between two sentences – a premise and a

Question answering (QA) datasets:

  1. The Stanford Question Answering Dataset (SQuAD) [2]: a large collection of question-answer pairs created by crowd workers on Wikipedia articles, SQuAD 2.0 contains more
    than 150k pairs of question answers.
  2. COVID-QA [3]: a SQuAD-like dataset consisting of 2,019 COVID related questions and
    answers to build a COVID-specific QA system
  3. SearchQA [4]: more than 140k general question-answer pairs from the popular television
    show Jeopardy.
  4. BioASQ [5]: domain-specific data consisting of 1,504 question-answer pairs created by
    biomedical domain experts.

Natural Language Inference (NLI) datasets:

  1. Stanford NLI (SNLI) [6]
  2. Multi-Genre NLI (MultiNLI) [7]
  3. Medical NLI (MedNLI) [8]
  4. Science text entailment (SciTail) [9]

The BNL-curated QA validation/test datasets that will be provided to the participants will include
113 question/answer pairs for the following four questions:

  1. What are the oligomeric states of coronavirus structural proteins?
  2. What are the oligomeric states of non-structural coronavirus proteins?
  3. What are the catalytic domains (active sites) of coronavirus proteins?
  4. Are there antivirals that target structural viral proteins?

The answers are sentences from papers related to COVID and are labeled as one of the three
categories (relevant, partially relevant, and not relevant). The QA datasets listed above may not
be directly applicable to the BNL-curated datasets due to the different formats. The QA datasets
above are pairs of questions and passages (contexts), and the purpose is to find an answer
(mostly very short answer e.g., a word) in the passage. On the other hand, the BNL-curated
datasets are composed of pairs of questions and sentences, and it aims to determine the
relevance between them. The QA datasets can be still used to deliver general QA knowledge to
a model. The NLI datasets are useful for semantic relation analysis between sentences. The list is
not exclusive, and participants may utilize any other resources for model training.

Test references: The list of DOIs of the articles used for the QA dataset generation as well as
pre-processed versions of these articles (in JSON format that is more structured and
convenient to use compared to the raw PDF document) will be provided.

QA data: Question & Answer pairs will be provided for validation of the developed NLP model
(by the participant) and for assessing its performance (by the organizer)

  1. Validation set: the validation set consists of 54 queries and 54 sentences identified from
    the test references mentioned above that may provide answers to the queries. Labels
    (relevant, partially relevant, irrelevant) are provided for all 54 pairs. This validation set
    can be used by the participant to validate and optimize the performance of their NLP
    model during development.
  2. Test set: the test set consist of 59 queries and 59 sentences identified from the test
    references that may (or may not) provide answers to the queries. Labels are not provided
    for this test set, and the participants are expected to predict the labels for all query-sentence pairs.

The test references and the BNL-curated QA datasets can be obtained from the following GitHub

Evaluation criteria:

Evaluation criteria for this task falls into two parts: (1) language modeling of the CORD-19 dataset,
and (2) QA performance on the curated dataset.

First part (language modeling): due to the small dataset size, effective modeling of the language
found in COVID-related scientific articles is essential to performing well on the QA task.
Participants may choose to use a pretrained base language model (e.g., BERT [10] or BioBERT
[11]), in which case they will receive a baseline score on the language modeling task. However,
they may instead choose to domain-tune their language model of choice on the CORD-19 dataset,
which if done well will give them an edge both by increasing their language modeling score, but
also potentially by improving the quality of their QA model. Language modeling quality will be
determined using the perplexity metric on a the “Test references” shown above. All participants
should report the perplexity metric measured on each of the test references with a detailed
description of the evaluation procedure.

Second part (QA performance): an ideal QA system first ranks and retrieves articles related to a
given question from the literature database, and then extracts answers from the selected articles.
To evaluate the participant’s system on this test environment, the participant may build an endto-end model (document retrieval and answer extraction) and check if the list of answers includes
the answers in the test sets. However, given that the test data has not been generated using the
entire collection of articles in COVID-19, the retrieved articles by the QA system developed by
the participant may not contain the test articles. We ask the participants to perform the following

  1. Relevant sentence prediction: from the Test references, identify top 3 sentences that may
    contain answers to the queries. Provide top 3 sentences for each of the 59 queries in the
    test set (also indicate the articles from which each sentence was extracted).
  2. QA pair label prediction: for each of the 59 question-sentence pairs in the test set, label
    each pair as “relevant”, “partially relevant”, or “irrelevant”. For this task, the participant
    will need to build a classifier that identifies whether a given sentence is relevant to a given
    query or not.

For participants who intend to build an “end-to-end model”, the TREC (Text REtrieval Conference)
workshops can provide useful resources for finding the most relevant documents to a given query.
TREC aims to support research in information retrieval and provide materials for large-scale
evaluation of text retrieval methodologies. Recently, researchers, clinicians, and policy makers
have hosted a TREC workshop for COVID-19. Please refer to TREC-COVID [9].



[1] Wang, L. L. et al. (2020). Cord-19: The covid-19 open research dataset. arXiv preprint
[2] Rajpurkar, P. et al. (2016). Squad: 100,000+ questions for machine comprehension of text.
arXiv preprint arXiv:1606.05250.
[3] Möller, T. et al. (2020). Covid-qa: A question & answering dataset for covid-19.
[4] Dunn, M. et al. (2017). Searchqa: A new q&a dataset augmented with context from a search
engine. arXiv preprint arXiv:1704.05179.
[5] Tsatsaronis, G. et al. (2012). Bioasq: A challenge on large-scale biomedical semantic
indexing and question answering. In AAAI fall symposium: Information retrieval and
knowledge discovery in biomedical text. Citeseer.
[6] Bowman, S. R. et al. (2015). A large annotated corpus for learning natural language
inference. arXiv preprint arXiv:1508.05326.
[7] Williams, A. et al. (2017). Abroad-coverage challenge corpus for sentence understanding
through inference. arXiv preprint arXiv:1704.05426.
[8] Shivade, C. (2019). Mednli – a natural language inference dataset for the clinical domain
(version 1.0.0). PhysioNet.
[9] Khot, T. et al. (2018). Scitail: A textual entailment dataset from science question answering.
In AAAI, volume 17, pages 41–42.
[10] Devlin, J., et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language
Understanding.” Proceedings of NAACL-HLT. 2019.
[11] Lee, J., et al. “BioBERT: a pre-trained biomedical language representation model for
biomedical text mining.” Bioinformatics 36.4 (2020): 1234-1240.

Other relevant resources:

[1] Soto, C., Park, G., Chen, Y.C., Sedova, A., Pouchard, L. and Yoo, S., 2020. Applying Natural
Language Processing (NLP) techniques on the scientific literature to accelerate drug
discovery for COVID-19. F1000Research, 9.
[2] Acharya, Atanu, Rupesh Agarwal, Matthew B. Baker, Jerome Baudry, Debsindhu Bhowmik,
Swen Boehm, Kendall G. Byler et al. “Supercomputer-based ensemble docking drug
discovery pipeline with application to COVID-19.” Journal of chemical information and
modeling 60, no. 12 (2020): 5832-5852.