ChEMU - Cheminformatics Elsevier Melbourne University lab

ChEMU lab series provides a unique opportunity for development of information extraction tools over chemical patents. As first running of ChEMU lab, ChEMU2020 focuses on extracting chemical reactions from chemical patents. ChEMU 2020 provides two key tasks to achieve this goal: named entity recognition (NER) and event extraction (EE).

Overview of ChEMU 2020

Brought to you by the University of Melbourne natural language processing group in the School of Computing and Information System, the Elsevier Content Transformations, Life Science team, and RMIT University, the ChEMU lab series provides an opportunity for development of information extraction models over chemical patents.

ChEMU2020, part of Conference and Labs of the Evaluation Forum 2020 (CLEF 2020), is the first running of our ChEMU lab series. In ChEMU2020, we invited teams from academic and industrial communities to work on the task of information extraction over chemical reactions from chemical patents. The challenge consisted of two sub-tasks. The first sub-task, Named Entity Recognition (NER), focuses on identifying chemical compounds as well as their types in context, i.e., to assign the label of a chemical compound according to the role which the compound plays within a chemical reaction. The second task, Event extraction (EE), involves identification of chemical reactions, represented in terms of relations between compounds and reaction steps. Teams were invited to build models for the NER and EE sub-tasks, and also to build end-to-end systems that address both sub-tasks simultaneously, reflecting a complete system that supports extraction of reaction events from chemical patents.

The shared task was held from 10 April 2020 to 4 June 2020. It attracted 36 registrants from 13 countries including Portugal, Switzerland, Germany, India, Japan, United States, China, and United Kingdom. We received 26 submissions from 11 teams for the NER sub-task, 10 submissions from 5 teams for the EE sub-task, and 10 submissions from 4 teams for end-to-end systems, respectively. Submissions from a team from the company Melax Technologies (from Houston, TX, USA) ranked first in all 3 sub-tasks.

ChEMU 2021

ChEMU 2021 will be held as part of CLEF again in 2021. Two new tasks are defined focusing at different problems in information extraction over chemical patents.

Task 1 - chemical reaction reference resolution: Given a chemical reaction snippet, the task aims to find similar chemical reactions and general conditions that it refers to.

Task 2 - anaphora resolution: The task requires identification of references between expressions in chemical patents.

Access to ChEMU corpus

As of September 2020, the data is also available for use outside of the scope of the CLEF2020 ChEMU lab. Researchers interested in accessing the data may register (click on “Sign Up” above) to accept the data license agreement and download the data. Please note that the gold standard annotations over the test set will remain unavailable until the end of September 2021; in the meantime results over the test data can be submitted to this website for automatic evaluation. The results will be displayed in the Leaderboard. If you utilise this data, please cite:

@incollection{he2020overview,
    author = {He, Jiayuan and Nguyen, Dat Quoc and Akhondi, Saber A. and Druckenbrodt, Christian and Thorne, Camilo and Hoessel, Ralph and Afzal, Zubair and Zhai, Zenan and Fang, Biaoyan and Yoshikawa, Hiyori and Albahem, Ameer and Cavedon, Lawrence and Cohn, Trevor and Baldwin, Timothy and Verspoor, Karin},
    title = {Overview of ChEMU 2020: Named Entity Recognition and Event Extraction of Chemical Reactions from Patents},
    booktitle = {Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Eleventh International Conference of the CLEF Association (CLEF 2020)},
    publisher = {Lecture Notes in Computer Science},
    year = 2020,
    volume = 12260,
    doi = "https://doi.org/10.1007/978-3-030-58219-7_18"
}
@incollection{he2020extended,
    author = {He, Jiayuan and Nguyen, Dat Quoc and Akhondi, Saber A. and Druckenbrodt, Christian and Thorne, Camilo and Hoessel, Ralph and Afzal, Zubair and Zhai, Zenan and Fang, Biaoyan and Yoshikawa, Hiyori and Albahem, Ameer and Wang, Jingqi and Ren, Yuankai Ren and Zhang, Zhi and Zhang, Yaoyun
              and Dao, Mai Hoang and Ruas, Pedro and Lamurias, Andre and Couto, Francisco M. and Copara, Jenny and Naderi, Nona and Knafou, Julien and Ruch, Patrick and Teodoro, Douglas and Lowe, Daniel and Mayfield, John and Köksal, Abdullatif and Dönmez, Hilal and Özkırımlı, Elif and Özgür, Arzucan
              and Mahendran, Darshini and Gurdin, Gabrielle and Lewinski, Nastassja and Tang, Christina and McInnes, Bridget T. and C.S., Malarkodi and Rk Rao., Pattabhi and Lalitha Devi, Sobha and Cavedon, Lawrence and Cohn, Trevor and Baldwin, Timothy and Verspoor, Karin},
    title = {An Extended Overview of the CLEF 2020 ChEMU Lab: Information Extraction of Chemical Reactions from Patents},
    booktitle = {Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum},
    year = 2020
}
@misc{chemu2020dataset,
    author = {Verspoor, Karin and Nguyen, Dat Quoc and Akhondi, Saber A. and Druckenbrodt, Christian and Thorne, Camilo and Hoessel, Ralph and He, Jiayuan and Zhai, Zenan},
    title = {ChEMU dataset for information extraction from chemical patents},
    doi = {10.17632/wy6745bjfj.1},
    publisher = {Mendeley Data}}
}

ChEMU lab 2020 is part of CLEF 2020, which will be held during 22 September to 25 September. This is the schedule. All times are in CET.

Date Time Team Title
Wed 23 Sep 11:00 - 12:30 ChEMU Overview of ChEMU2020
15:00 - 15:15 Welcome & Intro
15:15 - 15:30 LasigBioTM LasigeBioTM team at CLEF2020 ChEMU evaluation lab: Named Entity Recognition and Event extraction from chemical reactions described in patents using BioBERT NER and RE
15:30 - 15:45 NextMove Software/Minesoft Extraction of reactions from patents using grammars
15:45 - 16:00 BOUN-REX BOUN-REX at CLEF-2020 ChEMU Task 2: Evaluating Pretrained Transformers for Event Extraction
16:00 - 16:15 NLP@VCU NLPatVCU CLEF 2020 ChEMU Shared Task System Description
16:15 - 16:30 Melaxtech Melaxtech: A report for CLEF 2020 – ChEMU Task of Chemical Reaction Extraction from Patent
Thu 24 Sep 11:00 - 11:05 Welcome
11:05 - 11:20 VinAI VinAI at ChEMU 2020: An accurate system for named entity recognition in chemical reactions from patents
11:20 - 11:35 AU-KBC CLRG ChemNER: A Chemical Named Entity Recognizer @ ChEMU CLEF 2020
11:35 - 11:50 BiTeM Named entity recognition in chemical patents using ensemble of contextual language models
11:50 - 12:10 Elsevier Background and future of ChEMU shared task
12:10 - 12:30 Discussion

ChEMU2020 provides two information extraction tasks of named entity recognition (Task 1) and event extraction (Task 2) over chemical reactions from patent documents. We also host a third track: end-to-end system which aims to address the two tasks simultaneously.

Task 1: Named Entity Recognition

In general, a chemical reaction is a process leading to the transformation of one set of chemical substances to another. Task 1 aims to identify chemical compounds and their specific types, i.e. to assign the label of a chemical compound according to the role which it plays within a chemical reaction. In addition to chemical compounds, this task also requires identification of the temperatures and reaction times at which the chemical reaction is carried out, as well as yields obtained for the final chemical product and the label of the reaction.

In particular, we define 10 different entity type labels as shown in the following table.

Entity Type Definition
EXAMPLE_LABEL A label associated with a reaction specification.
REACTION_PRODUCT A product is a substance that is formed during a chemical reaction.
STARTING_MATERIAL A substance that is consumed in the course of a chemical reaction providing atoms to products is considered as starting material.
REAGENT_CATALYST A reagent is a compound added to a system to cause or help with a chemical reaction. Compounds like catalysts, bases to remove protons or acids to add protons must be also annotated with this tag.
SOLVENT A solvent is a chemical entity that dissolves a solute resulting in a solution.
TIME The reaction time of the reaction.
TEMPERATURE The temperature of the reaction.
YIELD_PERCENT Yields given in percent values.
YIELD_OTHER Yields provided in other units than %.
OTHER_COMPOUND Other chemical compounds that are not the products, starting materials, reagents, catalysts and solvents.
Task 1 aims to identify the entities with the above 10 class labels. It also requires you predict the boundary of those entities.
Task 2: Event Extraction

A chemical reaction leading to an end product often consists of a sequence of individual event steps. Task 2 is to identify those steps which involve chemical entities recognized from Task 1. Task 2 requires identification of event trigger words (e.g. "added" and "stirred") which all have the same type of "EVENT_TRIGGER", and then determination of the chemical entity arguments of these events.

When predicting event arguments, we adapt semantic argument role labels Arg1 and ArgM from the Proposition Bank to label the relations between the trigger words and the chemical entities: Arg1 is used to label the relation between an event trigger word and a chemical compound. Here, Arg1 represents argument roles of being causally affected by another participant in the event. ArgM represents adjunct roles with respect to an event, used to label the relation between a trigger word and a temperature, time or yield entity.

  • 15 Sep 2020 - Overview paper on ChEMU2020 shared task is available online. The paper presents an overview of the activities in ChEMU2020 including tasks, resources, evaluation framework and results. Check out the paper: Overview paper.
  • 10 Sep 2020 - New data license is issued! A new data license is issued to grant use of data for academic purpose. Please head over to http://chemu.eng.unimelb.edu.au/data to access our data.
  • 05 Jun 2020 - End of evaluation stage. Congratulations to MelaxTech team for their top rankings in all three tracks.
  • 10 Apr 2020 - Training data is released. Please register at our website: http://chemu.eng.unimelb.edu.au/ to access the data and formally participate.
  • 07 Apr 2020 - Latest version of sample data is updated. On 7 April, we have removed the labeled trigger words from the annotation files in "ner", since those words are not the target output in task 1. This version is available at: chemu_sample.v3.zip
  • 18 Mar 2020 - Second version of sample data is updated. On 18 March, we create the 2nd version of the sample dataset. Due to some inconsistencies in how character entities were handled, we have corrected the sample. This version is available at: chemu_sample.v2.zip

    Note that the file numbers in this version of the sample differ from in the first version.

  • 09 Mar 2020 - First version of sample data is released. Please find the first version of sample dataset here: chemu_sample.zip
Making a submission

You can choose to make a submission against the development or test dataset by toggling the "data split" in the submission panel. You will be provided with evaluation result right after your submission is uploaded successfully. A ranking of all your submissions is provided in your private leaderboard. You may also click "publish" to make the performance of a submission visible to all teams. By "publishing" a submission, the performance of the submission will appear in the public leaderboard.

Submission format

For each patent snippet (e.g., 0000.txt), Task 1 requires the identification of the entities in the patent snippet.

In Task 2, you are given the patent snippets and the ground-truth entities from Task 1. Task 2 requires the identification of the trigger words of reaction events and the relations between trigger words and entities from Task 1.

In the third track, end-to-end systems, you are given the patent snippets only. This track requires the identification of all entities and trigger words, and the relations between them.

A valid submission is a compressed folder (e.g., submission.zip) consisting of your predicted annotation files (.ann files). Each annotation file included in the submission should pair with one patent snippet. For example, 0000.ann should contain your prediction for the patent snippet 0000.txt. A submission will be rejected if the prediction for a mandatory patent snippet is missing. For example, if a submission is made for development data split and 0000.txt file exists in development split, the submission will be rejected if 0000.ann file is not found in the submission.

The format of each submitted annotation file should be consistent with the BRAT standoff format. More information about BRAT standoff format can be found in this website: BRAT standoff format.

An example of an annotation file submitted for Task 1 is as follows:

T0    OTHER_COMPOUND 417 421    DMSO
T1    TIME 305 309    16 h

etc...

An example of an annotation file submitted for Task 2/end-to-end systems is as follows:

T1    OTHER_COMPOUND 417 421    DMSO
T2    TIME 305 309    16 h
...
R0    ARGM Arg1:T3 Arg2:T14
R1    ARGM Arg1:T3 Arg2:T4

etc...

Evaluation metrics

Three metrics, namely precision, recall and F1 scores are used for evaluation, under both exact and relaxed span matching conditions. Note that F1 score under exact span matching is the main metric when ranking all participating teams in the leaderboard. For more information about how your model is evaluated, please checkout the paper: Annotating the biomedical literature for the human variome or the repository of the evaluation algorithms: BRATEVAL.

  • Registration opens: 20 November 2019 Registration Form
  • Sample set release: 9 March chemu_sample.v3.zip
  • Training set release: 10 April Access to Training Set
  • Registration closes: 26 April 2020
  • Evaluation period of Task 1: 23:59 pm 22 May 2020 - 23:59 pm 28 May 2020
  • Evaluation period of Task 2: 23:59 pm 29 May 2020 - 23:59 pm 3 June 2020
  • End of evaluation cycle and feedback for participants: 23:59 pm 5 June 2020
  • Submission of participant papers [CEUR-WS]: 23:59 pm 17 July 2020
  • Review process of participant papers: 23:59 pm 17 July - 23:59 pm 14 August 2020
  • Notification of acceptance participant papers [CEUR-WS]: 23:59 pm 14 August 2020
  • Camera-ready copy of participant papers and extended lab overviews [CEUR-WS]: 23:59 pm 28 August 2020
  • Evaluation Lab meeting @CLEF 2020, Thessaloniki, Greece: September 22-25 2020

All the above times are in AoE zone.

Annotation Guidelines

To know how the datasets are annotated and gain further insight into the task, please see the annotation guidelines:

Sample Dataset

To access the latest version of sample dataset, please click chemu_sample.v3.zip to download.

The data for this task is released in BRAT format. This is a standoff format, with the text in one plain text file (*.txt), and the annotations in a different file (*.ann).

The configuration files required for BRAT are included in each of the two subdirectories, "ner" for Task 1 and "ee" for Task 2.

A visualization of the latest sample dataset is provided here: Visualization of Sample Dataset.

Pre-trained ChemPatent Word Embeddings

In a related work, we have publicized a set of new word embeddings, named ChemPatent Word Embeddings, which is trained on a collection of 84,076 full patent documents (1B tokens) across 7 patent offices. We have also released an ELMo model pre-trained on the same corpus which provides contextualized word presentations. We have demonstrate that ChemPatent Word Embeddings produce better performance than the word embeddings pre-trained on biomedical literature corpora.

To access and utilize the released ChemPatent Word Embeddings and the pre-trained ELMo model, please click Github Repository for ChemPatent Embeddings

To see detailed information about the embeddings, please find the original paper in https://www.aclweb.org/anthology/W19-5035.pdf.

Relevant Background:
  1. Nguyen DQ, Zhai Z, Yoshikawa H, Fang B, Druckenbrodt C, Thorne C, Hoessel R, Akhondi SA, Cohn T, Baldwin T and Verspoor K. (2020) ChEMU: Named Entity Recognition and Event Extraction of Chemical Reactions from Patents. In ECIR 2020. PDF.
  2. Zhai Z, Nguyen DQ, Akhondi S, Thorne C, Druckenbrodt C, Cohn T, Gregory M and Verspoor K. (2019) Improving Chemical Named Entity Recognition in Patents with Contextualized Word Embeddings. Proceedings of the Workshop on Biomedical Natural Language Processing (BioNLP) at ACL 2019. https://www.aclweb.org/anthology/W19-5035.pdf
  3. Yoshikawa H, Verspoor K, Baldwin T, Nguyen DQ, Zhai Z, Zkhondi S, Thorne C, Druckenbrodt C. (2019) Detecting Chemical Reaction Schemes in Patents. Australian Language Technology Association Workshop (ALTA 2019). Sydney, Australia, December 2019. https://www.aclweb.org/anthology/U19-1014.pdf
  1. Can I log in using my credentials used in CLEF registration?

    To provide a more secured environment in our submission website, we use an independent registration system from CLEF. To log into our submission website for the first time, you will need to sign up by providing some simple information including your username, email, password, and your institution. We apologize for any inconvenience incurred.