WebNLG Challenge 2020


The challenge data (including test sets) was published online: meet the WebNLG 3.0 version!

Three years after the first edition, the second WebNLG challenge took place in 2020.

WebNLG goes bi-lingual (English, Russian) and bi-directional (generation and parsing)!


The challenge comprises two main tasks:

  1. RDF-to-text generation, similarly to WebNLG 2017 but with new data and into two languages;
  2. Text-to-RDF semantic parsing: converting a text into the corresponding set of RDF triples.

For Task 1, given the four RDF triples shown in (a), the aim is to generate a text such as (b) or (c). For Task 2, the opposite should be achieved, i.e. to generate the triples in (a) starting from text as in (b) or (c).


(a) Set of RDF triples

<entry category="Company" eid="Id21" shape="(X (X) (X) (X) (X))" shape_type="sibling" size="4">
        <mtriple>Trane | foundingDate | 1913-01-01</mtriple>
        <mtriple>Trane | location | Ireland</mtriple>
        <mtriple>Trane | foundationPlace | La_Crosse,_Wisconsin</mtriple>
        <mtriple>Trane | numberOfEmployees | 29000</mtriple>

(b) English text

Trane, which was founded on January 1st 1913 in La Crosse, Wisconsin, is based in Ireland. It has 29,000 employees.

(c) Russian text

Компания "Тране", основанная 1 января 1913 года в Ла-Кроссе в штате Висконсин, находится в Ирландии. В компании работают 29 тысяч человек.


The English WebNLG 2020 dataset for training comprises data-text pairs for 16 distinct DBpedia categories:

  • The 10 seen categories used in 2017: Airport, Astronaut, Building, City, ComicsCharacter, Food, Monument, SportsTeam, University, and WrittenWork.

  • The 5 unseen categories of 2017, which are now part of the seen data: Athlete, Artist, CelestialBody, MeanOfTransportation, Politician.

    Data improvements comparing to the 2017 data

    • ~5,600 texts were cleaned from misspellings and missing triple verbalisations were added to some texts.

    • Information about tree shapes and shape types were added to each entry. See documentation.

  • 1 new category: Company.

The new Russian dataset comprises around 8,000 data inputs and 20,800 data-text pairs for 9 distinct categories:

  • Airport, Astronaut, Building, CelestialBody, ComicsCharacter, Food, Monument, SportsTeam, and University.

For every input triple set, several references in each language (English, Russian) are provided. New test sets are released for all categories seen in the training data (see above), and for several new unseen categories (categories not included in the training data). See test data.

See corpus documentation for the WebNLG format, and the WebNLG 2017 challenge report to know more about seen/unseen categories.


The WebNLG data was originally created to promote the development of RDF verbalisers able to generate short text and to handle micro-planning (i.e., sentence segmentation and ordering, referring expression generation, aggregation); the data for the first challenge included a total of 15 DBpedia categories. The 2020 challenge aims first of all at increasing the datasets (hence, the coverage of the verbalisers), by covering more categories and an additional language. The other main objective of the 2020 edition is to promote the development of knowledge extraction tools, with a task that mirrors the verbalisation task.

RDF Verbalisers

The RDF language—in which DBpedia is encoded—is widely used within the Linked Data framework. Many large scale datasets are encoded in this language (e.g., MusicBrainz, FOAF, LinkedGeoData) and official institutions increasingly publish their data in this format. Being able to generate good quality text from RDF data would open the way to many new applications such as making linked data more accessible to lay users, enriching existing text with information drawn from knowledge bases or describing, comparing and relating entities present in these knowledge bases.


By providing a bilingual corpus (English and Russian), we aim to promote the development of tools for languages other than English and to allow for experimentation with pre-training and transfer approaches (do the English verbalisations of RDF triples help in better verbalising the triples in Russian?).

Knowledge extraction

The new semantic parsing task opens up new lines of research in several directions. Can it be used to bootstrap entity linkers? How does RDF-based semantic parsing relate to other semantic parsing tasks where the output semantic representations are lambda terms or KB queries? Can semantic parsing be used to improve generation in ways similar to the back translation approaches proposed in machine translation?

Important Dates

  • 15 April 2020: Release of training and development data
  • 30 April 2020: Release of some simple preliminary evaluation scripts to support development
  • 30 May 2020: Release of the final evaluation scripts
  • 13 September 2020: Release of test data
  • 27 September 2020: Entry submission deadline (no extension)
  • 5-9 October 2020: Automatic evaluation results are released to participants
  • 15 October 2020: Participants submit a description of their systems
  • October-December 2020: Human evaluation of submissions
  • 18 December 2020: Results of automatic and human evaluations and system presentations at WebNLG workshop at INLG 2020



Organising Committee

  • Thiago Castro Ferreira, Federal University of Minas Gerais, Brazil
  • Claire Gardent, CNRS/LORIA, Nancy, France
  • Nikolai Ilinykh, University of Gothenburg, Sweden
  • Chris van der Lee, Tilburg University, The Netherlands
  • Simon Mille, Universitat Pompeu Fabra, Barcelona, Spain
  • Diego Moussallem, Paderborn University, Germany
  • Anastasia Shimorina, Université de Lorraine/LORIA, Nancy, France


Participation in the challenge

Registration and data access

Registration is now closed. Data used in the challenge are available on GitLab (WebNLG dataset 3.0 version).

The XML WebNLG data reader in Python is available here.


System outputs are assessed with automatic and human evaluation. Please note that human evaluation is a primary evaluation method.

Automatic Evaluation

Evaluation scripts to support development:

  • RDF-to-text generation

    Generation is evaluated with automatic metrics: BLEU, METEOR, chrF++, TER, and BERT-Score.

  • Text-to-RDF semantic parsing

    Semantic parsing is evaluated with F-score, Precision, and Recall, based on full triple match, as well as four kinds of partial matching on an element-level:

    • Strict: for each element of the triple, exact match of the candidate string with the reference is required, and the element type (subject, predicate, object) should match with the reference.
    • Exact: for each element of the triple, exact match of the candidate string with the reference is required, and the element type (subject, predicate, object) is irrelevant.
    • Partial: for each element of the triple, the candidate string should match at least partially with the reference string, and the element type (subject, predicate, object) is irrelevant.
    • Type: for each element of the triple, the candidate string should match at least partially with the reference string, and the element type (subject, predicate, object) should match with the reference.


Outputs of systems on the development sets can be submitted to GERBIL-NLG leaderboards. The outputs are evaluated with official automatic scripts.

The development set is distributed via several folders and files. For leaderboard submission, your outputs should be merged to a single file. The reference data is a concatenation of all files. It is ordered from the 1triples folder to 7triples folder and is a merge of all files by alphabet. The example of the reference file is here.

GERBIL-NLG follows the FAIR principles and uses Uniform Resource Identifiers (URI) for maintaining findable links. Each experiment has therefore a unique URI, so anybody can access the experiment later and/or include it in papers.

Human Evaluation

For RDF-to-text generation, system outputs are assessed according to criteria such as grammaticality/correctness, appropriateness/adequacy, fluency/naturalness, etc., by native speakers recruited on crowdsourcing platforms.

Test Data

Test sets for both tasks include three types of data:

  1. RDF triples/texts based on the entities and categories seen in the training data (e.g., Alan Bean in the category Astronaut)
  2. RDF triples/texts based on the categories seen in the training data, but not entities (e.g., Yuri Gagarin in the category Astronaut)
  3. RDF triples/texts based on the categories not present in the training data (surprise domains).

The above is valid for both tasks of the challenge for English. For Russian, only the data of type 1 is present.

See the WebNLG 2017 challenge report to know more about seen/unseen categories.


Submission link: https://gerbil-nlg.dice-research.org/gerbil/submission

You can submit multiple outputs stemming from different systems. Please name them accordingly and specify which system should be considered as the primary system.

RDF-to-text generation

Your submission file must be a .txt file (UTF-8 encoding) where each text is true-cased and detokenised. Example for English.

Each line should correspond to a verbalisation of a DBpedia triple set. Line 1 should represent the verbalisation of the DBpedia triple set with the ID=1, line 2 — the DBpedia triple set with the ID=2, etc.

Text-to-RDF semantic parsing

Your submission file must be an .xml file formatted following this example.

Each entry corresponds to a set of RDF triples extracted for a single text.

Participant FAQ

Which resources are allowed?

There are no restrictions for any task. E.g., you may use a pre-trained language model, external corpora, etc.

Can I submit multiple outputs?

Yes, given that they stem from substantially different systems. However, for human assessment we may ask you to provide a primary system that will be evaluated.

Can I participate in one task / for one language only?

Yes. You can participate only in, say, semantic parsing for Russian, or RDF-to-text generation for English.

Can I download the data without participating in the challenge?

Yes. Data used in the challenge are available on GitLab (WebNLG dataset 3.0 version).

Will it be possible to withdraw my results if my team's performance is unsatisfactory?

Yes. We will first announce the results to participants anonymously, and you will have an opportunity to withdraw your results.

Is it obligatory to submit a workshop paper in order to present a system?

No. However, we expect that you provide a description of your system. It will be reviewed by other participants and/or organisers, and will be made available on the challenge website. You are also strongly encouraged to submit your description (as a paper) to the WebNLG workshop at INLG 2020, but it is not mandatory.

Challenge Results

Automatic Evaluation Results

https://gerbil-nlg.dice-research.org/gerbil/webnlg2020results (anonymous results)

Human Evaluation Results

https://beng.dice-research.org/gerbil/webnlg2020resultshumaneval (anonymous results)