WebNLG Corpus Documentation¶
This documentation describes the WebNLG corpus which maps RDF-triples to an English text. RDF-triples were extracted from DBpedia, and texts were collected using crowdsourcing.
Everything is wrapped in the root tag
The main unit of the benchmark is
<entry>. All the entries are wrapped in the tag
<entries>. Each entry has five attributes: a DBpedia category, entry ID, shape, shape type, and triple set size.
<entry category="Food" eid="Id65" shape="(X (X) (X))" shape_type="sibling" size="2">
Each entry consists of three sections:
Original tripleset represents a set of triples as they were extracted from DBpedia. Each original triple is wrapped with the tag
Modified tripleset represents a set of triples as they were presented to crowdworkers (for more details on modifications, see below). The order of triples in the benchmark is the same as the order in which triples were presented to the crowd. Each modified triple is wrapped with the tag
Lex (shortened for lexicalisation) represents a natural language text corresponding to triples. Each lexicalisation has two attributes: a comment, and a lexicalisation ID. By default, comments have the value good, except rare cases when they were manually marked as toFix. That was done during the corpus creation, when it was seen that a lexicalisation did not exactly match a triple set.
Subject-predicate-object structure of triples is linearised with vertical bars as separators. For instance,
Arròs_negre | country | Spain
where Arròs_negre is a subject, country is a predicate, Spain is an object of the RDF-triple.
<entry category="Food" eid="Id65" shape="(X (X) (X))" shape_type="sibling" size="2"> <originaltripleset> <otriple>Arròs_negre | country | Spain</otriple> <otriple>Arròs_negre | ingredient | White_rice</otriple> </originaltripleset> <modifiedtripleset> <mtriple>Arròs_negre | country | Spain</mtriple> <mtriple>Arròs_negre | ingredient | White_rice</mtriple> </modifiedtripleset> <lex comment="good" lid="1">White rice is an ingredient of Arros negre which is a traditional dish from Spain.</lex> <lex comment="good" lid="2">White rice is used in Arros negre which is from Spain.</lex> <lex comment="good" lid="3">Arros negre contains white rice as an ingredient and it comes from Spain.</lex> </entry>
Each set of RDF-triples is a tree, which is characterised by its shape and shape type:
<entry shape="(X (X) (X))">is a string representation of the tree with nested parentheses where
Xis a node (see Newick tree format);
<entry shape_type="sibling">is a type of the tree shape. We identify three types of tree shapes:
chain(the object of one triple is the subject of the other);
sibling(triples with a shared subject);
Entities, which served as roots, are listed in this file.
Initial triples extracted from DBpedia were modified in several ways. We describe below the most frequent changes that have been made. Full mapping information can be found here.
Unclear properties were renamed.
<otriple>Karnataka | west | Arabian_Sea</otriple> <mtriple>Karnataka | has to its west | Arabian_Sea</mtriple>
Properties whose semantics does not differ were merged to the same property to avoid redundancy in data.
<otriple>Stuart_Parker_(footballer) | club | Chesterfield_F.C.</otriple> <otriple>Stuart_Parker_(footballer) | team | Chesterfield_F.C.</otriple> <mtriple>Stuart_Parker_(footballer) | club | Chesterfield_F.C.</mtriple>
Inexact subjects and objects were clarified.
<otriple>1_Decembrie_1918_University,_Alba_Iulia | nickname | Uab</otriple> <mtriple>1_Decembrie_1918_University | nickname | Uab</mtriple>
This example demonstrates the motivation to have only the name of the university (1_Decembrie_1918_University), rather than its name together with its location (Alba_Iulia).
Objects were replaced due to the following reasons:
incorrect DBpedia data (quite often stemming from the bad parsing of infoboxes);
<otriple>Ab_Klink | almaMater | Law</otriple> <mtriple>Ab_Klink | almaMater | Leiden_University</mtriple>
This incorrect original triple resulted from having Ab Klink who studied Law at the Leiden University.
same data, but in different measurement units (e.g., feet/metres, Celsius/Fahrenheit, etc);
<otriple>320_South_Boston_Building | height | 400.0 (feet)</otriple> <otriple>320_South_Boston_Building | height | 121.92 (metres)</otriple> <mtriple>320_South_Boston_Building | height | 121.92 (metres)</mtriple>
same data, but in different formats (e.g., using double quotes, datatypes);
<otriple>Elliot_See | deathDate | "1966-02-28"^^xsd:date</otriple> <otriple>Elliot_See | deathDate | 1966-02-28</otriple> <mtriple>Elliot_See | deathDate | 1966-02-28</mtriple>
The changes that have been made were sometimes quite drastic especially in the case of incorrect DBpedia data, so do not be surprised to see how original triples were converted to modified ones.
An original tripleset and a modified tripleset usually represent a one-to-one mapping. However, there are cases with many-to-one mappings when several original triplesets are mapped to one modified tripleset.
<originaltripleset> <otriple>Jens_Härtel | team | 1._FC_Magdeburg</otriple> </originaltripleset> <originaltripleset> <otriple>Jens_Härtel | managerClub | 1._FC_Magdeburg</otriple> </originaltripleset> <modifiedtripleset> <mtriple>Jens_Härtel | club | 1._FC_Magdeburg</mtriple> </modifiedtripleset>
We model the difference between original and modified triples as follows. They serve different purposes: the original triples — to link data to a knowledge base (DBpedia), whereas the modified triples — to ensure consistency and homogeneity throughout the data. To train models, the modified triples should be used.
Note on entries from 1_triple files¶
We built the corpus in such a way that, in the 1_triple files, we included all the triples that you could find in the corpus, given a particular category. Hence the name allSolutions in file names.
For example, given the Food category, 1triple, one can find the triple United_States | leader | Barack_Obama. That means that somewhere in the 2, 3, 4 or 5 triples files in the category Food there is such a tripleset (talking about Food) that includes a triple about the leader of the United States.
Theoretically, that hierarchical corpus construction enables to produce texts expressing 5 triples by using only 1_triple entries.
Note on typos¶
WebNLG was crowdsourced, so there are spelling mistakes in some lexicalisations. Those were mostly corrected in the cleaned WebNLG version 2.1. If you use earlier versions of WebNLG (< 2.1), this script may help you to detect some typos.
However, if you find mistakes in realising semantic content (e.g., a triple realisation is missing), do not hesitate to drop us a line at firstname.lastname@example.org; we will fix it.