Table of Contents
gedcom55XML.rngto Validate a GEDCOM 5.5 XML Document
The Church of the Latter-day Saints' GEDCOM 5.5 Standard provides a way to format, digitally store, and transfer genealogical data in a standardized, human readable text file. Over the years, new standards, using the Extensible Markup Language (XML), have been proposed. The web page here provides a list of these proposals.
None of these proposals have gained widespread acceptance, and GEDCOM 5.5 remains the format that nearly all genealogical programs export to and that users still exchange their genealogical data using.
XML proponents can console themselves with the fact that the GEDCOM 5.5 standard bares some resemblance to a XML markup language. Despite its non-standard character set (Ansel), it is human readable text. Instead of XML elements, it delimits genealogical data between a text tag and character return and line feed or both; the tag precedes the data. Finally, GEDCOM 5.5 tags can be nested under other tags, resembling the parent and child node trees of XML elements.
Given these similarities, it is easy to envision an one-to-one translation of GEDCOM 5.5 into a XML language. GEDCOM 5.5 tags could be XML elements; XML's greater than ’>’ and less than ’<’ characters could surround the tags, converting them into elements. Open and closed elements could delimit genealogical data. To differentiate it from other proposals, such a XML markup language could be called “GEDCOM 5.5 XML.”
Two XML markup languages have come close to describing GEDCOM 5.5 XML: GedML and GeniML.
In 1998, Michael H. Kay proposed GedML. A XML Schema, a DTD, for this proposal was released. In GedML's DTD, most of the GEDCOM 5.5 tags have been translated into XML elements with the same name; e.g.,
INDI tags are simply
<INDI> elements. Other tags, though, have been translated; e.g.,
SOUR tags are
<Source> elements. GedML is also missing the
<TRLR> element which replicates the GEDCOM 5.5
TRLR tag. This tag indicates the end of a GEDCOM 5.5 file. It is absent from GedML because the closed document root element,
</GED>, performs the same function, rendering a
<TRLR/> element superfluous. Considering these differences, as far as the DTD is concerned, GedML is not an one-to-one translation of GEDCOM 5.5 into XML.
Kay, however, also released a Java language program that translates GEDCOM 5.5 files into XML documents. The markup language of these documents does not have a XML schema (DTD, W3C XML Schema, RELAX NG). However, a review of the output of his program reveals that all GEDCOM 5.5 tags have been translated to XML elements with the same name, just like GEDCOM 5.5 XML. There is, though, one missing element: the one corresponding to the
TRLR tag. Given this flaw, this version of GedML also fails to be an one-to-one translation of GEDCOM 5.5 into XML.
Jerry Fitzpatrick of Software Renovation Corporation developed GeniML. Like Kay, he released a Windows program to convert GEDCOM 5.5 to XML. No schema for this markup language has been released. However, sample output included with the converter shows that GeniML appears to be an one-to-one translation of GEDCOM 5.5 into XML. GeniML even includes the
<TRLR> element. There are, however, some differences; for example, GeniML delimits surname data using a
<SURNAME> element, when GEDCOM 5.5 uses a
SURN tag for the same purpose. Considering this difference and many others, GeniML, like GedML, is not an one-to-one translation of GEDCOM 5.5 into XML markup.
GEDCOM 5.5 XML differentiates itself from GedML and GeniML because it attempts to replicate the LDS's GEDCOM 5.5 standard using XML markup. Without exception, all GEDCOM 5.5 tags should correspond to XML elements with the same name; all tags should be preserved; the parent-child relationships between the tags and elements should parallel one another; and all data delimited by the elements should fall within the strict guidelines of the standard.
To define GEDCOM 5.5 XML, a schema is needed. I am releasing one written in RELAX NG/Schematron. The RELAX NG XML version of the schema is contained in the file called “
gedcom55XML.rng”. The RELAX NG compact syntax version is in the file called “
Like any XML schema, it can be used to validate, in the XML sense, a GEDCOM 5.5 file that has been converted to a GEDCOM 5.5 XML document; i.e., it can be used to determine if the document is both “well-formed” and “valid.” This means that it can be used to test if the GEDCOM 5.5 tags (when converted to XML elements) are correctly nested within each other as prescribed by the LDS's GEDCOM 5.5 standard. It can also be used to test if the data, delimited by the elements, follows the strict guidelines prescribed by the standard.
http://www.neomantic.com/gedcom55XML is the namespace for the root element of a GEDCOM 5.5 XML document.
The schema described in this document is version 0.1. It was originally written in RELAX NG's XML markup.
The XML version is located at
A RELAX NG compact syntax version, produced using the
translator, is also being released. It is located at
. (The compact syntax version has not been tested.)
Both the XML and compact syntax 0.1 schema files can also be found bundled with this documentation at http://www.neomantic.com/gedcom55XML/0.1/gedcom55XML-0.1.tar.gz.
For verification purposes, I have signed this tar, gzipped archive with my gnupg public key located here. The signature of
gedcom55XML-0.1.tar.gz is located at http://www.neomantic.com/gedcom55XML/0.1/gedcom55XML-0.1.tar.gz.sign.
The source code for both
gedcom55XML.rnc is released under the GNU General Public License Version 2 (GPL). The full text of this license can be found in a file called “
Hyperlinks to the most up-to-date version of the schema are located at http://www.neomantic.com/gedcom55XML.
Ideally, the GEDCOM 5.5 XML RELAX NG/Schematron schema in
gedcom55XML.rng would completely mirror the GEDCOM 5.5 standard. There are, however, nine places where it does not fully replicate the standard. This is why the version number of
gedcom55XML.rng/c is “0.1”. When the schema captures 100 percent of the standard, the version number will be incremented to “1.0”.
The schema does not describe valid content of a GEDCOM 5.5
DATE_VALUE. The possible combinations of
DATE_VALUEs are simply too numerous to handle using Schematron alone. The schema only prescribes that the length of character data of a
DATE_VALUE be between 1 and 35 characters.
The schema does not describe all instances of
EXACT_DATE data. It fails to describe the
EXACT_DATE values of
CHANGE_DATE. When the Schematron rule is able to describe the
EXACT_DATE value, it only uses the English
MONTH abbreviations; i.e.,
Normally, the regions in the
PLAC tag are separated by commas (e.g., Kansas City, Jackson County, Missouri, USA) or, alternatively, the delimiter specified in the
gedcom55XML.rng ignores the delimiter of
The given name and surname of an individual are usually delimited in the
NAME tag by forward slashes (e.g., Joseph/Smith/). The Schematron rules do not prescribe this format.
GEDCOM 5.5 allows for user-defined tags, even though it discourages their usage. The schema does not allow user-defined tags.
The schema does not describe the content of
EVENT_TYPE_CITED_FROM, even though the GEDCOM 5.5 standard specifies a restricted set of values.
It also does not describe the content of
EVENTS_RECORDED, even though again the GEDCOM 5.5 standard specifies a restricted set of values.
AGE tag permits values such as
STILLBORN, along with '
d' which respectively signify year, month, and day. Following the specification strictly, these values should be case insensitive. In
gedcom55XML.rng, they are case sensitive, defaulting to the lowercase values.
MONTH values of
DATE_EXACT should also be case insensitive, but in the current schema they are not. The defaults are the upper-case values.
Most of these shortcoming are due to difficulties in describing what would be considered, in XML terms, mixed-content elements.
If a GEDCOM 5.5 file is converted into a GEDCOM 5.5 XML document, the RELAX NG/Schematron schema contained in
gedcom55XML.rng provides a way to test the validity, as describe above, of the XML document.
The instructions below describe how to do so using several external programs available on the internet. The instructions follow several conventions:
family.ged represents a GEDCOM 5.5 file.
family.xml represents the
family.ged file translated into a GEDCOM 5.5 XML document.
Text sandwiched between brackets [ ] indicates variables that depend upon your computer's environment.
In rough outline, the process has two steps:
Convert the GEDCOM 5.5 file to a GEDCOM 5.5 XML document.
Validate the GEDCOM 5.5 XML document using
Currently there is no quick and easy way to convert a GEDCOM 5.5 (or a .ged) file into a GEDCOM 5.5 XML (or .xml) file. I know of only one way to accomplish it now, and it requires some programming skills. To perform the conversion, you will need to use a Java program released by Michael Kay which, used with his famous
saxon XML parser, will perform the conversion.
Download Kay's source code and unzip it in a location of your choosing. Remember the path to this location. You will need to know this path in a later step, and it will be referred to using the variable
Find the files called “
GedcomParser.java” and “
GedcomToXml.xsl” in Kay's source code.
Compile the file
GedcomParser.java using your favorite Java distribution's compiler -
javac. This will produce a class file called “
GedcomParser.class”. The command is as follows:
parser, install it, and remember its location, which will be referred to below using the
[path-to-saxon] variable. (It may already be installed on your system; on my Debian GNU/Linux system it was located at
family.xml by issuing the following command in your terminal:
java -cp [path-to-saxon]/saxon.jar:[path-to-gedml-classes] com.icl.saxon.StyleSheet -x GedcomParser -o family.xml family.ged [path-to-gedml-classes]/GedcomToXml.xsl
The output of this command,
family.xml, will be a XML version of
family.xml will be a near perfect reproduction of a GEDCOM 5.5 into GEDCOM 5.5 XML. The only problem with the output is that it fails to add the
<TRLR/> element before the
</GED> element at the end of the document. Kay's GedcomParser removes the
<TRLR/> element, because the parser was intended for a version of GedML.
Validating a XML file with a RELAX NG/Schematron schema is a complicated process. The complication arises because, to fully validate your GEDCOM 5.5 XML file using
gedcom55XML.rng, a RELAX NG validator and
a Schematron validator must be used. (The entire process is described here). Two methods for validation are described below.
One way of validating a GEDCOM 5.5 XML file with
gedcom55XML.rng is by using Topologi's
Open Source Schematron Java classes. Their zip file here contains all the binaries and scripts needed to validate a XML file with Schematron rules embedded in RELAX NG patterns. To do so, type the following command in the directory called “Schematron” that is produced after unzipping the downloaded file above:
java -cp ./Saxon/saxon.jar:./Java/Schematron.jar:./Jing/jing.jar com.topologi.schematron.EmbRNGValidator family.xml gedcom55XML.rng
When this command detects errors in
family.xml using the RELAX NG schema, it stops at the first error and indicates the line it is on. You will need to correct the XML markup of
family.xml (and the corresponding tags in
family.ged), and then run the command above again to find the remaining errors using the RELAX NG schema.
After the command above reveals the error checked against the RELAX NG schema, it then proceeds to check the Schematron rules. When it finds an error, it will not tell you which line has a problem, but only the text of the problematic line.
In either the case of RELAX NG errors or Schematron errors, the command does not tell you what exactly is wrong with the line. It is up to you to figure out why the line is incorrect by referring to the LDS's GEDCOM 5.5 standard.
Validate the GEDCOM 5.5 XML file using the RELAX NG portion of the schema.
RELAX NG validation can be performed by a Open Source Java program called “
jing”. This command line application can be downloaded here.
supposedly can validate a XML file against the Schematron rules embedded in the RELAX NG schema. However, I have been unable to get
jing to recognized the Schematron patterns in
To perform the RELAX NG validation, type the following command line:
java -jar [path-to-jing]/jing.jar gedcom55XML.rng family.xml
The results will pipe to the terminal's standard output.
jing will report all
the errors that it encounters. It will not report exactly what is wrong, but it will report what lines have errors in them. If
jing reports no errors, your GEDCOM 5.5 XML document is valid, both in terms of its structure and most of its content.
Perform the Schematron validation.
Step one does not test the mixed-content inherent in the GEDCOM 5.5 (and GEDCOM 5.5 XML) object model. To test the mixed-content elements, the XML document must be tested against the Schematron rules in
gedcom55XML.rng. It takes two steps to do so.
The Schematron rules must first be extracted from the RELAX NG schema using the style sheet,
RNG2Schtrn.xsl, available here. For the sake of convenience, I have extracted the Schematron rules from
gedcom55XML.rng. They are in the file called “
gedcom55XML.sch”. I am distributing this source code under the GNU General Public License Version 2.
family.xml file must be passed through the Schematron rules. To do so, we will use Topologoi's Schematron Java classes again, the same ones used in Method 1 above. Go to the “Schematron” directory again and type the following command:
java -cp ./Saxon/saxon.jar:./Java/Schematron.jar:./Jing/jing.jar com.topologi.schematron.SchtrnValidator family.xml gedcom55XML.sch
If this command reveals any errors in
family.xml, it will not tell you the line it occurs on, but it will indicate the data that fails to pass the Schematron rules.
This document is released under the GNU Free Documentation License Version 1.2. The full text of this license is found in the file called “
fdl.txt” released with
gedcom55XML-0.1.tar.gz. It can also be located at
Please direct questions or requests for more information to
<email@example.com>. Corrections, suggestions, bug reports, and patches are welcome as well.