RELAX NG/Schematron Schema for GEDCOM 5.5 XML (version 0.1)

Chad Albers


Table of Contents

Background
GEDCOM 5.5 XML
GedML
GeniML
RELAX NG/Schematron Schema for GEDCOM 5.5 XML
GEDCOM 5.5 XML URI
Schema Versions and Download Locations
Schema License
Schema Updates
Schema Limitations
Using gedcom55XML.rng to Validate a GEDCOM 5.5 XML Document
Convert the GEDCOM 5.5 File to a GEDCOM 5.5 XML Document
Validate the GEDCOM 5.5 XML Document
Documentation License
Contact

Background

The Church of the Latter-day Saints' GEDCOM 5.5 Standard provides a way to format, digitally store, and transfer genealogical data in a standardized, human readable text file. Over the years, new standards, using the Extensible Markup Language (XML), have been proposed. The web page here provides a list of these proposals.

None of these proposals have gained widespread acceptance, and GEDCOM 5.5 remains the format that nearly all genealogical programs export to and that users still exchange their genealogical data using.

GEDCOM 5.5 XML

XML proponents can console themselves with the fact that the GEDCOM 5.5 standard bares some resemblance to a XML markup language. Despite its non-standard character set (Ansel), it is human readable text. Instead of XML elements, it delimits genealogical data between a text tag and character return and line feed or both; the tag precedes the data. Finally, GEDCOM 5.5 tags can be nested under other tags, resembling the parent and child node trees of XML elements.

Given these similarities, it is easy to envision an one-to-one translation of GEDCOM 5.5 into a XML language. GEDCOM 5.5 tags could be XML elements; XML's greater than ’>’ and less than ’<’ characters could surround the tags, converting them into elements. Open and closed elements could delimit genealogical data. To differentiate it from other proposals, such a XML markup language could be called “GEDCOM 5.5 XML.”

Two XML markup languages have come close to describing GEDCOM 5.5 XML: GedML and GeniML.

GedML

In 1998, Michael H. Kay proposed GedML. A XML Schema, a DTD, for this proposal was released. In GedML's DTD, most of the GEDCOM 5.5 tags have been translated into XML elements with the same name; e.g., INDI tags are simply <INDI> elements. Other tags, though, have been translated; e.g., SOUR tags are <Source> elements. GedML is also missing the <TRLR> element which replicates the GEDCOM 5.5 TRLR tag. This tag indicates the end of a GEDCOM 5.5 file. It is absent from GedML because the closed document root element, </GED>, performs the same function, rendering a <TRLR/> element superfluous. Considering these differences, as far as the DTD is concerned, GedML is not an one-to-one translation of GEDCOM 5.5 into XML.

Kay, however, also released a Java language program that translates GEDCOM 5.5 files into XML documents. The markup language of these documents does not have a XML schema (DTD, W3C XML Schema, RELAX NG). However, a review of the output of his program reveals that all GEDCOM 5.5 tags have been translated to XML elements with the same name, just like GEDCOM 5.5 XML. There is, though, one missing element: the one corresponding to the TRLR tag. Given this flaw, this version of GedML also fails to be an one-to-one translation of GEDCOM 5.5 into XML.

GeniML

Jerry Fitzpatrick of Software Renovation Corporation developed GeniML. Like Kay, he released a Windows program to convert GEDCOM 5.5 to XML. No schema for this markup language has been released. However, sample output included with the converter shows that GeniML appears to be an one-to-one translation of GEDCOM 5.5 into XML. GeniML even includes the <TRLR> element. There are, however, some differences; for example, GeniML delimits surname data using a <SURNAME> element, when GEDCOM 5.5 uses a SURN tag for the same purpose. Considering this difference and many others, GeniML, like GedML, is not an one-to-one translation of GEDCOM 5.5 into XML markup.

RELAX NG/Schematron Schema for GEDCOM 5.5 XML

GEDCOM 5.5 XML differentiates itself from GedML and GeniML because it attempts to replicate the LDS's GEDCOM 5.5 standard using XML markup. Without exception, all GEDCOM 5.5 tags should correspond to XML elements with the same name; all tags should be preserved; the parent-child relationships between the tags and elements should parallel one another; and all data delimited by the elements should fall within the strict guidelines of the standard.

To define GEDCOM 5.5 XML, a schema is needed. I am releasing one written in RELAX NG/Schematron. The RELAX NG XML version of the schema is contained in the file called “gedcom55XML.rng”. The RELAX NG compact syntax version is in the file called “gedcom55XML.rnc”.

Like any XML schema, it can be used to validate, in the XML sense, a GEDCOM 5.5 file that has been converted to a GEDCOM 5.5 XML document; i.e., it can be used to determine if the document is both “well-formed” and “valid.” This means that it can be used to test if the GEDCOM 5.5 tags (when converted to XML elements) are correctly nested within each other as prescribed by the LDS's GEDCOM 5.5 standard. It can also be used to test if the data, delimited by the elements, follows the strict guidelines prescribed by the standard.

GEDCOM 5.5 XML URI

http://www.neomantic.com/gedcom55XML is the namespace for the root element of a GEDCOM 5.5 XML document.

Schema Versions and Download Locations

The schema described in this document is version 0.1. It was originally written in RELAX NG's XML markup.

Schema License

The source code for both gedcom55XML.rng and gedcom55XML.rnc is released under the GNU General Public License Version 2 (GPL). The full text of this license can be found in a file called “gpl-2.0” in gedcom55XML-0.1.tar.gz.

Schema Updates

Hyperlinks to the most up-to-date version of the schema are located at http://www.neomantic.com/gedcom55XML.

Schema Limitations

Ideally, the GEDCOM 5.5 XML RELAX NG/Schematron schema in gedcom55XML.rng would completely mirror the GEDCOM 5.5 standard. There are, however, nine places where it does not fully replicate the standard. This is why the version number of gedcom55XML.rng/c is “0.1”. When the schema captures 100 percent of the standard, the version number will be incremented to “1.0”.

  1. The schema does not describe valid content of a GEDCOM 5.5 DATE_VALUE. The possible combinations of DATE_VALUEs are simply too numerous to handle using Schematron alone. The schema only prescribes that the length of character data of a DATE_VALUE be between 1 and 35 characters.

  2. The schema does not describe all instances of EXACT_DATE data. It fails to describe the EXACT_DATE values of TRANSMISSION_DATE and CHANGE_DATE. When the Schematron rule is able to describe the EXACT_DATE value, it only uses the English MONTH abbreviations; i.e., JAN, FEB, MAR, etc..

  3. Normally, the regions in the PLAC tag are separated by commas (e.g., Kansas City, Jackson County, Missouri, USA) or, alternatively, the delimiter specified in the HEAD.PLAC.FORM tag. gedcom55XML.rng ignores the delimiter of PLAC data.

  4. The given name and surname of an individual are usually delimited in the NAME tag by forward slashes (e.g., Joseph/Smith/). The Schematron rules do not prescribe this format.

  5. GEDCOM 5.5 allows for user-defined tags, even though it discourages their usage. The schema does not allow user-defined tags.

  6. The schema does not describe the content of SOURCE_CITATION's EVEN tag's EVENT_TYPE_CITED_FROM, even though the GEDCOM 5.5 standard specifies a restricted set of values.

  7. It also does not describe the content of SOURCE_RECORD's EVEN tag's EVENTS_RECORDED, even though again the GEDCOM 5.5 standard specifies a restricted set of values.

  8. The AGE tag permits values such as CHILD, INFANT, STILLBORN, along with 'y','m','d' which respectively signify year, month, and day. Following the specification strictly, these values should be case insensitive. In gedcom55XML.rng, they are case sensitive, defaulting to the lowercase values.

  9. The MONTH values of DATE_EXACT should also be case insensitive, but in the current schema they are not. The defaults are the upper-case values.

Most of these shortcoming are due to difficulties in describing what would be considered, in XML terms, mixed-content elements.

Using gedcom55XML.rng to Validate a GEDCOM 5.5 XML Document

If a GEDCOM 5.5 file is converted into a GEDCOM 5.5 XML document, the RELAX NG/Schematron schema contained in gedcom55XML.rng provides a way to test the validity, as describe above, of the XML document.

The instructions below describe how to do so using several external programs available on the internet. The instructions follow several conventions:

  • family.ged represents a GEDCOM 5.5 file.

  • family.xml represents the family.ged file translated into a GEDCOM 5.5 XML document.

  • Text sandwiched between brackets [ ] indicates variables that depend upon your computer's environment.

In rough outline, the process has two steps:

  1. Convert the GEDCOM 5.5 file to a GEDCOM 5.5 XML document.

  2. Validate the GEDCOM 5.5 XML document using gedcom55XML.rng.

Convert the GEDCOM 5.5 File to a GEDCOM 5.5 XML Document

Currently there is no quick and easy way to convert a GEDCOM 5.5 (or a .ged) file into a GEDCOM 5.5 XML (or .xml) file. I know of only one way to accomplish it now, and it requires some programming skills. To perform the conversion, you will need to use a Java program released by Michael Kay which, used with his famous saxon XML parser, will perform the conversion.

  1. Download Kay's source code and unzip it in a location of your choosing. Remember the path to this location. You will need to know this path in a later step, and it will be referred to using the variable [path-to-gedml-classes].

  2. Find the files called “GedcomParser.java” and “GedcomToXml.xsl” in Kay's source code.

  3. Compile the file GedcomParser.java using your favorite Java distribution's compiler - javac. This will produce a class file called “GedcomParser.class”. The command is as follows:

    javac GedcomParser.java
    
  4. Download Kay's saxon parser, install it, and remember its location, which will be referred to below using the [path-to-saxon] variable. (It may already be installed on your system; on my Debian GNU/Linux system it was located at /usr/share/java/saxon.jar.)

  5. Convert family.ged to family.xml by issuing the following command in your terminal:

    java -cp [path-to-saxon]/saxon.jar:[path-to-gedml-classes] com.icl.saxon.StyleSheet -x GedcomParser -o family.xml family.ged [path-to-gedml-classes]/GedcomToXml.xsl
    
  6. The output of this command, family.xml, will be a XML version of family.ged. family.xml will be a near perfect reproduction of a GEDCOM 5.5 into GEDCOM 5.5 XML. The only problem with the output is that it fails to add the <TRLR/> element before the </GED> element at the end of the document. Kay's GedcomParser removes the <TRLR/> element, because the parser was intended for a version of GedML.

Validate the GEDCOM 5.5 XML Document

Validating a XML file with a RELAX NG/Schematron schema is a complicated process. The complication arises because, to fully validate your GEDCOM 5.5 XML file using gedcom55XML.rng, a RELAX NG validator and a Schematron validator must be used. (The entire process is described here). Two methods for validation are described below.

Method 1 - Simultaneously Validation

One way of validating a GEDCOM 5.5 XML file with gedcom55XML.rng is by using Topologi's Open Source Schematron Java classes. Their zip file here contains all the binaries and scripts needed to validate a XML file with Schematron rules embedded in RELAX NG patterns. To do so, type the following command in the directory called “Schematron” that is produced after unzipping the downloaded file above:

java -cp ./Saxon/saxon.jar:./Java/Schematron.jar:./Jing/jing.jar com.topologi.schematron.EmbRNGValidator family.xml gedcom55XML.rng

When this command detects errors in family.xml using the RELAX NG schema, it stops at the first error and indicates the line it is on. You will need to correct the XML markup of family.xml (and the corresponding tags in family.ged), and then run the command above again to find the remaining errors using the RELAX NG schema.

After the command above reveals the error checked against the RELAX NG schema, it then proceeds to check the Schematron rules. When it finds an error, it will not tell you which line has a problem, but only the text of the problematic line.

In either the case of RELAX NG errors or Schematron errors, the command does not tell you what exactly is wrong with the line. It is up to you to figure out why the line is incorrect by referring to the LDS's GEDCOM 5.5 standard.

Method 2 - Sequentially Validation

  1. Validate the GEDCOM 5.5 XML file using the RELAX NG portion of the schema.

    RELAX NG validation can be performed by a Open Source Java program called “jing”. This command line application can be downloaded here. jing supposedly can validate a XML file against the Schematron rules embedded in the RELAX NG schema. However, I have been unable to get jing to recognized the Schematron patterns in gedcom55XML.rng.

    To perform the RELAX NG validation, type the following command line:

    java -jar [path-to-jing]/jing.jar gedcom55XML.rng family.xml
    

    The results will pipe to the terminal's standard output. jing will report all the errors that it encounters. It will not report exactly what is wrong, but it will report what lines have errors in them. If jing reports no errors, your GEDCOM 5.5 XML document is valid, both in terms of its structure and most of its content.

  2. Perform the Schematron validation.

    Step one does not test the mixed-content inherent in the GEDCOM 5.5 (and GEDCOM 5.5 XML) object model. To test the mixed-content elements, the XML document must be tested against the Schematron rules in gedcom55XML.rng. It takes two steps to do so.

    1. The Schematron rules must first be extracted from the RELAX NG schema using the style sheet, RNG2Schtrn.xsl, available here. For the sake of convenience, I have extracted the Schematron rules from gedcom55XML.rng. They are in the file called “gedcom55XML.sch”. I am distributing this source code under the GNU General Public License Version 2.

    2. The family.xml file must be passed through the Schematron rules. To do so, we will use Topologoi's Schematron Java classes again, the same ones used in Method 1 above. Go to the “Schematron” directory again and type the following command:

      java -cp ./Saxon/saxon.jar:./Java/Schematron.jar:./Jing/jing.jar com.topologi.schematron.SchtrnValidator family.xml gedcom55XML.sch
      

    If this command reveals any errors in family.xml, it will not tell you the line it occurs on, but it will indicate the data that fails to pass the Schematron rules.

Documentation License

This document is released under the GNU Free Documentation License Version 1.2. The full text of this license is found in the file called “fdl.txt” released with gedcom55XML-0.1.tar.gz. It can also be located at http://www.neomantic.com/gedcom55XML/0.1/README.html .

Contact

Please direct questions or requests for more information to . Corrections, suggestions, bug reports, and patches are welcome as well.

5/1/2008