GEDCOM 5.5 XML

The genealogical markup language that I call "GEDCOM 5.5 XML" and the XML schema that I wrote to describe it are the result of a personal project. I had a GEDCOM file, full of my family's history. It was produced by the Open Source program Lifelines, which requires users to manually enter genealogical data line by GEDCOM line. It performs no error checking. As my GEDCOM file grew to 20,000 lines, I decided to clean it up and correct all its syntax errors.

I knew that if I could convert my file into a XML document, I could validate it, in the XML sense, and find all the syntax errors. I simply needed to run the XML file through a validation parser, testing it against a DTD description of the markup language.

I thought I found my solution in GedML, a genealogical XML markup language proposed by Michael H. Kay of Saxon fame. He released a DTD that describes this markup language. He also released a Java program that could convert a GEDCOM file into a XML document. The DTD, however, did not describe the markup language produced by the converter. It couldn't be used to validate a converted GEDCOM file.

The markup language produced by his converter, though, is quite simple: all GEDCOM tags are translated into XML elements; open and closed elements delimit the data; and the elements are nested in the same way prescribed by the GEDCOM specification. The markup diverges from the specification with respect to the TRLR tag. Kay's converter eliminates this tag which indicates the end of a GEDCOM file. Kay probably considered this element unnecessary because a closed XML document root element serves the same purpose.

All I needed, then, was a XML schema that described the markup produced by the converter. I ruled out a DTD; it couldn't capture all of the restrictions prescribed by the GEDCOM specification. I wrote, instead, a XML schema in RELAX NG and embedded Schematron rules in it.

The resulting schema almost completely describes the full GEDCOM 5.5 specification in XML markup. There are, however, a few places where it does not. These are duly noted in the documentation below. Since the schema has a few shortcomings, I have designated it version "0.1". When the schema describes 100 percent of the GEDCOM 5.5 specification, the version number will be incremented to "1.0".

I call the XML markup described by the schema "GEDCOM 5.5 XML". This name differentiates it from Kay's GedML, with its limited DTD and with its implementation that leaves out the TRLR tag. GEDCOM 5.5 XML attempts to be a 100 percent one-to-one translation of GEDCOM 5.5 into XML; it even includes the superfluous (and empty) <TRLR/> element.

The source code for the schema and its documentation can be found using the hyperlinks below. All source code is released using the GNU General Public License Version 2.

GEDCOM 5.5 XML version 0.1