In order to parse XML, you need to know the structure of your
documents. Usually this is given in the form of a DTD or schema. The
Sports Network does provide DTDs for their XML, but they're wrong! So,
what can we do?

The easiest option would be to guess and pray. But we need to
construct a database into which to insert the XML. How do we know if
<game> should be a column, or if it should have its own table? We need
to know how many times it can appear. So we need some form of
specification. And reading all of the XML files one at a time to count
the number of <game>s is impractical. So, we would like to generate
the DTDs manually.

The process should go something like,

  1. Generate a DTD from the first foo.xml file we see. Call it
     foo.dtd.

  2. Validate future foo documents against foo.dtd. If they all
     validate, great. If one fails, add it to the corpus and update
     foo.dtd so that both the original and the new foo.xml validate.

  3. Repeat until no more failures occur. This can never be perfect:
     tomorrow we could get a foo.xml that's wildly different from what
     we've seen in the past. But it's the best we can hope for under
     the circumstances.

Enter XML-Schema-learner. This tool can infer a DTD from a set of
sample XML files. The top-level "schemagen" folder (in this project)
contains a number of subfolders -- one for each type of document that
we want to parse. Contained therein are XML samples for that
particular document type. These were hand-picked one at a time
according to the procedure above, and the complete set of XML is what
we use to generate the DTDs used by htsn-import.

To generate them, run `make schema` at the project
root. XML-Schema-learner will be invoked on each subfolder of
"schemagen" and will output the corresponding DTDs to the "schemagen"
folder.

Most of the production schemas are generated this way; however, a few
needed manual tweaking. Any hand-modified schemas can be found in the
"schema" folder in the project root.