In order to parse XML, you need to know the structure of your documents. Usually this is given in the form of a DTD or schema. The Sports Network does provide DTDs for their XML, but they're wrong! So, what can we do? The easiest option would be to guess and pray. But we need to construct a database into which to insert the XML. How do we know if should be a column, or if it should have its own table? We need to know how many times it can appear. So we need some form of specification. And reading all of the XML files one at a time to count the number of s is impractical. So, we would like to generate the DTDs manually. The process should go something like, 1. Generate a DTD from the first foo.xml file we see. Call it foo.dtd. 2. Validate future foo documents against foo.dtd. If they all validate, great. If one fails, add it to the corpus and update foo.dtd so that both the original and the new foo.xml validate. 3. Repeat until no more failures occur. This can never be perfect: tomorrow we could get a foo.xml that's wildly different from what we've seen in the past. But it's the best we can hope for under the circumstances. Enter XML-Schema-learner. This tool can infer a DTD from a set of sample XML files. The top-level "schemagen" folder (in this project) contains a number of subfolders -- one for each type of document that we want to parse. Contained therein are XML samples for that particular document type. These were hand-picked one at a time according to the procedure above, and the complete set of XML is what we use to generate the DTDs used by htsn-import. To generate them, run `make schema` at the project root. XML-Schema-learner will be invoked on each subfolder of "schemagen" and will output the corresponding DTDs to the "schemagen" folder. Most of the production schemas are generated this way; however, a few needed manual tweaking. Any hand-modified schemas can be found in the "schema" folder in the project root.