]> gitweb.michael.orlitzky.com - dead/htsn-import.git/blob - doc/README.schemagen
d32075ba0083f9fe3a1bcfa59997bf66e7743ab5
[dead/htsn-import.git] / doc / README.schemagen
1 In order to parse XML, you need to know the structure of your
2 documents. Usually this is given in the form of a DTD or schema. The
3 Sports Network does provide DTDs for their XML, but they're wrong! So,
4 what can we do?
5
6 The easiest option would be to guess and pray. But we need to
7 construct a database into which to insert the XML. How do we know if
8 <game> should be a column, or if it should have its own table? We need
9 to know how many times it can appear. So we need some form of
10 specification. And reading all of the XML files one at a time to count
11 the number of <game>s is impractical. So, we would like to generate
12 the DTDs automatically.
13
14 The process should go something like,
15
16 1. Generate a DTD from the first foo.xml file we see. Call it
17 foo.dtd.
18
19 2. Validate future foo documents against foo.dtd. If they all
20 validate, great. If one fails, add it to the corpus and update
21 foo.dtd so that both the original and the new foo.xml validate.
22
23 3. Repeat until no more failures occur. This can never be perfect:
24 tomorrow we could get a foo.xml that's wildly different from what
25 we've seen in the past. But it's the best we can hope for under
26 the circumstances.
27
28 Enter XML-Schema-learner. This tool can infer a DTD from a set of
29 sample XML files. The top-level "schemagen" folder (in this project)
30 contains a number of subfolders -- one for each type of document that
31 we want to parse. Contained therein are XML samples for that
32 particular document type. These were hand-picked one at a time
33 according to the procedure above, and the complete set of XML is what
34 we use to generate the DTDs used by htsn-import.
35
36 To generate them, run `make schema` at the project
37 root. XML-Schema-learner will be invoked on each subfolder of
38 "schemagen" and will output the corresponding DTDs to the "schemagen"
39 folder.
40
41 Most of the production schemas are generated this way; however, a few
42 needed manual tweaking. The final, believed-to-be-correct schemas for
43 all supported document types can be found in the "schema" folder in
44 the project root. Having the "correct" DTDs available means you
45 don't need XML-Schema-learner available to install htsn-import.
46
47 As explained in the man page, there is a second type of weatherxml
48 document that we don't parse at the moment. An example is provided as
49 schemagen/weatherxml/20143655.xml.