X-Git-Url: http://gitweb.michael.orlitzky.com/?a=blobdiff_plain;f=doc%2Fman1%2Fhtsn-import.1;h=79febf08bdee9ab1e35232b2c1dc935a57579729;hb=bde3a534d56efa328b36b004a1d578726d89ff78;hp=66f7ae5d1e769b23357f9fcdea9aaf1beb69cc83;hpb=16d86e7a3c1eda08b91752f92510a1de0b952a17;p=dead%2Fhtsn-import.git diff --git a/doc/man1/htsn-import.1 b/doc/man1/htsn-import.1 index 66f7ae5..79febf0 100644 --- a/doc/man1/htsn-import.1 +++ b/doc/man1/htsn-import.1 @@ -106,6 +106,26 @@ type are provided with the \fBhtsn-import\fR documentation, in the should be considered a bug if they are incorrect. The diagrams are created using the pgModeler tool. +.SH DATABASE SCHEMA COMPROMISES + +There are a few places that the database schema isn't exactly how we'd +like it to be: + +.IP \[bu] 2 +\fIearlylineXML.dtd\fR + +The database representations for earlylineXML.dtd and +MLB_earlylineXML.dtd are the same; that is, they share the same +tables. The two document types represent team names in different +ways. In order to accomodate both types with one parser, we had to +make both ways optional, and then merge the two together before +converting to the database representation. + +Unfortunately, when we merge two optional things together, we get +another optional thing back. There's no way to say that \(dqat least +one is not optional.\(dq So the team names in the database schema are +optional as well, even though they should always be present. + .SH NULL POLICY .P Normally in a database one makes a distinction between fields that @@ -204,8 +224,8 @@ Successfully imported schemagen/Odds_XML/19996433.xml. Processed 1 document(s) total. .fi .P -At this point, the database schema matches the old documents, i.e. the -ones without \fIAStarter\fR and \fIHStarter\fR. If we use a new +At this point, the database schema matches the old documents, that is, +the ones without \fIAStarter\fR and \fIHStarter\fR. If we use a new version of \fBhtsn-import\fR, supporting the new fields, the migration is handled gracefully: .P @@ -284,6 +304,33 @@ We don't parse this case at the moment, but we do recognize it and report it as unsupported so that offending documents can be removed. An example is provided as test/xml/newsxml-multiple-sms.xml. +.IP \[bu] +\fIMLB_earlylineXML.dtd\fR + +Unlike earlylineXML.dtd, this document type has more than one +associated with each . Moreover, each has a bunch of + children that are supposed to be associated with the s, +but the document structure indicates no explicit relationship. For +example, + +.nf + + ... + ... + ... + ... + ... + +.fi + +Here the first is inferred to apply to the two s that +follow it, and the second applies to the single that +follows it. But this is very fragile to parse. Instead, we use a hack +to facilitate (un)pickling, and then drop the notes entirely during +the database conversion. + +A similar workaround is implemented for Odds_XML.dtd. + .IP \[bu] \fIOdds_XML.dtd\fR @@ -333,10 +380,12 @@ in every message. A typical timestamp looks like, The \(dqtime zone\(dq is given as \(dqET\(dq, but unfortunately \(dqET\(dq is not a valid time zone. It stands for \(dqEastern Time\(dq, which can belong to either of two time zones, EST or EDT, -based on the time of the year (i.e. whether or not daylight savings -time is in effect). Since we can't tell from the timestamp, we always -parse these as EST which is UTC-5. When daylight savings is in effect, -they will be off by an hour. +based on the time of the year (that is, whether or not daylight +savings time is in effect) and one's location (for example, Arizona +doesn't observe daylight savings time). It's not much more useful to +be off by one hour than it is to be off by five hours, and since we +can't determine the true offset from the timestamp, we always parse +and store these as UTC. Here's a list of the ones that may cause surprises: @@ -390,6 +439,10 @@ don't have any other information upon which to base a guess. Even if one ignores the UTC time zone, the time can possibly be off by 12 hours (due to the a.m./p.m. issue). +The game