X-Git-Url: http://gitweb.michael.orlitzky.com/?a=blobdiff_plain;f=doc%2Fman1%2Fhtsn-import.1;h=79febf08bdee9ab1e35232b2c1dc935a57579729;hb=bde3a534d56efa328b36b004a1d578726d89ff78;hp=07e8c32240a8710bdabf5c60bb2f1223bccde95e;hpb=7f1806d4303f413a434c7f53cc9533e30a2ffa2d;p=dead%2Fhtsn-import.git

diff --git a/doc/man1/htsn-import.1 b/doc/man1/htsn-import.1
index 07e8c32..79febf0 100644
--- a/doc/man1/htsn-import.1
+++ b/doc/man1/htsn-import.1
@@ -106,6 +106,26 @@ type are provided with the \fBhtsn-import\fR documentation, in the
 should be considered a bug if they are incorrect. The diagrams are
 created using the pgModeler <http://www.pgmodeler.com.br/> tool.
 
+.SH DATABASE SCHEMA COMPROMISES
+
+There are a few places that the database schema isn't exactly how we'd
+like it to be:
+
+.IP \[bu] 2
+\fIearlylineXML.dtd\fR
+
+The database representations for earlylineXML.dtd and
+MLB_earlylineXML.dtd are the same; that is, they share the same
+tables. The two document types represent team names in different
+ways. In order to accomodate both types with one parser, we had to
+make both ways optional, and then merge the two together before
+converting to the database representation.
+
+Unfortunately, when we merge two optional things together, we get
+another optional thing back. There's no way to say that \(dqat least
+one is not optional.\(dq So the team names in the database schema are
+optional as well, even though they should always be present.
+
 .SH NULL POLICY
 .P
 Normally in a database one makes a distinction between fields that
@@ -204,8 +224,8 @@ Successfully imported schemagen/Odds_XML/19996433.xml.
 Processed 1 document(s) total.
 .fi
 .P
-At this point, the database schema matches the old documents, i.e. the
-ones without \fIAStarter\fR and \fIHStarter\fR. If we use a new
+At this point, the database schema matches the old documents, that is,
+the ones without \fIAStarter\fR and \fIHStarter\fR. If we use a new
 version of \fBhtsn-import\fR, supporting the new fields, the migration
 is handled gracefully:
 .P
@@ -280,7 +300,36 @@ an empty <SMS> followed a non-empty one:
 <SMS></SMS>
 .nf
 
-We don't parse this case at the moment.
+We don't parse this case at the moment, but we do recognize it and report
+it as unsupported so that offending documents can be removed. An example
+is provided as test/xml/newsxml-multiple-sms.xml.
+
+.IP \[bu]
+\fIMLB_earlylineXML.dtd\fR
+
+Unlike earlylineXML.dtd, this document type has more than one <game>
+associated with each <date>. Moreover, each <date> has a bunch of
+<note> children that are supposed to be associated with the <game>s,
+but the document structure indicates no explicit relationship. For
+example,
+
+.nf
+<date>
+  <note>...</note>
+  <game>...</game>
+  <game>...</game>
+  <note>...</note>
+  <game>...</game>
+</date>
+.fi
+
+Here the first <note> is inferred to apply to the two <game>s that
+follow it, and the second <note> applies to the single <game> that
+follows it. But this is very fragile to parse. Instead, we use a hack
+to facilitate (un)pickling, and then drop the notes entirely during
+the database conversion.
+
+A similar workaround is implemented for Odds_XML.dtd.
 
 .IP \[bu]
 \fIOdds_XML.dtd\fR
@@ -314,6 +363,148 @@ report these as unsupported and then \(dqsucceed\(dq so that the
 offending document can be removed if desired. An example is provided
 as test/xml/weatherxml-backwards-teams.xml.
 
+.SH DATE/TIME ISSUES
+
+Dates and times appear in a number of places on the feed. The date
+portions are usually, fine, but the times often lack important
+information such as the time zone, or whether \(dq8 o'clock\(dq means
+a.m. or p.m.
+
+The most pervasive issue occurs with the timestamps that are included
+in every message. A typical timestamp looks like,
+
+.nf
+<time_stamp> May 24, 2014, at 04:18 PM ET </time_stamp>
+.fi
+
+The \(dqtime zone\(dq is given as \(dqET\(dq, but unfortunately
+\(dqET\(dq is not a valid time zone. It stands for \(dqEastern
+Time\(dq, which can belong to either of two time zones, EST or EDT,
+based on the time of the year (that is, whether or not daylight
+savings time is in effect) and one's location (for example, Arizona
+doesn't observe daylight savings time). It's not much more useful to
+be off by one hour than it is to be off by five hours, and since we
+can't determine the true offset from the timestamp, we always parse
+and store these as UTC.
+
+Here's a list of the ones that may cause surprises:
+
+.IP \[bu] 2
+\fIAutoRacingResultsXML.dtd\fR
+
+The <RaceDate> elements contain a full date and time, but no time zone
+information:
+
+.nf
+<RaceDate>5/24/2014 2:45:00 PM</RaceDate>
+.fi
+
+We parse them as UTC, which will be wrong when stored,
+but \(dqcorrect\(dq if the new UTC time zone is ignored.
+
+.IP \[bu]
+\fIAuto_Racing_Schedule_XML.dtd\fR
+
+The <Race_Date> and <Race_Time> elements are combined into on field in
+the database, but no time zone information is given. For example,
+
+.nf
+<Race_Date>02/16/2013</Race_Date>
+<Race_Time>08:10 PM</Race_Time>
+.fi
+
+As a result, we parse and store the times as UTC. The race times are
+not always present in the database, but when they are missing, they
+are presented as \(dqTBA\(dq (to be announced):
+
+.nf
+<Race_Time>TBA</Race_Time>
+.fi
+
+Since the dates do not appear to be optional, we store only the race
+date in that case.
+
+.IP \[bu]
+\fIearlylineXML.dtd\fR
+
+The <time> elements in the early lines contain neither a time zone nor
+an am/pm identifier:
+
+.nf
+<time>8:30</time>
+.fi
+
+The times are parsed and stored as UTC, since we
+don't have any other information upon which to base a guess. Even if
+one ignores the UTC time zone, the time can possibly be off by 12
+hours (due to the a.m./p.m. issue).
+
+The game <time> elements can also be empty. Since we store the
+combined game date/time in one field, these games will appear to begin
+at midnight on the day they occur.
+
+.IP \[bu]
+\fIjfilexml.dtd\fR
+
+The <Game_Date> and <Game_Time> elements are combined into on field in
+the database, but no time zone information is given. For example,
+
+.nf
+<Game_Date>06/15/2014</Game_Date>
+<Game_Time>08:00 PM</Game_Time>
+.fi
+
+As a result, we parse and store the times as UTC.
+
+The <CurrentTimestamp> elements suffer a similar problem, sans the
+date:
+
+.nf
+<CurrentTimeStamp>11:30 A.M.</CurrentTimeStamp>
+.fi
+
+They are also stored as UTC.
+
+.IP \[bu]
+\fIMLB_earlylineXML.dtd\fR
+
+See earlylineXML.dtd.
+
+.IP \[bu]
+\fIOdds_XML.dtd\fR
+
+The <Game_Date> and <Game_Time> elements are combined into on field in
+the database, but no time zone information is given. For example,
+
+.nf
+<Game_Date>01/04/2014</Game_Date>
+<Game_Time>04:35 PM</Game_Time>
+.fi
+
+As a result, we parse and store the times as UTC.
+
+.IP \[bu]
+\fISchedule_Changes_XML.dtd\fR
+
+The <Game_Date> and <Game_Time> elements are combined into on field in
+the database, but no time zone information is given. For example,
+
+.nf
+<Game_Date>06/06/2014</Game_Date>
+<Game_Time>04:00 PM</Game_Time>
+.fi
+
+As a result, we parse and store the times as UTC. The game times are
+not always present in the database, but when they are missing, they
+are presented as \(dqTBA\(dq (to be announced):
+
+.nf
+<Game_Time>TBA</Game_Time>
+.fi
+
+Since the dates do not appear to be optional, we store only the game
+date in that case.
+
 .SH DEPLOYMENT
 .P
 When deploying for the first time, the target database will most
@@ -446,6 +637,8 @@ AutoRacingResultsXML.dtd
 .IP \[bu]
 Auto_Racing_Schedule_XML.dtd
 .IP \[bu]
+earlylineXML.dtd
+.IP \[bu]
 Heartbeat.dtd
 .IP \[bu]
 Injuries_Detail_XML.dtd
@@ -454,6 +647,8 @@ injuriesxml.dtd
 .IP \[bu]
 jfilexml.dtd
 .IP \[bu]
+MLB_earlylineXML.dtd
+.IP \[bu]
 newsxml.dtd
 .IP \[bu]
 Odds_XML.dtd
@@ -713,30 +908,50 @@ NFLFumbleLeaderXML.dtd
 .IP \[bu]
 NFLGiveTakeXML.dtd
 .IP \[bu]
+NFLGrassTurfDomeOutsideXML.dtd
+.IP \[bu]
 NFLInside20XML.dtd
 .IP \[bu]
+NFLInterceptionLeadersXML.dtd
+.IP \[bu]
 NFLKickoffsXML.dtd
 .IP \[bu]
 NFLMondayNightXML.dtd
 .IP \[bu]
+NFLPassingLeadersXML.dtd
+.IP \[bu]
 NFLPassLeadXML.dtd
 .IP \[bu]
 NFLQBStartsXML.dtd
 .IP \[bu]
+NFLReceivingLeadersXML.dtd
+.IP \[bu]
+NFLRushingLeadersXML.dtd
+.IP \[bu]
 NFLSackLeadersXML.dtd
 .IP \[bu]
 nflstandxml.dtd
 .IP \[bu]
+NFLTackleFFLeadersXML.dtd
+.IP \[bu]
 NFLTeamRankingsXML.dtd
 .IP \[bu]
+NFLTopKickoffReturnXML.dtd
+.IP \[bu]
 NFLTopPerformanceXML.dtd
 .IP \[bu]
+NFLTopPuntReturnXML.dtd
+.IP \[bu]
 NFLTotalYardageXML.dtd
 .IP \[bu]
+NFLYardsXML.dtd
+.IP \[bu]
 NFL_KickingLeaders_XML.dtd
 .IP \[bu]
 NFL_NBA_Draft_XML.dtd
 .IP \[bu]
+NFL_PuntingLeaders_XML.dtd
+.IP \[bu]
 NFL_Roster_XML.dtd
 .IP \[bu]
 NFL_Team_Stats_XML.dtd