X-Git-Url: http://gitweb.michael.orlitzky.com/?a=blobdiff_plain;f=doc%2Fman1%2Fhtsn-import.1;h=92a7dbc48d29754f76b7b0ccaded2d472d7fdd2a;hb=6b8d64cbe8094ea450cddc45362987ab36d4d82a;hp=433ecf09ae09d406748ec3bd947212033cfb1c13;hpb=adb901a81d69d4abf19bc52a1b39a8a1868be368;p=dead%2Fhtsn-import.git diff --git a/doc/man1/htsn-import.1 b/doc/man1/htsn-import.1 index 433ecf0..92a7dbc 100644 --- a/doc/man1/htsn-import.1 +++ b/doc/man1/htsn-import.1 @@ -23,9 +23,10 @@ database. .P First, we must parse the XML. Each supported document type (see below) has a full pickle/unpickle implementation (\(dqpickle\(dq is simply a -synonym for serialize here). That means that we parse the entire -document into a data structure, and if we pickle (serialize) that data -structure, we get the exact same XML document tha we started with. +synonym for \(dqserialize\(dq here). That means that we parse the +entire document into a data structure, and if we pickle (serialize) +that data structure, we get the exact same XML document tha we started +with. .P This is important for two reasons. First, it serves as a second level of validation. The first validation is performed by the XML parser, @@ -42,8 +43,592 @@ exist. We don't want the schema to change out from under us without warning, so it's important that no XML be parsed that would result in a different schema than we had previously. Since we can pickle/unpickle everything already, this should be impossible. +.P +A list of supported document types is given in the appendix. +.P +The GameInfo and SportInfo types do not have their own top-level +tables in the database. Instead, their raw XML is stored in either the +\(dqgame_info\(dq or \(dqsport_info\(dq table respectively. + +.SH DATABASE SCHEMA +.P +At the top level (with two notable exceptions), we have one table for +each of the XML document types that we import. For example, the +documents corresponding to \fInewsxml.dtd\fR will have a table called +\(dqnews\(dq. All top-level tables contain two important fields, +\(dqxml_file_id\(dq and \(dqtime_stamp\(dq. The former is unique and +prevents us from inserting the same data twice. The time stamp on the +other hand lets us know when the data is old and can be removed. The +database schema make it possible to delete only the outdated top-level +records; all transient children should be removed by triggers. +.P +These top-level tables will often have children. For example, each +news item has zero or more locations associated with it. The child +table will be named _, which in this case +corresponds to \(dqnews_locations\(dq. +.P +To relate the two, a third table may exist with name +__. Note the two underscores. This prevents ambiguity +when the child table itself contains underscores. The table joining +\(dqnews\(dq with \(dqnews_locations\(dq is thus called +\(dqnews__news_locations\(dq. This is necessary when the child table +has a unique constraint; we don't want to blindly insert duplicate +records keyed to the parent. Instead we'd like to use the third table +to map an existing child to the new parent. +.P +Where it makes sense, children are kept unique to prevent pointless +duplication. This slows down inserts, and speeds up reads (which are +much more frequent). There is a tradeoff to be made, however. For a +table with a small, fixed upper bound on the number of rows (like +\(dqodds_casinos\(dq), there is great benefit to de-duplication. The +total number of rows stays small, so inserts are still quick, and many +duplicate rows are eliminated. +.P +But, with a table like \(dqodds_games\(dq, the number of games grows +quickly and without bound. It is therefore more beneficial to be able +to delete the old games (through an ON DELETE CASCADE, tied to +\(dqodds\(dq) than it is to eliminate duplication. A table like +\(dqnews_locations\(dq is somewhere in-between. It is hoped that the +unique constraint in the top-level table's \(dqxml_file_id\(dq will +prevent duplication in this case anyway. +.P +The aforementioned exceptions are the \(dqgame_info\(dq and +\(dqsport_info\(dq tables. These tables contain the raw XML for a +number of DTDs that are not handled individually. This is partially +for backwards-compatibility with a legacy implementation, but is +mostly a stopgap due to a lack of resources at the moment. These two +tables (game_info and sport_info) still possess timestamps that allow +us to prune old data. +.P +UML diagrams of the resulting database schema for each XML document +type are provided with the \fBhtsn-import\fR documentation, in the +\fIdoc/dbschema\fR directory. These are not authoritative, but it +should be considered a bug if they are incorrect. The diagrams are +created using the pgModeler tool. + +.SH DATABASE SCHEMA COMPROMISES + +There are a few places that the database schema isn't exactly how we'd +like it to be: + +.IP \[bu] 2 +\fIearlylineXML.dtd\fR + +The database representations for earlylineXML.dtd and +MLB_earlylineXML.dtd are the same; that is, they share the same +tables. The two document types represent team names in different +ways. In order to accomodate both types with one parser, we had to +make both ways optional, and then merge the two together before +converting to the database representation. + +Unfortunately, when we merge two optional things together, we get +another optional thing back. There's no way to say that \(dqat least +one is not optional.\(dq So the team names in the database schema are +optional as well, even though they should always be present. + +.SH NULL POLICY +.P +Normally in a database one makes a distinction between fields that +simply don't exist, and those fields that are +\(dqempty\(dq. Translating from XML, there is a natural way to +determine which one should be used: if an element is present in the +XML document but its contents are empty, then an empty string should +be inserted into the corresponding field. If on the other hand the +element is missing entirely, the corresponding database entry should +be NULL to indicate that fact. +.P +This sounds well and good, but the XML must be consistent for the +database consumer to make any sense of what he sees. The feed XML uses +optional and blank elements interchangeably, and without any +discernable pattern. To propagate this pattern into the database would +only cause confusion. +.P +As a result, a policy was adopted: both optional elements and elements +whose contents can be empty will be considered nullable in the +database. If the element is missing, the corresponding field is +NULL. Likewise if the content is simply missing. That means there +should never be a (completely) empty string in a database column. + +.SH XML SCHEMA GENERATION +.P +In order to parse XML, you need to know the structure of your +documents. Usually this is given in the form of a DTD or schema. The +Sports Network does provide DTDs for their XML, but unfortunately many +of them do not match the XML found on the feed. +.P +We need to construct a database into which to insert the XML. How do +we know if should be a column, or if it should have its own +table? We need to know how many times it can appear in the +document. So we need some form of specification. Since the supplied +DTDs are incorrect, we would like to generate them automatically. +.P +The process should go something like, +.IP 1. +Generate a DTD from the first foo.xml file we see. Call it foo.dtd. +.IP 2. +Validate future foo documents against foo.dtd. If they all validate, +great. If one fails, add it to the corpus and update foo.dtd so +that both the original and the new foo.xml validate. +.IP 3. +Repeat until no more failures occur. This can never be perfect: +tomorrow we could get a foo.xml that's wildly different from what +we've seen in the past. But it's the best we can hope for under +the circumstances. +.P +Enter XML-Schema-learner +. This tool can infer a +DTD from a set of sample XML files. The top-level \(dqschemagen\(dq +folder (in this project) contains a number of subfolders\(emone for +each type of document that we want to parse. Contained therein are XML +samples for that particular document type. These were hand-picked one +at a time according to the procedure above, and the complete set of +XML is what we use to generate the DTDs used by htsn-import. +.P +To generate them, run `make schema` at the project +root. XML-Schema-learner will be invoked on each subfolder of +\(dqschemagen\(dq and will output the corresponding DTDs to the +\(dqschemagen\(dq folder. +.P +Most of the production schemas are generated this way; however, a few +needed manual tweaking. The final, believed-to-be-correct schemas for +all supported document types can be found in the \(dqschema\(dq folder in +the project root. Having the correct DTDs available means you +don't need XML-Schema-learner available to install \fBhtsn-import\fR. + +.SH XML SCHEMA UPDATES +.P +If a new tag is added to an XML document type, \fBhtsn-import\fR will +most likely refuse to parse it, since the new documents no longer +match the existing DTD. +.P +The first thing to do in that case is add the unparseable document to +the \(dqschemagen\(dq directory, and generate a new DTD that matches +both the old and new samples. Once a new, correct DTD has been +generated, it should be added to the \(dqschema\(dq directory. Then, +the parser can be updated and \fBhtsn-import\fR rebuilt. +.P +At this point, \fBhtsn-import\fR should be capable of importing the +new document. But the addition of the new tag will most require new +fields in the database. Fortunately, easy migrations like this are +handled automatically. As an example, at one point, \fIOdds_XML.dtd\fR +did not contain the \(dqHStarter\(dq and \(dqAStarter\(dq elements +associated with its games. Suppose we parse one of the old documents +(without \(dqHStarter\(dq and \(dqAStarter\(dq) using an old version +of \fBhtsn-import\fR: +.P +.nf +.I $ htsn-import --connection-string='foo.sqlite3' \\\\ +.I " schemagen/Odds_XML/19996433.xml" +Migration: CREATE TABLE \(dqodds\(dq ... +Successfully imported schemagen/Odds_XML/19996433.xml. +Processed 1 document(s) total. +.fi +.P +At this point, the database schema matches the old documents, that is, +the ones without \fIAStarter\fR and \fIHStarter\fR. If we use a new +version of \fBhtsn-import\fR, supporting the new fields, the migration +is handled gracefully: +.P +.nf +.I $ htsn-import --connection-string='foo.sqlite3' \\\\ +.I " schemagen/Odds_XML/21315768.xml" +Migration: ALTER TABLE \(dqodds_games\(dq + ADD COLUMN \(dqaway_team_starter_id\(dq INTEGER; +Migration: ALTER TABLE \(dqodds_games\(dq + ADD COLUMN \(dqaway_team_starter_name\(dq VARCHAR; +Migration: ALTER TABLE \(dqodds_games\(dq + ADD COLUMN \(dqhome_team_starter_id\(dq INTEGER; +Migration: ALTER TABLE \(dqodds_games\(dq + ADD COLUMN \(dqhome_team_starter_name\(dq VARCHAR; +Successfully imported schemagen/Odds_XML/21315768.xml. +Processed 1 document(s) total. +.fi +.P +If fields are removed from the schema, then manual intervention may be +necessary: +.P +.nf +.I $ htsn-import -b Postgres -c 'dbname=htsn user=postgres' \\\\ +.I " schemagen/Odds_XML/19996433.xml" +ERROR: Database migration: manual intervention required. +The following actions are considered unsafe: +ALTER TABLE \(dqodds_games\(dq DROP COLUMN \(dqaway_team_starter_id\(dq +ALTER TABLE \(dqodds_games\(dq DROP COLUMN \(dqaway_team_starter_name\(dq +ALTER TABLE \(dqodds_games\(dq DROP COLUMN \(dqhome_team_starter_id\(dq +ALTER TABLE \(dqodds_games\(dq DROP COLUMN \(dqhome_team_starter_name\(dq + +ERROR: Failed to import file schemagen/Odds_XML/19996433.xml. +Processed 0 document(s) total. +.fi +.P +To fix these errors, manually invoke the SQL commands that were +considered unsafe: +.P +.nf +.I $ psql -U postgres -d htsn \\\\ +.I " -c 'ALTER TABLE odds_games DROP COLUMN away_team_starter_id;'" +ALTER TABLE +.I $ psql -U postgres -d htsn \\\\ +.I " -c 'ALTER TABLE odds_games DROP COLUMN away_team_starter_name;'" +ALTER TABLE +.I $ psql -U postgres -d htsn \\\\ +.I " -c 'ALTER TABLE odds_games DROP COLUMN home_team_starter_id;'" +ALTER TABLE +.I $ psql -U postgres -d htsn \\\\ +.I " -c 'ALTER TABLE odds_games DROP COLUMN home_team_starter_name;'" +ALTER TABLE +.fi +.P +After manually adjusting the schema, the import should succeed. + +.SH XML SCHEMA ODDITIES +.P +There are a number of problems with the XML on the wire. Even if we +construct the DTDs ourselves, the results are sometimes +inconsistent. Here we document a few of them. + +.IP \[bu] 2 +\fInewsxml.dtd\fR + +The TSN DTD for news (and almost all XML on the wire) suggests that +there is a exactly one (possibly-empty) element present in each +message. However, we have seen an example (XML_File_ID 21232353) where +an empty followed a non-empty one: + +.fi +Odd Man Rush: Snow under pressure to improve Isles quickly + +.nf + +We don't parse this case at the moment, but we do recognize it and report +it as unsupported so that offending documents can be removed. An example +is provided as test/xml/newsxml-multiple-sms.xml. + +.IP \[bu] +\fIMLB_earlylineXML.dtd\fR + +Unlike earlylineXML.dtd, this document type has more than one +associated with each . Moreover, each has a bunch of + children that are supposed to be associated with the s, +but the document structure indicates no explicit relationship. For +example, + +.nf + + ... + ... + ... + ... + ... + +.fi + +Here the first is inferred to apply to the two s that +follow it, and the second applies to the single that +follows it. But this is very fragile to parse. Instead, we use a hack +to facilitate (un)pickling, and then drop the notes entirely during +the database conversion. + +A similar workaround is implemented for Odds_XML.dtd. + +.IP \[bu] +\fIOdds_XML.dtd\fR + +The elements here are supposed to be associated with a set of + elements, but since the pair +(......) can appear zero or more times, +this leads to ambiguity in parsing. We therefore ignore the notes +entirely (although a hack is employed to facilitate parsing). The same +thing goes for the newer element. + +.IP \[bu] +\fIweatherxml.dtd\fR + +There appear to be two types of weather documents; the first has + contained within and the second has +contained within . While it would be possible to parse both, +it would greatly complicate things. The first form is more common, so +that's all we support for now. An example is provided as +test/xml/weatherxml-type2.xml. + +We are however able to identify the second type. When one is +encountered, an informational message (that it is unsupported) will be +printed. If the \fI\-\-remove\fR flag is used, the file will be +deleted. This prevents documents that we know we can't import from +building up. + +Another problem that comes up occasionally is that the home and away +team elements appear in the reverse order. As in the other case, we +report these as unsupported and then \(dqsucceed\(dq so that the +offending document can be removed if desired. An example is provided +as test/xml/weatherxml-backwards-teams.xml. + +.SH DATE/TIME ISSUES + +Dates and times appear in a number of places on the feed. The date +portions are usually, fine, but the times often lack important +information such as the time zone, or whether \(dq8 o'clock\(dq means +a.m. or p.m. + +The most pervasive issue occurs with the timestamps that are included +in every message. A typical timestamp looks like, + +.nf + May 24, 2014, at 04:18 PM ET +.fi + +The \(dqtime zone\(dq is given as \(dqET\(dq, but unfortunately +\(dqET\(dq is not a valid time zone. It stands for \(dqEastern +Time\(dq, which can belong to either of two time zones, EST or EDT, +based on the time of the year (that is, whether or not daylight +savings time is in effect) and one's location (for example, Arizona +doesn't observe daylight savings time). It's not much more useful to +be off by one hour than it is to be off by five hours, and since we +can't determine the true offset from the timestamp, we always parse +and store these as UTC. + +Here's a list of the ones that may cause surprises: + +.IP \[bu] 2 +\fIAutoRacingResultsXML.dtd\fR + +The elements contain a full date and time, but no time zone +information: + +.nf +5/24/2014 2:45:00 PM +.fi + +We parse them as UTC, which will be wrong when stored, +but \(dqcorrect\(dq if the new UTC time zone is ignored. + +.IP \[bu] +\fIAuto_Racing_Schedule_XML.dtd\fR + +The and elements are combined into on field in +the database, but no time zone information is given. For example, + +.nf +02/16/2013 +08:10 PM +.fi + +As a result, we parse and store the times as UTC. The race times are +not always present in the database, but when they are missing, they +are presented as \(dqTBA\(dq (to be announced): + +.nf +TBA +.fi + +Since the dates do not appear to be optional, we store only the race +date in that case. + +.IP \[bu] +\fIearlylineXML.dtd\fR + +The