]> gitweb.michael.orlitzky.com - dead/htsn-import.git/blob - doc/README.development
Allow empty game <time> elements (with a test case) in TSN.XML.EarlyLine.
[dead/htsn-import.git] / doc / README.development
1 == Overview ==
2
3 The "main" function accepts a list of XML files on the command line,
4 and goes through them one at a time. A minimal parse attempt is made
5 to determine the DTD of the file, and then one big case statement
6 decides what to do with it based on that DTD name. Each DTD name has
7 an associated module in src/TSN/XML which can do a few things:
8
9 1. List the DTDs for which it's responsible
10
11 2. Parse a top-level <message> element into an XmlTree
12
13 3. In rare cases (Weather, News) detect specific malformed documents
14
15 Most of the XML modules are similar. The big idea is that every object
16 (for example, a <team>) has both a database type and an XML type. When
17 those two types differ, we need to be able to convert between
18 them. So, for example, if the XML representation of a team differs
19 from the database representation, we might define,
20
21 > data Team = ...
22 > data TeamXml = ...
23
24 But if you're lucky, the database/XML representations will be the
25 same, and you'd only need to define "Team"!
26
27 The most common situation where the representations differ is when
28 there exists a parent/child relationship. In the XML representation,
29 you will have e.g. the Team contained within a Game:
30
31 > data GameXml = GameXml { xml_game_id :: Int, xml_team :: TeamXml }
32
33 But in the database representation--which looks a lot like a schema
34 specification--there's no mention of the team at all.
35
36 > data Game = Game { game_id :: Int }
37
38 That's because the database representation of the Team will have a
39 foreign key to a Game instead:
40
41 > data Team = Team { games_id :: DefaultKey Game, ... }
42
43 Most of the XML modules are devoted to converting back and forth
44 between these two types. The XML modules are also responsible for
45 "unpickling" the XML document, which essentially parses it into a
46 bunch of Haskell data types (the FooXml representations).
47
48 Furthermore, each top-level message element in the XML modules knows
49 how to insert itself into the database. The "Message" type is always a
50 member of the "DbImport" class, and that class defines two methods:
51 dbmigrate, to run the migrations, and dbimport, which actually says
52 how to import the thing.
53
54 Each XML file is handed off to the appropriate XML module which then
55 runs its migrations and tries to import the XML into the database. The
56 results are reported and collected into a list so that later the
57 processed files may be removed.
58
59
60 == Pickle Failures ==
61
62 Our schemas are "best guesses" based on what we've seen on the
63 wire. From time to time they'll be wrong, and thus the (un)pickler
64 implementation will fail to unpickle some XML document. The easiest
65 way to test a fix for this is interactively: it's quick, and error
66 messages are written to the console. Here's an example of such a
67 session (wrapped for readability):
68
69 $ ghci
70 htsn-import> runX $ xunpickleDocument
71 TSN.XML.AutoRacingResults.pickle_message
72 parse_opts
73 "schemagen/AutoRacingResultsXML/21241892.xml"
74 [Message {xml_xml_file_id = 21241892... stamp = 2014-06-08 04:05:00 UTC}]
75
76 If there's an error, you'll see something like the following:
77
78 $ ghci
79 htsn-import> runX $ xunpickleDocument
80 TSN.XML.AutoRacingResults.pickle_message
81 parse_opts
82 "schemagen/AutoRacingResultsXML/21241892-bad.xml"
83 fatal error: document unpickling failed
84 xpElem: got element name "RaceDate", but expected "RaceID"
85 context: element "message"
86 contents: <Title>IRL - Firestone 600 - Final Results</Title><Track_Location>
87 Texas Motor Sp...
88 []