X-Git-Url: http://gitweb.michael.orlitzky.com/?a=blobdiff_plain;f=doc%2FREADME.development;h=61a5e599c0dac4c22ed23f7086203e02563f3a09;hb=1cecab83b93656aa08ef5128b4e3bd3b6385ac8d;hp=b1df69861650a579959fa32b576590d00e978627;hpb=43d1e1035dc7d5d5cea5576b6b5cf0d9415438fa;p=dead%2Fhtsn-import.git diff --git a/doc/README.development b/doc/README.development index b1df698..61a5e59 100644 --- a/doc/README.development +++ b/doc/README.development @@ -1,3 +1,62 @@ +== Overview == + +The "main" function accepts a list of XML files on the command line, +and goes through them one at a time. A minimal parse attempt is made +to determine the DTD of the file, and then one big case statement +decides what to do with it based on that DTD name. Each DTD name has +an associated module in src/TSN/XML which can do a few things: + + 1. List the DTDs for which it's responsible + + 2. Parse a top-level element into an XmlTree + + 3. In rare cases (Weather, News) detect specific malformed documents + +Most of the XML modules are similar. The big idea is that every object +(for example, a ) has both a database type and an XML type. When +those two types differ, we need to be able to convert between +them. So, for example, if the XML representation of a team differs +from the database representation, we might define, + +> data Team = ... +> data TeamXml = ... + +But if you're lucky, the database/XML representations will be the +same, and you'd only need to define "Team"! + +The most common situation where the representations differ is when +there exists a parent/child relationship. In the XML representation, +you will have e.g. the Team contained within a Game: + +> data GameXml = GameXml { xml_game_id :: Int, xml_team :: TeamXml } + +But in the database representation--which looks a lot like a schema +specification--there's no mention of the team at all. + +> data Game = Game { game_id :: Int } + +That's because the database representation of the Team will have a +foreign key to a Game instead: + +> data Team = Team { games_id :: DefaultKey Game, ... } + +Most of the XML modules are devoted to converting back and forth +between these two types. The XML modules are also responsible for +"unpickling" the XML document, which essentially parses it into a +bunch of Haskell data types (the FooXml representations). + +Furthermore, each top-level message element in the XML modules knows +how to insert itself into the database. The "Message" type is always a +member of the "DbImport" class, and that class defines two methods: +dbmigrate, to run the migrations, and dbimport, which actually says +how to import the thing. + +Each XML file is handed off to the appropriate XML module which then +runs its migrations and tries to import the XML into the database. The +results are reported and collected into a list so that later the +processed files may be removed. + + == Pickle Failures == Our schemas are "best guesses" based on what we've seen on the @@ -27,23 +86,3 @@ If there's an error, you'll see something like the following: contents: IRL - Firestone 600 - Final Results Texas Motor Sp... [] - - -== Creating the Database Schema (Deployment) == - -When deploying for the first time, the target database will most -likely be empty. The schema will be migrated when a new document type -is seen, but this has a downside: it can be months before every -supported document type has been seen once. This can make it difficult -to test the database permissions. - -Since all of the test XML documents have old timestamps, one easy -workaround is the following: simply import all of the test XML -documents, and then delete them. This will force the migration of the -schema, after which you can set and test the database permissions. - -Something as simple as, - - $ find ./test/xml -iname '*.xml' | xargs htsn-import -c foo.sqlite - -should do it.