doc/man1/htsn-import.1

   1 .TH htsn-import 1
   2
   3 .SH NAME
   4 htsn-import \- Import XML files from The Sports Network into an RDBMS.
   5
   6 .SH SYNOPSIS
   7
   8 \fBhtsn-import\fR [OPTIONS] [FILES]
   9
  10 .SH DESCRIPTION
  11 .P
  12 The Sports Network <http://www.sportsnetwork.com/> offers an XML feed
  13 containing various sports news and statistics. Our sister program
  14 \fBhtsn\fR is capable of retrieving the feed and saving the individual
  15 XML documents contained therein. But what to do with them?
  16 .P
  17 The purpose of \fBhtsn-import\fR is to take these XML documents and
  18 get them into something we can use, a relational database management
  19 system (RDBMS), loosely known as a SQL database. The structure of
  20 relational database, is, well, relational, and the feed XML is not. So
  21 there is some work to do before the data can be inserted.
  22 .P
  23 First, we must parse the XML. Each supported document type (see below)
  24 has a full pickle/unpickle implementation (\(dqpickle\(dq is simply a
  25 synonym for serialize here). That means that we parse the entire
  26 document into a data structure, and if we pickle (serialize) that data
  27 structure, we get the exact same XML document tha we started with.
  28 .P
  29 This is important for two reasons. First, it serves as a second level
  30 of validation. The first validation is performed by the XML parser,
  31 but if that succeeds and unpicking fails, we know that something is
  32 fishy. Second, we don't ever want to be surprised by some new element
  33 or attribute showing up in the XML. The fact that we can unpickle the
  34 whole thing now means that we won't be surprised in the future.
  35 .P
  36 The aforementioned feature is especially important because we
  37 automatically migrate the database schema every time we import a
  38 document. If you attempt to import a \(dqnewsxml.dtd\(dq document, all
  39 database objects relating to the news will be created if they do not
  40 exist. We don't want the schema to change out from under us without
  41 warning, so it's important that no XML be parsed that would result in
  42 a different schema than we had previously. Since we can
  43 pickle/unpickle everything already, this should be impossible.
  44
  45 .SH SUPPORTED DOCUMENT TYPES
  46 .P
  47 The XML document types obtained from the feed are uniquely identified
  48 by their DTDs. We currently support documents with the following DTDs:
  49 .IP \[bu] 2
  50 Heartbeat.dtd
  51 .IP \[bu]
  52 newsxml.dtd
  53 .IP \[bu]
  54 Injuries_Detail_XML.dtd
  55 .IP \[bu]
  56 injuriesxml.dtd
  57 .IP \[bu]
  58 Odds_XML.dtd
  59 .IP \[bu]
  60 weatherxml.dtd
  61
  62 .SH DATABASE SCHEMA
  63 .P
  64 At the top level, we have one table for each of the XML document types
  65 that we import. For example, the documents corresponding to
  66 \fInewsxml.dtd\fR will have a table called \(dqnews\(dq.
  67 .P
  68 These top-level tables will often have children. For example, each
  69 news item has zero or more locations associated with it. The child
  70 table will be named <parent>_<children>, which in this case
  71 corresponsds to \(dqnews_locations\(dq.
  72 .P
  73 To relate the two, a third table exists with name <parent
  74 table>__<child table>. Note the two underscores. This prevents
  75 ambiguity when the child table itself contains underscores. As long we
  76 never go more than one level down, this system should suffice. The
  77 table joining \(dqnews\(dq with \(dqnews_locations\(dq is thus called
  78 \(dqnews__news_locations\(dq.
  79 .P
  80 Wherever possible, children are kept unique to prevent pointless
  81 duplication. This slows down inserts, and speeds up reads (which we
  82 assume are much more frequent). The current rate at which the feed
  83 transmits XML is much too slow to cause problems inserting.
  84 .P
  85 UML diagrams of the resulting database schema for each XML document
  86 type are provided with the \fBhtsn-import\fR documentation.
  87 .P
  88 In some cases the top-level table for a document type has been
  89 omitted. For example, all of the information in the the
  90 \(dqinjuriesxml\(dq documents is contained in \(dqlisting\(dq
  91 elements. We therefore omit the \(dqinjuries\(dq table and create only
  92 \(dqinjuries_listings\(dq.
  93
  94 .SH OPTIONS
  95
  96 .IP \fB\-\-backend\fR,\ \fB\-b\fR
  97 The RDBMS backend to use. Valid choices are \fISqlite\fR and
  98 \fIPostgres\fR. Capitalization is important, sorry.
  99
 100 Default: Sqlite
 101
 102 .IP \fB\-\-connection-string\fR,\ \fB\-c\fR
 103 The connection string used for connecting to the database backend
 104 given by the \fB\-\-baclend\fR option. The default is appropriate for
 105 the \fISqlite\fR backend.
 106
 107 Default: \(dq:memory:\(dq
 108
 109 .IP \fB\-\-log-file\fR
 110 If you specify a file here, logs will be written to it (possibly in
 111 addition to syslog). Can be either a relative or absolute path. It
 112 will not be auto-rotated; use something like logrotate for that.
 113
 114 Default: none
 115
 116 .IP \fB\-\-log-level\fR
 117 How verbose should the logs be? We log notifications at three levels:
 118 INFO, WARN, and ERROR. Specify the \(dqmost boring\(dq level of
 119 notifications you would like to receive (in all-caps); more
 120 interesting notifications will be logged as well.
 121
 122 Default: INFO
 123
 124 .IP \fB\-\-remove\fR,\ \fB\-r\fR
 125 Remove successfully processed files. If you enable this, you can see
 126 at a glance which XML files are not being processed, because they're
 127 all that should be left.
 128
 129 Default: disabled
 130
 131 .IP \fB\-\-syslog\fR,\ \fB\-s\fR
 132 Enable logging to syslog. On Windows this will attempt to communicate
 133 (over UDP) with a syslog daemon on localhost, which will most likely
 134 not work.
 135
 136 Default: disabled
 137
 138 .IP \fB\-\-username\fR,\ \fB\-u\fR
 139 Your TSN username. A username is required, so you must supply one
 140 either on the command line or in a configuration file.
 141
 142 Default: none
 143
 144 .SH CONFIGURATION FILE
 145 .P
 146 Any of the command-line options mentioned above can be specified in a
 147 configuration file instead. We first look for \(dqhtsn-importrc\(dq in
 148 the system configuration directory. We then look for a file named
 149 \(dq.htsn-importrc\(dq in the user's home directory. The latter will
 150 override the former.
 151 .P
 152 The user's home directory is simply $HOME on Unix; on Windows it's
 153 wherever %APPDATA% points. The system configuration directory is
 154 determined by Cabal; the \(dqsysconfdir\(dq parameter during the
 155 \(dqconfigure\(dq step is used.
 156 .P
 157 The file's syntax is given by examples in the htsn-importrc.example file
 158 (included with \fBhtsn-import\fR).
 159 .P
 160 Options specified on the command-line override those in either
 161 configuration file.
 162
 163 .SH EXAMPLES
 164 .IP \[bu] 2
 165 Import newsxml.xml into a preexisting sqlite database named \(dqfoo.sqlite3\(dq:
 166
 167 .nf
 168 .I $ htsn-import --connection-string='foo.sqlite3' \\\\
 169 .I "              test/xml/newsxml.xml"
 170 Successfully imported test/xml/newsxml.xml.
 171 Imported 1 document(s) total.
 172 .fi
 173 .IP \[bu]
 174 Repeat the previous example, but delete newsxml.xml afterwards:
 175
 176 .nf
 177 .I $ htsn-import --connection-string='foo.sqlite3' \\\\
 178 .I "              --remove test/xml/newsxml.xml"
 179 Successfully imported test/xml/newsxml.xml.
 180 Imported 1 document(s) total.
 181 Removed processed file test/xml/newsxml.xml.
 182 .fi
 183 .IP \[bu]
 184 Use a Postgres database instead of the default Sqlite. This assumes
 185 that you have a database named \(dqhtsn\(dq accessible to user
 186 \(dqpostgres\(dq locally:
 187
 188 .nf
 189 .I $ htsn-import --connection-string='dbname=htsn user=postgres' \\\\
 190 .I "              --backend=Postgres test/xml/newsxml.xml"
 191 Successfully imported test/xml/newsxml.xml.
 192 Imported 1 document(s) total.
 193 .fi
 194
 195 .SH BUGS
 196
 197 .P
 198 Send bugs to michael@orlitzky.com.