]> gitweb.michael.orlitzky.com - dead/htsn-import.git/blob - doc/man1/htsn-import.1
Fix a typo.
[dead/htsn-import.git] / doc / man1 / htsn-import.1
1 .TH htsn-import 1
2
3 .SH NAME
4 htsn-import \- Import XML files from The Sports Network into an RDBMS.
5
6 .SH SYNOPSIS
7
8 \fBhtsn-import\fR [OPTIONS] [FILES]
9
10 .SH DESCRIPTION
11 .P
12 The Sports Network <http://www.sportsnetwork.com/> offers an XML feed
13 containing various sports news and statistics. Our sister program
14 \fBhtsn\fR is capable of retrieving the feed and saving the individual
15 XML documents contained therein. But what to do with them?
16 .P
17 The purpose of \fBhtsn-import\fR is to take these XML documents and
18 get them into something we can use, a relational database management
19 system (RDBMS), loosely known as a SQL database. The structure of
20 relational database, is, well, relational, and the feed XML is not. So
21 there is some work to do before the data can be inserted.
22 .P
23 First, we must parse the XML. Each supported document type (see below)
24 has a full pickle/unpickle implementation (\(dqpickle\(dq is simply a
25 synonym for serialize here). That means that we parse the entire
26 document into a data structure, and if we pickle (serialize) that data
27 structure, we get the exact same XML document tha we started with.
28 .P
29 This is important for two reasons. First, it serves as a second level
30 of validation. The first validation is performed by the XML parser,
31 but if that succeeds and unpicking fails, we know that something is
32 fishy. Second, we don't ever want to be surprised by some new element
33 or attribute showing up in the XML. The fact that we can unpickle the
34 whole thing now means that we won't be surprised in the future.
35 .P
36 The aforementioned feature is especially important because we
37 automatically migrate the database schema every time we import a
38 document. If you attempt to import a \(dqnewsxml.dtd\(dq document, all
39 database objects relating to the news will be created if they do not
40 exist. We don't want the schema to change out from under us without
41 warning, so it's important that no XML be parsed that would result in
42 a different schema than we had previously. Since we can
43 pickle/unpickle everything already, this should be impossible.
44
45 .SH SUPPORTED DOCUMENT TYPES
46 .P
47 The XML document types obtained from the feed are uniquely identified
48 by their DTDs. We currently support documents with the following DTDs:
49 .IP \[bu] 2
50 Heartbeat.dtd
51 .IP \[bu]
52 newsxml.dtd
53 .IP \[bu]
54 Injuries_Detail_XML.dtd
55 .IP \[bu]
56 injuriesxml.dtd
57 .IP \[bu]
58 Odds_XML.dtd
59 .IP \[bu]
60 weatherxml.dtd
61
62 .SH DATABASE SCHEMA
63 .P
64 At the top level, we have one table for each of the XML document types
65 that we import. For example, the documents corresponding to
66 \fInewsxml.dtd\fR will have a table called \(dqnews\(dq.
67 .P
68 These top-level tables will often have children. For example, each
69 news item has zero or more locations associated with it. The child
70 table will be named <parent>_<children>, which in this case
71 corresponsds to \(dqnews_locations\(dq.
72 .P
73 To relate the two, a third table exists with name <parent
74 table>__<child table>. Note the two underscores. This prevents
75 ambiguity when the child table itself contains underscores. As long we
76 never go more than one level down, this system should suffice. The
77 table joining \(dqnews\(dq with \(dqnews_locations\(dq is thus called
78 \(dqnews__news_locations\(dq.
79 .P
80 Wherever possible, children are kept unique to prevent pointless
81 duplication. This slows down inserts, and speeds up reads (which we
82 assume are much more frequent). The current rate at which the feed
83 transmits XML is much too slow to cause problems inserting.
84 .P
85 UML diagrams of the resulting database schema for each XML document
86 type are provided with the \fBhtsn-import\fR documentation.
87 .P
88 In some cases the top-level table for a document type has been
89 omitted. For example, all of the information in the the
90 \(dqinjuriesxml\(dq documents is contained in \(dqlisting\(dq
91 elements. We therefore omit the \(dqinjuries\(dq table and create only
92 \(dqinjuries_listings\(dq.
93
94 .SH OPTIONS
95
96 .IP \fB\-\-backend\fR,\ \fB\-b\fR
97 The RDBMS backend to use. Valid choices are \fISqlite\fR and
98 \fIPostgres\fR. Capitalization is important, sorry.
99
100 Default: Sqlite
101
102 .IP \fB\-\-connection-string\fR,\ \fB\-c\fR
103 The connection string used for connecting to the database backend
104 given by the \fB\-\-baclend\fR option. The default is appropriate for
105 the \fISqlite\fR backend.
106
107 Default: \(dq:memory:\(dq
108
109 .IP \fB\-\-log-file\fR
110 If you specify a file here, logs will be written to it (possibly in
111 addition to syslog). Can be either a relative or absolute path. It
112 will not be auto-rotated; use something like logrotate for that.
113
114 Default: none
115
116 .IP \fB\-\-log-level\fR
117 How verbose should the logs be? We log notifications at three levels:
118 INFO, WARN, and ERROR. Specify the \(dqmost boring\(dq level of
119 notifications you would like to receive (in all-caps); more
120 interesting notifications will be logged as well.
121
122 Default: INFO
123
124 .IP \fB\-\-remove\fR,\ \fB\-r\fR
125 Remove successfully processed files. If you enable this, you can see
126 at a glance which XML files are not being processed, because they're
127 all that should be left.
128
129 Default: disabled
130
131 .IP \fB\-\-syslog\fR,\ \fB\-s\fR
132 Enable logging to syslog. On Windows this will attempt to communicate
133 (over UDP) with a syslog daemon on localhost, which will most likely
134 not work.
135
136 Default: disabled
137
138 .IP \fB\-\-username\fR,\ \fB\-u\fR
139 Your TSN username. A username is required, so you must supply one
140 either on the command line or in a configuration file.
141
142 Default: none
143
144 .SH CONFIGURATION FILE
145 .P
146 Any of the command-line options mentioned above can be specified in a
147 configuration file instead. We first look for \(dqhtsn-importrc\(dq in
148 the system configuration directory. We then look for a file named
149 \(dq.htsn-importrc\(dq in the user's home directory. The latter will
150 override the former.
151 .P
152 The user's home directory is simply $HOME on Unix; on Windows it's
153 wherever %APPDATA% points. The system configuration directory is
154 determined by Cabal; the \(dqsysconfdir\(dq parameter during the
155 \(dqconfigure\(dq step is used.
156 .P
157 The file's syntax is given by examples in the htsn-importrc.example file
158 (included with \fBhtsn-import\fR).
159 .P
160 Options specified on the command-line override those in either
161 configuration file.
162
163 .SH EXAMPLES
164 .IP \[bu] 2
165 Import newsxml.xml into a preexisting sqlite database named \(dqfoo.sqlite3\(dq:
166
167 .nf
168 .I $ htsn-import --connection-string='foo.sqlite3' \\\\
169 .I " test/xml/newsxml.xml"
170 Successfully imported test/xml/newsxml.xml.
171 Imported 1 document(s) total.
172 .fi
173 .IP \[bu]
174 Repeat the previous example, but delete newsxml.xml afterwards:
175
176 .nf
177 .I $ htsn-import --connection-string='foo.sqlite3' \\\\
178 .I " --remove test/xml/newsxml.xml"
179 Successfully imported test/xml/newsxml.xml.
180 Imported 1 document(s) total.
181 Removed processed file test/xml/newsxml.xml.
182 .fi
183 .IP \[bu]
184 Use a Postgres database instead of the default Sqlite. This assumes
185 that you have a database named \(dqhtsn\(dq accessible to user
186 \(dqpostgres\(dq locally:
187
188 .nf
189 .I $ htsn-import --connection-string='dbname=htsn user=postgres' \\\\
190 .I " --backend=Postgres test/xml/newsxml.xml"
191 Successfully imported test/xml/newsxml.xml.
192 Imported 1 document(s) total.
193 .fi
194
195 .SH BUGS
196
197 .P
198 Send bugs to michael@orlitzky.com.