]> gitweb.michael.orlitzky.com - dead/htsn-import.git/blob - doc/man1/htsn-import.1
b0b4f9c3050ed17d0f009bfdc25469421d4430e9
[dead/htsn-import.git] / doc / man1 / htsn-import.1
1 .TH htsn-import 1
2
3 .SH NAME
4 htsn-import \- Import XML files from The Sports Network into an RDBMS.
5
6 .SH SYNOPSIS
7
8 \fBhtsn-import\fR [OPTIONS] [FILES]
9
10 .SH DESCRIPTION
11 .P
12 The Sports Network <http://www.sportsnetwork.com/> offers an XML feed
13 containing various sports news and statistics. Our sister program
14 \fBhtsn\fR is capable of retrieving the feed and saving the individual
15 XML documents contained therein. But what to do with them?
16 .P
17 The purpose of \fBhtsn-import\fR is to take these XML documents and
18 get them into something we can use, a relational database management
19 system (RDBMS), loosely known as a SQL database. The structure of
20 relational database, is, well, relational, and the feed XML is not. So
21 there is some work to do before the data can be inserted.
22 .P
23 First, we must parse the XML. Each supported document type (see below)
24 has a full pickle/unpickle implementation (\(dqpickle\(dq is simply a
25 synonym for serialize here). That means that we parse the entire
26 document into a data structure, and if we pickle (serialize) that data
27 structure, we get the exact same XML document tha we started with.
28 .P
29 This is important for two reasons. First, it serves as a second level
30 of validation. The first validation is performed by the XML parser,
31 but if that succeeds and unpicking fails, we know that something is
32 fishy. Second, we don't ever want to be surprised by some new element
33 or attribute showing up in the XML. The fact that we can unpickle the
34 whole thing now means that we won't be surprised in the future.
35 .P
36 The aforementioned feature is especially important because we
37 automatically migrate the database schema every time we import a
38 document. If you attempt to import a \(dqnewsxml.dtd\(dq document, all
39 database objects relating to the news will be created if they do not
40 exist. We don't want the schema to change out from under us without
41 warning, so it's important that no XML be parsed that would result in
42 a different schema than we had previously. Since we can
43 pickle/unpickle everything already, this should be impossible.
44
45 .SH SUPPORTED DOCUMENT TYPES
46 .P
47 The XML document types obtained from the feed are uniquely identified
48 by their DTDs. We currently support documents with the following DTDs:
49 .IP \[bu] 2
50 Heartbeat.dtd
51 .IP \[bu]
52 newsxml.dtd
53 .IP \[bu]
54 Injuries_Detail_XML.dtd
55 .IP \[bu]
56 injuriesxml.dtd
57 .IP \[bu]
58 Odds_XML.dtd
59 .IP \[bu]
60 weatherxml.dtd
61
62 .SH DATABASE SCHEMA
63 .P
64 At the top level, we have one table for each of the XML document types
65 that we import. For example, the documents corresponding to
66 \fInewsxml.dtd\fR will have a table called \(dqnews\(dq.
67 .P
68 These top-level tables will often have children. For example, each
69 news item has zero or more locations associated with it. The child
70 table will be named <parent>_<children>, which in this case
71 corresponds to \(dqnews_locations\(dq.
72 .P
73 To relate the two, a third table may exist with name <parent
74 table>__<child table>. Note the two underscores. This prevents
75 ambiguity when the child table itself contains underscores. The table
76 joining \(dqnews\(dq with \(dqnews_locations\(dq is thus called
77 \(dqnews__news_locations\(dq.
78 .P
79 Where it makes sense, children are kept unique to prevent pointless
80 duplication. This slows down inserts, and speeds up reads (which are
81 much more frequent). There is a tradeoff to be made, however. For a
82 table with a small, fixed upper bound on the number of rows (like
83 \(dqodds_casinos\(dq), there is great benefit to de-duplication. The
84 total number of rows stays small, so inserts are still quick, and many
85 duplicate rows are eliminated.
86 .P
87 But, with a table like \(dqodds_games\(dq, the number of games grows
88 quickly and without bound. It is therefore more beneficial to be able
89 to delete the old games (though an ON DELETE CASCADE, tied to
90 \(dqodds\(dq) than it is to eliminate duplication. A table like
91 \(dqnews_locations\(dq is somewhere in-between.
92 .P
93 UML diagrams of the resulting database schema for each XML document
94 type are provided with the \fBhtsn-import\fR documentation.
95 .P
96 In some cases the top-level table for a document type has been
97 omitted. For example, all of the information in the the
98 \(dqinjuriesxml\(dq documents is contained in \(dqlisting\(dq
99 elements. We therefore omit the \(dqinjuries\(dq table and create only
100 \(dqinjuries_listings\(dq.
101
102 .SH OPTIONS
103
104 .IP \fB\-\-backend\fR,\ \fB\-b\fR
105 The RDBMS backend to use. Valid choices are \fISqlite\fR and
106 \fIPostgres\fR. Capitalization is important, sorry.
107
108 Default: Sqlite
109
110 .IP \fB\-\-connection-string\fR,\ \fB\-c\fR
111 The connection string used for connecting to the database backend
112 given by the \fB\-\-backend\fR option. The default is appropriate for
113 the \fISqlite\fR backend.
114
115 Default: \(dq:memory:\(dq
116
117 .IP \fB\-\-log-file\fR
118 If you specify a file here, logs will be written to it (possibly in
119 addition to syslog). Can be either a relative or absolute path. It
120 will not be auto-rotated; use something like logrotate for that.
121
122 Default: none
123
124 .IP \fB\-\-log-level\fR
125 How verbose should the logs be? We log notifications at three levels:
126 INFO, WARN, and ERROR. Specify the \(dqmost boring\(dq level of
127 notifications you would like to receive (in all-caps); more
128 interesting notifications will be logged as well.
129
130 Default: INFO
131
132 .IP \fB\-\-remove\fR,\ \fB\-r\fR
133 Remove successfully processed files. If you enable this, you can see
134 at a glance which XML files are not being processed, because they're
135 all that should be left.
136
137 Default: disabled
138
139 .IP \fB\-\-syslog\fR,\ \fB\-s\fR
140 Enable logging to syslog. On Windows this will attempt to communicate
141 (over UDP) with a syslog daemon on localhost, which will most likely
142 not work.
143
144 Default: disabled
145
146 .SH CONFIGURATION FILE
147 .P
148 Any of the command-line options mentioned above can be specified in a
149 configuration file instead. We first look for \(dqhtsn-importrc\(dq in
150 the system configuration directory. We then look for a file named
151 \(dq.htsn-importrc\(dq in the user's home directory. The latter will
152 override the former.
153 .P
154 The user's home directory is simply $HOME on Unix; on Windows it's
155 wherever %APPDATA% points. The system configuration directory is
156 determined by Cabal; the \(dqsysconfdir\(dq parameter during the
157 \(dqconfigure\(dq step is used.
158 .P
159 The file's syntax is given by examples in the htsn-importrc.example file
160 (included with \fBhtsn-import\fR).
161 .P
162 Options specified on the command-line override those in either
163 configuration file.
164
165 .SH EXAMPLES
166 .IP \[bu] 2
167 Import newsxml.xml into a preexisting sqlite database named \(dqfoo.sqlite3\(dq:
168
169 .nf
170 .I $ htsn-import --connection-string='foo.sqlite3' \\\\
171 .I " test/xml/newsxml.xml"
172 Successfully imported test/xml/newsxml.xml.
173 Imported 1 document(s) total.
174 .fi
175 .IP \[bu]
176 Repeat the previous example, but delete newsxml.xml afterwards:
177
178 .nf
179 .I $ htsn-import --connection-string='foo.sqlite3' \\\\
180 .I " --remove test/xml/newsxml.xml"
181 Successfully imported test/xml/newsxml.xml.
182 Imported 1 document(s) total.
183 Removed processed file test/xml/newsxml.xml.
184 .fi
185 .IP \[bu]
186 Use a Postgres database instead of the default Sqlite. This assumes
187 that you have a database named \(dqhtsn\(dq accessible to user
188 \(dqpostgres\(dq locally:
189
190 .nf
191 .I $ htsn-import --connection-string='dbname=htsn user=postgres' \\\\
192 .I " --backend=Postgres test/xml/newsxml.xml"
193 Successfully imported test/xml/newsxml.xml.
194 Imported 1 document(s) total.
195 .fi
196
197 .SH BUGS
198
199 .P
200 Send bugs to michael@orlitzky.com.