X-Git-Url: http://gitweb.michael.orlitzky.com/?a=blobdiff_plain;ds=sidebyside;f=doc%2Fman1%2Fhtsn-import.1;h=aebfb062bd0c4030eb047e8791bc82e17e286189;hb=09fb9f7ddc30d003224da3fc45142ce2e37c4cbf;hp=352eb2b6ab66048f400f97521b580f082c75794e;hpb=a9a8667246a544705d85698ec967437e4770be2c;p=dead%2Fhtsn-import.git
diff --git a/doc/man1/htsn-import.1 b/doc/man1/htsn-import.1
index 352eb2b..aebfb06 100644
--- a/doc/man1/htsn-import.1
+++ b/doc/man1/htsn-import.1
@@ -106,6 +106,29 @@ type are provided with the \fBhtsn-import\fR documentation, in the
should be considered a bug if they are incorrect. The diagrams are
created using the pgModeler tool.
+.SH NULL POLICY
+.P
+Normally in a database one makes a distinction between fields that
+simply don't exist, and those fields that are
+\(dqempty\(dq. Translating from XML, there is a natural way to
+determine which one should be used: if an element is present in the
+XML document but its contents are empty, then an empty string should
+be inserted into the corresponding field. If on the other hand the
+element is missing entirely, the corresponding database entry should
+be NULL to indicate that fact.
+.P
+This sounds well and good, but the XML must be consistent for the
+database consumer to make any sense of what he sees. The feed XML uses
+optional and blank elements interchangeably, and without any
+discernable pattern. To propagate this pattern into the database would
+only cause confusion.
+.P
+As a result, a policy was adopted: both optional elements and elements
+whose contents can be empty will be considered nullable in the
+database. If the element is missing, the corresponding field is
+NULL. Likewise if the content is simply missing. That means there
+should never be a (completely) empty string in a database column.
+
.SH XML SCHEMA GENERATION
.P
In order to parse XML, you need to know the structure of your
@@ -245,6 +268,21 @@ construct the DTDs ourselves, the results are sometimes
inconsistent. Here we document a few of them.
.IP \[bu] 2
+\fInewsxml.dtd\fR
+
+The TSN DTD for news (and almost all XML on the wire) suggests that
+there is a exactly one (possibly-empty) element present in each
+message. However, we have seen an example (XML_File_ID 21232353) where
+an empty followed a non-empty one:
+
+.fi
+Odd Man Rush: Snow under pressure to improve Isles quickly
+
+.nf
+
+We don't parse this case at the moment.
+
+.IP \[bu]
\fIOdds_XML.dtd\fR
The elements here are supposed to be associated with a set of
@@ -264,6 +302,28 @@ it would greatly complicate things. The first form is more common, so
that's all we support for now. An example is provided as
schemagen/weatherxml/20143655.xml.
+.SH DEPLOYMENT
+.P
+When deploying for the first time, the target database will most
+likely be empty. The schema will be migrated when a new document type
+is seen, but this has a downside: it can be months before every
+supported document type has been seen once. This can make it difficult
+to test the database permissions.
+.P
+Since all of the test XML documents have old timestamps, one easy
+workaround is the following: simply import all of the test XML
+documents, and then delete them using whatever script is used to prune
+old entries. This will force the migration of the schema, after which
+you can set and test the database permissions.
+.P
+Something as simple as,
+.P
+.nf
+.I $ find ./test/xml -iname '*.xml' | xargs htsn-import -c foo.sqlite
+.fi
+.P
+should do it.
+
.SH OPTIONS
.IP \fB\-\-backend\fR,\ \fB\-b\fR