X-Git-Url: http://gitweb.michael.orlitzky.com/?a=blobdiff_plain;f=doc%2Fman1%2Fhtsn-import.1;h=aebfb062bd0c4030eb047e8791bc82e17e286189;hb=449d86461d8afd7839de750ec48339a4c0f735d0;hp=352eb2b6ab66048f400f97521b580f082c75794e;hpb=a9a8667246a544705d85698ec967437e4770be2c;p=dead%2Fhtsn-import.git diff --git a/doc/man1/htsn-import.1 b/doc/man1/htsn-import.1 index 352eb2b..aebfb06 100644 --- a/doc/man1/htsn-import.1 +++ b/doc/man1/htsn-import.1 @@ -106,6 +106,29 @@ type are provided with the \fBhtsn-import\fR documentation, in the should be considered a bug if they are incorrect. The diagrams are created using the pgModeler tool. +.SH NULL POLICY +.P +Normally in a database one makes a distinction between fields that +simply don't exist, and those fields that are +\(dqempty\(dq. Translating from XML, there is a natural way to +determine which one should be used: if an element is present in the +XML document but its contents are empty, then an empty string should +be inserted into the corresponding field. If on the other hand the +element is missing entirely, the corresponding database entry should +be NULL to indicate that fact. +.P +This sounds well and good, but the XML must be consistent for the +database consumer to make any sense of what he sees. The feed XML uses +optional and blank elements interchangeably, and without any +discernable pattern. To propagate this pattern into the database would +only cause confusion. +.P +As a result, a policy was adopted: both optional elements and elements +whose contents can be empty will be considered nullable in the +database. If the element is missing, the corresponding field is +NULL. Likewise if the content is simply missing. That means there +should never be a (completely) empty string in a database column. + .SH XML SCHEMA GENERATION .P In order to parse XML, you need to know the structure of your @@ -245,6 +268,21 @@ construct the DTDs ourselves, the results are sometimes inconsistent. Here we document a few of them. .IP \[bu] 2 +\fInewsxml.dtd\fR + +The TSN DTD for news (and almost all XML on the wire) suggests that +there is a exactly one (possibly-empty) element present in each +message. However, we have seen an example (XML_File_ID 21232353) where +an empty followed a non-empty one: + +.fi +Odd Man Rush: Snow under pressure to improve Isles quickly + +.nf + +We don't parse this case at the moment. + +.IP \[bu] \fIOdds_XML.dtd\fR The elements here are supposed to be associated with a set of @@ -264,6 +302,28 @@ it would greatly complicate things. The first form is more common, so that's all we support for now. An example is provided as schemagen/weatherxml/20143655.xml. +.SH DEPLOYMENT +.P +When deploying for the first time, the target database will most +likely be empty. The schema will be migrated when a new document type +is seen, but this has a downside: it can be months before every +supported document type has been seen once. This can make it difficult +to test the database permissions. +.P +Since all of the test XML documents have old timestamps, one easy +workaround is the following: simply import all of the test XML +documents, and then delete them using whatever script is used to prune +old entries. This will force the migration of the schema, after which +you can set and test the database permissions. +.P +Something as simple as, +.P +.nf +.I $ find ./test/xml -iname '*.xml' | xargs htsn-import -c foo.sqlite +.fi +.P +should do it. + .SH OPTIONS .IP \fB\-\-backend\fR,\ \fB\-b\fR