From: Michael Orlitzky Date: Fri, 25 Jul 2014 03:23:03 +0000 (-0400) Subject: Add a new module, TSN.XML.MLBEarlyLines supporting MLB_earlylinesXML.dtd. X-Git-Tag: 0.0.9~6 X-Git-Url: https://gitweb.michael.orlitzky.com/?a=commitdiff_plain;h=b63146a890d17e73e5943b427d51fd9311365bb8;p=dead%2Fhtsn-import.git Add a new module, TSN.XML.MLBEarlyLines supporting MLB_earlylinesXML.dtd. Add TSN.XML.MLBEarlyLines to the .ghci and cabal files. Mention all TSN.XML.MLBEarlyLines-related weirdness in the man page. Add test cases for TSN.XML.MLBEarlyLines to the test suite. Enable import of MLB_earlyinesXML.dtd documents in Main. Bump the file counts in import-duplicates.test. --- diff --git a/.ghci b/.ghci index 0df75f1..be99942 100644 --- a/.ghci +++ b/.ghci @@ -25,6 +25,7 @@ src/TSN/XML/Injuries.hs src/TSN/XML/InjuriesDetail.hs src/TSN/XML/JFile.hs + src/TSN/XML/MLBEarlyLine.hs src/TSN/XML/News.hs src/TSN/XML/Odds.hs src/TSN/XML/ScheduleChanges.hs @@ -56,6 +57,7 @@ import TSN.XML.Heartbeat import TSN.XML.Injuries import TSN.XML.InjuriesDetail import TSN.XML.JFile +import TSN.XML.MLBEarlyLine import TSN.XML.News import TSN.XML.Odds import TSN.XML.ScheduleChanges diff --git a/doc/TODO b/doc/TODO index 01caafd..9ab4e8e 100644 --- a/doc/TODO +++ b/doc/TODO @@ -39,7 +39,6 @@ * Minor_Baseball_TeamScheduleXML * MinorLeagueHockeyTeamScheduleXML * MLB_Boxscore_XML - * MLB_earlylineXML * MLB_IndividualStats_XML * MLB_Probable_Pitchers_XML * MLB_Roster_XML @@ -64,3 +63,6 @@ 5. Consolidate all of the make_game_time functions which take a date/time and produce a combined time. + +6. Factor out test code where possible; a lot of them differ only in + the filename. diff --git a/doc/man1/htsn-import.1 b/doc/man1/htsn-import.1 index 66f7ae5..34a8bbe 100644 --- a/doc/man1/htsn-import.1 +++ b/doc/man1/htsn-import.1 @@ -106,6 +106,26 @@ type are provided with the \fBhtsn-import\fR documentation, in the should be considered a bug if they are incorrect. The diagrams are created using the pgModeler tool. +.SH DATABASE SCHEMA COMPROMISES + +There are a few places that the database schema isn't exactly how we'd +like it to be: + +.IP \[bu] 2 +\fIearlylineXML.dtd\fR + +The database representations for earlylineXML.dtd and +MLB_earlylineXML.dtd are the same; that is, they share the same +tables. The two document types represent team names in different +ways. In order to accomodate both types with one parser, we had to +make both ways optional, and then merge the two together before +converting to the database representation. + +Unfortunately, when we merge two optional things together, we get +another optional thing back. There's no way to say that \(dqat least +one is not optional.\(dq So the team names in the database schema are +optional as well, even though they should always be present. + .SH NULL POLICY .P Normally in a database one makes a distinction between fields that @@ -204,8 +224,8 @@ Successfully imported schemagen/Odds_XML/19996433.xml. Processed 1 document(s) total. .fi .P -At this point, the database schema matches the old documents, i.e. the -ones without \fIAStarter\fR and \fIHStarter\fR. If we use a new +At this point, the database schema matches the old documents, that is, +the ones without \fIAStarter\fR and \fIHStarter\fR. If we use a new version of \fBhtsn-import\fR, supporting the new fields, the migration is handled gracefully: .P @@ -284,6 +304,33 @@ We don't parse this case at the moment, but we do recognize it and report it as unsupported so that offending documents can be removed. An example is provided as test/xml/newsxml-multiple-sms.xml. +.IP \[bu] +\fIMLB_earlylineXML.dtd\fR + +Unlike earlylineXML.dtd, this document type has more than one +associated with each . Moreover, each has a bunch of + children that are supposed to be associated with the s, +but the document structure indicates no explicit relationship. For +example, + +.nf + + ... + ... + ... + ... + ... + +.fi + +Here the first is inferred to apply to the two s that +follow it, and the second applies to the single that +follows it. But this is very fragile to parse. Instead, we use a hack +to facilitate (un)pickling, and then drop the notes entirely during +the database conversion. + +A similar workaround is implemented for Odds_XML.dtd. + .IP \[bu] \fIOdds_XML.dtd\fR @@ -333,10 +380,10 @@ in every message. A typical timestamp looks like, The \(dqtime zone\(dq is given as \(dqET\(dq, but unfortunately \(dqET\(dq is not a valid time zone. It stands for \(dqEastern Time\(dq, which can belong to either of two time zones, EST or EDT, -based on the time of the year (i.e. whether or not daylight savings -time is in effect). Since we can't tell from the timestamp, we always -parse these as EST which is UTC-5. When daylight savings is in effect, -they will be off by an hour. +based on the time of the year (that is, whether or not daylight +savings time is in effect). Since we can't tell from the timestamp, we +always parse these as EST which is UTC-5. When daylight savings is in +effect, they will be off by an hour. Here's a list of the ones that may cause surprises: @@ -412,6 +459,11 @@ date: They are also stored as UTC. +.IP \[bu] +\fIMLB_earlylineXML.dtd\fR + +See earlylineXML.dtd. + .IP \[bu] \fIOdds_XML.dtd\fR @@ -589,6 +641,8 @@ injuriesxml.dtd .IP \[bu] jfilexml.dtd .IP \[bu] +MLB_earlylineXML.dtd +.IP \[bu] newsxml.dtd .IP \[bu] Odds_XML.dtd diff --git a/htsn-import.cabal b/htsn-import.cabal index e7d2940..0f04f6e 100644 --- a/htsn-import.cabal +++ b/htsn-import.cabal @@ -76,6 +76,7 @@ extra-source-files: schemagen/MLB_Matchup_XML/*.xml schemagen/mlbonbasepctxml/*.xml schemagen/MLBOPSXML/*.xml + schemagen/MLB_earlylineXML/*.xml schemagen/MLB_Pitching_Appearances_Leaders/*.xml schemagen/MLB_Pitching_Balks_Leaders/*.xml schemagen/MLB_Pitching_CG_Leaders/*.xml @@ -280,6 +281,7 @@ executable htsn-import TSN.XML.Injuries TSN.XML.InjuriesDetail TSN.XML.JFile + TSN.XML.MLBEarlyLine TSN.XML.News TSN.XML.Odds TSN.XML.ScheduleChanges diff --git a/src/Main.hs b/src/Main.hs index 63bde65..59da419 100644 --- a/src/Main.hs +++ b/src/Main.hs @@ -60,6 +60,9 @@ import qualified TSN.XML.Injuries as Injuries ( dtd, pickle_message ) import qualified TSN.XML.InjuriesDetail as InjuriesDetail ( dtd, pickle_message ) +import qualified TSN.XML.MLBEarlyLine as MLBEarlyLine ( + dtd, + pickle_message ) import qualified TSN.XML.JFile as JFile ( dtd, pickle_message ) import qualified TSN.XML.News as News ( dtd, @@ -207,6 +210,9 @@ import_file cfg path = do | dtd == JFile.dtd = go JFile.pickle_message + | dtd == MLBEarlyLine.dtd = + go MLBEarlyLine.pickle_message + | dtd == News.dtd = -- Some of the newsxml docs are busted in predictable ways. -- We want them to "succeed" so that they're deleted. diff --git a/src/TSN/XML/MLBEarlyLine.hs b/src/TSN/XML/MLBEarlyLine.hs new file mode 100644 index 0000000..fc90483 --- /dev/null +++ b/src/TSN/XML/MLBEarlyLine.hs @@ -0,0 +1,136 @@ +-- | Parse TSN XML for the DTD \"MLB_earlylineXML.dtd\". This module +-- is unique (so far) in that it is almost entirely a subclass of +-- another module, "TSN.XML.EarlyLine". The database representations +-- should be almost identical, and the XML schema /could/ be +-- similar, but instead, welcome to the jungle baby. Here are the +-- differences: +-- +-- * In earlylineXML.dtd, each \ element contains exactly one +-- game. In MLB_earlylineXML.dtd, they contain multiple games. +-- +-- * As a result of the previous difference, the \s are no +-- longer in one-to-one correspondence with the games. The +-- \ elements are thrown in beside the \s, and we're +-- supposed to figure out to which \s they correspond +-- ourselves. This is the same sort of nonsense going on with +-- 'TSN.XML.Odds.OddsGameWithNotes'. +-- +-- * The \ element can be empty in +-- MLB_earlylineXML.dtd (it can't in earlylineXML.dtd). +-- +-- * Each home/away team in MLB_earlylineXML.dtd has a \ +-- that isn't present in the regular earlylineXML.dtd. +-- +-- * In earlylineXML.dtd, the home/away team lines are given as +-- attributes on the \ and \ elements +-- respectively. In MLB_earlylineXML.dtd, the lines can be found +-- in \ elements that are children of the \ and +-- \ elements. +-- +-- * In earlylineXML.dtd, the team names are given as text within +-- the \ and \ elements. In MLB_earlylineXML.dtd, +-- they are instead given as attributes on those respective +-- elements. +-- +-- Most of these difficulties have been worked around in +-- "TSN.XML.EarlyLine", so this module could be kept somewhat boring. +-- +module TSN.XML.MLBEarlyLine ( + dtd, + mlb_early_line_tests, + module TSN.XML.EarlyLine -- This re-exports the EarlyLine and EarlyLineGame + -- constructors unnecessarily. Whatever. + ) +where + +-- System imports (needed only for tests) +import Database.Groundhog ( + countAll, + deleteAll, + migrate, + runMigration, + silentMigrationLogger ) +import Database.Groundhog.Generic ( runDbConn ) +import Database.Groundhog.Sqlite ( withSqliteConn ) +import Test.Tasty ( TestTree, testGroup ) +import Test.Tasty.HUnit ( (@?=), testCase ) + + +-- Local imports. +import TSN.DbImport ( DbImport( dbimport ) ) +import TSN.XML.EarlyLine ( EarlyLine, EarlyLineGame, pickle_message ) +import Xml ( + pickle_unpickle, + unpickleable, + unsafe_unpickle ) + + +-- | The DTD to which this module corresponds. Used to invoke dbimport. +-- +dtd :: String +dtd = "MLB_earlylineXML.dtd" + + + +-- +-- * Tasty Tests +-- + +-- | A list of all tests for this module. +-- +mlb_early_line_tests :: TestTree +mlb_early_line_tests = + testGroup + "MLBEarlyLine tests" + [ test_on_delete_cascade, + test_pickle_of_unpickle_is_identity, + test_unpickle_succeeds ] + +-- | If we unpickle something and then pickle it, we should wind up +-- with the same thing we started with. WARNING: success of this +-- test does not mean that unpickling succeeded. +-- +test_pickle_of_unpickle_is_identity :: TestTree +test_pickle_of_unpickle_is_identity = + testCase "pickle composed with unpickle is the identity" $ do + let path = "test/xml/MLB_earlylineXML.xml" + (expected, actual) <- pickle_unpickle pickle_message path + actual @?= expected + + + +-- | Make sure we can actually unpickle these things. +-- +test_unpickle_succeeds :: TestTree +test_unpickle_succeeds = + testCase "unpickling succeeds" $ do + let path = "test/xml/MLB_earlylineXML.xml" + actual <- unpickleable path pickle_message + + let expected = True + actual @?= expected + + + +-- | Make sure everything gets deleted when we delete the top-level +-- record. +-- +test_on_delete_cascade :: TestTree +test_on_delete_cascade = + testCase "deleting (MLB) early_lines deletes its children" $ do + let path = "test/xml/MLB_earlylineXML.xml" + results <- unsafe_unpickle path pickle_message + let a = undefined :: EarlyLine + let b = undefined :: EarlyLineGame + + actual <- withSqliteConn ":memory:" $ runDbConn $ do + runMigration silentMigrationLogger $ do + migrate a + migrate b + _ <- dbimport results + deleteAll a + count_a <- countAll a + count_b <- countAll b + return $ sum [count_a, count_b] + let expected = 0 + actual @?= expected diff --git a/test/TestSuite.hs b/test/TestSuite.hs index ee38148..e82c2a0 100644 --- a/test/TestSuite.hs +++ b/test/TestSuite.hs @@ -9,6 +9,7 @@ import TSN.XML.Heartbeat ( heartbeat_tests ) import TSN.XML.Injuries ( injuries_tests ) import TSN.XML.InjuriesDetail ( injuries_detail_tests ) import TSN.XML.JFile ( jfile_tests ) +import TSN.XML.MLBEarlyLine ( mlb_early_line_tests ) import TSN.XML.News ( news_tests ) import TSN.XML.Odds ( odds_tests ) import TSN.XML.ScheduleChanges ( schedule_changes_tests ) @@ -27,6 +28,7 @@ tests = testGroup injuries_tests, injuries_detail_tests, jfile_tests, + mlb_early_line_tests, news_tests, odds_tests, pickler_tests, diff --git a/test/shell/import-duplicates.test b/test/shell/import-duplicates.test index 2e2e27c..94034c8 100644 --- a/test/shell/import-duplicates.test +++ b/test/shell/import-duplicates.test @@ -16,15 +16,15 @@ rm -f shelltest.sqlite3 # and a newsxml that aren't really supposed to import. find ./test/xml -maxdepth 1 -name '*.xml' | wc -l >>> -28 +29 >>>= 0 # Run the imports again; we should get complaints about the duplicate -# xml_file_ids. There are 2 errors for each violation, so we expect 2*24 +# xml_file_ids. There are 2 errors for each violation, so we expect 2*25 # occurrences of the string 'ERROR'. ./dist/build/htsn-import/htsn-import -c 'shelltest.sqlite3' test/xml/*.xml 2>&1 | grep ERROR | wc -l >>> -48 +50 >>>= 0 # Finally, clean up after ourselves. diff --git a/test/xml/MLB_earlylineXML.dtd b/test/xml/MLB_earlylineXML.dtd new file mode 100644 index 0000000..e3c242f --- /dev/null +++ b/test/xml/MLB_earlylineXML.dtd @@ -0,0 +1,22 @@ + + + + + + + + + + + + + + + + + + + + + + diff --git a/test/xml/MLB_earlylineXML.xml b/test/xml/MLB_earlylineXML.xml new file mode 100644 index 0000000..0b2d120 --- /dev/null +++ b/test/xml/MLB_earlylineXML.xml @@ -0,0 +1 @@ + 21161927 AAO;MLB-EARLY-LINE Odds MLB Major League Baseball Overnight Line National League J.Nelson R.Wolf -105 8o Game one of doubleheader B.Arroyo R.Montero -105 7.5u J.Beckett -110 AJ.Burnett D.Fister F.Liriano -115 7p J.Hammel I.Kennedy -125 6.5p F.Morales J.Teheran -160 7.5p A.Wainwright -140 M.Leake 6.5o American League C.Lewis J.Verlander -180 8.5u D.Pomeranz -125 J.Happ 9o T.Bauer M.Gonzalez -120 9p B.Workman J.Odorizzi -120 8p M.Tanaka -165 A.Rienzo 7.5o J.Vargas G.Richards -155 8u D.Keuchel H.Iwakuma -165 6.5o Inter League R.Nolasco M.Bumgarner -175 7.5p Write-in game - Game two of doubleheader Z.Spruill D.Matsuzaka off off May 24, 2014, at 03:05 PM ET \ No newline at end of file