Add TSN.XML.MLBEarlyLines to the .ghci and cabal files.
Mention all TSN.XML.MLBEarlyLines-related weirdness in the man page.
Add test cases for TSN.XML.MLBEarlyLines to the test suite.
Enable import of MLB_earlyinesXML.dtd documents in Main.
Bump the file counts in import-duplicates.test.
src/TSN/XML/Injuries.hs
src/TSN/XML/InjuriesDetail.hs
src/TSN/XML/JFile.hs
+ src/TSN/XML/MLBEarlyLine.hs
src/TSN/XML/News.hs
src/TSN/XML/Odds.hs
src/TSN/XML/ScheduleChanges.hs
import TSN.XML.Injuries
import TSN.XML.InjuriesDetail
import TSN.XML.JFile
+import TSN.XML.MLBEarlyLine
import TSN.XML.News
import TSN.XML.Odds
import TSN.XML.ScheduleChanges
* Minor_Baseball_TeamScheduleXML
* MinorLeagueHockeyTeamScheduleXML
* MLB_Boxscore_XML
- * MLB_earlylineXML
* MLB_IndividualStats_XML
* MLB_Probable_Pitchers_XML
* MLB_Roster_XML
5. Consolidate all of the make_game_time functions which take a
date/time and produce a combined time.
+
+6. Factor out test code where possible; a lot of them differ only in
+ the filename.
should be considered a bug if they are incorrect. The diagrams are
created using the pgModeler <http://www.pgmodeler.com.br/> tool.
+.SH DATABASE SCHEMA COMPROMISES
+
+There are a few places that the database schema isn't exactly how we'd
+like it to be:
+
+.IP \[bu] 2
+\fIearlylineXML.dtd\fR
+
+The database representations for earlylineXML.dtd and
+MLB_earlylineXML.dtd are the same; that is, they share the same
+tables. The two document types represent team names in different
+ways. In order to accomodate both types with one parser, we had to
+make both ways optional, and then merge the two together before
+converting to the database representation.
+
+Unfortunately, when we merge two optional things together, we get
+another optional thing back. There's no way to say that \(dqat least
+one is not optional.\(dq So the team names in the database schema are
+optional as well, even though they should always be present.
+
.SH NULL POLICY
.P
Normally in a database one makes a distinction between fields that
Processed 1 document(s) total.
.fi
.P
-At this point, the database schema matches the old documents, i.e. the
-ones without \fIAStarter\fR and \fIHStarter\fR. If we use a new
+At this point, the database schema matches the old documents, that is,
+the ones without \fIAStarter\fR and \fIHStarter\fR. If we use a new
version of \fBhtsn-import\fR, supporting the new fields, the migration
is handled gracefully:
.P
it as unsupported so that offending documents can be removed. An example
is provided as test/xml/newsxml-multiple-sms.xml.
+.IP \[bu]
+\fIMLB_earlylineXML.dtd\fR
+
+Unlike earlylineXML.dtd, this document type has more than one <game>
+associated with each <date>. Moreover, each <date> has a bunch of
+<note> children that are supposed to be associated with the <game>s,
+but the document structure indicates no explicit relationship. For
+example,
+
+.nf
+<date>
+ <note>...</note>
+ <game>...</game>
+ <game>...</game>
+ <note>...</note>
+ <game>...</game>
+</date>
+.fi
+
+Here the first <note> is inferred to apply to the two <game>s that
+follow it, and the second <note> applies to the single <game> that
+follows it. But this is very fragile to parse. Instead, we use a hack
+to facilitate (un)pickling, and then drop the notes entirely during
+the database conversion.
+
+A similar workaround is implemented for Odds_XML.dtd.
+
.IP \[bu]
\fIOdds_XML.dtd\fR
The \(dqtime zone\(dq is given as \(dqET\(dq, but unfortunately
\(dqET\(dq is not a valid time zone. It stands for \(dqEastern
Time\(dq, which can belong to either of two time zones, EST or EDT,
-based on the time of the year (i.e. whether or not daylight savings
-time is in effect). Since we can't tell from the timestamp, we always
-parse these as EST which is UTC-5. When daylight savings is in effect,
-they will be off by an hour.
+based on the time of the year (that is, whether or not daylight
+savings time is in effect). Since we can't tell from the timestamp, we
+always parse these as EST which is UTC-5. When daylight savings is in
+effect, they will be off by an hour.
Here's a list of the ones that may cause surprises:
They are also stored as UTC.
+.IP \[bu]
+\fIMLB_earlylineXML.dtd\fR
+
+See earlylineXML.dtd.
+
.IP \[bu]
\fIOdds_XML.dtd\fR
.IP \[bu]
jfilexml.dtd
.IP \[bu]
+MLB_earlylineXML.dtd
+.IP \[bu]
newsxml.dtd
.IP \[bu]
Odds_XML.dtd
schemagen/MLB_Matchup_XML/*.xml
schemagen/mlbonbasepctxml/*.xml
schemagen/MLBOPSXML/*.xml
+ schemagen/MLB_earlylineXML/*.xml
schemagen/MLB_Pitching_Appearances_Leaders/*.xml
schemagen/MLB_Pitching_Balks_Leaders/*.xml
schemagen/MLB_Pitching_CG_Leaders/*.xml
TSN.XML.Injuries
TSN.XML.InjuriesDetail
TSN.XML.JFile
+ TSN.XML.MLBEarlyLine
TSN.XML.News
TSN.XML.Odds
TSN.XML.ScheduleChanges
import qualified TSN.XML.InjuriesDetail as InjuriesDetail (
dtd,
pickle_message )
+import qualified TSN.XML.MLBEarlyLine as MLBEarlyLine (
+ dtd,
+ pickle_message )
import qualified TSN.XML.JFile as JFile ( dtd, pickle_message )
import qualified TSN.XML.News as News (
dtd,
| dtd == JFile.dtd = go JFile.pickle_message
+ | dtd == MLBEarlyLine.dtd =
+ go MLBEarlyLine.pickle_message
+
| dtd == News.dtd =
-- Some of the newsxml docs are busted in predictable ways.
-- We want them to "succeed" so that they're deleted.
--- /dev/null
+-- | Parse TSN XML for the DTD \"MLB_earlylineXML.dtd\". This module
+-- is unique (so far) in that it is almost entirely a subclass of
+-- another module, "TSN.XML.EarlyLine". The database representations
+-- should be almost identical, and the XML schema /could/ be
+-- similar, but instead, welcome to the jungle baby. Here are the
+-- differences:
+--
+-- * In earlylineXML.dtd, each \<date\> element contains exactly one
+-- game. In MLB_earlylineXML.dtd, they contain multiple games.
+--
+-- * As a result of the previous difference, the \<note\>s are no
+-- longer in one-to-one correspondence with the games. The
+-- \<note\> elements are thrown in beside the \<game\>s, and we're
+-- supposed to figure out to which \<game\>s they correspond
+-- ourselves. This is the same sort of nonsense going on with
+-- 'TSN.XML.Odds.OddsGameWithNotes'.
+--
+-- * The \<over_under\> element can be empty in
+-- MLB_earlylineXML.dtd (it can't in earlylineXML.dtd).
+--
+-- * Each home/away team in MLB_earlylineXML.dtd has a \<pitcher\>
+-- that isn't present in the regular earlylineXML.dtd.
+--
+-- * In earlylineXML.dtd, the home/away team lines are given as
+-- attributes on the \<teamH\> and \<teamA\> elements
+-- respectively. In MLB_earlylineXML.dtd, the lines can be found
+-- in \<line\> elements that are children of the \<teamH\> and
+-- \<teamA\> elements.
+--
+-- * In earlylineXML.dtd, the team names are given as text within
+-- the \<teamA\> and \<teamH\> elements. In MLB_earlylineXML.dtd,
+-- they are instead given as attributes on those respective
+-- elements.
+--
+-- Most of these difficulties have been worked around in
+-- "TSN.XML.EarlyLine", so this module could be kept somewhat boring.
+--
+module TSN.XML.MLBEarlyLine (
+ dtd,
+ mlb_early_line_tests,
+ module TSN.XML.EarlyLine -- This re-exports the EarlyLine and EarlyLineGame
+ -- constructors unnecessarily. Whatever.
+ )
+where
+
+-- System imports (needed only for tests)
+import Database.Groundhog (
+ countAll,
+ deleteAll,
+ migrate,
+ runMigration,
+ silentMigrationLogger )
+import Database.Groundhog.Generic ( runDbConn )
+import Database.Groundhog.Sqlite ( withSqliteConn )
+import Test.Tasty ( TestTree, testGroup )
+import Test.Tasty.HUnit ( (@?=), testCase )
+
+
+-- Local imports.
+import TSN.DbImport ( DbImport( dbimport ) )
+import TSN.XML.EarlyLine ( EarlyLine, EarlyLineGame, pickle_message )
+import Xml (
+ pickle_unpickle,
+ unpickleable,
+ unsafe_unpickle )
+
+
+-- | The DTD to which this module corresponds. Used to invoke dbimport.
+--
+dtd :: String
+dtd = "MLB_earlylineXML.dtd"
+
+
+
+--
+-- * Tasty Tests
+--
+
+-- | A list of all tests for this module.
+--
+mlb_early_line_tests :: TestTree
+mlb_early_line_tests =
+ testGroup
+ "MLBEarlyLine tests"
+ [ test_on_delete_cascade,
+ test_pickle_of_unpickle_is_identity,
+ test_unpickle_succeeds ]
+
+-- | If we unpickle something and then pickle it, we should wind up
+-- with the same thing we started with. WARNING: success of this
+-- test does not mean that unpickling succeeded.
+--
+test_pickle_of_unpickle_is_identity :: TestTree
+test_pickle_of_unpickle_is_identity =
+ testCase "pickle composed with unpickle is the identity" $ do
+ let path = "test/xml/MLB_earlylineXML.xml"
+ (expected, actual) <- pickle_unpickle pickle_message path
+ actual @?= expected
+
+
+
+-- | Make sure we can actually unpickle these things.
+--
+test_unpickle_succeeds :: TestTree
+test_unpickle_succeeds =
+ testCase "unpickling succeeds" $ do
+ let path = "test/xml/MLB_earlylineXML.xml"
+ actual <- unpickleable path pickle_message
+
+ let expected = True
+ actual @?= expected
+
+
+
+-- | Make sure everything gets deleted when we delete the top-level
+-- record.
+--
+test_on_delete_cascade :: TestTree
+test_on_delete_cascade =
+ testCase "deleting (MLB) early_lines deletes its children" $ do
+ let path = "test/xml/MLB_earlylineXML.xml"
+ results <- unsafe_unpickle path pickle_message
+ let a = undefined :: EarlyLine
+ let b = undefined :: EarlyLineGame
+
+ actual <- withSqliteConn ":memory:" $ runDbConn $ do
+ runMigration silentMigrationLogger $ do
+ migrate a
+ migrate b
+ _ <- dbimport results
+ deleteAll a
+ count_a <- countAll a
+ count_b <- countAll b
+ return $ sum [count_a, count_b]
+ let expected = 0
+ actual @?= expected
import TSN.XML.Injuries ( injuries_tests )
import TSN.XML.InjuriesDetail ( injuries_detail_tests )
import TSN.XML.JFile ( jfile_tests )
+import TSN.XML.MLBEarlyLine ( mlb_early_line_tests )
import TSN.XML.News ( news_tests )
import TSN.XML.Odds ( odds_tests )
import TSN.XML.ScheduleChanges ( schedule_changes_tests )
injuries_tests,
injuries_detail_tests,
jfile_tests,
+ mlb_early_line_tests,
news_tests,
odds_tests,
pickler_tests,
# and a newsxml that aren't really supposed to import.
find ./test/xml -maxdepth 1 -name '*.xml' | wc -l
>>>
-28
+29
>>>= 0
# Run the imports again; we should get complaints about the duplicate
-# xml_file_ids. There are 2 errors for each violation, so we expect 2*24
+# xml_file_ids. There are 2 errors for each violation, so we expect 2*25
# occurrences of the string 'ERROR'.
./dist/build/htsn-import/htsn-import -c 'shelltest.sqlite3' test/xml/*.xml 2>&1 | grep ERROR | wc -l
>>>
-48
+50
>>>= 0
# Finally, clean up after ourselves.
--- /dev/null
+<!ELEMENT XML_File_ID (#PCDATA)>
+<!ELEMENT heading (#PCDATA)>
+<!ELEMENT category (#PCDATA)>
+<!ELEMENT sport (#PCDATA)>
+<!ELEMENT title (#PCDATA)>
+<!ELEMENT note (#PCDATA)>
+<!ELEMENT time (#PCDATA)>
+<!ELEMENT pitcher (#PCDATA)>
+<!ELEMENT line (#PCDATA)>
+<!ELEMENT teamA ( ( pitcher, line ) )>
+<!ELEMENT teamH ( ( pitcher, line ) )>
+<!ELEMENT over_under (#PCDATA)>
+<!ELEMENT game ( ( time, teamA, teamH, over_under ) )>
+<!ELEMENT date ( ( note | game )+ )>
+<!ELEMENT time_stamp (#PCDATA)>
+<!ELEMENT message ( ( XML_File_ID, heading, category, sport, title, date, time_stamp ) )>
+
+<!ATTLIST teamA rotation CDATA #REQUIRED>
+<!ATTLIST teamA name CDATA #REQUIRED>
+<!ATTLIST teamH rotation CDATA #REQUIRED>
+<!ATTLIST teamH name CDATA #REQUIRED>
+<!ATTLIST date value CDATA #REQUIRED>
--- /dev/null
+<?xml version="1.0" standalone="no" ?>\r<!DOCTYPE message PUBLIC "-//TSN//DTD Odds 1.0/EN" "MLB_earlylineXML.dtd">\r<message>\r<XML_File_ID>21161927</XML_File_ID>\r<heading>AAO;MLB-EARLY-LINE</heading>\r<category>Odds</category>\r<sport>MLB</sport>\r<title>Major League Baseball Overnight Line</title>\r<date value="SUNDAY, MAY 25TH (05/25/2014)">\r<note>National League</note>\r<game>\r<time>1:10</time>\r<teamA rotation="951" name="MIL">\r<pitcher>J.Nelson</pitcher>\r<line></line>\r</teamA>\r<teamH rotation="952" name="MIA">\r<pitcher>R.Wolf</pitcher>\r<line>-105</line>\r</teamH>\r<over_under>8o</over_under>\r</game>\r<note>Game one of doubleheader</note>\r<game>\r<time>1:10</time>\r<teamA rotation="953" name="ARI">\r<!-- Manually removed pitcher -->\r<pitcher>B.Arroyo</pitcher>\r<line></line>\r</teamA>\r<teamH rotation="954" name="NYM">\r<pitcher>R.Montero</pitcher>\r<line>-105</line>\r</teamH>\r<over_under>7.5u</over_under>\r</game>\r<game>\r<time>1:35</time>\r<teamA rotation="955" name="LOS">\r<pitcher>J.Beckett</pitcher>\r<line>-110</line>\r</teamA>\r<teamH rotation="956" name="PHI">\r<pitcher>AJ.Burnett</pitcher>\r<line></line>\r</teamH>\r<!-- Manually removed over_under -->\r<over_under></over_under>\r</game>\r<game>\r<time>1:35</time>\r<teamA rotation="957" name="WAS">\r<pitcher>D.Fister</pitcher>\r<line></line>\r</teamA>\r<teamH rotation="958" name="PIT">\r<pitcher>F.Liriano</pitcher>\r<line>-115</line>\r</teamH>\r<over_under>7p</over_under>\r</game>\r<game>\r<time>4:10</time>\r<teamA rotation="959" name="CHC">\r<pitcher>J.Hammel</pitcher>\r<line></line>\r</teamA>\r<teamH rotation="960" name="SDP">\r<pitcher>I.Kennedy</pitcher>\r<line>-125</line>\r</teamH>\r<over_under>6.5p</over_under>\r</game>\r<game>\r<time>5:10</time>\r<teamA rotation="961" name="COL">\r<pitcher>F.Morales</pitcher>\r<line></line>\r</teamA>\r<teamH rotation="962" name="ATL">\r<pitcher>J.Teheran</pitcher>\r<line>-160</line>\r</teamH>\r<over_under>7.5p</over_under>\r</game>\r<game>\r<time>8:05</time>\r<teamA rotation="963" name="STL">\r<pitcher>A.Wainwright</pitcher>\r<line>-140</line>\r</teamA>\r<teamH rotation="964" name="CIN">\r<pitcher>M.Leake</pitcher>\r<line></line>\r</teamH>\r<over_under>6.5o</over_under>\r</game>\r<note>American League</note>\r<game>\r<time>1:05</time>\r<teamA rotation="965" name="TEX">\r<pitcher>C.Lewis</pitcher>\r<line></line>\r</teamA>\r<teamH rotation="966" name="DET">\r<pitcher>J.Verlander</pitcher>\r<line>-180</line>\r</teamH>\r<over_under>8.5u</over_under>\r</game>\r<game>\r<time>1:05</time>\r<teamA rotation="967" name="OAK">\r<pitcher>D.Pomeranz</pitcher>\r<line>-125</line>\r</teamA>\r<teamH rotation="968" name="TOR">\r<pitcher>J.Happ</pitcher>\r<line></line>\r</teamH>\r<over_under>9o</over_under>\r</game>\r<game>\r<time>1:35</time>\r<teamA rotation="969" name="CLE">\r<pitcher>T.Bauer</pitcher>\r<line></line>\r</teamA>\r<teamH rotation="970" name="BAL">\r<pitcher>M.Gonzalez</pitcher>\r<line>-120</line>\r</teamH>\r<over_under>9p</over_under>\r</game>\r<game>\r<time>1:40</time>\r<teamA rotation="971" name="BOS">\r<pitcher>B.Workman</pitcher>\r<line></line>\r</teamA>\r<teamH rotation="972" name="TAM">\r<pitcher>J.Odorizzi</pitcher>\r<line>-120</line>\r</teamH>\r<over_under>8p</over_under>\r</game>\r<game>\r<time>2:10</time>\r<teamA rotation="973" name="NYY">\r<pitcher>M.Tanaka</pitcher>\r<line>-165</line>\r</teamA>\r<teamH rotation="974" name="CWS">\r<pitcher>A.Rienzo</pitcher>\r<line></line>\r</teamH>\r<over_under>7.5o</over_under>\r</game>\r<game>\r<time>3:35</time>\r<teamA rotation="975" name="KAN">\r<pitcher>J.Vargas</pitcher>\r<line></line>\r</teamA>\r<teamH rotation="976" name="ANA">\r<pitcher>G.Richards</pitcher>\r<line>-155</line>\r</teamH>\r<over_under>8u</over_under>\r</game>\r<game>\r<time>4:10</time>\r<teamA rotation="977" name="HOU">\r<pitcher>D.Keuchel</pitcher>\r<line></line>\r</teamA>\r<teamH rotation="978" name="SEA">\r<pitcher>H.Iwakuma</pitcher>\r<line>-165</line>\r</teamH>\r<over_under>6.5o</over_under>\r</game>\r<note>Inter League</note>\r<game>\r<time>4:05</time>\r<teamA rotation="979" name="MIN">\r<pitcher>R.Nolasco</pitcher>\r<line></line>\r</teamA>\r<teamH rotation="980" name="SFG">\r<pitcher>M.Bumgarner</pitcher>\r<line>-175</line>\r</teamH>\r<over_under>7.5p</over_under>\r</game>\r<note>Write-in game - Game two of doubleheader</note>\r<game>\r<time>4:40</time>\r<teamA rotation="981" name="ARI">\r<pitcher>Z.Spruill</pitcher>\r<line></line>\r</teamA>\r<teamH rotation="982" name="NYM">\r<pitcher>D.Matsuzaka</pitcher>\r<line>off</line>\r</teamH>\r<over_under>off</over_under>\r</game>\r</date>\r<time_stamp> May 24, 2014, at 03:05 PM ET </time_stamp>\r</message>\r
\ No newline at end of file