From: Michael Orlitzky Date: Mon, 7 Jul 2014 04:41:39 +0000 (-0400) Subject: Add a has_only_single_sms function to TSN.XML.News. X-Git-Tag: 0.0.6~5 X-Git-Url: https://gitweb.michael.orlitzky.com/?a=commitdiff_plain;h=9b28d5ed0e1570f0890eae1ee01721688c20e266;p=dead%2Fhtsn-import.git Add a has_only_single_sms function to TSN.XML.News. Report multiple SMS in newsxml as unsupported in Main. Add tests for the multiple SMS elements in newsxml. Update the man page mention of the double SMS. --- diff --git a/doc/man1/htsn-import.1 b/doc/man1/htsn-import.1 index 07e8c32..616f7e8 100644 --- a/doc/man1/htsn-import.1 +++ b/doc/man1/htsn-import.1 @@ -280,7 +280,9 @@ an empty followed a non-empty one: .nf -We don't parse this case at the moment. +We don't parse this case at the moment, but we do recognize it and report +it as unsupported so that offending documents can be removed. An example +is provided as test/xml/newsxml-multiple-sms.xml. .IP \[bu] \fIOdds_XML.dtd\fR diff --git a/src/Main.hs b/src/Main.hs index cb4bdff..492f0ce 100644 --- a/src/Main.hs +++ b/src/Main.hs @@ -58,7 +58,10 @@ import qualified TSN.XML.InjuriesDetail as InjuriesDetail ( dtd, pickle_message ) import qualified TSN.XML.JFile as JFile ( dtd, pickle_message ) -import qualified TSN.XML.News as News ( dtd, pickle_message ) +import qualified TSN.XML.News as News ( + dtd, + has_only_single_sms, + pickle_message ) import qualified TSN.XML.Odds as Odds ( dtd, pickle_message ) import qualified TSN.XML.ScheduleChanges as ScheduleChanges ( dtd, @@ -198,8 +201,16 @@ import_file cfg path = do | dtd == JFile.dtd = go JFile.pickle_message - | dtd == News.dtd = go News.pickle_message - + | dtd == News.dtd = + -- Some of the newsxml docs are busted in predictable ways. + -- We want them to "succeed" so that they're deleted. + -- We already know we can't parse them. + if News.has_only_single_sms xml + then go News.pickle_message + else do + let msg = "Unsupported newsxml.dtd with multiple SMS " ++ + "(" ++ path ++ ")" + return $ ImportUnsupported msg | dtd == Odds.dtd = go Odds.pickle_message | dtd == ScheduleChanges.dtd = go ScheduleChanges.pickle_message diff --git a/src/TSN/XML/News.hs b/src/TSN/XML/News.hs index 3f9fef5..10c5195 100644 --- a/src/TSN/XML/News.hs +++ b/src/TSN/XML/News.hs @@ -11,6 +11,7 @@ -- module TSN.XML.News ( dtd, + has_only_single_sms, pickle_message, -- * Tests news_tests, @@ -45,6 +46,16 @@ import Test.Tasty ( TestTree, testGroup ) import Test.Tasty.HUnit ( (@?=), testCase ) import Text.XML.HXT.Core ( PU, + XmlTree, + (/>), + (>>>), + addNav, + descendantAxis, + filterAxis, + followingSiblingAxis, + hasName, + remNav, + runLA, xp13Tuple, xpAttr, xpElem, @@ -69,6 +80,7 @@ import Xml ( ToDb(..), pickle_unpickle, unpickleable, + unsafe_read_document, unsafe_unpickle ) @@ -202,8 +214,37 @@ data News_Location = News_Location + +-- | Some newsxml documents contain two \ elements in a row, +-- violating the DTD. The second one has always been empty, but it's +-- irrelevant: we can't parse these, and would like to detect them +-- in order to report the fact that the busted document is +-- unsupported. +-- +-- This function detects whether two \ elements appear in a +-- row, as siblings. +-- +has_only_single_sms :: XmlTree -> Bool +has_only_single_sms xmltree = + case elements of + [] -> True + _ -> False + where + parse :: XmlTree -> [XmlTree] + parse = runLA $ hasName "/" + /> hasName "message" + >>> addNav + >>> descendantAxis + >>> filterAxis (hasName "SMS") + >>> followingSiblingAxis + >>> remNav + >>> hasName "SMS" + + elements = parse xmltree + + -- --- Database code +-- * Database code -- -- | Define 'dbmigrate' and 'dbimport' for 'Message's. The import is @@ -407,7 +448,8 @@ news_tests = [ test_news_fields_have_correct_names, test_on_delete_cascade, test_pickle_of_unpickle_is_identity, - test_unpickle_succeeds ] + test_unpickle_succeeds, + test_sms_detected_correctly ] -- | Make sure our codegen is producing the correct database names. @@ -506,3 +548,22 @@ test_on_delete_cascade = testGroup "cascading delete tests" count_e <- countAll e return $ count_a + count_b + count_c + count_d + count_e actual @?= expected + + +-- | We want to make sure the single-SMS documents and the multi-SMS +-- documents are identified correctly. +-- +test_sms_detected_correctly :: TestTree +test_sms_detected_correctly = + testGroup "newsxml SMS count determined correctly" + [ check "test/xml/newsxml.xml" + "single SMS detected correctly" + True, + check "test/xml/newsxml-multiple-sms.xml" + "multiple SMS detected correctly" + False ] + where + check path desc expected = testCase desc $ do + xmltree <- unsafe_read_document path + let actual = has_only_single_sms xmltree + actual @?= expected diff --git a/test/shell/import-duplicates.test b/test/shell/import-duplicates.test index 99b2070..be3dcf5 100644 --- a/test/shell/import-duplicates.test +++ b/test/shell/import-duplicates.test @@ -12,11 +12,11 @@ rm -f shelltest.sqlite3 >>>= 0 # We note the number of XML files that we have. There's one extra -# Heartbeat.xml that doesn't really count, and 2 weatherxml that -# aren't really supposed to import. +# Heartbeat.xml that doesn't really count. There are also 2 weatherxml +# and a newsxml that aren't really supposed to import. find ./test/xml -maxdepth 1 -name '*.xml' | wc -l >>> -25 +26 >>>= 0 # Run the imports again; we should get complaints about the duplicate diff --git a/test/shell/news-double-sms-unsupported.test b/test/shell/news-double-sms-unsupported.test new file mode 100644 index 0000000..936338c --- /dev/null +++ b/test/shell/news-double-sms-unsupported.test @@ -0,0 +1,11 @@ +# +# More than one in a newsxml document is unsupported, but we +# don't want to consider it an error when we encounter one, because +# we want to delete it. +# +# The real output contains escape characters so we use a regexp to get +# a pretty good idea of what it says. + +./dist/build/htsn-import/htsn-import test/xml/newsxml-multiple-sms.xml +>>> /Unsupported newsxml\.dtd with multiple SMS \(test\/xml\/newsxml-multiple-sms.xml\)/ +>>>= 0 diff --git a/test/xml/newsxml-multiple-sms.xml b/test/xml/newsxml-multiple-sms.xml new file mode 100644 index 0000000..fe9d1c6 --- /dev/null +++ b/test/xml/newsxml-multiple-sms.xml @@ -0,0 +1 @@ + 21232353 ADN;RUSHTON-COLUMN-NYI 4699713 News NHL /nhl/news/ADN4699713.htm NYI New York NY USA Odd Man Rush: Snow under pressure to improve Isles quickly By Michael Rushton, NHL Contributing Editor Philadelphia, PA (SportsNetwork.com) - The moves that New York Islanders general manager Garth Snow has made this offseason are just as much about 2015 as they are about this upcoming season.

More than a few thought the Islanders made a mistake in opting to keep their 2014 draft pick and instead send their selection in next year's draft to the Buffalo Sabres to complete October's trade for Thomas Vanek.

Vanek, by the way, recently wrapped his season as a member of the playoff- participating Montreal Canadiens after the Islanders were forced to deal the winger to try to recoup some of the loss when it became apparent they were not going to the postseason.

In keeping his upcoming selection, Snow knows he will be picking fifth overall and will most likely be getting a talented player who can help in the future. Had he chosen to keep his 2015 pick, Snow would not have had any idea where he was picking in what is expected to be a deep draft headlined by a pair of top selections in Connor McDavid and Jack Eichel.

Obviously, Snow has to think his team will improve into a later pick in 2015 or else he isn't doing his job.

Snow can take this gamble because of the pending return of captain John Tavares, who had notched 24 goals and 42 points in 59 games before suffering a season-ending left knee injury in February while playing for Canada in the 2014 Winter Olympics in Sochi, Russia.

Tavares helped the Islanders snap a five-season playoff drought in 2013, but New York was last in the newly-formed Metropolitan Division heading into the Olympic break and already 12 points out of a wild card spot.

In other words, Tavares needs help and Long Island isn't the most attractive option for free agents right now. That could change when the club relocates to Brooklyn for the 2015-16 season, but Snow has had to get creative.

He has done so through trades designed to get a jump on free agency, acquiring the rights of goaltender Jaroslav Halak from the Washington Capitals back on May 1 and then doing the same for veteran defenseman Dan Boyle from the San Jose Sharks on Thursday.

Halak has already forgone free agency, inking a four-year deal with the Islanders on May 22 to stabilize that important position, while New York now has exclusive negotiating rights with Boyle up until free agency on July 1.

Should they sign the 37-year-old, it will cost the Islanders a fourth-round pick heading the Sharks' way. If they do not sign him, a fifth-round selection is the cost of the gamble.

"We don't know what the future brings with Dan, but we wanted to get that exclusive negotiating window with him," Snow told Newsday. "He's a defenseman we think can help our team."

Several New York-based newspapers cited Boyle's agent as saying his client is looking for a two-year deal and according to capgeek.com, Snow has nearly $28 million of cap space to play with, so signing Boyle shouldn't be a problem.

And it is a move Snow needs to make. The Islanders are deep with defensive prospects and having a veteran like Boyle around for two years -- heck, he might even be worth three at the right price -- would be invaluable for the team's future.

That's the path Snow has decided to take in getting both Halak and Boyle while opting to part ways with a future draft pick. Signing Boyle would be sticking that plan.

June 6, 2014, at 03:08 PM ET
\ No newline at end of file