I will be grateful for any help.
I am trying to scrape the show title from the text in the body of the webpage. The Webgrab log has ...
scrub strings:
type & arguments : single(exclude="<a href=[*]>" debug.4)
blockstart (bs): <header>
elementstart (es): <h1 itemprop="name">
elementend (ee): </a>
blockend (be): </h1>
Separated html block(s), number of blocks = 1
----------begin--block----------
<h1 itemprop="name"><a href="http://213.126.50.203/programme/ypp/sons-of-anarchy">Sons of Anarchy</a></h1>
----------end----block----------
Separated Element(s) (es) applied
----------begin--element----------
<a href="http://213.126.50.203/programme/ypp/sons-of-anarchy">Sons of Anarchy</a></h1>
----------end----element----------
Separated Element(s) (es) and (ee) applied of block 0
----------begin--element----------
<a href="http://213.126.50.203/programme/ypp/sons-of-anarchy">Sons of Anarchy
----------end----element----------
Argument -exclude- , string value = "<a href=[*]>" debug.4
Separated Element(s) arguments include and exclude applied of block 0
----------begin--element----------
<a href="http://213.126.50.203/programme/ypp/sons-of-anarchy">Sons of Anarchy
----------end----element----------
Elements , type single applied
----------begin--element----------
<a href="http://213.126.50.203/programme/ypp/sons-of-anarchy">Sons of Anarchy
----------end----element----------
It appears to ignore the "exclude". I have tried the "exclude" without the wildcard ... single(exclude="<a href="http://213.126.50.203/%20debug.4%29%20...%20and%20the%20"exclude" is still ignored.
What do I need to do to get "exclude" to work?
Please show me an example for the wildcard in the "exclude"?
Many thanks.
Graham
Nevermind.
I am getting the result that I need with ...
title.scrub {single|<header>|<h1 itemprop="name">|</a>|</h1>}
and
title.modify {remove(type=regex)|"(<.*>)"}
Thanks
Graham
FYI:
The regex for removing html tags, is
Just a little bit safer. Because your regex, will also remove all off <a ....>the title</a>.
But maybe in your own case, this is not an issue.
Thanks for the regex. I can see why yours is better than mine.
For anyone who stumbles upon this post while trying to use regex, I found a couple of helpful debugging sites at ...
http://www.regexr.com/
https://regex101.com/
I have been looking at this because I see a couple of issues with the stock radiotimes.com.ini.
This morning, the stock radiotimes.com.ini ( * @Revision 9 - [03/12/2013] ) produced ...
<title lang="en">Eddie Stobart: Trucks and Trailers 26 May 2015 Spike!??! Series 2 - Episode 8</title>
and
<sub-title lang="en">. A Horse Walks into a Bar</sub-title>
The leading dot space in sub-title was discussed at
http://www.webgrabplus.com/comment/1627#comment-1627
but may not have found its way into the .ini.
My effort in the posts above was a workaround for the ugly values in <title> from the index page.
Thanks for your help.