Hello again :)
I've been trying to scrub part of this following line to use it as the productiondate of the show.
The line is:
<p class="bubble-programme-description description">...Of Reason: Renée Zellweger is back, but still torn between steady Colin Firth and slimy Hugh Grant. Will it be wedding cake or comfort ice cream for bumbling Bridget? (2004)(104 mins)</p>
I want to scrub the 2004 part, this is my regex that is not working:
temp_1.scrub {regex(debug)||<p class=\"bubble-programme-description description\">\s*?\d{4}\s||}
productiondate.modify {addstart|'temp_1'}
The log says "not match found".
Anyone could give me any guidance guys ?
Thanks.
2 answers:
1. http://regexpal.com/
2.<p class=\"bubble-programme-description description\">[^>]*?\([12]\d{3}\)
Things to know:
\s are white spaces. (so yours could never work)
[^>] means all values except >
*? means, take 0 or more values (the smallest amount possible)
[12]\d{3} means, any 4 digit number starting with 1 or 2
Thanks Francis again for always helping and giving from your time.
1. I use the mentioned site but somethings slip out of my limited knowledge :) but I'm learning
2. the expression you gave me doesn't get a match. Here's an example from the log file:
[ Debug ] No Production date found in:
[ Debug ] Debugging information SiteIni
[ Debug ] Element: TEMP_3
[ Debug ] html source written to : C:\ProgramData\ServerCare\WebGrab\html.source.htm
[ Debug ] scrub strings:
[ Debug ] type & arguments : regex(debug)
[ Debug ] regex_expression : <p class=\"bubble-programme-description description\">[^>]*?\([12]\d{3}\)
[ Debug ] !! No match group definition () in :<p class=\"bubble-programme-description description\">[^>]*?\([12]\d{3}\)
[ Debug ] Found 1 top level un-grouped match(es):
[ Debug ] <p class="bubble-programme-description description">Paroled US Army ranger Nicolas Cage becomes trapped on a hijacked prison plane. OTT action with Steve Buscemi and John Malkovich among the cons. (1997)
[ Debug ] Element Value(s) :
[ Debug ] ----------begin--element----------
<p class="bubble-programme-description description">Paroled US Army ranger Nicolas Cage becomes trapped on a hijacked prison plane. OTT action with Steve Buscemi and John Malkovich among the cons. (1997)
[ Debug ] ----------end----element----------
It seems it stops after the 4 digits are found not "scrub" the 4 digits (the original has about 5 more words in it after these 4 digits)
4. I tried this regex and it worked:
productiondate.scrub{regex(debug)||<p class=\"bubble-programme-description description\">.+?(\d{4})||}
What do you think ? is it robust enough ?
Thanks again .
2. For me its seems to work correctly. Because you did not defined a group in your regex, it will grab all. So just define a group around the year, and wg++ will only return the result of the group. You can see that because WG++ warns you about it with:
!! No match group definition () in
So just change
4. Well, I don't know if there are other blocks after the description <p>. If so, it is risky because you it could grab something like
goto(2100)
that is occurring after the description <p>
Also yours will catch "this show is about the 2000 people ..."
Thanks again Francis, as usual giving a hand to everybody here :)
I guess that your modified regex is of course better. At least it will not catch wrong numbers like in the example you gave.
There are no blocks after </p> but I am going to use your regex, it is of course and as usual better than what I try :)
Thanks again.