Freesat.co.uk has had a major site change and i'm looking into creating a siteini for the changes, this will be the first one i've tried to create and i'm already a bit stuck -_-
When looking into the site, the API starts of like this https://www.freesat.co.uk/tv-guide/api/1/?channel=21000 and will go from 1 to 15 https://www.freesat.co.uk/tv-guide/api/15/?channel=21000 which will list all the basic infomation for that channel, from looking at other siteini and the manual i understand how to change the channel part of the url but i'm unsure how i can increment the number after api to go from 1 all the way up to 15.
Also then from that API request, i need to grab the "evtid": 11823 which then will allow access to the details page which has a url of this format https://www.freesat.co.uk/whats/showcase/api/channel/21000/episode/11823
Thank you,
I couldn't get the last bit to work for urlshow, the server returns a 500 error, i've tried added debug to urlshow but the debug is not return anything.
I then tried a different way to try and get the parameter for the evtid but i gain the same error and the debug yet again does not return what the url looks like, am i using the debug correctly?
This is what i have so far
site {url=freesat.co.uk|timezone=GMT|maxdays=7|cultureinfo=en-GB|charset=UTF-8|titlematchfactor=90}
site {episodesystem=onscreen}
url_index{url(debug)|https://www.freesat.co.uk/tv-guide/api/|urldate|/?channel=|channel|}
urldate.format {daycounter|0}
index_variable_element.modify {set|'config_site_id'}
index_temp_2.scrub {regex||"evtId": \d+||}
index_temp_2.modify {remove|('"evtId": ')}
index_urlshow {url(debug)|https://www.freesat.co.uk/whats/showcase/api/channel/'index_variable_element'/episode/'index_temp_2'}
The first url works fine for finding the channel with the basic details but i can't get the index urlshow to work.
[ Info ] ( 1/1 ) FREESAT.CO.UK -- chan. (xmltv_id=BBC 1) -- mode Force
[ Debug ] debugging information siteini; urlindex builder
[ Debug ] siteini entry :
[ Debug ] urldate format type: daycounter, value: |0
[ Debug ] https://www.freesat.co.uk/tv-guide/api/|urldate|/?channel=|channel
[ Debug ] url_index created:
[ Debug ] https://www.freesat.co.uk/tv-guide/api/0/?channel=505
[ Info ]
[ Info ] Summary for update of BBC 1
[ Info ] no changes, no update necessary !
[ Info ] unchanged shows inspected 0
[ Info ] total after update 0
[ Debug ]
[ Debug ] 0 shows in 1 channels
[ Debug ] 0 updated shows
[ Debug ] 0 new shows added
[ Info ]
[ Info ]
[ ] Job finished at 23/04/2018 21:37:22 done in 1s
[ Debug ] statistics upload error: The remote server returned an error: (500) Internal Server Error.
Thank you, that's worked now, i added index_showsplit.scrub {regex||\"hasTstv".+?\}||} before the index_urlshow
I appreciate your help these last few days
Hi, I did but i was reading it in stages as i got onto each part, i've spent a lot more time reading the manual now and have got much further with the siteini now, i have it mostly working but have came across one issue which i i'm unsure how to resolve.
I’m having issues with the url show, it works most of the time but in the url_index, the elements change position when refreshed so in json where I scrub "evtId": sometimes it will appear at the front meaning it will be delimited with a comma but sometimes it will appear at the end of the file meaning it does not have a comma. So depending on the position of "evtId": when the json is loaded it will sometimes work and other times it will not work.
Thank you for explaining that, the urlshow is creating the correct url now.
I’m trying to debug the urlshow, when I had the debug on the url, that was perfectly fine and i was able to load the page in chrome.
but I’m unable to get the details page to come out in the html file, I only get the basic details from the url_index.
This is what my code looks like for getting the details page
index_variable_element.modify {set|'config_site_id'}
index_temp_2.scrub {regex(debug)||"evtId":\s*(\d+)||}
index_urlshow.modify {addstart('index_temp_2' not "")|https://www.freesat.co.uk/whats/showcase/api/channel/'index_variable_element'/episode/'index_temp_2'}
temp_1.scrub {single(debug)||||}
When reading the manual it stated to add temp_1.scrub {single(debug)||||} to output all the details within the details page to the html file and it stated to change the timespan in the config file to a set time, so i’ve gone ahead and did that also <timespan>0 23:30</timespan>
But the html is only containing the basic details and not the details page details, have I understood that correctly or can you see something that I’m doing wrong?
[ Debug ] Debugging information SiteIni
[ Debug ] Element: INDEX_URLSHOW
[ Debug ] Modify
[ Debug ] command & arguments : addstart('index_temp_2' not "")(debug)
[ Debug ] Expression-1 : https://www.freesat.co.uk/whats/showcase/api/channel/'index_variable_element'/episode/'index_temp_2'
[ Debug ] evaluation condition:
[ Debug ] 64262
[ Debug ] = (equals)
[ Debug ] Result = False
[ Debug ] 'not' condition : Result = True
[ Debug ] Element value before operation:
[ Debug ] empty element before the operation!
[ Debug ] String composer result for Expression-1 :
[ Debug ] Expression-1 expanded : https://www.freesat.co.uk/whats/showcase/api/channel/505/episode/64262
[ Debug ] Element value after operation:
[ Debug ] https://www.freesat.co.uk/whats/showcase/api/channel/505/episode/64262
[ Debug ] skipped show without a title at
[ Info ]
[ Info ] Summary for update of BBC 1
[ Info ] no changes, no update necessary !
[ Info ] unchanged shows inspected 0
[ Info ] total after update 0
I’ve given it another look and also removed the debugs which I didn’t need, I’m still unable to scrub the urlshow to get the data from the details page, I’m only able to get data from the urlindex
site {url=freesat.co.uk|timezone=GMT|maxdays=7|cultureinfo=en-GB|charset=UTF-8|titlematchfactor=90}
site {episodesystem=onscreen}
url_index{url(debug)|https://www.freesat.co.uk/tv-guide/api/|urldate|/?channel=|channel|}
urldate.format {daycounter|0}
index_showsplit.scrub {regex||(?!.*event){".+?\}||}
index_showsplit.modify {cleanup(removeduplicates=equal,100)}
index_variable_element.modify {set|'config_site_id'}
index_temp_2.scrub {regex||"evtId":\s*(\d+)||}
index_urlshow.modify {addstart('index_temp_2' not "")(debug)|https://www.freesat.co.uk/whats/showcase/api/channel/'index_variable_element'/episode/'index_temp_2'}
temp_1.scrub {single(debug)||||}
I’ve done some more reading of the manual and also looked into how other siteinis are scrubbing the details page and I can’t really see why this isn’t working.
I’ve added the log as an attachment
In the config file I’ve also got the timespans set like this <timespan>0 19:30</timespan> to test the scrub
they must be doing something, i get https://www.freesat.co.uk/
403. That’s an error.
Access is forbidden. That’s all we know.
Are you in the UK? Whether it's blocked if you aren't, the links which are generated in the logs from the urlshow are all working links, if you copy them into the browser, it loads up the json for that show.
i'm able to write shows and the information from the sub-details page to the xml file now, when looking at the xml, i came across shows with multiple elements like the name, start time and duration, which is related to otherEpisodes and not the main show, I’ve read about Removeduplicates which you can add a match factor to but I’m not sure if this is the right way about getting around this issue, because as you can see the name on some of them change a fair bit plus if it does work it will only work with the name and not startime since there is nothing really similar to be matched. Looking in the manual I’ve not yet came across another possible solution, is there a way I could make webgrab ignore otherEpisodes?
{
"duration":3600,
"showcase":false,
"hd":false,
"evtId":64588,
"dolby":false,
"hasTstv":false,
"channel":{
"channelname":"BBC 1",
"tstv":true,
"channeldescription":"Aims to speak to everyone in the UK through programming that celebrates the richness and diversity of life in new and surprising ways.",
"white_logourl":"https://fdp-sv15-image-v1-0.gcprod1.freetime-platform.net/cache//ms/img/...",
"logourl":"https://fdp-sv15-image-v1-0.gcprod1.freetime-platform.net/cache/ms/img/c...",
"channelid":505,
"lcn":101
},
"otherEpisodes":{
"event":[
{
"startTime":1524214800,
"duration":3600,
"svcId":505,
"name":"Homes Under the Hammer",
"evtId":58337
},
{
"startTime":1524214800,
"duration":3600,
"svcId":555,
"name":"Homes Under the Hammer",
"evtId":58337
},
{
"startTime":1524256200,
"duration":1800,
"svcId":505,
"name":"Home from Home",
"evtId":63595
}
]
},
"series":true,
"seriesNo":29,
"hdSimulcast":{
"svcId":555,
"evtId":64588
},
"hasTrailer":false,
"description":"A two-bed flat in Woodford Green in London, a three-bed terrace in the Welsh valleys and a one-bed flat in Middleton all go under the hammer. Also in HD. [AD,S]",
"sub":true,
"threeD":false,
"longDescription":"Property renovation series. A two-bed flat in Woodford Green, London, a three-bed terrace in the Welsh valleys and a one-bed flat in Middleton, Greater Manchester, all go under the hammer. Martin, Lucy and Dion catch up with the new owners to find out their renovation plans. [AD,S]",
"startTime":1524646800,
"genre":"Entertainment",
"ad":false,
"image":"/ms/img/epg/rb/6414-6321034.l.png",
"guidance":false,
"svcId":505,
"sl":false,
"name":"Homes Under the Hammer",
"showingAgain":{
"event":[
]
}
}
Yeah, i've worked out what i need within the details page, currently i'm finding it hard to grab the name and starttime from within the details page because that is also used within "otherEpisodes" : { } which i ideally want to remove completely from the details page using a regex since the location moves in the json depending on the loading of the page, is there a way i can modify everything within the json that is returned from urlshow so i can remove "otherEpisodes" before i begin scrubbing the elements within the json?
from the details page, i'm using it to gain the genre, longDescription and hd, so i can display a more detailed description in the TV guide along with being able to set the genre for colour coding and using hd to set the SDTV/HDTV in video quality
Yeah I think this one turned out a lot harder than I thought it was going to be, when I first looked at the json, I thought it wouldn't be that bad since it followed similar sites but then finding out the elements changed position made it a bit more challenging but was overcome it with regex and then it seemed to go more smoothly until I started discovering extra title's from the details page D: which kind of has screwed it up for me because it's a really detailed guide which covers all UK free to air. So it's kind of a shame to give up on it, so what I think i might do is carry on with it and not worry about the extra title's since they're contained within the same programme because until I can find a solution in Webgrab, I will be able to get away with removing them extra elements using a python script once the XML file is created and then spend more time learning how to use WebGrab which will hopefully one day reveal a solution to this issue I'm having. But thanks again for all your help, you've really helped me a lot and helped me develop a better understanding of Webgrab.
I’m getting an issue with the date format, which I have became stuck with because it will for a while, then it will stop working and give me a date error and I really can’t find out why this is happening since the timestamp is in the correct format and it works from time to time.
Here is what I have so far:
site {url=freesat.co.uk|timezone=GMT|maxdays=7|cultureinfo=en-GB|charset=UTF-8|titlematchfactor=90}
site {episodesystem=onscreen}
url_index{url(debug)|https://www.freesat.co.uk/tv-guide/api/|urldate|/?channel=|channel|}
url_index.headers {customheader=Accept-Encoding=gzip,deflate} * to speedup the downloading of the index pages
urldate.format {daycounter|0}
index_showsplit.scrub {regex||(?!.*event){".+?\}||}
index_showsplit.modify {cleanup(removeduplicates=equal,100)}
index_variable_element.modify {set|'config_site_id'}
index_temp_2.scrub {regex||"evtId":\s*(\d+)||}
index_title.scrub {regex||"name":\s*"(.+?)"||}
index_start.scrub {regex(debug)||"startTime":\s*\d*||}
index_start.modify {remove(debug)|"startTime": }
index_duration.scrub {regex(debug)||"duration":\s*\d*||}
index_duration.modify {remove(debug)|"duration": }
index_duration.modify {calculate(debug)(format=F0)|60 /}
index_urlshow.modify {addstart('index_temp_2' not "")|https://www.freesat.co.uk/whats/showcase/api/channel/'index_variable_element'/episode/'index_temp_2'}
index_urlshow.headers {customheader=Accept-Encoding=gzip,deflate} * to speedup the downloading of the detail pages
scope.range{(showdetails)|end}
title.modify {addstart|'index_title'}
description.scrub {regex||"longDescription":\s*"(.+?)"||}
category.scrub {regex||"genre":\s*"(.+?)"||}
videoquality.scrub {regex||"hd":\s*\w*||}
videoquality.modify {replace|false|SDTV}
videoquality.modify {replace|true|HDTV}
videoquality.modify {remove|"hd": }
*temp_1.scrub {regex||"episodeNo":\s*(\d+)||}
*temp_1.modify {remove(debug)|"episodeNo": }
*temp_1.modify {addstart('temp_1' not "")|E}
*temp_2.scrub {regex||"seriesNo":\s*(\d+)||}
*temp_2.modify {remove(debug)|"seriesNo": }
*temp_2.modify {addstart('temp_1' not "")|E}
*episode.modify {addstart('temp_1' not "")|'temp_1'}
*episode.modify {addstart('temp_2' not "")|'temp_2'}
end_scope
And sometimes it runs fine then other times I gain this error message:
update requested for - 1 - out of - 1 - channels for 1 day(s)
( 1/1 ) FREESAT.CO.UK -- chan. (xmltv_id=BBC 1) -- mode Force
innnnnnnnnnnnnnnnnnnnnnnn
Unable to update channel BBC 1
Generic syntax exception:
message:
Current culture: en-GB
time parsing error : String was not recognized as a valid DateTime.
nextstartdatetime time scrubbed :
computer date/time format: 27/04/2018 20:24:48
Existing guide data restored!
I have also attached the log, here is a link to the json which it’s scrubbing from, I’ve had to use regex to grab the values due to them changing position in the json https://www.freesat.co.uk/tv-guide/api/0/?channel=506
Thank you for that info about the capture, ah so the show split could be not splitting the show up in one place which could be causing one show to have more than one start time which is throwing this error?
By doing the following (?!.*event) it allows the regex to ignore the following {"channelid": "506", "offset": 0, "event": [ because the show information is all after that and then this allowed me to grab everything thats between {xxxx} which I used to slit up shows, I’ve just discovered why it’s not working and thats because if a show is apart of a series it contains {xxxx}xxxx} but normally it would be just {xxxx} which is when the regex works fine. Here is a link to the regex being used to split the show https://regex101.com/r/mUX5kL/1/
Thanks, I have it creating a XML perfectly now but I’m not really sure I get the part to converting the episode number into xmltv_ns, I have xmltv_ns set in site and the format of my scrub is S1E3, S1 or E3 and I read about a patten needed to be set to convert it into a xmltv_ns but I don’t really get how the patten works.
Currently this is how I’m grabbing the series and episode
temp_1.scrub {regex||"episodeNo":\s*(\d+)||}
temp_1.modify {remove|"episodeNo": }
temp_1.modify {addstart('temp_1' not "")|E}
temp_2.scrub {regex||"seriesNo":\s*(\d+)||}
temp_2.modify {remove|"seriesNo": }
temp_2.modify {addstart('temp_2' not "")|S}
episode.modify {addstart('temp_2' not "")|'temp_2'}
episode.modify {addstart('temp_1' not "")|'temp_1'}
But I’m unsure how I would add a patten into this to help converting it from onscreen to xmltv_ns