Hi guys,
I like this project and would to learn in order to participate.
I am trying to grab aljazeera.net by myself, I know it is there, but it is encrypted.
i found this request in devtool https://www.aljazeera.net/graphql?wp-site=aja&operationName=SchedulePage...
trying to grab it but it shows
no index page data received from aljazeera
unable to update channel, try again later
Existing guide data restored!
I tried with cookie, but unfortunately same error
can you advice what I am missing
Appreciate your support
post your webgrab log please, you are getting closer ;)
ps
ask authors to enable debug on your license, new version do stuff that you cannot do with 2.1
here is it :)
wait let me see if i can enable debug....
Do you want me to add debug and send you another log
ok try
here is it
ok, close your browser and clear cookies, then open again your browser and add that link(do not open other sites or aljazeera), the answer is there, tell me if you get data or the solution ;)
it says site header not specified
but I already set the header
url_index.headers {customheader=Accept-Encoding=gzip,deflate}
then we discuss showsplit
good !! but is not what you need.....now what "site" headers wants ? What do you think ?
Suggestion: on same address you can have english or arab...find the difference...see what changes
it need the same request header from the devtool.
but how can I add those, can you provide an example
Ask Jan to enable you debug (send mail in forum contact) wg3.1 compared to old versions, does a lot more....and easier
only one is needed so you do a url_index.headers {customheader=
the difference is the link in your picture and this link:
https://www.aljazeera.net/graphql?wp-site=aje&operationName=SchedulePage...
the difference tells the site aljazeera what schedule you want arab or english
I don't understand what you are asking, I read your post five times, but not getting your point, what debug can I get.
https://www.aljazeera.net/graphql?wp-site=aja&operationName=SchedulePage...
https://www.aljazeera.net/graphql?wp-site=aje&operationName=SchedulePage...
what changes ?
ooooooh my god, wp-site=aje
thank you sooooooooo much
but the question now from where you got the aje value
you are awesome, Thank you sooo much for such kindful help
much appreciated
url_index.headers {customheader=wp-site=aje} *Specifies the site header, the site language you need to indicate in url.
aje= aljazeera english
aja= aljazeera arab
now i want to see your ini ;)
it generates the guide now, but because it is from a Json not HTML the started date was Sunday not Today
schedule":[
{
"showDay":"Sunday",
"showTimeslot":"00:00",
"showName":"نشرة الأخبار",
"showDescription":"نشرة تقدم الأخبار السياسية العربية والعالمية.",
"duration":"01:26:0",
"startDate":"1619308800",
"__typename":"Schedule"
},
{
"showDay":"Sunday",
"showTimeslot":"01:26",
"showName":"النشرة الجويـة",
"showDescription":"التنبؤات بأوضاع الطقس ومتغيراته، ودرجات الحرارة والرطوبة والمنخفضات الجوية المتوقعة.",
"duration":"00:34:0",
"startDate":"1619308800",
"__typename":"Schedule"
},
I am stuck on how to figure out the date stuff
can you give me a clue
Thank you Mat so much
in the data-link you get shows with a "startDate":"1620000000"( = 3 may unixdate) or "showDay":"Monday", (weekdayname)... you have 2 possibilties.
Yes i know, but how to filter on those values
I am stuck here
Could you advise please
By the way i contacted Jan and he enabled a debug mode for me
I am trying to grab the data based on startDate (unix date) or showDay since two hours, I could not find a way to do it.
can you advice
Thank you so much
index_start is wrong, should be "showTimeslot":"||",|",
yes, sorry my mistake I just changed to test and did send the old file, sorry, but the problem is the file is not starting from today day, how can I set the start date to match the startDate
you did not read post 22....scrub index_date
Also change revision to v3.1 and remove cookie
i read that, but i could not find a way to do that, can you edit mu ini and show me the correct way please
here is it
Pay attention to what i write, multi is when within same blockstart(BS) and blockend(BE) there are multi elements repeated ES (element start) and EE (element end) and is not this the case because it's all single
See documentation 4.2.1.1 http://webgrabplus.com/sites/default/files/download/documentation/Manual...
oooh god, I did not noticed an index_date
much much appreciated thank you Mat :)
tomorrow i will do another siteini,
Before you start read documentation above....i still read it after 5 years.
now I got this
i guess it is working fine now, i just added those two lines
index_date.scrub {single|startDate":"||",|",}
index_date.modify {calculate(format=utctime)}
Thank you so much
your solution is not correct, do the easy way, remove index_date and set in site line firstday=1 so it start on monday
Hi Mat,
you are right, there was an error I did not saw yesterday. and when I used firstday=1 it graped the data, but the date still not correct.
I wonder how firstday=1 works, I checked the documentation but it not showing any information about how to map the firstday to the property I am looking for, in my example it is similar to this "startDate":"1619308800"
the generated file works but the time is not correct. how can i set the date to be startDate
please advice
i misstyped firstday=0123456 and then urldate.format {daycounter|1} to indicate to bypass first day of the index. In the log you should indication of skipped : show that happened before 'today'
why do I need for urldate.format {daycounter|1}
the data is not generated based on specified date from this url
https://www.aljazeera.net/graphql?wp-site=aja&operationName=SchedulePage...
check the attached .json
the data is almost a week and what I am looking for is matching the "showTimeslot" + startDate" to be the start index
i tries multiple time, but still not showing the correct result.
"schedule":[
{
"showDay":"Sunday",
"showTimeslot":"00:00",
"showName":"نشرة الأخبار",
"showDescription":"نشرة تقدم الأخبار السياسية العربية والعالمية.",
"duration":"01:26:0",
"startDate":"1619308800",
"__typename":"Schedule"
},
{
"showDay":"Sunday",
"showTimeslot":"01:26",
"showName":"النشرة الجويـة",
"showDescription":"التنبؤات بأوضاع الطقس ومتغيراته، ودرجات الحرارة والرطوبة والمنخفضات الجوية المتوقعة.",
"duration":"00:34:0",
"startDate":"1619308800",
"__typename":"Schedule"
},
can you check how did you make it in your original version "aljazeera.com.ini"
I spent 3 days on this trying to learn and to figure out how to do it.
please check my ini as well
thanks Mat :)
fix firstday properly see page 19
can we do zoom meeting for 5 minutes?
yes i added those already, but the problem with this json is startDate is unix date only not datetime if it is converted it will show Tue May 04 2021 00:00:00 GMT+0000, so in order to get the correct index_start I need to calculate the startDate + showTimeslot this is where i am stuck into
{
"showDay":"Sunday",
"showTimeslot":"01:26",
"showName":"النشرة الجويـة",
"showDescription":"التنبؤات بأوضاع الطقس ومتغيراته، ودرجات الحرارة والرطوبة والمنخفضات الجوية المتوقعة.",
"duration":"00:34:0",
"startDate":"1619308800",
"__typename":"Schedule"
},
and if you checked the json i sent on post 39, the data is not sorted based on date, it is sorted based on weekday, so that Tuesday = 1620086400 next Wednesday = 1619568000, which is last week not next day
can you just modify the file for me
thanks
use only showTimeslot":" as index start....no index_date
this is what I was doing since yesterday, the question is how the graper know when I use showTimeslot only the date from json. either I don't understand you or I don't understand the whole idea of wg++.
I am stuck in this since three days, trying to accomplish this and I thought it is fun, but i dont think it is for me, i wasted much much time, i will do it in the stupid way the html one and grap each 8 hours
thank you Mat
I think i now understand what you want to do....wait ( i am at work now)
Please
so do change lang =en
1. scrub with a temp_1 startdate then modify calculate format=yyyy/MM/dd
2. then scrub with another temp_2 the showTimeslot
3. addstart to index_start 'index_temp_1' 'index_temp_2'
4 index_start modify calculate format=date,unix
5 with modify addend (lang=ar) to title and description
I already did all of these steps. Did you checked my ini
i see in your latest siteini max7 should be 7.1 or whatever number of days as it is a .1 siteini
Yes. You are right
Can you modify my ini and send back to me
It needs your touches
Thanks
if you keep aksing without trying and understanding you will never learn, so this is first and last time i do it for you. From now on only suggestions :)
Pages