It's my first time imitating Python and Scrappy.
When I try to log in to get the information I want,
Login page -> Login verification URL1 -> Login verification URL2 -> Desired page
The URL changes several times in this way before you log in.
The code for Spider is as follows.
from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import Rule
from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest
class gradeSpider(InitSpider):
name = "grade"
allowed_domains = ["example.com"]
login_page = "https://www.example.com/login/myweb.jsp?RSP=www.example.com&RelayState=index_SSO.jsp"
start_urls = "https://www.example.com/main/dataList"
def init_request(self):
return Request(url=self.login_page, callback=self.login)
def login(self, response):
return FormRequest.from_response(response, formdata={'userID':'myid', 'userPW':'mypw'}, callback=self.check_login_response)
def check_login_response(self, response):
print "+"*50
print "current URL: " + response.url
print "+"*50
#check login success
if success
return self.initialized();
else
return self.error();
def initialized(self):
return Request(url=self.start_urls, callback=self.parse_item)
def parse_item(self, response):
#doing parse
print "Success login ready to parse."
Currently, we have written the test code as above, and print out the current URL to check if the login is successful and going to start_url to crawl.
If you run the above code,
2016-09-20 08:59:22 [scrapy] INFO: Spider opened
2016-09-20 08:59:22 [scrapy] INFO: Crawled 0 pages (at 0 pages/min),
scraped 0 items (at 0 items/min)
2016-09-20 08:59:22 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-09-20 08:59:22 [scrapy] DEBUG: Crawled (404) GET https://www.example.com/robots.txt (referer: None)
2016-09-20 08:59:22 [scrapy] DEBUG: Crawled (200) GET https://www.example.com/login/myweb.jsp?RSP=www.example.com&RelayState=index_SSO.jsp (referer: None)
2016-09-20 08:59:23 [scrapy] DEBUG: Crawled (200) POST https://www.example.com/login/loginCheck.jsp (referer: https://www.example.com/login/myweb.jsp?RSP=www.example.com&RelayState=index_SSO.jsp)
+++++++++++++++++++++++++++++++++++++++++++++++++
current URL: https://www.example.com/login/loginCheck.jsp
// I want the current URL to be start_url... (Currently stopped at login verification URL 1)
+++++++++++++++++++++++++++++++++++++++++++++++++
2016-09-20 08:59:23 [scrapy] INFO: Closing spider (finished)
It looks like this:
I hope the current URL is the start_url I will crawl as it is in the comment, but it is not
What should I do? Thank you for reading it.
scrapy python crawling
If you don't have to use a scrap, use selenium
to solve it easily.
© 2024 OneMinuteCode. All rights reserved.