If the URL changes several times when I log in to the website, how can I get a session with scrapy?

Asked 1 years ago, Updated 1 years ago, 65 views

It's my first time imitating Python and Scrappy.

When I try to log in to get the information I want,

Login page -> Login verification URL1 -> Login verification URL2 -> Desired page

The URL changes several times in this way before you log in.

The code for Spider is as follows.

from scrapy.spider import BaseSpider
from scrapy.contrib.spiders import Rule
from scrapy.contrib.spiders.init import InitSpider
from scrapy.http import Request, FormRequest

class gradeSpider(InitSpider):
name = "grade"
allowed_domains = ["example.com"]
login_page = "https://www.example.com/login/myweb.jsp?RSP=www.example.com&RelayState=index_SSO.jsp"
start_urls = "https://www.example.com/main/dataList"

def init_request(self):
    return Request(url=self.login_page, callback=self.login)


def login(self, response):
    return FormRequest.from_response(response, formdata={'userID':'myid', 'userPW':'mypw'}, callback=self.check_login_response)

def check_login_response(self, response):

    print "+"*50
    print "current URL: " + response.url
    print "+"*50

    #check login success
    if success
        return self.initialized();
    else
        return self.error();

def initialized(self):
    return Request(url=self.start_urls, callback=self.parse_item)

def parse_item(self, response):
    #doing parse
    print "Success login ready to parse."

Currently, we have written the test code as above, and print out the current URL to check if the login is successful and going to start_url to crawl.

If you run the above code,

2016-09-20 08:59:22 [scrapy] INFO: Spider opened
2016-09-20 08:59:22 [scrapy] INFO: Crawled 0 pages (at 0 pages/min),
scraped 0 items (at 0 items/min)
2016-09-20 08:59:22 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2016-09-20 08:59:22 [scrapy] DEBUG: Crawled (404) GET https://www.example.com/robots.txt (referer: None)
2016-09-20 08:59:22 [scrapy] DEBUG: Crawled (200)  GET https://www.example.com/login/myweb.jsp?RSP=www.example.com&RelayState=index_SSO.jsp (referer: None)
2016-09-20 08:59:23 [scrapy] DEBUG: Crawled (200)  POST https://www.example.com/login/loginCheck.jsp (referer: https://www.example.com/login/myweb.jsp?RSP=www.example.com&RelayState=index_SSO.jsp)
+++++++++++++++++++++++++++++++++++++++++++++++++
current URL: https://www.example.com/login/loginCheck.jsp
// I want the current URL to be start_url... (Currently stopped at login verification URL 1)
+++++++++++++++++++++++++++++++++++++++++++++++++
2016-09-20 08:59:23 [scrapy] INFO: Closing spider (finished)

It looks like this:

I hope the current URL is the start_url I will crawl as it is in the comment, but it is not

What should I do? Thank you for reading it.

scrapy python crawling

2022-09-21 15:49

1 Answers

If you don't have to use a scrap, use selenium to solve it easily.


2022-09-21 15:49

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.