You need to log in when crawling with Java programs

Asked 2 years ago, Updated 2 years ago, 80 views

Because it is a program, basic information is different when accessing it, and various information is needed for connection.

Can I know the specific way to log in with the Java program?

I'm trying to log in on sites like Job Korea and People

When using Jsoup, it can't be solved with just ID and pw input.

I made a request using header or cookie information, but I think it fails because I don't know which header and cookie to use in detail.

a web browser and access not possible way?

java crawling login

2022-09-20 19:13

1 Answers

You're making a crawler. I have no experience, but according to what I have heard, many sites have applied measures to prevent crawlers. It's like judging whether it's a client who can't run JavaScript and making it no longer work. I just said this once.

Let's say there's no defense like this.

First of all, logins from the web are usually implemented with session information on the server and cookies from the client.

For example, if you connect a person to a cookie-free state (using a secret/private browser) and analyze traffic, these values are in the response header:

Header that tells the client to generate PHPSESSID and PCID. Among them, PHPSESSID is a random value created by the server to identify the client.

The browser looks at the Set-Cookie header and generates cookies. Each of the following requests (but only when path and domain conditions are met) will be sent together. The server analyzes the request and determines that the session is the same if the PHPSESSID value of the cookie is the same as the previous request. The same session simply means that multiple requests were made by the same person.

If you log in with a session created like this, the server says, "You're logged in, so I'll give you a token value so we can recognize it." You can store these tokens in cookies, or you can store them in session storage or local storage, but they are usually stored in cookies. The browser handles it automatically.

In some cases, tokens are not given separately. In any case, it is okay to determine whether it is the same session with PHPSESSID and keep the login only on the server.

Anyway, after logging in, you must send PHPSESSID and the login token together for each request. That way, the server will respond by judging whether it is a valid session/login, and whether it is really a member.

It's a rough description of the login that... In fact, each site has developed it in its own way, and this information is of course not disclosed due to security, so you have to record and analyze browser storage every moment. It's not easy. 😥

Nevertheless, if you have to implement it somehow, you have to analyze the response packet to generate cookies, throw the cookies you made, and if you get a response code such as page movement (= redirection), you have to act.


2022-09-20 19:13

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.