instagram crayling using python3 and selenium and Beautiful Soup

Asked 2 years ago, Updated 2 years ago, 162 views

For those who are busy # [SEARCH COMPLETE] # You can only look at the code below.

I have a question during installation crawling using python3, selenium, and Beautiful Soup. I succeeded in logging in and searching remotely using selenium.

However, the problem is to extract the information I want from the exposed list after completing the search

![Image][1]

After completing the search, view the source of the page If you look here

![Image][2]

And

as drawing the red box contains the hash tags in the caption. The source is soup = urllib.request.urlopen("https://www.instagram.com/explore/tags/"+hashTag[1:]) bssource = soup.read() print(bssource) That were originally imported for data.

I want to get the hashtags listed in each caption, the number of likes id, etc., but I don't know how to approach it ![Image][3]

Originally, I wanted to bring it using find_element_by_xpath or find_element_by_class_name, but I don't think that's how it works either. Returns an empty value continuously.

![Image][4]

[I'm attaching the sauce.]

# -*- coding: UTF-8 -*- 
# # [ IMPORT ] #
import requests
from bs4 import BeautifulSoup
import pymysql
from  selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
import getpass
import urllib.request
# [] #

username = input("Input ID : ")# User ID
password = getpass.getpass("Input PWD : ")# User PWD
hashTag = input("Input HashTag # : ")# Search #

checkTag = hashTag.find('#')

# # [ HASHTAG USING CHECK ] #

if checkTag==-1:
    hashTag = '#'+hashTag

driver = webdriver.Chrome("C:/Users/LEEJIYONG/Desktop/crawling/chromedriver.exe")# Chromedriver PATH
driver.get("https://www.instagram.com/accounts/login/")

# # [ LOGIN ] #

element_id = driver.find_element_by_name("username")
element_id.send_keys(username)
element_password = driver.find_element_by_name("password")
element_password.send_keys(password)

password = 0 #RESET Password

driver.find_element_by_xpath("""//*[@id="react-root"]/section/main/div/article/div/div[1]/div/form/div[1]/div/input""").submit()
driver.implicitly_wait(5)

# # [ LOGIN COMPLETE and SEARCH ] #

driver.find_element_by_xpath("""//*[@id="react-root"]/section/nav/div[2]/div/div/div[2]/input""").send_keys(hashTag)
driver.find_element_by_xpath("""//*[@id="react-root"]/section/nav/div[2]/div/div/div[2]/div[2]/div[2]/div/a[1]""").click()

searchTotalCount = driver.find_element_by_xpath("""//*[@id="react-root"]/section/main/article/header/span/span""").text
print('Total search result:'+searchTotalCount+' has been searched.')

elem = driver.find_element_by_tag_name("body")

# # [ AUTO PAGE DOWN ] #
# Automatically scroll down twice
no_of_pagedowns = 2

while no_of_pagedowns:
    elem.send_keys(Keys.PAGE_DOWN)
    time.sleep(0.5)
    no_of_pagedowns-=1

# # [ SEARCH COMPLETE ] #
# Search to create as many empty arrays as the number of pictures in the currently exposed page
resultCnt = len(driver.find_elements_by_class_name("_2di5p"))
resultValues =[]

for i in range(resultCnt):
    resultValues.append('')

print(resultCnt)
print(resultValues)
# Finished creating an empty array with as many indexes as the number of pictures

#Get the information you want
soup = urllib.request.urlopen("https://www.instagram.com/explore/tags/"+hashTag[1:])
bssource = soup.read()

searchvalues = BeautifulSoup(bssource, 'lxml')
print(searchvalues)
ProdList = soup.find_all('div') #I've tried it, but it's impossible
print(prodList)```





To explain briefly,
For example, assuming that 21 posts came out after searching for hashtags,
You want to save information such as the number of hashtags and likes in the array in turn in those 21 posts.
I ask for your help me.


  [1][1]: https://res.cloudinary.com/eightcruz/image/upload/v1512639450/ef2mtyx9656n2m2c9n7o.jpg
  [2][2]: https://res.cloudinary.com/eightcruz/image/upload/v1512639473/inyiupcqfqqz0lpskvmq.jpg
  [3][3]: https://res.cloudinary.com/eightcruz/image/upload/v1512639489/n16541f47tpwk92magic.jpg
  [4][4]: https://res.cloudinary.com/eightcruz/image/upload/v1512639504/qsfeg9i3dckxoakdrhmo.jpg

python3 selenium beautifulsoup instagram crawling

2022-09-21 14:54

1 Answers

For those who can't see the picture,

https://www.facebook.com/photo.php?fbid=1342381782557288&set=pcb.1545783272171495&type=3&theater&ifg=1

You can refer to it here.


2022-09-21 14:54

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.