There is data missing when crawling by combining multiprocessing and multi-threading. How can we solve this?

Crawling using multiprocessing, treading, beautiful soup, and requests results in a list index out of range error or none type has not attribute text error.

import requests
from bs4 import BeautifulSoup as BS
from multiprocessing import Process
from queue import Queue
from functools import partial
from threading import Lock, Thread, local
from glob import glob

def crawl(urls, q, fn):
    lock = Lock()
    for url in urls:
        # # sleep(0.25)
        lock.acquire()
        spon = []
        date = []
        cosp = []

        req = requests.get(url)
        html = req.text
        soup = BS(html, "html.parser")

        table = soup.find("table", {"class" : "standard01"})
        tr = table.find_all("tr")

        for t in tr:
            if "Sen." in t.find("td").text:
                sd = t.find("td").text
                break
        # # sp = soup.select('body table.standard01 > tr td')
        # # sd = sp[0].text

        spon.append(sd[:sd.index("(")-1])
        date.append(sd[sd.index("/")-2:sd.index("/")+8])


        url = url.replace("?","/cosponsors?")
        req = requests.get(url)
        html = req.text
        soup = BS(html, "html.parser")

        cosps = soup.select('#main > ol > li.expanded > table > tbody > tr')
        temp = []
        for c in cosps:
            text = c.text.replace("\n","").replace("*","")
            text = text[:text.index("]")+1]
            temp.append(text)
        cosp.append(temp)

        q.put((spon, date, cosp))
        save(fn, q)
        # # lock.release()

def save(fn, q):
    spon, date, cosp = q.get()
    f = open("./raw/"+fn + ".txt","a", encoding = "utf-8")
    for s in spon:
        f.write(s)
        f.write("\t")
        f.write(date[spon.index(s)])
        f.write("\t")
        for c in cosp[spon.index(s)]:
            f.write(c)
            f.write(", ")
        f.write("\n")
    f.close()

def do_thr(urls, func, fn):
    q = Queue()

    urls = [urls[i:i+len(urls)//5] for i in range(0,len(urls),len(urls)//5)]

    if len(urls) > 5:
        urls[4].extend(urls[5])
        del urls[5]

    th = []
    for i in range(len(urls)):
        t = Thread(target = crawl, args = (urls[i], q, fn, ))
        th.append(t)
        t.start()

    for t in th:
        t.join()


if __name__ == "__main__":

    files = glob("./links/*.txt")
    q = Queue()

    for file in files:
        print(file)
        fn = file[file.index("links\\")+6:file.index("_address.txt")]

        urls = open(file, "r", encoding = "utf-8").read().splitlines()
        urls = [urls[i:i+len(urls)//5] for i in range(0,len(urls), len(urls)//5)]

        if len(urls) > 5:
            urls[4].extend(urls[5])
            del urls[5]

        proc = []

        for i in range(len(urls)):
            p = Process(target = do_thr , args = (urls[i], crawl, fn, ))
            proc.append(p)
            p.start()

        for p in proc:
            p.join()

At first, I used beautiful soup.select, but I changed it to find because I thought it would be better if I changed it to find because of the list index out of range error.

Not all of them are getting errors, but if you go well in the middle, there is an error. If you do multi-threading or multiprocessing, there will be no errors. Turning to a single process does not result in an error.

If I only do multiprocessing or multi-threading, the speed is too slow, so I want to share it with you, so how can I solve it? After switching to find, the error occurs in the table.find_all("tr") part. When selected, the list index out of range came out in the sp[0].text section that was annotated.

If you use try exception and look at the contents directly, an empty list is printed in the case of select, and none appears in the case of beautiful soup.

It feels like the price is empty because of a collision in the process of multi-threading and multi-processing, but I don't know how to fix it. Part of the link that needs to be crawled. The total number is about 3000*27, but the speed is too slow, so I think we need to use it together to make it faster.

https://www.congress.gov/bill/98th-congress/senate-resolution/148?s=1&r=1 https://www.congress.gov/bill/98th-congress/senate-bill/2766?s=1&r=2 https://www.congress.gov/bill/98th-congress/senate-bill/2413?s=1&r=3 https://www.congress.gov/bill/98th-congress/senate-bill/137?s=1&r=4 https://www.congress.gov/bill/98th-congress/senate-resolution/132?s=1&r=5 https://www.congress.gov/bill/98th-congress/senate-concurrent-resolution/74?s=1&r=6

python multiprocessing multithreading crawling

2022-09-20 18:04

1 Answers

As you may know, most sites determine that if a particular ip requests excessive traffic, it is a traffic attack and van the ip.

If you order an ip van from the site, of course, the html source is not sent, so select returns an empty list, and find returns None because you cannot find the object whether you are selecting a specific tag or finding.

Also, if you convert None or an object with nothing to text, you output an error.

It seems like this is the problem that goes well and then suddenly an error pops up.

I don't know how many cores you use, but it seems that the site recognized it as a traffic attack because at least four traffic requests were continuously requested.

If you see the error, you can check if you can access the site by visiting the site through a browser.

How do I solve this? Time after traffic request.There is no other way but to walk to sleep and wait for a while, or to make a request once without multiprocessing.

Also, as far as I know, the multiprocessing module is not used as follows.

You must first specify a process core and then have a command to assign a command to an individual core, but this process does not appear to exist.

As far as I know, it will work as follows, so please check it.

        proc = []

        for i in range(len(urls)):
            p = Process (target = do_thr, args = (urls[i], crawl, fn, ) # Run do_thr with p specification
            proc.append(p)
            Run p.start() #p again

        for p in proc:
            p.join()

2022-09-20 18:04

If you have any answers or tips

Popular Tags

python x 4647

android x 1593

java x 1494

javascript x 1427

c x 927

c++ x 878

ruby-on-rails x 696

php x 692

python3 x 685

html x 656