Python CSV output does not work.

Asked 2 years ago, Updated 2 years ago, 82 views

Attempted to output the scraping that I tried while imitating.
However, the output is only similar to the attached image.
There are no specific errors, but I would appreciate it if you could tell me where and how to correct them.

 from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.keys import Keys
from urllib import request
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import datetime
import time
import requests
import csv
import pandas aspd

START_DT_STR = '2021-12-01'
SEARCH_WORD = 'python'
PRTIMES_URL='https://prtimes.jp/'

start_dt = datetime.datetime.strptime(START_DT_STR, '%Y-%m-%d')

options=Options()
options.add_argument("--headless")
driver=webdriver.Chrome("/Users/tanaka.maru/Desktop/Python/chromedriver", options=options)
driver.get ("https://www.google.com/")

driver=webdriver.Chrome('chromedriver', options=options)

# Open the PR TIMES front page
target_url='https://prtimes.jp/'   
driver.get(target_url)

driver.find_element("xpath", '/html/body/header/div/div[2]/div/input').click()

kensaku=driver.find_element("xpath", '/html/body/header/div/div[2]/div/input')
kensaku.send_keys (SEARCH_WORD)
kensaku.send_keys (Keys.ENTER)

cnt = 0
while True:
    try:
        driver.find_element_by_xpath("/html/body/main/section/section/div/a").click()
    except: 
        break
    html = driver.page_source
    soup = BeautifulSoup(html, "html.parser")
    # Obtain article URL (40 each)
    articles=soup.find_all(class_='list-article_link') [cnt*40:]
    
    # an array containing article information
    
    # Get information by article
    For article in articles:
        article_time=article.find(class_='list-article_time')
                
        #csv-related
        eof_flag=False
        csv_date = datetime.datetime.today().strftime("%Y%m%d")
        csv_file_name = 'prtimes_' + csv_date + '.csv'
        f=open(csv_file_name, 'w', encoding='cp932', errors="ignore")
        writer=csv.writer(f, lineterminator='\n')
        csv_header=["title", "sub_title", "company", "published", "category1" ]
        writer.writerow(csv_header)

        try:
            str_to_dt = datetime.datetime.strptime(article_time.get('datetime'), '%Y-%m-%dT%H:%M:%S%z')
        except:
            try:
                article_time_cvt = article_time.get('datetime').replace('+09:00', '+0900')
                str_to_dt = datetime.datetime.strptime(article_time_cvt, '%Y-%m-%dT%H:%M:%S%z')
            except:
                str_to_dt = datetime.datetime.strptime (article_time.text, '%Y year %m month %d day %H hour %M minutes')

        article_time_dt = datetime.datetime(str_to_dt.year, str_to_dt.month, str_to_dt.day, str_to_dt.hour, str_to_dt.minute)
        
        ifarticle_time_dt<start_dt:
            eof_flag = True 
            break

        relative_href=article ["href"]
        url=urljoin(target_url, relative_href)

        r=requests.get(url)
        html=r.text
        soup = BeautifulSoup(html, "html.parser")
        
        records = [ ]

        # Article Title
        title=group.select_one("#main>div.content>article>div>header>h1").text
        
        sub_title_elem = group.select_one("#main>div.content>article>div>header>h2")
        
        #subtitle

        if sub_title_elem:
            sub_title=sub_title_elem.text
        else:
            sub_title=""
            
        company=group.select_one('#main>div.content>article>div>header>div.release --info_wrapper>div.information-release>div').text
        
        published = group.select_one('#main>div.content>article>div>header>div.release --info_wrapper>div.information-release>time').text
        
        category1 = group.select_one('#main>div.content>article>dl>dd:nth-child(4)>a:nth-child(1)').text
                
        records.append({'title':title, 'sub_title':sub_title, 'company':company, 'published':published, 'category1':category1})
        
        writer.writerow (records)

    if records:
        pass

    if of_flag:
        break

    time.sleep(2)  
    cnt+=1

    f.close

Enter a description of the image here

python csv selenium

2022-09-30 19:33

1 Answers

Incidentally, in my environment, click() with driver.find_element_by_xpath() immediately after while True:, exception ended without any processing.
try:...except:break can now be moved to the end of the while True: loop.For your information.

Other major fixes are as follows:

  • The main reason is that we open a new 'w' csv file every time in the loop for each article (remove and recreate any files with the same name already).
    I can only do the file containing the data for the last article.
    If you want to open the csv file, do it before the per-article loop for article in articles: and open it with an additional write of 'a' instead of new.

  • The
  • csv write object is csv.writer() in normal simple text, but the write process for each article is not correct because it tries to write dictionary data.
    Use csv.DictWriter().
    If so, use DictWriter.writeheader() to write the header.
    However, if you have opened the file with 'a' additional writes, and data already exists, the header will always be added as the middle row data, so try to get the current location with the f.tell() function when you are not doing anything right after opening the file.

  • On a small note, the fourth "published" in the header item name string is misspelled.The correct one is "published".

  • In the per-article loop for article inarticles:, initialize records as a list each time, add only one data extracted from the article as dictionary data records.append({...}) and immediately write the list as a csvrd.
    If you want to write to csv and leave it as a list for later processing, it's weird to initialize every time in the loop for each article, and if you write the entire number of articles in csv every time, the same data will be written multiple times.You can write multiple lines (records) using a function with s ending with s.
    Add the data extracted from the current article as dictionary data to the list, stop writing the entire list to csv for each article, create one dictionary data record for each article, and write it to csv with writer.writerow(record)

  • The if records: after the per-article loop for article in articles: and the pass processing when it is established are not meaningful.There may be something left to do that doesn't matter what you ask when you write a question article.

  • open() the csv file in the per-article loop but f.close() at the end of the loop does not match the processing nesting level.
    The first reason is that open() moves the open() to the front of the article-by-article inarticles:, so the processing level will be fine, but if eof flag: and break are not clear after processing the last article.
    So let's move f.close() before if eof_flag:.

The main reason is that we open a new 'w' csv file every time in the loop for each article (remove and recreate any files with the same name already).
I can only do the file containing the data for the last article.
If you want to open the csv file, do it before the per-article loop for article in articles: and open it with an additional write of 'a' instead of new.

The csv write object is csv.writer() in normal simple text, but the write process for each article is not correct because it tries to write dictionary data.
Use csv.DictWriter().
If so, use DictWriter.writeheader() to write the header.
However, if you have opened the file with 'a' additional writes, and data already exists, the header will always be added as the middle row data, so try to get the current location with the f.tell() function when you are not doing anything right after opening the file.

The fourth "published" in the header item name string is misspelled.The correct one is "published".

records=[] initializes each article with records as a list in the per-article loop for article in articles:, adding only one data extracted from the article as dictionary data, and immediately writing the list as row data in csv.
If you want to write to csv and leave it as a list for later processing, it's weird to initialize every time in the loop for each article, and if you write the entire number of articles in csv every time, the same data will be written multiple times.You can write multiple lines (records) using a function with s ending with s.
Add the data extracted from the current article as dictionary data to the list, stop writing the entire list to csv for each article, create one dictionary data record for each article, and write it to csv with writer.writerow(record)

The if records: after the per-article loop for article in articles: and the pass processing when it is established are not meaningful.There may be something left to do that doesn't matter what you ask when you write a question article.

open() the csv file is open() in the per-article loop, but f.close() at the end of the loop does not match the nested level of the process.
The first reason is that open() moves the open() to the front of the article-by-article inarticles:, so the processing level will be fine, but if eof flag: and break are not clear after processing the last article.
So let's move f.close() before if eof_flag:.


2022-09-30 19:33

If you have any answers or tips


© 2024 OneMinuteCode. All rights reserved.