问题描述
所以我搞砸了BeautifulSoup 。 我写了一些代码,经过你的努力之后,这里。 遇到以下问题-是否可以使用多线程或多处理来加快速度? 相信此代码远非理想的:)是否应将Pool用于此类场合?
PS。 我以这个网站为例。
先感谢您。
import requests
from bs4 import BeautifulSoup
import csv
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
pages = [str(i) for i in range(100,2000)]
for page in pages:
html = requests.get('https://statesassembly.gov.je/Pages/Members.aspxMemberID='+page).text
def get_page_data():
soup = BeautifulSoup(html, 'lxml')
name = soup.find('h1').text
title = soup.find(class_='gel-layout__item gel-2/3@m gel-1/1@s').find('h2').text
data = {'name': name,
'title': title,
}
return (data)
data = get_page_data()
with open('Members.csv','a') as output_file:
writer = csv.writer(output_file, delimiter=';')
writer.writerow((data['name'],
data['title'],
))
1楼
暴力破解政府网站在某些国家/地区可能是非法的。 请确保您阅读了所在国家/地区以及要从其获取数据的国家/地区的版权法。
首先,请先将列表分成几部分,然后再使其列表线程并行执行它们。
Python程序说明线程化的概念
import threading
import os
def task1():
print("Task 1 assigned to thread: {}".format(threading.current_thread().name))
print("ID of process running task 1: {}".format(os.getpid()))
def task2():
print("Task 2 assigned to thread: {}".format(threading.current_thread().name))
print("ID of process running task 2: {}".format(os.getpid()))
if __name__ == "__main__":
# print ID of current process
print("ID of process running main program: {}".format(os.getpid()))
# print name of main thread
print("Main thread name: {}".format(threading.main_thread().name))
# creating threads
t1 = threading.Thread(target=task1, name='t1')
t2 = threading.Thread(target=task2, name='t2')
# starting threads
t1.start()
t2.start()
# wait until all threads finish
t1.join()
t2.join()