有人說, 不爬妹子的爬蟲不是一隻好爬蟲, 怎麼海量爬取網路上的圖片, 有時還蠻重要的. 只是蠻浪費硬碟空間就是了.

安裝套件

PIL套件的使用, 需使用如下指令安裝

pip install pillow

os.path.basename

完整url如 http://www.win4000.com/1234.jpg, 可利用os.path.basename取得1234.jpg檔名, 當成儲存在本地端硬碟的檔案名稱

Image.open

取得網路照片的網址時, 可利用Image.open(BytesIO(requests.get(url).content) 將照片取回存放在image物件

最後再由image.save(filename) 儲存在硬碟中

requests text 與content的差異

page=requests(url), 經由requests取得物件page，可以使用 page.text 返回 Unicode 格式，也就是文字模式。但如果 page 裏的內容是圖片等二進位碼，則需使用 page.content 返回二進位數據。

簡易下載圖片

底下使用 requests取得圖片的資料page，再使用BytesIO將 page.content轉成Image格式，然後就可以將Image save到檔案，或plt顯圖出來

import os
from io import BytesIO
import requests
from PIL import Image
import pylab as plt

url='/wp-content/uploads/2016/10/img_6279.jpg'
page = requests.get(url)
file=os.path.basename(url)
print(file)
image = Image.open(BytesIO(page.content))
image.save(f"d:/{file}")
plt.imshow(image)
plt.show()

xpath

selenium的 find_element可以使用xpath對網頁的節點進行存取

/ : 由最外面的 <html>開始, 為絕對路徑
// : 由指定的節定開始尋找
div[@class=’w1180 clearfix’] : 尋找 div標籤, 且class name 為w1180 clearfix的節點

完整代碼

from PIL import Image
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import requests
from io import BytesIO
import os
import re
import time
def saveimage(imgurl):
    filename = os.path.basename(imgurl)
    if p1.match(filename):
        response = requests.get(imgurl)
        image = Image.open(BytesIO(response.content))
        if image.width>100 and image.height>100:
            print(filename)
            image.save("%s/%s" % (path, filename))
path="d:/pic_tmp"
if not os.path.isdir(path):
    os.mkdir(path)
p1 = re.compile('([-\w]+\.(?:jpg|gif|png))')
options = Options()#Chrome選項物件
options.add_argument('--headless')#啟動無頭模式
options.add_argument('--disable-gpu')
ua = "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0"
options.add_argument("user-agent={}".format(ua))
web=webdriver.Chrome(options=options)
url='http://www.win4000.com/meinvtag752_1.html'
web.get(url)
#print(web.page_source)
clearfix=web.find_element_by_xpath("//div[@class='w1180 clearfix']/div/div")
tags=clearfix.find_elements_by_tag_name('a')
links=[]
for tag in tags:
    links.append(tag.get_attribute('href'))
link=links[0]
for i in range(1000):
    web.get(link)
    img=web.find_element_by_class_name('pic-large')
    imgurl=img.get_attribute('url')
    saveimage(imgurl)
    link=web.find_element_by_class_name('pic-next-img').find_element_by_tag_name('a').get_attribute('href')
    time.sleep(2)

Selenium爬取圖片