以360搜索和百度搜索首页为例用python采集搜索引擎联想词
以360搜索首页为例分析
打开www.so.com
右键审查元素,输入关键词,可以看到联想,并查看network
可以看到它的response
可以看到它的headers
所以我们就可以通过python3提供的模块来获得它的response,再通过正则来提取出我们想要的数据
## 代码的实现
360搜索首页
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
| import urllib.request import urllib import re import time from random import choice
iplist=['110.73.142.124','119.188.94.145','119.48.180.182']
list=["编程","前端"] for item in list: ip=choice(iplist) gjc=urllib.parse.quote(item) url="https://sug.so.360.cn/suggest?callback=suggest_so&encodein=utf-8&encodeout=utf-8&format=json&fields=word,obdata&word="+gjc headers={ "Get":url, "Host":"sug.so.360.cn", "Referer":"https://www.so.com/?src=so.com", "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 UBrowser/5.5.10106.5 Safari/537.36" } proxy_support=urllib.request.ProxyHandler({'http':'http://'+ip}) opener=urllib.request.build_opener(proxy_support) urllib.request.install_opener(opener) req=urllib.request.Request(url) for key in headers: req.add_header(key, headers[key]) html=urllib.request.urlopen(req).read().decode() ss=re.findall("\"word\":\"(.*?)\"",html) for item in ss: print(item)
|
百度搜索首页
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
| import urllib.request import urllib import re import time from random import choice
iplist=['110.73.142.124','119.188.94.145','119.48.180.182'] list=["编程","前端"]
for item in list: ip=choice(iplist) gjc=urllib.parse.quote(item) url="https://sp0.baidu.com/5a1Fazu8AA54nxGko9WTAnF6hhy/su?wd="+gjc+"&json=1&p=3&sid=18285_1460_18283_12826_18560_17001_17073_15526_11784_18081_18017&csor=2&cb=jQuery110205117536815814674_1454926196994&_=1454926197002" headers={ "Get":url, "Host":"sp0.baidu.com", "Referer":"https://www.baidu.com/", "User-Agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 UBrowser/5.5.10106.5 Safari/537.36" } proxy_support=urllib.request.ProxyHandler({'http':'http://'+ip}) opener=urllib.request.build_opener(proxy_support) urllib.request.install_opener(opener) req=urllib.request.Request(url) for key in headers: req.add_header(key, headers[key]) html=urllib.request.urlopen(req).read().decode('gb2312') ss=re.findall("\"s\":\[(.*?)\]",html) for item in ss: pass q=re.findall("\"(.*?)\"",item) for item in q: print(item)
|
## 代码运行结果
发现360搜索和百度搜索的关键词联想一样
可自由转载、引用,但需署名作者且注明文章出处。