百度的爬虫对于SSL网站 还是优先爬http协议 而不是https
对于混合的 HTTP转向SSL协议的 301转向 对于百度爬虫 好像问题比较严重
查看最近一周的日志 都是第一个日志的样子 不爬内容 判断301 直接停了
而且这种现象很多
追随301爬下个链接的 行为很少
所以如果是靠百度过日子的 还是考虑清楚在SSL吧
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
#这个样子的 180.76.15.34 - - [13/Aug/2016:17:33:55 +0800] "GET / HTTP/1.1" 301 436 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" 180.76.15.9 - - [13/Aug/2016:17:34:45 +0800] "GET / HTTP/1.1" 301 436 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" 42.156.138.38 - - [13/Aug/2016:17:44:52 +0800] "GET /ipv4/42.156.128.186 HTTP/1.1" 200 8548 "-" "YisouSpider" 42.120.160.38 - - [13/Aug/2016:17:45:02 +0800] "GET /ipv4/42.156.128.186 HTTP/1.1" 200 42504 "-" "YisouSpider" 220.181.108.162 - - [13/Aug/2016:17:52:32 +0800] "GET /asn/AS13006 HTTP/1.1" 301 458 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" 123.125.71.71 - - [13/Aug/2016:17:52:33 +0800] "GET /asn/AS13006 HTTP/1.1" 200 15128 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" 103.231.166.27 - - [13/Aug/2016:18:03:39 +0800] "OPTIONS / HTTP/1.0" 404 1643 "-" "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.0 Safari/537.36" 180.76.15.153 - - [13/Aug/2016:18:27:09 +0800] "GET / HTTP/1.1" 301 436 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" 180.76.15.14 - - [13/Aug/2016:18:27:43 +0800] "GET / HTTP/1.1" 301 436 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" 45.55.52.227 - - [13/Aug/2016:18:43:08 +0800] "GET / HTTP/1.1" 200 9514 "-" "Netcraft SSL Server Survey - contact [email protected]" 180.76.15.161 - - [13/Aug/2016:18:52:36 +0800] "GET /asn/AS4609 HTTP/1.1" 301 456 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" 180.76.15.140 - - [13/Aug/2016:18:52:37 +0800] "GET /asn/AS4609 HTTP/1.1" 200 28073 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)" |
测了4个独立的不同服务器的网站
都一个现象