Giter Club home page Giter Club logo

antispider's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

antispider's Issues

autohome.py运行没效果啊

span没有被替换:

&nbsp;&nbsp;&nbsp;&nbsp;自去年12月12日提车之后<span class='hs_kw0_mainmx'></span>基本<span class='hs_kw1_mainmx'></span>就没驾驶博越去<span class='hs_kw2_mainmx'></span>点<span class='hs_kw3_mainass='hs_kw4_mainmx'></span>方<span class='hs_kw5_mainmx'></span>这次<span class='hs_kw6_mainmx'></span>朋友商量之后<span class='hs_kw0_mainmx'></span>决定自驾去天津<span class='hs_kw0_mainmx'></s='hs_kw7_mainmx'></span>可以带<span class='hs_kw8_mainmx'></span><span class='hs_kw9_mainmx'></span>越越<span class='hs_kw0_mainmx'></span>去欣赏<span class='hs_kw10_mainmx'></span><span class='hainmx'></span>他乡<span class='hs_kw3_mainmx'></span>风光<span class='hs_kw5_mainmx'></span>由于<span class='hs_kw12_mainmx'></span>第<span class='hs_kw10_mainmx'></span>次去天津<span class='hs_k</span>所以道路<span class='hs_kw1_mainmx'></span><span class='hs_kw13_mainmx'></span>太熟悉<span class='hs_kw0_mainmx'></span>还<span class='hs_kw7_mainmx'></span>博越为我提供<span class='hs_kw1x'></span>精准<span class='hs_kw3_mainmx'></span>导航系统<span class='hs_kw0_mainmx'></span>跟随<span class='hs_kw8_mainmx'></span>博野<span class='hs_kw3_mainmx'></span>脚步<span class='hs_kw0_mn>踏<span class='hs_kw1_mainmx'></span>前往天津<span class='hs_kw3_mainmx'></span>征程<span class='hs_kw5_mainmx'></span><br />&nbsp;&nbsp;&nbsp;全程<span class='hs_kw15_mainmx'></span>速<span cl0_mainmx'></span>由保定北<span class='hs_kw1_mainmx'></span>京港澳<span class='hs_kw15_mainmx'></span>速北京方向<span class='hs_kw0_mainmx'></span>再转入荣乌<span class='hs_kw15_mainmx'></span>速'hs_kw5_mainmx'></span><span class='hs_kw10_mainmx'></span>路由朋友担当摄影<span class='hs_kw0_mainmx'></span>拍<span class='hs_kw3_mainmx'></span>照片都<span class='hs_kw12_mainmx'></span>路<spaw1_mainmx'></span><span class='hs_kw3_mainmx'></span>风景<span class='hs_kw5_mainmx'></span><span class='hs_kw15_mainmx'></span>速途中<span class='hs_kw10_mainmx'></span>路驾驶博越<span class='hsx'></span>给我<span class='hs_kw3_mainmx'></span>感觉非常稳重<span class='hs_kw0_mainmx'></span>方向指向精准<span class='hs_kw5_mainmx'></span><span class='hs_kw13_mainmx'></span><span class='hs_/span><span class='hs_kw13_mainmx'></span>说<span class='hs_kw0_mainmx'></span>吉利真<span class='hs_kw3_mainmx'></span><span class='hs_kw12_mainmx'></span>在用心造车<span class='hs_kw0_mainmx'><越<span class='hs_kw0_mainmx'></span>已经成为同级别车型中<span class='hs_kw3_mainmx'></span>标杆产品<span class='hs_kw5_mainmx'></span><br />&nbsp;&nbsp;<span class='hs_kw11_mainmx'></span>面<spamainmx'></span>就为<span class='hs_kw17_mainmx'></span>家奉<span class='hs_kw1_mainmx'></span>精美<span class='hs_kw17_mainmx'></span>图

a bug in antispider/autohome.py

The 270th line in antispider/autohome.py

# 获取所有变量
var_regex = "var\s+(\w+)=(.*?);\s"

should be:

# 获取所有变量
var_regex = "var\s+(\w+)\s*=\s*([\'\"].*?[\'\"]);\s*"

Since the following case exists.
var bs_=';12';

Thanks for your code. :)

Exception on autohome SUV [FIXED]

爬汽车之家的SUV车型时程序会报错,index out of range。
排查发现因为SUV是加密关键词,但是是个英文关键词所以没有URL转义。所以不能被正则抓取,导致字典长度少了3,所以在执行中索引会溢出字典导致错误。

例如

res = requests.get("http://car.autohome.com.cn/config/spec/1646.html")
res.encoding = 'gb18030'
item = get_params(res.text)
print json.dumps(item, ensure_ascii=False, indent=4)

其中反混淆得到的Js如下,SUV作为前三个字符因为没有采用%xx的形式没被抓到。

SUV%E4%B8%87%E4%B8%AD%E4%BA%AC%E4%BB%B7%E4%BC%98%E4%BD%93%E4%BE%9B%E4%BF%9D%E5%85%83%E5%85%A8%E5%87%86%E5%87%91%E5%88%97%E5%88%B6%E5%89%8D%E5%8A%9B%E5%8A%9F%E5%8A%A8%E5%8A%A9%E5%8C%97%E5%8D%8E%E5%8E%8B%E5%8F%B7%E5%90%88%E5%90%8D%E5%90%8E%E5%90%B8%E5%95%86%E5%96%B7%E5%99%A8%E5%9C%B0%E5%9E%8B%E5%A4%87%E5%A4%9A%E5%A4%A7%E5%A4%AE%E5%AD%90%E5%AE%9A%E5%AE%9E%E5%AE%B9%E5%AE%BD%E5%AF%B8%E5%AF%BC%E5%B0%BA%E5%B7%AE%E5%B9%B4%E5%BA%A6%E5%BC%8F%E5%BC%B9%E5%BE%84%E5%BE%B7%E6%82%AC%E6%88%96%E6%89%AD%E6%89%BF%E6%8C%87%E6%8E%92%E6%95%B0%E6%95%B4%E6%9C%80%E6%9C%BA%E6%9D%86%E6%9E%84%E6%9E%B6%E6%A0%87%E6%A0%BC%E6%A2%B0%E6%AC%A7%E6%AF%94%E6%B0%94%E6%B2%B9%E6%B5%8B%E6%B6%B2%E7%82%B9%E7%84%B6%E7%87%83%E7%8B%AC%E7%8E%87%E7%8E%AF%E7%94%B5%E7%9B%96%E7%9B%98%E7%9F%A9%E7%A6%BB%E7%A7%AF%E7%A7%B0%E7%A8%8B%E7%A8%B3%E7%AB%8B%E7%AE%B1%E7%B0%A7%E7%B4%A7%E7%BB%BC%E7%BC%A9%E7%BC%B8%E7%BD%AE%E8%80%97%E8%83%8E%E8%87%AA%E8%93%9D%E8%A1%8C%E8%A7%84%E8%B1%AA%E8%B4%A8%E8%B7%9D%E8%BD%A6%E8%BD%AC%E8%BD%AE%E8%BD%B4%E8%BD%BD%E8%BF%9B%E8%BF%9E%E9%80%9A%E9%80%9F%E9%85%8D%E9%87%8F%E9%93%81%E9%93%9D%E9%95%BF%E9%97%A8%E9%97%B4%E9%9A%99%E9%9B%85%E9%A3%8E%E9%A9%B1%E9%A9%BB%E9%AB%98%E9%BC%93C%

我怀疑里面的英文字母也会有问题。建议把这个问题修一修,改一下正则。

运行出错:

UnicodeDecodeError: 'gbk' codec can't decode byte 0x9a in position 918: illegal multibyte sequence

文字顺序问题

请教一下:
这是我正则之后出来的JS一段。
(%E3%80%82%E4%B8%80%E4%B8%89%E4%B8%8A%E4%B8%8B%E4%B8%8D%E4%BA%86%E4%BA%8C%E4%BD%8E%E5%92%8C%E5%9C%B0%E5%A4%9A%E5%A4%A7%E5%A5%BD%E5%B0%91%E5%BE%88%E5%BE%97%E6%98%AF%E7%9A%84%E7%9D%80%E8%BF%91%E9%AB%98%EF%BC%81%EF%BC%8CNx_());=IK_((Nx_()1;11;18;23;13;17;3;0;6;8;22;9;5;19;20;15;12;7;10;4;2;21;16;14),hZ_(;))
解析出来的文字为(。一三上下不了二低和地多大好少很得是的着近高!,)所以想问一下1;11;18;23;13;17;3;0;6;8;22;9;5;19;20;15;12;7;10;4;2;21;16;14是表示文字的索引吗,是hs_kw(索引)_mainBf吗,

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.