Comments (17)
现在文书网,人为 的访问都会被 反爬机制 误伤(人的体验及其不好)。现在爬取以及比较难搞了。。。。
的确是很难爬取了,大佬的这个cookie获取方式能获取到,不过也会出现拿不到的情况。
result : {"code":9,"description":null,"secretKey":null,"result":null,"success":false} ----出现该返回的原因是,请求参数中的querycondition没有json.dumps
from wenshu.
还有尝试用selenium 获取cookie,可以成功,但是效率比较低。可以做个参考
url = f'https://wenshu.court.gov.cn/website/wenshu/181217BMTKHNT2W0/index.html?pageId={PageID()}&s8=02'
set navigator.webdriver = undefined,防止被反爬识别为自动化webdriver
driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
"source": """
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined
})
"""
})
set userAgent
driver.execute_cdp_cmd('Network.setUserAgentOverride', {"userAgent": RandomUa()})
selenium需要设置这几个关键参数
from wenshu.
现在的问题是,使用这个cookie获取文书列表返回的是
status_code : 200
result : {"code":9,"description":null,"secretKey":null,"result":null,"success":false}
from wenshu.
现在文书网,人为 的访问都会被 反爬机制 误伤(人的体验及其不好)。现在爬取以及比较难搞了。。。。
from wenshu.
现在文书网,人为 的访问都会被 反爬机制 误伤(人的体验及其不好)。现在爬取以及比较难搞了。。。。
的确是很难爬取了,大佬的这个cookie获取方式能获取到,不过也会出现拿不到的情况。
result : {"code":9,"description":null,"secretKey":null,"result":null,"success":false} ----出现该返回的原因是,请求参数中的querycondition没有json.dumps
真的是没有json.dumps的缘故吗?我试了一下,如果用data=json.dumps(data)
,反而报错了:
{'code': 9, 'description': '请求接口未定义或格式错误,cfg=null', 'secretKey': None, 'result': None, 'success': False}
和 @zwzhengke 不一样的是,我原来success
的返回状态是True:
{'code': 1, 'description': None, 'secretKey': 'znw3eIZGg3ZZNZMLCI3ajj7O', 'result': '1ok02QhmXaA=', 'success': True}
我现在只是用Requests进行单页测试,大致思路如下:
url = "http://wenshu.court.gov.cn/website/parse/rest.q4w"
data = {
'docID': '97e53a7245264aaeacd4abde01272f72',
'ciphertext': make_ciphertext(),
'cfg': 'com.lawyee.judge.dc.parse.dto.SearchDataDsoDTO@docInfoSearch',
'__RequestVerificationToken': verification_token(),
}
print (data)
headers = {
'User-Agent': 'Mozilla/5.0 (Linux; U; Android 4.4.2; en-us; SCH-I535 Build/KOT49H) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30',
'Cookie':cookie_string
}
print (headers)
r = requests.post(url, headers=headers, data=data)
if r.status_code == 200:
response = r.json()
print (response)
else:
print (r.status_code)
from wenshu.
现在文书网,人为 的访问都会被 反爬机制 误伤(人的体验及其不好)。现在爬取以及比较难搞了。。。。
的确是很难爬取了,大佬的这个cookie获取方式能获取到,不过也会出现拿不到的情况。
result : {"code":9,"description":null,"secretKey":null,"result":null,"success":false} ----出现该返回的原因是,请求参数中的querycondition没有json.dumps真的是没有json.dumps的缘故吗?我试了一下,如果用
data=json.dumps(data)
,反而报错了:
{'code': 9, 'description': '请求接口未定义或格式错误,cfg=null', 'secretKey': None, 'result': None, 'success': False}
和 @zwzhengke 不一样的是,我原来
success
的返回状态是True:
{'code': 1, 'description': None, 'secretKey': 'znw3eIZGg3ZZNZMLCI3ajj7O', 'result': '1ok02QhmXaA=', 'success': True}
我现在只是用Requests进行单页测试,大致思路如下:
url = "http://wenshu.court.gov.cn/website/parse/rest.q4w" data = { 'docID': '97e53a7245264aaeacd4abde01272f72', 'ciphertext': make_ciphertext(), 'cfg': 'com.lawyee.judge.dc.parse.dto.SearchDataDsoDTO@docInfoSearch', '__RequestVerificationToken': verification_token(), } print (data) headers = { 'User-Agent': 'Mozilla/5.0 (Linux; U; Android 4.4.2; en-us; SCH-I535 Build/KOT49H) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30', 'Cookie':cookie_string } print (headers) r = requests.post(url, headers=headers, data=data) if r.status_code == 200: response = r.json() print (response) else: print (r.status_code)
是的,你这个会报错是因为,你这个代码贴出来的是根据ID 获取详情,是不需要queryCondition的。
而根据 queryCondition 拿docId 的时候,queryCondition是需要类似这种方式的
json.dumps( [ {"s8":"02"}, {"court": "枣阳市人民法院", "start": "1998-02-01", "end": "1998-02-28"} ] )
from wenshu.
现在文书网,人为 的访问都会被 反爬机制 误伤(人的体验及其不好)。现在爬取以及比较难搞了。。。。
的确是很难爬取了,大佬的这个cookie获取方式能获取到,不过也会出现拿不到的情况。
result : {"code":9,"description":null,"secretKey":null,"result":null,"success":false} ----出现该返回的原因是,请求参数中的querycondition没有json.dumps真的是没有json.dumps的缘故吗?我试了一下,如果用
data=json.dumps(data)
,反而报错了:
{'code': 9, 'description': '请求接口未定义或格式错误,cfg=null', 'secretKey': None, 'result': None, 'success': False}
和 @zwzhengke 不一样的是,我原来success
的返回状态是True:
{'code': 1, 'description': None, 'secretKey': 'znw3eIZGg3ZZNZMLCI3ajj7O', 'result': '1ok02QhmXaA=', 'success': True}
我现在只是用Requests进行单页测试,大致思路如下:url = "http://wenshu.court.gov.cn/website/parse/rest.q4w" data = { 'docID': '97e53a7245264aaeacd4abde01272f72', 'ciphertext': make_ciphertext(), 'cfg': 'com.lawyee.judge.dc.parse.dto.SearchDataDsoDTO@docInfoSearch', '__RequestVerificationToken': verification_token(), } print (data) headers = { 'User-Agent': 'Mozilla/5.0 (Linux; U; Android 4.4.2; en-us; SCH-I535 Build/KOT49H) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30', 'Cookie':cookie_string } print (headers) r = requests.post(url, headers=headers, data=data) if r.status_code == 200: response = r.json() print (response) else: print (r.status_code)
是的,你这个会报错是因为,你这个代码贴出来的是根据ID 获取详情,是不需要queryCondition的。
而根据 queryCondition 拿docId 的时候,queryCondition是需要类似这种方式的
json.dumps( [ {"s8":"02"}, {"court": "枣阳市人民法院", "start": "1998-02-01", "end": "1998-02-28"} ] )
谢谢你的回复。我现在已经搞定docID这一关了,反倒是获取文书详情还没进展。问题在于返回success=True,但得不到全文。
你试验成功的结果如何?拼接HifJzoc9
参数了吗?我原来的问题: #12
再次感谢!
from wenshu.
现在文书网,人为 的访问都会被 反爬机制 误伤(人的体验及其不好)。现在爬取以及比较难搞了。。。。
的确是很难爬取了,大佬的这个cookie获取方式能获取到,不过也会出现拿不到的情况。
result : {"code":9,"description":null,"secretKey":null,"result":null,"success":false} ----出现该返回的原因是,请求参数中的querycondition没有json.dumps真的是没有json.dumps的缘故吗?我试了一下,如果用
data=json.dumps(data)
,反而报错了:
{'code': 9, 'description': '请求接口未定义或格式错误,cfg=null', 'secretKey': None, 'result': None, 'success': False}
和 @zwzhengke 不一样的是,我原来success
的返回状态是True:
{'code': 1, 'description': None, 'secretKey': 'znw3eIZGg3ZZNZMLCI3ajj7O', 'result': '1ok02QhmXaA=', 'success': True}
我现在只是用Requests进行单页测试,大致思路如下:url = "http://wenshu.court.gov.cn/website/parse/rest.q4w" data = { 'docID': '97e53a7245264aaeacd4abde01272f72', 'ciphertext': make_ciphertext(), 'cfg': 'com.lawyee.judge.dc.parse.dto.SearchDataDsoDTO@docInfoSearch', '__RequestVerificationToken': verification_token(), } print (data) headers = { 'User-Agent': 'Mozilla/5.0 (Linux; U; Android 4.4.2; en-us; SCH-I535 Build/KOT49H) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30', 'Cookie':cookie_string } print (headers) r = requests.post(url, headers=headers, data=data) if r.status_code == 200: response = r.json() print (response) else: print (r.status_code)
是的,你这个会报错是因为,你这个代码贴出来的是根据ID 获取详情,是不需要queryCondition的。
而根据 queryCondition 拿docId 的时候,queryCondition是需要类似这种方式的
json.dumps( [ {"s8":"02"}, {"court": "枣阳市人民法院", "start": "1998-02-01", "end": "1998-02-28"} ] )
谢谢你的回复。我现在已经搞定docID这一关了,反倒是获取文书详情还没进展。问题在于返回success=True,但得不到全文。
你试验成功的结果如何?拼接
HifJzoc9
参数了吗?我原来的问题: #12再次感谢!
HifJzoc9 这个参数其实没有用到?服务端是没有校验的,只需要按照正常的参数请求过去即可
def __generate_params(self, docId): params = { "docId": docId, "ciphertext": Cipher.binary(), "cfg": "com.lawyee.judge.dc.parse.dto.SearchDataDsoDTO@docInfoSearch", "__RequestVerificationToken": Cipher.random24key(24) } return params
拿到数据的关键,我这边发现是cookie,携带了cookie,我这边可以拿到详情,但是这个cookie只有1分钟时效,而且我是用selenium拿到的cookie,效率还是挺低的。说实话,这个网站是真的难搞,docId我是通过APP端的接口拿到的,但是详情APP端的接口返回的数据不全,需要web端的
from wenshu.
如果获取详情不需要很多的字段信息,那么APP端的接口返回的详情数据即可满足要求
params = {
"ciphertext": Cipher.binary(),
"devid": "XXX",
"devtype": "1",
"docId": doc_id
}
q = {
"id": Cipher.stamp,
"command": action, #docInfoSearch
"params": params,
"cfg": "com.lawyee.judge.dc.parse.dto.SearchDataDsoDTO@docInfoSearch"
}
data = "request={}".format(base64.b64encode(json.dumps(q).encode()).decode())
headers = {
"Content-Type": "application/x-www-form-urlencoded"
}
app端是没有太多反爬的,常用的请求头即可,不需要携带cookie,但是数据需要通过docID和详情进行整合下,才能拿到较为完整的数据
from wenshu.
如果获取详情不需要很多的字段信息,那么APP端的接口返回的详情数据即可满足要求
params = {
"ciphertext": Cipher.binary(),
"devid": "XXX",
"devtype": "1",
"docId": doc_id
}
q = {
"id": Cipher.stamp,
"command": action, #docInfoSearch
"params": params,
"cfg": "com.lawyee.judge.dc.parse.dto.SearchDataDsoDTO@docInfoSearch"
}
data = "request={}".format(base64.b64encode(json.dumps(q).encode()).decode())
headers = {
"Content-Type": "application/x-www-form-urlencoded"
}
app端是没有太多反爬的,常用的请求头即可,不需要携带cookie,但是数据需要通过docID和详情进行整合下,才能拿到较为完整的数据
非常感谢。我也是用APP版获得docID,然后再请求全文,但也遇到了和网页版同样的问题:结果返回成功,但无正文。考虑到网页版无相关数据(如当事人信息,援引法律等),我这才转战网页版的。
我原以为问题出在cookies: 我把它变成string,加在了headers里。但按照你的方法,把cookies直接放在请求里,也还是无法得到正文?如果可以的话,我把代码分享如下,还请指教。
import sys
from typing import Dict
from PyQt5.QtCore import QEventLoop, QUrl
from PyQt5.QtWidgets import QApplication
from PyQt5.QtWebEngineWidgets import QWebEngineView, QWebEngineProfile
import json
import datetime
import time
import random
import base64
from Crypto.Cipher import DES3
from Crypto.Util.Padding import pad, unpad
import math
import dataset
import requests
def get_cookie(url: str) -> Dict[str, str]:
class Render(QWebEngineView):
cookies = {}
html = None
def __init__(self, url):
self.app = QApplication(sys.argv)
super(Render, self). __init__()
self.page().profile().setHttpUserAgent(
"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36"
)
self.resize(1920, 1080)
self.loadFinished.connect(self._loadFinished)
self.load(QUrl(url))
QWebEngineProfile.defaultProfile().cookieStore().cookieAdded.connect(self._onCookieAdd)
while self.html is None:
self.app.processEvents(QEventLoop.ExcludeUserInputEvents | QEventLoop.ExcludeSocketNotifiers | QEventLoop.WaitForMoreEvents)
def _onCookieAdd(self, cookie):
if cookie.domain() != 'wenshu.court.gov.cn':
return
name = cookie.name().data().decode('utf-8')
value = cookie.value().data().decode('utf-8')
self.cookies[name] = value
def _callable(self, data):
self.html = data
def _loadFinished(self):
self.page().toHtml(self._callable)
def __del__(self):
self.app.quit()
return Render(url).cookies
class Des(object):
@staticmethod
def encrypt(text, key):
text = pad(text.encode(), DES3.block_size)
iv = datetime.datetime.now().strftime('%Y%m%d').encode()
cryptor = DES3.new(key, DES3.MODE_CBC, iv)
ciphertext = cryptor.encrypt(text)
return base64.b64encode(ciphertext).decode("utf-8")
@staticmethod
def decrypt(text, key):
iv = datetime.datetime.now().strftime('%Y%m%d').encode()
cryptor = DES3.new(key, DES3.MODE_CBC, iv)
de_text = base64.b64decode(text)
plain_text = cryptor.decrypt(de_text)
out = unpad(plain_text, DES3.block_size)
return out.decode()
def make_ciphertext():
timestamp = str(int(time.time() * 1000))
salt = ''.join(
[random.choice('0123456789qwertyuiopasdfghjklzxcvbnm') for _ in range(24)])
iv = datetime.datetime.now().strftime('%Y%m%d')
des = Des()
enc = des.encrypt(timestamp, salt)
strs = salt + iv + enc
result = []
for i in strs:
result.append(bin(ord(i))[2:])
result.append(' ')
return ''.join(result[:-1])
def verification_token():
token = ''.join([random.choice('0123456789qwertyuiopasdfghjklzxcvbnm')
for _ in range(24)])
return token
url = "http://wenshu.court.gov.cn/website/parse/rest.q4w"
cookies = get_cookie('http://wenshu.court.gov.cn')
print (cookies)
data = {
'docID': '97e53a7245264aaeacd4abde01272f72',
'ciphertext': make_ciphertext(),
'cfg': 'com.lawyee.judge.dc.parse.dto.SearchDataDsoDTO@docInfoSearch',
'__RequestVerificationToken': verification_token(),
}
headers = {
'User-Agent': 'Mozilla/5.0 (Linux; U; Android 4.4.2; en-us; SCH-I535 Build/KOT49H) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30',
}
r = requests.post(url, headers=headers, data=data, cookies=cookies)
if r.status_code == 200:
print (r.content)
else:
print (r.status_code)
from wenshu.
get_cookie
get_cookie 方法返回的是字典,需要将cookie转化为string
cookie_dict = Render(url).cookies
cookie_string = "; ".join([str(x) + "=" + str(y) for x, y in cookie_dict.items()])
我是将cookie加到headers里面的,你可以尝试下·
from wenshu.
如果获取详情不需要很多的字段信息,那么APP端的接口返回的详情数据即可满足要求
params = {
"ciphertext": Cipher.binary(),
"devid": "XXX",
"devtype": "1",
"docId": doc_id
}
q = {
"id": Cipher.stamp,
"command": action, #docInfoSearch
"params": params,
"cfg": "com.lawyee.judge.dc.parse.dto.SearchDataDsoDTO@docInfoSearch"
}
data = "request={}".format(base64.b64encode(json.dumps(q).encode()).decode())
headers = {
"Content-Type": "application/x-www-form-urlencoded"
}
app端是没有太多反爬的,常用的请求头即可,不需要携带cookie,但是数据需要通过docID和详情进行整合下,才能拿到较为完整的数据非常感谢。我也是用APP版获得docID,然后再请求全文,但也遇到了和网页版同样的问题:结果返回成功,但无正文。考虑到网页版无相关数据(如当事人信息,援引法律等),我这才转战网页版的。
我原以为问题出在cookies: 我把它变成string,加在了headers里。但按照你的方法,把cookies直接放在请求里,也还是无法得到正文?如果可以的话,我把代码分享如下,还请指教。
import sys from typing import Dict from PyQt5.QtCore import QEventLoop, QUrl from PyQt5.QtWidgets import QApplication from PyQt5.QtWebEngineWidgets import QWebEngineView, QWebEngineProfile import json import datetime import time import random import base64 from Crypto.Cipher import DES3 from Crypto.Util.Padding import pad, unpad import math import dataset import requests def get_cookie(url: str) -> Dict[str, str]: class Render(QWebEngineView): cookies = {} html = None def __init__(self, url): self.app = QApplication(sys.argv) super(Render, self). __init__() self.page().profile().setHttpUserAgent( "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36" ) self.resize(1920, 1080) self.loadFinished.connect(self._loadFinished) self.load(QUrl(url)) QWebEngineProfile.defaultProfile().cookieStore().cookieAdded.connect(self._onCookieAdd) while self.html is None: self.app.processEvents(QEventLoop.ExcludeUserInputEvents | QEventLoop.ExcludeSocketNotifiers | QEventLoop.WaitForMoreEvents) def _onCookieAdd(self, cookie): if cookie.domain() != 'wenshu.court.gov.cn': return name = cookie.name().data().decode('utf-8') value = cookie.value().data().decode('utf-8') self.cookies[name] = value def _callable(self, data): self.html = data def _loadFinished(self): self.page().toHtml(self._callable) def __del__(self): self.app.quit() return Render(url).cookies class Des(object): @staticmethod def encrypt(text, key): text = pad(text.encode(), DES3.block_size) iv = datetime.datetime.now().strftime('%Y%m%d').encode() cryptor = DES3.new(key, DES3.MODE_CBC, iv) ciphertext = cryptor.encrypt(text) return base64.b64encode(ciphertext).decode("utf-8") @staticmethod def decrypt(text, key): iv = datetime.datetime.now().strftime('%Y%m%d').encode() cryptor = DES3.new(key, DES3.MODE_CBC, iv) de_text = base64.b64decode(text) plain_text = cryptor.decrypt(de_text) out = unpad(plain_text, DES3.block_size) return out.decode() def make_ciphertext(): timestamp = str(int(time.time() * 1000)) salt = ''.join( [random.choice('0123456789qwertyuiopasdfghjklzxcvbnm') for _ in range(24)]) iv = datetime.datetime.now().strftime('%Y%m%d') des = Des() enc = des.encrypt(timestamp, salt) strs = salt + iv + enc result = [] for i in strs: result.append(bin(ord(i))[2:]) result.append(' ') return ''.join(result[:-1]) def verification_token(): token = ''.join([random.choice('0123456789qwertyuiopasdfghjklzxcvbnm') for _ in range(24)]) return token url = "http://wenshu.court.gov.cn/website/parse/rest.q4w" cookies = get_cookie('http://wenshu.court.gov.cn') print (cookies) data = { 'docID': '97e53a7245264aaeacd4abde01272f72', 'ciphertext': make_ciphertext(), 'cfg': 'com.lawyee.judge.dc.parse.dto.SearchDataDsoDTO@docInfoSearch', '__RequestVerificationToken': verification_token(), } headers = { 'User-Agent': 'Mozilla/5.0 (Linux; U; Android 4.4.2; en-us; SCH-I535 Build/KOT49H) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30', } r = requests.post(url, headers=headers, data=data, cookies=cookies) if r.status_code == 200: print (r.content) else: print (r.status_code)
我的代码和你差不多,我可以拿到详情,只是有时候会比较慢,这个网站是真真的不好爬呀
def get_docid(self, docId):
"""文书列表页"""
url = "https://wenshu.court.gov.cn/website/parse/rest.q4w"
cookie_string = cookieByPyQt5("https://wenshu.court.gov.cn/")
# cookie_string = cookieByWebdriver()
headers = {
"Accept": "application/json, text/javascript, */*; q=0.01",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "zh-CN,zh;q=0.9",
"Host": "wenshu.court.gov.cn",
"Origin": "https://wenshu.court.gov.cn",
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",
"content-type": "application/x-www-form-urlencoded; charset=UTF-8",
"X-Requested-With": "XMLHttpRequest",
"cookie": cookie_string
}
print(headers)
params = {
"docId": docId,
"ciphertext": Cipher.binary(),
"cfg": "com.lawyee.judge.dc.parse.dto.SearchDataDsoDTO@docInfoSearch",
"__RequestVerificationToken": Cipher.random24key(24)
}
response = requests.post(url, data=params, headers=headers, proxies=self.proxies, verify=False)
if not response:
print('为空')
print(response.text, response.status_code, type(response))
json_data = response.json()
decrypt_result = json.loads(EnCrypt.des_decrypt(json_data['result'], json_data['secretKey'],
datetime.now().strftime('%Y%m%d')).decode())
print(decrypt_result)
from wenshu.
非常感谢你的回复。参考你的代码,我发现我有两个问题:
1) cookies字典没换成string;
2)cookies请求和rest.q4w
没有用https
。
可惜的是,这两点修改完,我的代码仍然返回
{"code":1,"description":null,"secretKey":"RefB5wzNCAeb0znwTRcPL6P7","result":"ChqZfCT8MQY=","success":true}
请求成功,但仍然没有正文,让我非常不解。难道是因为我们拿到的cookies不一样?
用https请求, 我获取到的有
HM4hUBT0dDOn443S
HM4hUBT0dDOn443T
HM4hUBT0dDOn80S
HM4hUBT0dDOn80T
HM4hUBT0dDOnenable
SESSION
但观察浏览器行为,发现有效的获取全文请求用了三个:
HM4hUBT0dDOn443S
_gscu_125736681
SESSION
HM4hUBT0dDOn443T
问题出在我没有_gscu_125736681
吗?我请求了几页,发现这个cookie值好像没有变化?
我修改后的代码如下。如果不介意,还求指点迷津。
import sys
from typing import Dict
from PyQt5.QtCore import QEventLoop, QUrl
from PyQt5.QtWidgets import QApplication
from PyQt5.QtWebEngineWidgets import QWebEngineView, QWebEngineProfile
import json
import datetime
import time
import random
import base64
from Crypto.Cipher import DES3
from Crypto.Util.Padding import pad, unpad
import math
import dataset
import requests
def get_cookie(url: str) -> Dict[str, str]:
class Render(QWebEngineView):
cookies = {}
html = None
def __init__(self, url):
self.app = QApplication(sys.argv)
super(Render, self). __init__()
self.page().profile().setHttpUserAgent(
"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36"
)
self.resize(1920, 1080)
self.loadFinished.connect(self._loadFinished)
self.load(QUrl(url))
QWebEngineProfile.defaultProfile().cookieStore().cookieAdded.connect(self._onCookieAdd)
while self.html is None:
self.app.processEvents(QEventLoop.ExcludeUserInputEvents | QEventLoop.ExcludeSocketNotifiers | QEventLoop.WaitForMoreEvents)
def _onCookieAdd(self, cookie):
print(cookie.domain())
if cookie.domain() != 'wenshu.court.gov.cn':
return
name = cookie.name().data().decode('utf-8')
value = cookie.value().data().decode('utf-8')
self.cookies[name] = value
def _callable(self, data):
self.html = data
def _loadFinished(self):
self.page().toHtml(self._callable)
def __del__(self):
self.app.quit()
return Render(url).cookies
class Des(object):
@staticmethod
def encrypt(text, key):
text = pad(text.encode(), DES3.block_size)
iv = datetime.datetime.now().strftime('%Y%m%d').encode()
cryptor = DES3.new(key, DES3.MODE_CBC, iv)
ciphertext = cryptor.encrypt(text)
return base64.b64encode(ciphertext).decode("utf-8")
@staticmethod
def decrypt(text, key):
iv = datetime.datetime.now().strftime('%Y%m%d').encode()
cryptor = DES3.new(key, DES3.MODE_CBC, iv)
de_text = base64.b64decode(text)
plain_text = cryptor.decrypt(de_text)
out = unpad(plain_text, DES3.block_size)
return out.decode()
def make_ciphertext():
timestamp = str(int(time.time() * 1000))
salt = ''.join(
[random.choice('0123456789qwertyuiopasdfghjklzxcvbnm') for _ in range(24)])
iv = datetime.datetime.now().strftime('%Y%m%d')
des = Des()
enc = des.encrypt(timestamp, salt)
strs = salt + iv + enc
result = []
for i in strs:
result.append(bin(ord(i))[2:])
result.append(' ')
return ''.join(result[:-1])
def verification_token():
token = ''.join([random.choice('0123456789qwertyuiopasdfghjklzxcvbnm')
for _ in range(24)])
return token
url = "https://wenshu.court.gov.cn/website/parse/rest.q4w"
cookies = get_cookie('https://wenshu.court.gov.cn')
cookie_string = "; ".join([str(x) + "=" + str(y) for x, y in cookies.items()])
data = {
'docID': '97e53a7245264aaeacd4abde01272f72',
'ciphertext': make_ciphertext(),
'cfg': 'com.lawyee.judge.dc.parse.dto.SearchDataDsoDTO@docInfoSearch',
'__RequestVerificationToken': verification_token(),
}
headers = {
"Accept": "application/json, text/javascript, */*; q=0.01",
"Accept-Encoding": "gzip, deflate, br",
"Accept-Language": "zh-CN,zh;q=0.9",
"Host": "wenshu.court.gov.cn",
"Origin": "https://wenshu.court.gov.cn",
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",
"content-type": "application/x-www-form-urlencoded; charset=UTF-8",
"X-Requested-With": "XMLHttpRequest",
"cookie": cookie_string
}
print (headers)
r = requests.post(url, headers=headers, data=data)
if r.status_code == 200:
print (r.text)
else:
print (r.status_code)
from wenshu.
非常感谢,加解密代码应该没问题,我APP版就是用它们的。
如果请求方式也和你的一样,那说明问题应该就出在cookies。
请问你cookie_string = cookieByPyQt5("https://wenshu.court.gov.cn/")
里,用的就是PyQt5.QtCore
方法吗?如果都是用 @nciefeiniu 的代码,为何结果不一致呢?
from wenshu.
非常感谢,加解密代码应该没问题,我APP版就是用它们的。
如果请求方式也和你的一样,那说明问题应该就出在cookies。
请问你cookie_string = cookieByPyQt5("https://wenshu.court.gov.cn/")
里,用的就是PyQt5.QtCore
方法吗?如果都是用 @nciefeiniu 的代码,为何结果不一致呢?
应该是该方式获取的cookie已经失效了。目前我已经弃用该方式生成的cookie,改为我自己的方式获取cookie。
主要有三种
cookieByPyQt5---就是大佬提供的这种方式
cookieByWebdriver----# 通过 webdriver 获取的cookie,效率比较低
cookieByPyppeteer----# 通过 pyppeteer + asyncio 的方式获取cookie,效率比较高
from wenshu.
非常感谢,加解密代码应该没问题,我APP版就是用它们的。
如果请求方式也和你的一样,那说明问题应该就出在cookies。
请问你cookie_string = cookieByPyQt5("https://wenshu.court.gov.cn/")
里,用的就是PyQt5.QtCore
方法吗?如果都是用 @nciefeiniu 的代码,为何结果不一致呢?应该是该方式获取的cookie已经失效了。目前我已经弃用该方式生成的cookie,改为我自己的方式获取cookie。
主要有三种
cookieByPyQt5---就是大佬提供的这种方式
cookieByWebdriver----# 通过 webdriver 获取的cookie,效率比较低
cookieByPyppeteer----# 通过 pyppeteer + asyncio 的方式获取cookie,效率比较高
大佬能否提供下获取cookie的方式, pyqt5我连续获取第二次程序会异常退出,现在没办法连续获取cookie
from wenshu.
非常感谢,加解密代码应该没问题,我APP版就是用它们的。
如果请求方式也和你的一样,那说明问题应该就出在cookies。
请问你cookie_string = cookieByPyQt5("https://wenshu.court.gov.cn/")
里,用的就是PyQt5.QtCore
方法吗?如果都是用 @nciefeiniu 的代码,为何结果不一致呢?应该是该方式获取的cookie已经失效了。目前我已经弃用该方式生成的cookie,改为我自己的方式获取cookie。
主要有三种
cookieByPyQt5---就是大佬提供的这种方式
cookieByWebdriver----# 通过 webdriver 获取的cookie,效率比较低
cookieByPyppeteer----# 通过 pyppeteer + asyncio 的方式获取cookie,效率比较高大佬能否提供下获取cookie的方式, pyqt5我连续获取第二次程序会异常退出,现在没办法连续获取cookie
pyqt5 连续获取会出现异常,是正常的,因为打开的web页面只允许一个主程序。不适应我需要的分布式,不能结合多线程等
获取cookie的方式就是上面说的方式,通过selenium --headless来获取的,不打开浏览器,请求
https://wenshu.court.gov.cn/website/wenshu/181217BMTKHNT2W0/index.html?pageId=6fbac8fa74baec81662cd72c8e726e53&s8=02
pageId 最好每次都随机生成下,然后提取cookie即可。注意加上webdriver undefined
set navigator.webdriver = undefined,防止被反爬识别为自动化webdriver
driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
"source": """
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined
})
"""
})
set userAgent
driver.execute_cdp_cmd('Network.setUserAgentOverride', {"userAgent": RandomUa()})
from wenshu.
Related Issues (14)
- 现在使用您的代码携带cookie 还是返回202 HOT 5
- 爬虫 HOT 1
- splash有时返回400error,文书列表页有时返回的response.text为空 HOT 1
- 大佬 url 拼接 HifJzoc9 这个参数如何生产 HOT 9
- 大佬,昨天下午我用浏览器请求出现了验证码,你们爬取的时候有出现吗
- 改版了 12-20 又加上rui数了 感觉就是针对你这个漏洞改的 HOT 29
- 是只能用splash获取cookie吗 HOT 1
- 请问有全部代码的demo,url后的HifJzoc9参数如何获取? HOT 4
- 老板 现在还能用吗 HOT 1
- 大佬,获取cookie,再发送request的请求,频繁出现202,更换cookie后也没用 HOT 6
- 如果是 html 文件,那么就解析这个文件,获取新的URL,并重试,发送 post 请求即可 这个不是很懂 是不是出现202也可以解析获取新的url? HOT 2
- 返回200,但无全文 HOT 15
- 现在似乎必须要登录后cookie才能获取到完整的响应
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from wenshu.