Giter Club home page Giter Club logo

Comments (17)

NotEnding avatar NotEnding commented on July 30, 2024 1

现在文书网,人为 的访问都会被 反爬机制 误伤(人的体验及其不好)。现在爬取以及比较难搞了。。。。

的确是很难爬取了,大佬的这个cookie获取方式能获取到,不过也会出现拿不到的情况。
result : {"code":9,"description":null,"secretKey":null,"result":null,"success":false} ----出现该返回的原因是,请求参数中的querycondition没有json.dumps

from wenshu.

NotEnding avatar NotEnding commented on July 30, 2024 1

还有尝试用selenium 获取cookie,可以成功,但是效率比较低。可以做个参考
url = f'https://wenshu.court.gov.cn/website/wenshu/181217BMTKHNT2W0/index.html?pageId={PageID()}&s8=02'

set navigator.webdriver = undefined,防止被反爬识别为自动化webdriver

    driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
        "source": """
            Object.defineProperty(navigator, 'webdriver', {
              get: () => undefined
            })
          """
    })

set userAgent

driver.execute_cdp_cmd('Network.setUserAgentOverride', {"userAgent": RandomUa()})

selenium需要设置这几个关键参数

from wenshu.

NotEnding avatar NotEnding commented on July 30, 2024

现在的问题是,使用这个cookie获取文书列表返回的是
status_code : 200
result : {"code":9,"description":null,"secretKey":null,"result":null,"success":false}

from wenshu.

nciefeiniu avatar nciefeiniu commented on July 30, 2024

现在文书网,人为 的访问都会被 反爬机制 误伤(人的体验及其不好)。现在爬取以及比较难搞了。。。。

from wenshu.

yilu1015 avatar yilu1015 commented on July 30, 2024

现在文书网,人为 的访问都会被 反爬机制 误伤(人的体验及其不好)。现在爬取以及比较难搞了。。。。

的确是很难爬取了,大佬的这个cookie获取方式能获取到,不过也会出现拿不到的情况。
result : {"code":9,"description":null,"secretKey":null,"result":null,"success":false} ----出现该返回的原因是,请求参数中的querycondition没有json.dumps

真的是没有json.dumps的缘故吗?我试了一下,如果用data=json.dumps(data),反而报错了:
{'code': 9, 'description': '请求接口未定义或格式错误,cfg=null', 'secretKey': None, 'result': None, 'success': False}

和 @zwzhengke 不一样的是,我原来success的返回状态是True:
{'code': 1, 'description': None, 'secretKey': 'znw3eIZGg3ZZNZMLCI3ajj7O', 'result': '1ok02QhmXaA=', 'success': True}

我现在只是用Requests进行单页测试,大致思路如下:

url = "http://wenshu.court.gov.cn/website/parse/rest.q4w"

data = {
    'docID': '97e53a7245264aaeacd4abde01272f72',
    'ciphertext': make_ciphertext(),
    'cfg': 'com.lawyee.judge.dc.parse.dto.SearchDataDsoDTO@docInfoSearch',
    '__RequestVerificationToken': verification_token(),
}

print (data)


headers = {
        'User-Agent': 'Mozilla/5.0 (Linux; U; Android 4.4.2; en-us; SCH-I535 Build/KOT49H) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30',
        'Cookie':cookie_string
            }

print (headers)

r = requests.post(url, headers=headers, data=data)

if r.status_code == 200:
    response = r.json()
    print (response)
else:
    print (r.status_code)

from wenshu.

NotEnding avatar NotEnding commented on July 30, 2024

现在文书网,人为 的访问都会被 反爬机制 误伤(人的体验及其不好)。现在爬取以及比较难搞了。。。。

的确是很难爬取了,大佬的这个cookie获取方式能获取到,不过也会出现拿不到的情况。
result : {"code":9,"description":null,"secretKey":null,"result":null,"success":false} ----出现该返回的原因是,请求参数中的querycondition没有json.dumps

真的是没有json.dumps的缘故吗?我试了一下,如果用data=json.dumps(data),反而报错了:
{'code': 9, 'description': '请求接口未定义或格式错误,cfg=null', 'secretKey': None, 'result': None, 'success': False}

和 @zwzhengke 不一样的是,我原来success的返回状态是True:
{'code': 1, 'description': None, 'secretKey': 'znw3eIZGg3ZZNZMLCI3ajj7O', 'result': '1ok02QhmXaA=', 'success': True}

我现在只是用Requests进行单页测试,大致思路如下:

url = "http://wenshu.court.gov.cn/website/parse/rest.q4w"

data = {
    'docID': '97e53a7245264aaeacd4abde01272f72',
    'ciphertext': make_ciphertext(),
    'cfg': 'com.lawyee.judge.dc.parse.dto.SearchDataDsoDTO@docInfoSearch',
    '__RequestVerificationToken': verification_token(),
}

print (data)


headers = {
        'User-Agent': 'Mozilla/5.0 (Linux; U; Android 4.4.2; en-us; SCH-I535 Build/KOT49H) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30',
        'Cookie':cookie_string
            }

print (headers)

r = requests.post(url, headers=headers, data=data)

if r.status_code == 200:
    response = r.json()
    print (response)
else:
    print (r.status_code)

是的,你这个会报错是因为,你这个代码贴出来的是根据ID 获取详情,是不需要queryCondition的。
而根据 queryCondition 拿docId 的时候,queryCondition是需要类似这种方式的
json.dumps( [ {"s8":"02"}, {"court": "枣阳市人民法院", "start": "1998-02-01", "end": "1998-02-28"} ] )

from wenshu.

yilu1015 avatar yilu1015 commented on July 30, 2024

现在文书网,人为 的访问都会被 反爬机制 误伤(人的体验及其不好)。现在爬取以及比较难搞了。。。。

的确是很难爬取了,大佬的这个cookie获取方式能获取到,不过也会出现拿不到的情况。
result : {"code":9,"description":null,"secretKey":null,"result":null,"success":false} ----出现该返回的原因是,请求参数中的querycondition没有json.dumps

真的是没有json.dumps的缘故吗?我试了一下,如果用data=json.dumps(data),反而报错了:
{'code': 9, 'description': '请求接口未定义或格式错误,cfg=null', 'secretKey': None, 'result': None, 'success': False}
和 @zwzhengke 不一样的是,我原来success的返回状态是True:
{'code': 1, 'description': None, 'secretKey': 'znw3eIZGg3ZZNZMLCI3ajj7O', 'result': '1ok02QhmXaA=', 'success': True}
我现在只是用Requests进行单页测试,大致思路如下:

url = "http://wenshu.court.gov.cn/website/parse/rest.q4w"

data = {
    'docID': '97e53a7245264aaeacd4abde01272f72',
    'ciphertext': make_ciphertext(),
    'cfg': 'com.lawyee.judge.dc.parse.dto.SearchDataDsoDTO@docInfoSearch',
    '__RequestVerificationToken': verification_token(),
}

print (data)


headers = {
        'User-Agent': 'Mozilla/5.0 (Linux; U; Android 4.4.2; en-us; SCH-I535 Build/KOT49H) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30',
        'Cookie':cookie_string
            }

print (headers)

r = requests.post(url, headers=headers, data=data)

if r.status_code == 200:
    response = r.json()
    print (response)
else:
    print (r.status_code)

是的,你这个会报错是因为,你这个代码贴出来的是根据ID 获取详情,是不需要queryCondition的。
而根据 queryCondition 拿docId 的时候,queryCondition是需要类似这种方式的
json.dumps( [ {"s8":"02"}, {"court": "枣阳市人民法院", "start": "1998-02-01", "end": "1998-02-28"} ] )

谢谢你的回复。我现在已经搞定docID这一关了,反倒是获取文书详情还没进展。问题在于返回success=True,但得不到全文。

你试验成功的结果如何?拼接HifJzoc9参数了吗?我原来的问题: #12

再次感谢!

from wenshu.

NotEnding avatar NotEnding commented on July 30, 2024

现在文书网,人为 的访问都会被 反爬机制 误伤(人的体验及其不好)。现在爬取以及比较难搞了。。。。

的确是很难爬取了,大佬的这个cookie获取方式能获取到,不过也会出现拿不到的情况。
result : {"code":9,"description":null,"secretKey":null,"result":null,"success":false} ----出现该返回的原因是,请求参数中的querycondition没有json.dumps

真的是没有json.dumps的缘故吗?我试了一下,如果用data=json.dumps(data),反而报错了:
{'code': 9, 'description': '请求接口未定义或格式错误,cfg=null', 'secretKey': None, 'result': None, 'success': False}
和 @zwzhengke 不一样的是,我原来success的返回状态是True:
{'code': 1, 'description': None, 'secretKey': 'znw3eIZGg3ZZNZMLCI3ajj7O', 'result': '1ok02QhmXaA=', 'success': True}
我现在只是用Requests进行单页测试,大致思路如下:

url = "http://wenshu.court.gov.cn/website/parse/rest.q4w"

data = {
    'docID': '97e53a7245264aaeacd4abde01272f72',
    'ciphertext': make_ciphertext(),
    'cfg': 'com.lawyee.judge.dc.parse.dto.SearchDataDsoDTO@docInfoSearch',
    '__RequestVerificationToken': verification_token(),
}

print (data)


headers = {
        'User-Agent': 'Mozilla/5.0 (Linux; U; Android 4.4.2; en-us; SCH-I535 Build/KOT49H) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30',
        'Cookie':cookie_string
            }

print (headers)

r = requests.post(url, headers=headers, data=data)

if r.status_code == 200:
    response = r.json()
    print (response)
else:
    print (r.status_code)

是的,你这个会报错是因为,你这个代码贴出来的是根据ID 获取详情,是不需要queryCondition的。
而根据 queryCondition 拿docId 的时候,queryCondition是需要类似这种方式的
json.dumps( [ {"s8":"02"}, {"court": "枣阳市人民法院", "start": "1998-02-01", "end": "1998-02-28"} ] )

谢谢你的回复。我现在已经搞定docID这一关了,反倒是获取文书详情还没进展。问题在于返回success=True,但得不到全文。

你试验成功的结果如何?拼接HifJzoc9参数了吗?我原来的问题: #12

再次感谢!

HifJzoc9 这个参数其实没有用到?服务端是没有校验的,只需要按照正常的参数请求过去即可
def __generate_params(self, docId): params = { "docId": docId, "ciphertext": Cipher.binary(), "cfg": "com.lawyee.judge.dc.parse.dto.SearchDataDsoDTO@docInfoSearch", "__RequestVerificationToken": Cipher.random24key(24) } return params

拿到数据的关键,我这边发现是cookie,携带了cookie,我这边可以拿到详情,但是这个cookie只有1分钟时效,而且我是用selenium拿到的cookie,效率还是挺低的。说实话,这个网站是真的难搞,docId我是通过APP端的接口拿到的,但是详情APP端的接口返回的数据不全,需要web端的

from wenshu.

NotEnding avatar NotEnding commented on July 30, 2024

如果获取详情不需要很多的字段信息,那么APP端的接口返回的详情数据即可满足要求

params = {
"ciphertext": Cipher.binary(),
"devid": "XXX",
"devtype": "1",
"docId": doc_id
}
q = {
"id": Cipher.stamp,
"command": action, #docInfoSearch
"params": params,
"cfg": "com.lawyee.judge.dc.parse.dto.SearchDataDsoDTO@docInfoSearch"
}
data = "request={}".format(base64.b64encode(json.dumps(q).encode()).decode())
headers = {
"Content-Type": "application/x-www-form-urlencoded"
}
app端是没有太多反爬的,常用的请求头即可,不需要携带cookie,但是数据需要通过docID和详情进行整合下,才能拿到较为完整的数据

from wenshu.

yilu1015 avatar yilu1015 commented on July 30, 2024

如果获取详情不需要很多的字段信息,那么APP端的接口返回的详情数据即可满足要求

params = {
"ciphertext": Cipher.binary(),
"devid": "XXX",
"devtype": "1",
"docId": doc_id
}
q = {
"id": Cipher.stamp,
"command": action, #docInfoSearch
"params": params,
"cfg": "com.lawyee.judge.dc.parse.dto.SearchDataDsoDTO@docInfoSearch"
}
data = "request={}".format(base64.b64encode(json.dumps(q).encode()).decode())
headers = {
"Content-Type": "application/x-www-form-urlencoded"
}
app端是没有太多反爬的,常用的请求头即可,不需要携带cookie,但是数据需要通过docID和详情进行整合下,才能拿到较为完整的数据

非常感谢。我也是用APP版获得docID,然后再请求全文,但也遇到了和网页版同样的问题:结果返回成功,但无正文。考虑到网页版无相关数据(如当事人信息,援引法律等),我这才转战网页版的。

我原以为问题出在cookies: 我把它变成string,加在了headers里。但按照你的方法,把cookies直接放在请求里,也还是无法得到正文?如果可以的话,我把代码分享如下,还请指教。

import sys

from typing import Dict

from PyQt5.QtCore import QEventLoop, QUrl
from PyQt5.QtWidgets import QApplication
from PyQt5.QtWebEngineWidgets import QWebEngineView, QWebEngineProfile

import json
import datetime
import time
import random
import base64
from Crypto.Cipher import DES3
from Crypto.Util.Padding import pad, unpad
import math
import dataset
import requests

def get_cookie(url: str) -> Dict[str, str]:

    class Render(QWebEngineView):
        cookies = {}
        html = None

        def __init__(self, url):
            self.app = QApplication(sys.argv)
            super(Render, self). __init__()
            self.page().profile().setHttpUserAgent(
                "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36"
            )
            self.resize(1920, 1080)
            self.loadFinished.connect(self._loadFinished)
            self.load(QUrl(url))

            QWebEngineProfile.defaultProfile().cookieStore().cookieAdded.connect(self._onCookieAdd)

            while self.html is None:
                self.app.processEvents(QEventLoop.ExcludeUserInputEvents | QEventLoop.ExcludeSocketNotifiers | QEventLoop.WaitForMoreEvents)

        def _onCookieAdd(self, cookie):
            if cookie.domain() != 'wenshu.court.gov.cn':
                return
            name = cookie.name().data().decode('utf-8')
            value = cookie.value().data().decode('utf-8')
            self.cookies[name] = value

        def _callable(self, data):
            self.html = data

        def _loadFinished(self):
            self.page().toHtml(self._callable)

        def __del__(self):
            self.app.quit()

    return Render(url).cookies

class Des(object):

    @staticmethod
    def encrypt(text, key):
        text = pad(text.encode(), DES3.block_size)
        iv = datetime.datetime.now().strftime('%Y%m%d').encode()
        cryptor = DES3.new(key, DES3.MODE_CBC, iv)
        ciphertext = cryptor.encrypt(text)
        return base64.b64encode(ciphertext).decode("utf-8")

    @staticmethod
    def decrypt(text, key):
        iv = datetime.datetime.now().strftime('%Y%m%d').encode()
        cryptor = DES3.new(key, DES3.MODE_CBC, iv)
        de_text = base64.b64decode(text)
        plain_text = cryptor.decrypt(de_text)
        out = unpad(plain_text, DES3.block_size)
        return out.decode()

def make_ciphertext():
    timestamp = str(int(time.time() * 1000))
    salt = ''.join(
        [random.choice('0123456789qwertyuiopasdfghjklzxcvbnm') for _ in range(24)])
    iv = datetime.datetime.now().strftime('%Y%m%d')
    des = Des()
    enc = des.encrypt(timestamp, salt)
    strs = salt + iv + enc
    result = []
    for i in strs:
        result.append(bin(ord(i))[2:])
        result.append(' ')
    return ''.join(result[:-1])

def verification_token():
    token = ''.join([random.choice('0123456789qwertyuiopasdfghjklzxcvbnm')
                 for _ in range(24)])
    return token

url = "http://wenshu.court.gov.cn/website/parse/rest.q4w"

cookies = get_cookie('http://wenshu.court.gov.cn')

print (cookies)

data = {
    'docID': '97e53a7245264aaeacd4abde01272f72',
    'ciphertext': make_ciphertext(),
    'cfg': 'com.lawyee.judge.dc.parse.dto.SearchDataDsoDTO@docInfoSearch',
    '__RequestVerificationToken': verification_token(),
}


headers = {
        'User-Agent': 'Mozilla/5.0 (Linux; U; Android 4.4.2; en-us; SCH-I535 Build/KOT49H) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30',
}


r = requests.post(url, headers=headers, data=data, cookies=cookies)

if r.status_code == 200:
    print (r.content)
else:
    print (r.status_code)

from wenshu.

NotEnding avatar NotEnding commented on July 30, 2024

get_cookie

get_cookie 方法返回的是字典,需要将cookie转化为string
cookie_dict = Render(url).cookies
cookie_string = "; ".join([str(x) + "=" + str(y) for x, y in cookie_dict.items()])
我是将cookie加到headers里面的,你可以尝试下·

from wenshu.

NotEnding avatar NotEnding commented on July 30, 2024

如果获取详情不需要很多的字段信息,那么APP端的接口返回的详情数据即可满足要求
params = {
"ciphertext": Cipher.binary(),
"devid": "XXX",
"devtype": "1",
"docId": doc_id
}
q = {
"id": Cipher.stamp,
"command": action, #docInfoSearch
"params": params,
"cfg": "com.lawyee.judge.dc.parse.dto.SearchDataDsoDTO@docInfoSearch"
}
data = "request={}".format(base64.b64encode(json.dumps(q).encode()).decode())
headers = {
"Content-Type": "application/x-www-form-urlencoded"
}
app端是没有太多反爬的,常用的请求头即可,不需要携带cookie,但是数据需要通过docID和详情进行整合下,才能拿到较为完整的数据

非常感谢。我也是用APP版获得docID,然后再请求全文,但也遇到了和网页版同样的问题:结果返回成功,但无正文。考虑到网页版无相关数据(如当事人信息,援引法律等),我这才转战网页版的。

我原以为问题出在cookies: 我把它变成string,加在了headers里。但按照你的方法,把cookies直接放在请求里,也还是无法得到正文?如果可以的话,我把代码分享如下,还请指教。

import sys

from typing import Dict

from PyQt5.QtCore import QEventLoop, QUrl
from PyQt5.QtWidgets import QApplication
from PyQt5.QtWebEngineWidgets import QWebEngineView, QWebEngineProfile

import json
import datetime
import time
import random
import base64
from Crypto.Cipher import DES3
from Crypto.Util.Padding import pad, unpad
import math
import dataset
import requests

def get_cookie(url: str) -> Dict[str, str]:

    class Render(QWebEngineView):
        cookies = {}
        html = None

        def __init__(self, url):
            self.app = QApplication(sys.argv)
            super(Render, self). __init__()
            self.page().profile().setHttpUserAgent(
                "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36"
            )
            self.resize(1920, 1080)
            self.loadFinished.connect(self._loadFinished)
            self.load(QUrl(url))

            QWebEngineProfile.defaultProfile().cookieStore().cookieAdded.connect(self._onCookieAdd)

            while self.html is None:
                self.app.processEvents(QEventLoop.ExcludeUserInputEvents | QEventLoop.ExcludeSocketNotifiers | QEventLoop.WaitForMoreEvents)

        def _onCookieAdd(self, cookie):
            if cookie.domain() != 'wenshu.court.gov.cn':
                return
            name = cookie.name().data().decode('utf-8')
            value = cookie.value().data().decode('utf-8')
            self.cookies[name] = value

        def _callable(self, data):
            self.html = data

        def _loadFinished(self):
            self.page().toHtml(self._callable)

        def __del__(self):
            self.app.quit()

    return Render(url).cookies

class Des(object):

    @staticmethod
    def encrypt(text, key):
        text = pad(text.encode(), DES3.block_size)
        iv = datetime.datetime.now().strftime('%Y%m%d').encode()
        cryptor = DES3.new(key, DES3.MODE_CBC, iv)
        ciphertext = cryptor.encrypt(text)
        return base64.b64encode(ciphertext).decode("utf-8")

    @staticmethod
    def decrypt(text, key):
        iv = datetime.datetime.now().strftime('%Y%m%d').encode()
        cryptor = DES3.new(key, DES3.MODE_CBC, iv)
        de_text = base64.b64decode(text)
        plain_text = cryptor.decrypt(de_text)
        out = unpad(plain_text, DES3.block_size)
        return out.decode()

def make_ciphertext():
    timestamp = str(int(time.time() * 1000))
    salt = ''.join(
        [random.choice('0123456789qwertyuiopasdfghjklzxcvbnm') for _ in range(24)])
    iv = datetime.datetime.now().strftime('%Y%m%d')
    des = Des()
    enc = des.encrypt(timestamp, salt)
    strs = salt + iv + enc
    result = []
    for i in strs:
        result.append(bin(ord(i))[2:])
        result.append(' ')
    return ''.join(result[:-1])

def verification_token():
    token = ''.join([random.choice('0123456789qwertyuiopasdfghjklzxcvbnm')
                 for _ in range(24)])
    return token

url = "http://wenshu.court.gov.cn/website/parse/rest.q4w"

cookies = get_cookie('http://wenshu.court.gov.cn')

print (cookies)

data = {
    'docID': '97e53a7245264aaeacd4abde01272f72',
    'ciphertext': make_ciphertext(),
    'cfg': 'com.lawyee.judge.dc.parse.dto.SearchDataDsoDTO@docInfoSearch',
    '__RequestVerificationToken': verification_token(),
}


headers = {
        'User-Agent': 'Mozilla/5.0 (Linux; U; Android 4.4.2; en-us; SCH-I535 Build/KOT49H) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30',
}


r = requests.post(url, headers=headers, data=data, cookies=cookies)

if r.status_code == 200:
    print (r.content)
else:
    print (r.status_code)

我的代码和你差不多,我可以拿到详情,只是有时候会比较慢,这个网站是真真的不好爬呀
def get_docid(self, docId):
"""文书列表页"""
url = "https://wenshu.court.gov.cn/website/parse/rest.q4w"

    cookie_string = cookieByPyQt5("https://wenshu.court.gov.cn/")
    # cookie_string = cookieByWebdriver()

    headers = {
        "Accept": "application/json, text/javascript, */*; q=0.01",
        "Accept-Encoding": "gzip, deflate, br",
        "Accept-Language": "zh-CN,zh;q=0.9",
        "Host": "wenshu.court.gov.cn",
        "Origin": "https://wenshu.court.gov.cn",
        "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",
        "content-type": "application/x-www-form-urlencoded; charset=UTF-8",
        "X-Requested-With": "XMLHttpRequest",
        "cookie": cookie_string
    }

    print(headers)
    params = {
        "docId": docId,
        "ciphertext": Cipher.binary(),
        "cfg": "com.lawyee.judge.dc.parse.dto.SearchDataDsoDTO@docInfoSearch",
        "__RequestVerificationToken": Cipher.random24key(24)
    }
    response = requests.post(url, data=params, headers=headers, proxies=self.proxies, verify=False)
    if not response:
        print('为空')
    print(response.text, response.status_code, type(response))
    json_data = response.json()
    decrypt_result = json.loads(EnCrypt.des_decrypt(json_data['result'], json_data['secretKey'],
                                                    datetime.now().strftime('%Y%m%d')).decode())
    print(decrypt_result)

from wenshu.

yilu1015 avatar yilu1015 commented on July 30, 2024

非常感谢你的回复。参考你的代码,我发现我有两个问题:
1) cookies字典没换成string;
2)cookies请求和rest.q4w没有用https

可惜的是,这两点修改完,我的代码仍然返回
{"code":1,"description":null,"secretKey":"RefB5wzNCAeb0znwTRcPL6P7","result":"ChqZfCT8MQY=","success":true}

请求成功,但仍然没有正文,让我非常不解。难道是因为我们拿到的cookies不一样?

用https请求, 我获取到的有
HM4hUBT0dDOn443S
HM4hUBT0dDOn443T
HM4hUBT0dDOn80S
HM4hUBT0dDOn80T
HM4hUBT0dDOnenable
SESSION

但观察浏览器行为,发现有效的获取全文请求用了三个:

HM4hUBT0dDOn443S
_gscu_125736681
SESSION
HM4hUBT0dDOn443T

问题出在我没有_gscu_125736681吗?我请求了几页,发现这个cookie值好像没有变化?

我修改后的代码如下。如果不介意,还求指点迷津。

import sys

from typing import Dict

from PyQt5.QtCore import QEventLoop, QUrl
from PyQt5.QtWidgets import QApplication
from PyQt5.QtWebEngineWidgets import QWebEngineView, QWebEngineProfile

import json
import datetime
import time
import random
import base64
from Crypto.Cipher import DES3
from Crypto.Util.Padding import pad, unpad
import math
import dataset
import requests

def get_cookie(url: str) -> Dict[str, str]:

    class Render(QWebEngineView):
        cookies = {}
        html = None

        def __init__(self, url):
            self.app = QApplication(sys.argv)
            super(Render, self). __init__()
            self.page().profile().setHttpUserAgent(
                "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36"
            )
            self.resize(1920, 1080)
            self.loadFinished.connect(self._loadFinished)
            self.load(QUrl(url))

            QWebEngineProfile.defaultProfile().cookieStore().cookieAdded.connect(self._onCookieAdd)

            while self.html is None:
                self.app.processEvents(QEventLoop.ExcludeUserInputEvents | QEventLoop.ExcludeSocketNotifiers | QEventLoop.WaitForMoreEvents)

        def _onCookieAdd(self, cookie):
            print(cookie.domain())
            if cookie.domain() != 'wenshu.court.gov.cn':
                return
            name = cookie.name().data().decode('utf-8')
            value = cookie.value().data().decode('utf-8')
            self.cookies[name] = value

        def _callable(self, data):
            self.html = data

        def _loadFinished(self):
            self.page().toHtml(self._callable)

        def __del__(self):
            self.app.quit()

    return Render(url).cookies

class Des(object):

    @staticmethod
    def encrypt(text, key):
        text = pad(text.encode(), DES3.block_size)
        iv = datetime.datetime.now().strftime('%Y%m%d').encode()
        cryptor = DES3.new(key, DES3.MODE_CBC, iv)
        ciphertext = cryptor.encrypt(text)
        return base64.b64encode(ciphertext).decode("utf-8")

    @staticmethod
    def decrypt(text, key):
        iv = datetime.datetime.now().strftime('%Y%m%d').encode()
        cryptor = DES3.new(key, DES3.MODE_CBC, iv)
        de_text = base64.b64decode(text)
        plain_text = cryptor.decrypt(de_text)
        out = unpad(plain_text, DES3.block_size)
        return out.decode()

def make_ciphertext():
    timestamp = str(int(time.time() * 1000))
    salt = ''.join(
        [random.choice('0123456789qwertyuiopasdfghjklzxcvbnm') for _ in range(24)])
    iv = datetime.datetime.now().strftime('%Y%m%d')
    des = Des()
    enc = des.encrypt(timestamp, salt)
    strs = salt + iv + enc
    result = []
    for i in strs:
        result.append(bin(ord(i))[2:])
        result.append(' ')
    return ''.join(result[:-1])

def verification_token():
    token = ''.join([random.choice('0123456789qwertyuiopasdfghjklzxcvbnm')
                 for _ in range(24)])
    return token

url = "https://wenshu.court.gov.cn/website/parse/rest.q4w"

cookies = get_cookie('https://wenshu.court.gov.cn')

cookie_string = "; ".join([str(x) + "=" + str(y) for x, y in cookies.items()])
            
data = {
    'docID': '97e53a7245264aaeacd4abde01272f72',
    'ciphertext': make_ciphertext(),
    'cfg': 'com.lawyee.judge.dc.parse.dto.SearchDataDsoDTO@docInfoSearch',
    '__RequestVerificationToken': verification_token(),
}


headers = {
    "Accept": "application/json, text/javascript, */*; q=0.01",
    "Accept-Encoding": "gzip, deflate, br",
    "Accept-Language": "zh-CN,zh;q=0.9",
    "Host": "wenshu.court.gov.cn",
    "Origin": "https://wenshu.court.gov.cn",
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36",
    "content-type": "application/x-www-form-urlencoded; charset=UTF-8",
    "X-Requested-With": "XMLHttpRequest",
    "cookie": cookie_string
}

print (headers)

r = requests.post(url, headers=headers, data=data)

if r.status_code == 200:
    print (r.text)
else:
    print (r.status_code)

from wenshu.

yilu1015 avatar yilu1015 commented on July 30, 2024

非常感谢,加解密代码应该没问题,我APP版就是用它们的。
如果请求方式也和你的一样,那说明问题应该就出在cookies。
请问你cookie_string = cookieByPyQt5("https://wenshu.court.gov.cn/")里,用的就是PyQt5.QtCore方法吗?如果都是用 @nciefeiniu 的代码,为何结果不一致呢?

from wenshu.

NotEnding avatar NotEnding commented on July 30, 2024

非常感谢,加解密代码应该没问题,我APP版就是用它们的。
如果请求方式也和你的一样,那说明问题应该就出在cookies。
请问你cookie_string = cookieByPyQt5("https://wenshu.court.gov.cn/")里,用的就是PyQt5.QtCore方法吗?如果都是用 @nciefeiniu 的代码,为何结果不一致呢?

应该是该方式获取的cookie已经失效了。目前我已经弃用该方式生成的cookie,改为我自己的方式获取cookie。
主要有三种
cookieByPyQt5---就是大佬提供的这种方式
cookieByWebdriver----# 通过 webdriver 获取的cookie,效率比较低
cookieByPyppeteer----# 通过 pyppeteer + asyncio 的方式获取cookie,效率比较高

from wenshu.

huangsiyuan924 avatar huangsiyuan924 commented on July 30, 2024

非常感谢,加解密代码应该没问题,我APP版就是用它们的。
如果请求方式也和你的一样,那说明问题应该就出在cookies。
请问你cookie_string = cookieByPyQt5("https://wenshu.court.gov.cn/")里,用的就是PyQt5.QtCore方法吗?如果都是用 @nciefeiniu 的代码,为何结果不一致呢?

应该是该方式获取的cookie已经失效了。目前我已经弃用该方式生成的cookie,改为我自己的方式获取cookie。
主要有三种
cookieByPyQt5---就是大佬提供的这种方式
cookieByWebdriver----# 通过 webdriver 获取的cookie,效率比较低
cookieByPyppeteer----# 通过 pyppeteer + asyncio 的方式获取cookie,效率比较高

大佬能否提供下获取cookie的方式, pyqt5我连续获取第二次程序会异常退出,现在没办法连续获取cookie

from wenshu.

NotEnding avatar NotEnding commented on July 30, 2024

非常感谢,加解密代码应该没问题,我APP版就是用它们的。
如果请求方式也和你的一样,那说明问题应该就出在cookies。
请问你cookie_string = cookieByPyQt5("https://wenshu.court.gov.cn/")里,用的就是PyQt5.QtCore方法吗?如果都是用 @nciefeiniu 的代码,为何结果不一致呢?

应该是该方式获取的cookie已经失效了。目前我已经弃用该方式生成的cookie,改为我自己的方式获取cookie。
主要有三种
cookieByPyQt5---就是大佬提供的这种方式
cookieByWebdriver----# 通过 webdriver 获取的cookie,效率比较低
cookieByPyppeteer----# 通过 pyppeteer + asyncio 的方式获取cookie,效率比较高

大佬能否提供下获取cookie的方式, pyqt5我连续获取第二次程序会异常退出,现在没办法连续获取cookie
pyqt5 连续获取会出现异常,是正常的,因为打开的web页面只允许一个主程序。不适应我需要的分布式,不能结合多线程等

获取cookie的方式就是上面说的方式,通过selenium --headless来获取的,不打开浏览器,请求
https://wenshu.court.gov.cn/website/wenshu/181217BMTKHNT2W0/index.html?pageId=6fbac8fa74baec81662cd72c8e726e53&s8=02

pageId 最好每次都随机生成下,然后提取cookie即可。注意加上webdriver undefined

set navigator.webdriver = undefined,防止被反爬识别为自动化webdriver

driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
    "source": """
                Object.defineProperty(navigator, 'webdriver', {
                  get: () => undefined
                })
              """
})

set userAgent

driver.execute_cdp_cmd('Network.setUserAgentOverride', {"userAgent": RandomUa()})

from wenshu.

Related Issues (14)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.