Giter Club home page Giter Club logo

got-scraping's Introduction

Got Scraping

Got Scraping is a small but powerful got extension with the purpose of sending browser-like requests out of the box. This is very essential in the web scraping industry to blend in with the website traffic.

Installation

$ npm install got-scraping

The module is now ESM only

This means you have to import it by using an import expression, or the import() method. You can do so by either migrating your project to ESM, or importing got-scraping in an async context

-const { gotScraping } = require('got-scraping');
+import { gotScraping } from 'got-scraping';

If you cannot migrate to ESM, here's an example of how to import it in an async context:

let gotScraping;

async function fetchWithGotScraping(url) {
    gotScraping ??= (await import('got-scraping')).gotScraping;

    return gotScraping.get(url);
}

Note:

  • Node.js >=16 is required due to instability of HTTP/2 support in lower versions.

API

Got scraping package is built using the got.extend(...) functionality, therefore it supports all the features Got has.

Interested what's under the hood?

import { gotScraping } from 'got-scraping';

gotScraping
    .get('https://apify.com')
    .then( ({ body }) =>Β console.log(body));

options

proxyUrl

Type: string

URL of the HTTP or HTTPS based proxy. HTTP/2 proxies are supported as well.

import { gotScraping } from 'got-scraping';

gotScraping
    .get({
        url: 'https://apify.com',
        proxyUrl: 'http://usernamed:[email protected]:1234',
    })
    .then(({ body }) => console.log(body));

useHeaderGenerator

Type: boolean
Default: true

Whether to use the generation of the browser-like headers.

headerGeneratorOptions

See the HeaderGeneratorOptions docs.

const response = await gotScraping({
    url: 'https://api.apify.com/v2/browser-info',
    headerGeneratorOptions:{
        browsers: [
            {
                name: 'chrome',
                minVersion: 87,
                maxVersion: 89
            }
        ],
        devices: ['desktop'],
        locales: ['de-DE', 'en-US'],
        operatingSystems: ['windows', 'linux'],
    }
});

sessionToken

A non-primitive unique object which describes the current session. By default, it's undefined, so new headers will be generated every time. Headers generated with the same sessionToken never change.

Under the hood

Thanks to the included header-generator package, you can choose various browsers from different operating systems and devices. It generates all the headers automatically so you can focus on the important stuff instead.

Yet another goal is to simplify the usage of proxies. Just pass the proxyUrl option and you are set. Got Scraping automatically detects the HTTP protocol that the proxy server supports. After the connection is established, it does another ALPN negotiation for the end server. Once that is complete, Got Scraping can proceed with HTTP requests.

Using the same HTTP version that browsers do is important as well. Most modern browsers use HTTP/2, so Got Scraping is making a use of it too. Fortunately, this is already supported by Got - it automatically handles ALPN protocol negotiation to select the best available protocol.

HTTP/1.1 headers are always automatically formatted in Pascal-Case. However, there is an exception: x- headers are not modified in any way.

By default, Got Scraping will use an insecure HTTP parser, which allows to access websites with non-spec-compliant web servers.

Last but not least, Got Scraping comes with updated TLS configuration. Some websites make a fingerprint of it and compare it with real browsers. While Node.js doesn't support OpenSSL 3 yet, the current configuration still should work flawlessly.

To get more detailed information about the implementation, please refer to the source code.

Tips

This package can only generate all the standard attributes. You might want to add the referer header if necessary. Please bear in mind that these headers are made for GET requests for HTML documents. If you want to make POST requests or GET requests for any other content type, you should alter these headers according to your needs. You can do so by passing a headers option or writing a custom Got handler.

This package should provide a solid start for your browser request emulation process. All websites are built differently, and some of them might require some additional special care.

Overriding request headers

const response = await gotScraping({
    url: 'https://apify.com/',
    headers: {
        'user-agent': 'test',
    },
});

For more advanced usage please refer to the Got documentation.

JSON mode

You can parse JSON with this package too, but please bear in mind that the request header generation is done specifically for HTML content type. You might want to alter the generated headers to match the browser ones.

const response = await gotScraping({
    responseType: 'json',
    url: 'https://api.apify.com/v2/browser-info',
});

Error recovery

This section covers possible errors that might happen due to different site implementations.

RequestError: Client network socket disconnected before secure TLS connection was established

The error above can be a result of the server not supporting the provided TLS setings. Try changing the ciphers parameter to either undefined or a custom value.

got-scraping's People

Contributors

b4nan avatar barjin avatar fnesveda avatar foxt451 avatar kikobeats avatar metalwarrior665 avatar mnmkng avatar petrpatek avatar szmarczak avatar vladfrangu avatar xboston avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

got-scraping's Issues

Release v1

Release v1, would be nice to close #2 before that.

Error: Failed to set ECDH curve

Hi together 😊

I am not sure whether or not this is an issue with got-scraping or rather one with the server I am scraping from (Apache/2.4.38; Debian) ...
Every ~15th request to the server fails with the error Failed to set ECDH curve with code ERR_CRYPTO_OPERATION_FAILED.
The SSL report states that the handshake Java 6u45 is failing. Does this correlate?
image
I looked up ECDH curve but it does not give me any hint and I don't have a lot of knowledge when it comes to topics like http request forging, cryptography, keys, etc. Any help in any directions, or resources to read into are highly appreciated.

Since I need at least like 200k requests to the server, I would really love to fix this any way I can.
Thanks in advance πŸ™ŒπŸ»

Request:

import { gotScraping } from 'got-scraping'

export function scrape(url: string): Promise<string> {
  return new Promise((resolve) => {
    gotScraping
      .get(url)
      .then(({ body }) => resolve(body as string))
      .catch((err) => {
        console.error(err)
      })
  })
}

Error:

RequestError: Failed to set ECDH curve
    at Request._beforeError (<mypath>/node_modules/got-cjs/dist/source/core/index.js:324:21)
    at Request.flush (<mypath>/node_modules/got-cjs/dist/source/core/index.js:313:18)
    at processTicksAndRejections (node:internal/process/task_queues:95:5)
    at configSecureContext (node:internal/tls/secure-context:239:11)
    at Object.createSecureContext (node:_tls_common:117:3)
    at Object.connect (node:_tls_wrap:1629:48)
    at Agent.createConnection (node:https:150:22)
    at Agent.createSocket (node:_http_agent:350:26)
    at Agent.addRequest (node:_http_agent:297:10)
    at TransformHeadersAgent.addRequest (<mypath>/node_modules/got-scraping/src/agent/wrapped-agent.ts:19:20)
    at TransformHeadersAgent.addRequest (<mypath>/node_modules/got-scraping/src/agent/transform-headers-agent.ts:95:22)
    at new ClientRequest (node:_http_client:335:16)
    at Object.request (node:http:97:10) {
  input: undefined,
  code: 'ERR_CRYPTO_OPERATION_FAILED',
  timings: undefined,
  options: {
    request: [Function (anonymous)],
    agent: {
      http: TransformHeadersAgent { agent: [Agent] },
      https: TransformHeadersAgent { agent: [Agent] },
      http2: undefined
    },
    h2session: undefined,
    decompress: true,
    timeout: {
      connect: undefined,
      lookup: undefined,
      read: undefined,
      request: 60000,
      response: undefined,
      secureConnect: undefined,
      send: undefined,
      socket: undefined
    },
    prefixUrl: '',
    body: undefined,
    form: undefined,
    json: undefined,
    cookieJar: undefined,
    ignoreInvalidCookies: false,
    searchParams: undefined,
    dnsLookup: undefined,
    dnsCache: undefined,
    context: {
      headerGenerator: HeaderGenerator {
        globalOptions: [Object],
        browserListQuery: undefined,
        inputGeneratorNetwork: [BayesianNetwork],
        headerGeneratorNetwork: [BayesianNetwork],
        uniqueBrowsers: [Array],
        headersOrder: [Object]
      },
      useHeaderGenerator: true,
      insecureHTTPParser: true,
      sessionData: undefined
    },
    hooks: {
      init: [
        [Function: optionsValidationHandler],
        [Function: customOptionsHook]
      ],
      beforeRequest: [
        [Function: insecureParserHook],
        [Function: sessionDataHook],
        [Function: http2Hook],
        [AsyncFunction: proxyHook],
        [AsyncFunction: browserHeadersHook],
        [Function: tlsHook]
      ],
      beforeError: [],
      beforeRedirect: [ [Function: refererHook] ],
      beforeRetry: [],
      afterResponse: []
    },
    followRedirect: true,
    maxRedirects: 10,
    cache: undefined,
    throwHttpErrors: false,
    username: '',
    password: '',
    http2: true,
    allowGetBody: false,
    headers: {
      'user-agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Firefox/102.0',
      accept: 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,*/*;q=0.8',
      'accept-language': 'en-US',
      'accept-encoding': 'gzip, deflate, br',
      dnt: '1',
      connection: 'keep-alive',
      'upgrade-insecure-requests': '1',
      'sec-fetch-mode': 'navigate',
      'sec-fetch-dest': 'document',
      'sec-fetch-site': 'same-site',
      'sec-fetch-user': '?1'
    },
    methodRewriting: false,
    dnsLookupIpVersion: undefined,
    parseJson: [Function: parse],
    stringifyJson: [Function: stringify],
    retry: {
      limit: 0,
      methods: [ 'GET', 'PUT', 'HEAD', 'DELETE', 'OPTIONS', 'TRACE' ],
      statusCodes: [
        408, 413, 429, 500,
        502, 503, 504, 521,
        522, 524
      ],
      errorCodes: [
        'ETIMEDOUT',
        'ECONNRESET',
        'EADDRINUSE',
        'ECONNREFUSED',
        'EPIPE',
        'ENOTFOUND',
        'ENETUNREACH',
        'EAI_AGAIN'
      ],
      maxRetryAfter: undefined,
      calculateDelay: [Function: calculateDelay],
      backoffLimit: Infinity,
      noise: 100
    },
    localAddress: undefined,
    method: 'GET',
    createConnection: undefined,
    cacheOptions: {
      shared: undefined,
      cacheHeuristic: undefined,
      immutableMinTimeToLive: undefined,
      ignoreCargoCult: undefined
    },
    https: {
      alpnProtocols: undefined,
      rejectUnauthorized: false,
      checkServerIdentity: undefined,
      certificateAuthority: undefined,
      key: undefined,
      certificate: undefined,
      passphrase: undefined,
      pfx: undefined,
      ciphers: 'TLS_AES_128_GCM_SHA256:TLS_CHACHA20_POLY1305_SHA256:TLS_AES_256_GCM_SHA384:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-SHA:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES128-SHA:ECDHE-RSA-AES256-SHA:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128-SHA:AES256-SHA:DES-CBC3-SHA',
      honorCipherOrder: undefined,
      minVersion: 'TLSv1.2',
      maxVersion: 'TLSv1.3',
      signatureAlgorithms: 'ecdsa_secp256r1_sha256:ecdsa_secp384r1_sha384:ecdsa_secp521r1_sha512:rsa_pss_rsae_sha256:rsa_pss_rsae_sha384:rsa_pss_rsae_sha512:rsa_pkcs1_sha256:rsa_pkcs1_sha384:rsa_pkcs1_sha512:ECDSA+SHA1:rsa_pkcs1_sha1',
      tlsSessionLifetime: undefined,
      dhparam: undefined,
      ecdhCurve: 'X25519:prime256v1:secp384r1:secp521r1:ffdhe2048:ffdhe3072',
      certificateRevocationLists: undefined
    },
    encoding: undefined,
    resolveBodyOnly: false,
    isStream: false,
    responseType: 'text',
    url: URL {
      href: "theURLiAmScraping"
    },
    pagination: {
      transform: [Function: transform],
      paginate: [Function: paginate],
      filter: [Function: filter],
      shouldContinue: [Function: shouldContinue],
      countLimit: Infinity,
      backoff: 0,
      requestLimit: 10000,
      stackAllItems: false
    },
    setHost: true,
    maxHeaderSize: undefined,
    signal: undefined,
    enableUnixSockets: true
  }
}
<mypath>/node_modules/got-cjs/dist/source/core/index.js:324
            error = new errors_js_1.RequestError(error.message, error, this);
                    ^
RequestError: Failed to set ECDH curve
    at Request._beforeError (<mypath>/node_modules/got-cjs/dist/source/core/index.js:324:21)
    at Request.flush (<mypath>/node_modules/got-cjs/dist/source/core/index.js:313:18)
    at processTicksAndRejections (node:internal/process/task_queues:95:5)
    at configSecureContext (node:internal/tls/secure-context:239:11)
    at Object.createSecureContext (node:_tls_common:117:3)
    at Object.connect (node:_tls_wrap:1629:48)
    at Agent.createConnection (node:https:150:22)
    at Agent.createSocket (node:_http_agent:350:26)
    at Agent.addRequest (node:_http_agent:297:10)
    at TransformHeadersAgent.addRequest (<mypath>/node_modules/got-scraping/src/agent/wrapped-agent.ts:19:20)
    at TransformHeadersAgent.addRequest (<mypath>/node_modules/got-scraping/src/agent/transform-headers-agent.ts:95:22)
    at new ClientRequest (node:_http_client:335:16)
    at Object.request (node:http:97:10) {
  input: undefined,
  code: 'ERR_CRYPTO_OPERATION_FAILED',
  timings: undefined,
  options: Options {
    _unixOptions: {
      insecureHTTPParser: true,
      secureOptions: 524304,
      requestOCSP: true
    },
    _internals: {
      request: [Function (anonymous)],
      agent: [Object],
      h2session: undefined,
      decompress: true,
      timeout: [Object],
      prefixUrl: '',
      body: undefined,
      form: undefined,
      json: undefined,
      cookieJar: undefined,
      ignoreInvalidCookies: false,
      searchParams: undefined,
      dnsLookup: undefined,
      dnsCache: undefined,
      context: [Object],
      hooks: [Object],
      followRedirect: true,
      maxRedirects: 10,
      cache: undefined,
      throwHttpErrors: false,
      username: '',
      password: '',
      http2: true,
      allowGetBody: false,
      headers: [Object],
      methodRewriting: false,
      dnsLookupIpVersion: undefined,
      parseJson: [Function: parse],
      stringifyJson: [Function: stringify],
      retry: [Object],
      localAddress: undefined,
      method: 'GET',
      createConnection: undefined,
      cacheOptions: [Object],
      https: [Object],
      encoding: undefined,
      resolveBodyOnly: false,
      isStream: false,
      responseType: 'text',
      url: [URL],
      pagination: [Object],
      setHost: true,
      maxHeaderSize: undefined,
      signal: undefined,
      enableUnixSockets: true
    },
    _merging: false,
    _init: [ [Object], [Object] ]
  }
}

Using gotScraping.extend results in incorrect type information

import { gotScraping } from 'got-scraping';

const instance = gotScraping.extend();

const response = await instance.get('https://api.apify.com/v2/browser-info', {
    responseType: 'json',
    proxyUrl: 'proxy'
});
console.log(response.body);

I had no doubt that I could write something like this, but when I actually run this code in TypeScript, it prints an error due to type information.

I ran the same code in Javascript, but the only thing that is wrong is the type information because Proxy worked properly

An error occurs after Electron builds

Issue 1: The header-generator dependency uses __dirname to get the built-in zip configuration file at initialization time, but asar doesn't seem to support reading it.

Is it possible to optimize the configuration to handle this issue rather than modifying the source code yourself?
image

Issue 2: cloneResponse instantiating PassThrough does not seem to have autoDestroy: false configured, a similar configuration exists in fixDecompress.

image

ESM rewrite

Why

got is ESM only, and that itself imposes issues on using the latest versions for us, especially on its dependencies.

What

  • rewrite the package to be ESM only
  • update to latest got
  • use the version in crawlee (and potentially other places) via dynamic imports

Invalid charset error in url

Urls with invalid charset such as

https://datadeliver.net/ads.txt has set uft-8 instead of utf-8 results in error

ERROR HttpCrawler: Request failed and reached maximum retries. Error: Resource https://datadeliver.net/ads.txt served with unsupported charset/encoding: uft-8
    at HttpCrawler._encodeResponse (/node_modules/@crawlee/http/internals/http-crawler.js:544:15)
    at HttpCrawler._parseResponse (/node_modules/@crawlee/http/internals/http-crawler.js:442:45)
    at HttpCrawler._runRequestHandler (/node_modules/@crawlee/http/internals/http-crawler.js:308:39)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async wrap (/node_modules/@apify/timeout/index.js:52:21) {"id":"uIK9UTUGCjKrRhO","url":"https://datadeliver.net/ads.txt","method":"GET","uniqueKey":"https://datadeliver.net/ads.txt"}

It works fine on curl and browser

It is expected to have such error or we can skip such checks?

Is it possible to change the proxyUrl with the beforeRequest hook?

Hey!

First off all thanks for this amazing package!

As the title suggests, is there any way to set the proxyUrl through the beforeRequest hook?
Ideally I'd like to do this on the beforeRetry hook but we cannot access the options object from there.

It should sort of act like a proxy rotation system - if the request fails, call a function which returns a new proxy and set it as the new proxyUrl.

Thanks!

Originally posted by @PhilBookst in #76

update docs/readme to match v3

the default export is no longer there, not sure if there are other things that need to be updated too, you will know better than me :]

A question about `http2-wrapper`

* The confusing thing is that internally, it destructures the agents out of
* the object for HTTP and HTTPS, but keeps the structure for HTTP2,
* because it passes all the agents down to http2.auto (from http2-wrapper).
* We're not using http2.auto, but http2.request, which expects a single agent.
* So for HTTP2, we need a single agent and for HTTP and HTTPS we need the object
* to allow destructuring of correct agents.

IIRC you decided to go with http2wrapper.request instead of http2wrapper.auto because of the ALPN cache issues inside the auto function, is that right?

http2-wrapper passes the raw options to resolve-alpn:
https://github.com/szmarczak/http2-wrapper/blob/592daaaf328b297fb0a9e25c7e065857434bf423/source/auto.js#L136

and the options look like this:
https://github.com/szmarczak/http2-wrapper/blob/592daaaf328b297fb0a9e25c7e065857434bf423/source/auto.js#L111-L123

and this looks very similar:

host: hostname,
servername: hostname,
port: port || 443,
ALPNProtocols: ['h2', 'http/1.1'],
rejectUnauthorized,

So maybe Got 11 doesn't forward all the options... I'll check.

How would one know if got-scraping has been blocked?

Somewhat of a vague and broad question, but still would be interested in knowing even where to begin looking for this sort of thing (#34 inspired this question).

Suppose we're using got-scraping not even for scraping, but to reliably connect to arbitrary websites (my specific use case is getting the canonical URL).

From my testing, got-scraping does pretty well against real-world websites (even with well-known "cloud IP"). However, I have to assume that at certain point, it will get blocked by some anti-bot software (e.g. cloudflare), and I'd like to know whether I got blocked vs. whether the resource simply failed to load.

How would one go about detecting that you've been blocked using got-scraping (or just got in general)?

Thanks and a happy new year

ESM got or CJS got in the future?

I'm trying to use ESM everywhere in my project, so I want my dependencies to be ESM. I happened to use both got and got-scraping in a project, but I noticed got-scraping depends on got-cjs, while got is now ESM, which means there are 2 different got versions installed. Will got-scraping eventually switch to ESM got, when v12 is out of beta?

Passing SSL_OP_LEGACY_SERVER_CONNECT down to Node

Node v18 removed the OpenSSL option to accept legacy servers. This causes Got to throw the following error when scraping servers that don't support RFC 5746 secure renegotiation:

RequestError: C0B70F932E7F0000:error:0A000152:SSL routines:final_renegotiate:unsafe legacy renegotiation disabled:../deps/openssl/openssl/ssl/statem/extensions.c:908:

Looking at the got-scraping source and also this issue #75, it seems it should be possible to pass SSL_OP_LEGACY_SERVER_CONNECT down to Got via _unixOptions.secureOptions. But I can't seem to get it working.

This is what I tried (in my case via a Cheerio preNavigationHook(), since I'm trying to scrape this server using the Crawlee SDK, not got-scraping directly):

  import crypto from 'crypto';
  gotOptions._unixOptions = {
    secureOptions: crypto.constants.SSL_OP_LEGACY_SERVER_CONNECT,
  };
  gotOptions.https = {
    ciphers: [
      // Chrome v92
      'TLS_AES_128_GCM_SHA256',
      'TLS_AES_256_GCM_SHA384',
      'TLS_CHACHA20_POLY1305_SHA256',
      'ECDHE-ECDSA-AES128-GCM-SHA256',
      'ECDHE-RSA-AES128-GCM-SHA256',
      'ECDHE-ECDSA-AES256-GCM-SHA384',
      'ECDHE-RSA-AES256-GCM-SHA384',
      'ECDHE-ECDSA-CHACHA20-POLY1305',
      'ECDHE-RSA-CHACHA20-POLY1305',
      // Legacy:
      'ECDHE-RSA-AES128-SHA',
      'ECDHE-RSA-AES256-SHA',
      'AES128-GCM-SHA256',
      'AES256-GCM-SHA384',
      'AES128-SHA',
      'AES256-SHA',
    ].join(':'),
  };

Here are some SO links related to this OpenSSL issue for additional context:
https://stackoverflow.com/questions/71603314/ssl-error-unsafe-legacy-renegotiation-disabled
https://stackoverflow.com/questions/74324019/allow-legacy-renegotiation-for-nodejs

What's up with _unixOptions?

I'm looking at ways to improve my request fingerprints to avoid blocks, and I ran into this project. Lots of interesting tricks here! Excellent work.

One that I found interesting was setting _unixOptions with additional TLS settings:

options._unixOptions = {
// @ts-expect-error Private use
...options._unixOptions,
secureOptions: secureOptions[browser],
requestOCSP: requestOCSP[browser],
};
. I can't find any info about this anywhere, even in Node's source, and it's not mentioned in your docs. Can you explain at all what this does?

Typescript options error

This starter example from the docs:

gotScraping
    .get({
        url: 'https://apify.com',
        proxyUrl: 'http://usernamed:[email protected]:1234',
    })

leads to the following TS error on TS v4.6.2

TS2769: No overload matches this call. Β Β 
The last overload gave the following error. Β Β Β Β 
Argument of type '{ url: string; proxyUrl: string; }' is not assignable to parameter of type 'OptionsInit'. Β Β Β Β Β Β 
Object literal may only specify known properties, and 'proxyUrl' does not exist in type 'OptionsInit'.

The same happen for any other got-scraping-provided context fields like useHeaderGenerator.

Using version 3.2.8 of this package (for some reason, the latest git tag isv3.7.8, but I assume that's the same?)

Unhandled rejection when connection to proxy fails

Connected to szmarczak/http2-wrapper#66.

(node:12200) UnhandledPromiseRejectionWarning: Error: The proxy server rejected the request with status code 502
    at Http2OverHttp.createConnection (G:\www\amazon\node_modules\got-scraping\node_modules\http2-wrapper\source\proxies\h2-over-hx.js:19:10)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (internal/process/task_queues.js:93:5)
    at async entry (G:\www\amazon\node_modules\got-scraping\node_modules\http2-wrapper\source\agent.js:395:15)
(node:12200) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 7) 

The `body` option from `got` does not work at all

This code does not send the body at all. It also does not work with the json option. I haven't tested with form, but I assume it will be the same.

const got = require('got-scraping');

got({
    url: 'https://en2ydmzx2ein.x.pipedream.net/',
    body: '{ foo: "bar" }',
    method: 'POST',
    ciphers: undefined,
}).then((res) => {
    console.dir(res);
});

If you replace the require with got, it works fine. The above code also does not fail when the method: 'POST' is removed. Whereas in got it throws

RequestError: The `GET` method cannot be used with a body

Slower compared to got and fetch

Hi, I'm trying got-scraping for a project, and noticed that it's almost 2x slower compared to got and native fetch.

Here's some basic test code:

import { got, gotScraping } from 'got-scraping';

const url = 'https://httpbin.org/html';
let t, r;

t = performance.now();
r = await fetch(url);
console.log(`fetch took ${performance.now() - t}ms`);

t = performance.now();
r = await got(url);
console.log(`got took ${performance.now() - t}ms`, JSON.stringify(r.timings.phases));

t = performance.now();
r = await gotScraping(url);
console.log(`gotScraping took ${performance.now() - t}ms`, JSON.stringify(r.timings.phases));

Results (similar timing difference on multiple runs, even with other URLs):

fetch took 987.2981999991462ms

got took 960.8908999999985ms {"wait":2,"dns":2,"tcp":234,"tls":477,"request":1,"firstByte":235,"download":3,"total":954}

gotScraping took 1676.9162999996915ms {"wait":719,"request":null,"firstByte":239,"download":2,"total":960}

My test machine is running Windows 11 and Node.js 18.12.1.

Why is got-scraping is slower, if not any faster, and can it be improved?

Error with http2-wrapper

This error was a heisenbug / flaky error in my case, so I cannot guarantee it's reproducible, but I can hint to code causing it.

This was called:

 gotScraping.get({url, proxyUrl, timeout: {request: 30_000}});

This error ocurred:

RequestError: Cannot read properties of undefined (reading 'end')\n    at Request._beforeError (/app/node_modules/got-cjs/dist/source/core/index.js:333:21)\n    at Request.flush (/app/node_modules/got-cjs/dist/source/core/index.js:322:18)\n    at /app/node_modules/got-scraping/dist/resolve-protocol.js:33:21\n    at processTicksAndRejections (node:internal/process/task_queues:96:5)"

This might be the reason:
got-scraping is calling the auto function of http2-wrapper here.

Since http2-wrapper version 2.1.5 the auto function is using delayAsyncDestroy (see here), which in turn can return undefined (see here).

got-scraping does not handle this case and the subsequent call to end() here fails with the described error.

Workaround
Pinning the http2-wrapper dependency to version 2.1.4 (using yarn resolutions in my case), fixed the issue.

Reproduction
For that reason, pinning the http2-wrapper to the latest version (or any version since including 2.1.5), is necessary to reproduce this, as the caret is used here ( "http2-wrapper": "^2.1.4"), which can lead to minor+patch later versions.

Using proxyUrl, always goes with timeout

I try to use one free proxy from https://www.free-proxy-list.com/ to test the package.

But each time I set a proxyUrl I have a ETIMEDOUT request error.
I have tried many different proxies and always got the same error.

I'm doing my test locally, and the format I use for the proxy is (for example)

proxyUrl: 'https://177.36.200.52:8080'

If someone has already got this kind of problem?

Export all the hooks from the package

I had a use-case when I wanted to extend the exported Got instance further and override just the tls hook (the last one). But doing like below will override the beforeRequest property as a whole, removing all other useful hooks (insecureParserHook, ...) that I don't want to override.

export const extendedGotScraping = gotScraping.extend({
    hooks: {
        beforeRequest: [
            myNewTlsHook,
        ],
    },
});

So I had to manually copy paste the source code of all other hooks (basically, the whole lib), and do so:

import { gotScraping } from 'got-scraping';
import { insecureParserHook } from './insecure-parser.js';
import { sessionDataHook } from './storage.js';
import { http2Hook } from './http2.js';
import { proxyHook } from './proxy.js';
import { browserHeadersHook } from './browser-headers.js';
import { tlsHook } from './clever-tls.js';

export const extendedGotScraping = gotScraping.extend({
    hooks: {
        beforeRequest: [
            insecureParserHook,
            sessionDataHook,
            http2Hook,
            proxyHook,
            browserHeadersHook,
            myNewTlsHook,
        ],
    },
});

it would be very convenient, if one could just do like this:

import { gotScraping, hooks } from 'got-scraping';
export const extendedGotScraping = gotScraping.extend({
    hooks: {
        beforeRequest: [
            insecureParserHook: hooks.insecureParserHook,
            sessionDataHook: hooks.sessionDataHook,
            ...
            myNewTlsHook,
        ],
    },
});

Or alternatively:

import { gotScraping, hooks } from 'got-scraping';
export const extendedGotScraping = gotScraping.extend({
    hooks: {
        beforeRequest: [
            ...hooks.beforeRequest.slice(0, -1)
            myNewTlsHook,
        ],
    },
});

Maybe even combine the two approaches

A request to bump up got version

Hi, just wanted to check if you could cut a release w/ got v13 (I understand you guys create a got-cjs cut of the original library, and that, too, doesn't have a version w/ got v13 yet).

Basically, I have a couple of got libraries based off of got v13 and there's a bit of type error when you try to "mix" got v13 with got-scraping (based on got v12):

src/index.ts(7,23): error TS2345: Argument of type 'Got' is not assignable to parameter of type 'ExtendOptions | Got'.
  Type 'import("/Users/janejeon/Projects/@janejeon/dev/node_modules/got/dist/source/types", { assert: { "resolution-mode": "import" } }).Got' is not assignable to type 'import("/Users/janejeon/Projects/@janejeon/dev/node_modules/got-cjs/dist/source/types").Got'.
    Type 'Got' is not assignable to type '{ stream: GotStream; paginate: GotPaginate; defaults: InstanceDefaults; extend: (...instancesOrOptions: (ExtendOptions | Got)[]) => Got; }'.
      Types of property 'stream' are incompatible.
        Type 'import("/Users/janejeon/Projects/@janejeon/dev/node_modules/got/dist/source/types", { assert: { "resolution-mode": "import" } }).GotStream' is not assignable to type 'import("/Users/janejeon/Projects/@janejeon/dev/node_modules/got-cjs/dist/source/types").GotStream'.
          Type 'GotStream' is not assignable to type '(url?: string | URL | undefined, options?: Merge<OptionsInit, { isStream?: true; }> | undefined) => Request'.
            Types of parameters 'options' and 'options' are incompatible.
              Type 'Merge<import("/Users/janejeon/Projects/@janejeon/dev/node_modules/got-cjs/dist/source/core/options").OptionsInit, { isStream?: true; }> | undefined' is not assignable to type 'Merge<import("/Users/janejeon/Projects/@janejeon/dev/node_modules/got/dist/source/core/options", { assert: { "resolution-mode": "import" } }).OptionsInit, { isStream?: true; }> | undefined'.
                Type 'Merge<OptionsInit, { isStream?: true; }>' is not assignable to type 'Merge<OptionsInit, { isStream?: true; }> | undefined'.
                  Type 'Merge<import("/Users/janejeon/Projects/@janejeon/dev/node_modules/got-cjs/dist/source/core/options").OptionsInit, { isStream?: true; }>' is not assignable to type 'Merge<import("/Users/janejeon/Projects/@janejeon/dev/node_modules/got/dist/source/core/options", { assert: { "resolution-mode": "import" } }).OptionsInit, { isStream?: true; }>' with 'exactOptionalPropertyTypes: true'. Consider adding 'undefined' to the types of the target's properties.
                    Type 'Merge<OptionsInit, { isStream?: true; }>' is not assignable to type 'Except<OptionsInit, "isStream">' with 'exactOptionalPropertyTypes: true'. Consider adding 'undefined' to the types of the target's properties.
                      The types of 'retry.calculateDelay' are incompatible between these types.
                        Type 'import("/Users/janejeon/Projects/@janejeon/dev/node_modules/got-cjs/dist/source/core/options").RetryFunction' is not assignable to type 'import("/Users/janejeon/Projects/@janejeon/dev/node_modules/got/dist/source/core/options", { assert: { "resolution-mode": "import" } }).RetryFunction'.

I'm not sure how "serious" it is, but given that it's just the retry options being different, I don't reckon it will be a huge deal, but would be nice to have a v13 to go off of to resolve this without @ts-expect-error nonetheless.

Thank you :)

SSRF protection

Hi, first of all, thanks for the library.

I'm trying to implement SSRF protection in-place (without relying on proxies), which requires modifying the http/https agents and passing the modified instances into got.

However, that's not really possible due to two things:

  1. got's extend functionality doesn't seemingly allow you to somehow get/override existing agent instances: https://github.com/sindresorhus/got/blob/9abdd06d72d90d034299dabd22544d14c897643b/test/merge-instances.ts
  2. The TransformHeadersAgent is never exported from this library, which makes it impossible for me to instantiate agents myself and pass it in.

This unfortunately prevents me from being able to use the library. Is there any way around this?

Thanks

Got-scraping throws error when trying to send a GET request to thetimes.co.uk

Minimum reproduction:

import {gotScraping} from 'got-scraping'
await gotScraping('https://www.thetimes.co.uk/article/budget-2021-keir-starmer-faces-rebellion-for-stance-on-recovery-80tc5v2w3?shareToken=36c28cdcb70fc533d671f8ea20fb488b')

The error thrown is as follows:

Uncaught:
RequestError: The `options.agent` can be only an object `http`, `https` or `http2` properties
    at Request._beforeError (/Users/janejeon/Projects/JS/normalize-url-plus/node_modules/got-cjs/dist/source/core/index.js:325:21)
    at Request._onResponseBase (/Users/janejeon/Projects/JS/normalize-url-plus/node_modules/got-cjs/dist/source/core/index.js:707:22)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at Object.module.exports [as auto] (/Users/janejeon/Projects/JS/normalize-url-plus/node_modules/http2-wrapper/source/auto.js:136:9)
    at options.request (/Users/janejeon/Projects/JS/normalize-url-plus/node_modules/got-scraping/dist/hooks/http2.js:14:36)
    at Request._makeRequest (/Users/janejeon/Projects/JS/normalize-url-plus/node_modules/got-cjs/dist/source/core/index.js:974:37)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at async Request._onResponseBase (/Users/janejeon/Projects/JS/normalize-url-plus/node_modules/got-cjs/dist/source/core/index.js:704:17)
    at async Request._onResponse (/Users/janejeon/Projects/JS/normalize-url-plus/node_modules/got-cjs/dist/source/core/index.js:772:13) {
  input: undefined,
  code: 'ERR_GOT_REQUEST_ERROR',
  timings: undefined,
  options: {
    request: [Function (anonymous)],
    agent: {
      http: TransformHeadersAgent { agent: [Agent] },
      https: TransformHeadersAgent { agent: [Agent] },
      http2: undefined
    },
    h2session: undefined,
    decompress: true,
    timeout: {
      connect: undefined,
      lookup: undefined,
      read: undefined,
      request: 60000,
      response: undefined,
      secureConnect: undefined,
      send: undefined,
      socket: undefined
    },
    prefixUrl: '',
    body: undefined,
    form: undefined,
    json: undefined,
    cookieJar: undefined,
    ignoreInvalidCookies: false,
    searchParams: undefined,
    dnsLookup: undefined,
    dnsCache: undefined,
    context: {
      headerGenerator: HeaderGenerator {
        defaultOptions: [Object],
        inputGeneratorNetwork: [BayesianNetwork],
        headerGeneratorNetwork: [BayesianNetwork]
      },
      useHeaderGenerator: true,
      insecureHTTPParser: true
    },
    hooks: {
      init: [
        [Function: optionsValidationHandler],
        [Function: customOptionsHook]
      ],
      beforeRequest: [
        [Function: http2Hook],
        [AsyncFunction: proxyHook],
        [AsyncFunction: browserHeadersHook],
        [Function: insecureParserHook]
      ],
      beforeError: [],
      beforeRedirect: [],
      beforeRetry: [],
      afterResponse: []
    },
    followRedirect: true,
    maxRedirects: 10,
    cache: undefined,
    throwHttpErrors: false,
    username: '',
    password: '',
    http2: true,
    allowGetBody: false,
    headers: {
      connection: 'keep-alive',
      'sec-ch-ua': '"Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"',
      'sec-ch-ua-mobile': '?0',
      'upgrade-insecure-requests': '1',
      dnt: '1',
      'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:90.0) Gecko/20100101 Firefox/90.0',
      accept: 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
      'sec-fetch-mode': 'navigate',
      'sec-fetch-dest': 'document',
      'sec-fetch-site': 'same-site',
      'sec-fetch-user': '?1',
      'accept-encoding': 'gzip, deflate, br',
      'accept-language': 'en-US',
      te: 'trailers'
    },
    methodRewriting: false,
    dnsLookupIpVersion: undefined,
    parseJson: [Function: parse],
    stringifyJson: [Function: stringify],
    retry: {
      limit: 0,
      methods: [ 'GET', 'PUT', 'HEAD', 'DELETE', 'OPTIONS', 'TRACE' ],
      statusCodes: [
        408, 413, 429, 500,
        502, 503, 504, 521,
        522, 524
      ],
      errorCodes: [
        'ETIMEDOUT',
        'ECONNRESET',
        'EADDRINUSE',
        'ECONNREFUSED',
        'EPIPE',
        'ENOTFOUND',
        'ENETUNREACH',
        'EAI_AGAIN'
      ],
      maxRetryAfter: undefined,
      calculateDelay: [Function: calculateDelay],
      backoffLimit: Infinity,
      noise: 100
    },
    localAddress: undefined,
    method: 'GET',
    createConnection: undefined,
    cacheOptions: {
      shared: undefined,
      cacheHeuristic: undefined,
      immutableMinTimeToLive: undefined,
      ignoreCargoCult: undefined
    },
    https: {
      alpnProtocols: undefined,
      rejectUnauthorized: false,
      checkServerIdentity: undefined,
      certificateAuthority: undefined,
      key: undefined,
      certificate: undefined,
      passphrase: undefined,
      pfx: undefined,
      ciphers: 'TLS_AES_256_GCM_SHA384:TLS_AES_128_GCM_SHA256:TLS_CHACHA20_POLY1305_SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA256:ECDHE-RSA-AES256-SHA384:DHE-RSA-AES256-SHA384:ECDHE-RSA-AES256-SHA256:DHE-RSA-AES256-SHA256:HIGH:!aNULL:!eNULL:!EXPORT:!DES:!RC4:!MD5:!PSK:!SRP:!CAMELLIA',
      honorCipherOrder: undefined,
      minVersion: undefined,
      maxVersion: undefined,
      signatureAlgorithms: undefined,
      tlsSessionLifetime: undefined,
      dhparam: undefined,
      ecdhCurve: undefined,
      certificateRevocationLists: undefined
    },
    encoding: undefined,
    resolveBodyOnly: false,
    isStream: false,
    responseType: 'text',
    url: URL {
      href: 'http://www.thetimes.co.uk/tto/public/acs',
      origin: 'http://www.thetimes.co.uk',
      protocol: 'http:',
      username: '',
      password: '',
      host: 'www.thetimes.co.uk',
      hostname: 'www.thetimes.co.uk',
      port: '',
      pathname: '/tto/public/acs',
      search: '',
      searchParams: URLSearchParams {},
      hash: ''
    },
    pagination: {
      transform: [Function: transform],
      paginate: [Function: paginate],
      filter: [Function: filter],
      shouldContinue: [Function: shouldContinue],
      countLimit: Infinity,
      backoff: 0,
      requestLimit: 10000,
      stackAllItems: false
    },
    setHost: true,
    maxHeaderSize: undefined
  }
}

I would appreciate it if y'all could release a fix for this. I'm not sure why it's erroring out for this URL and not others, but it's clear it's got-scraping that has problem with this link, not some other component.

Better handling of wrong ciphers

It can happen quite often that a request gets rejected due to the default ciphers. This is problematic because the error message is unhelpful. I know that there's a mention about this at the end of the readme, but I don't have high hopes that someone would read that AND remember that. Also, users of the SDK don't even know they need to read the got-scraping readme.

We need to either add the information how to fix the TLS connection error directly into the error that gets thrown from got-scraping which means adding some sort of error handling middleware (not sure how that would work for streams).

Or we need to catch that error and automatically retry with ciphers: undefined or something similar. That would be the better option I guess. It should also log some message that the request was retried due to invalid ciphers.

got-scraping inefficient against Cloudflare

Recently I have encounter some changes In Cloudflare antibot protection. While using got-scraping, I am now unable to send requests to websites protected by Cloudflare.I have to use Puppeteer to get through.

It is mentioned as well in this comment.

Any idea of how Cloudflare can be that good for detecting TLS configuration generated by got-scraping ?

Document how cookies are handled

It would be good to have a paragraph in the README on whether cookies are saved across requests and if it's possible to persist tham.

Preserve header names

I am sending requests to a server which does not respect HTTP specification specifying that headers should be case-insensitive.
The problem is that with got-scraping, a header named "MyHeader" will be renamed "Myheader"...

How can I send a header whose name will be kept untouched ?

All proxies returning RequestError: Proxy responded with 400: 155 bytes

RequestError: Proxy responded with 400: 155 bytes
    at Request._beforeError (/Users/daniel/Documents/test/node_modules/got-cjs/dist/source/core/index.js:325:21)
    at Request.flush (/Users/daniel/Documents/test/node_modules/got-cjs/dist/source/core/index.js:314:18)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at ClientRequest.<anonymous> (/Users/daniel/Documents/test/node_modules/got-scraping/dist/resolve-protocol.js:34:28)
    at Object.onceWrapper (node:events:514:26)
    at ClientRequest.emit (node:events:394:28)
    at Socket.socketOnData (node:_http_client:525:11)
    at Socket.emit (node:events:394:28)
    at addChunk (node:internal/streams/readable:315:12)
    at readableAddChunk (node:internal/streams/readable:289:9)
    at Socket.Readable.push (node:internal/streams/readable:228:10)
    at TCP.onStreamRead (node:internal/stream_base_commons:199:23) {
  input: undefined,
  code: 'ERR_GOT_REQUEST_ERROR',
  timings: undefined,

Proxy is in the format proxyUrl: 'http://172.64.75.188:80' and has been tested to work. I am not using any extra options besides proxyUrl and link.

Invalid filename after Electron build

Describe the bug
When initializing BayesianNetwork in the constructor of the HeaderGenerator class, __dirname is used, but after Electron is built, __dirname does not point to the same thing.

System information:
OS: win
Node.js version 18.16.0
Electron: 25.4.0

Additional context
image

Some proxies returns 502 after moving to node 20

We use different kinds of proxies and we wanted to upgrade the node version to the latest (Move from 18). And we observed that some certain proxy providers fail. (502 status code, ERR_GOT_REQUEST_ERROR)
Here is the stack trace

RequestError: Proxy responded with 502 Bad Gateway: 141 bytes
    at Request._beforeError (C:\Users\ibrahim.koymen\basic projects\got20\node_modules\got-cjs\dist\source\core\index.js:324:21)
    at Request.flush (C:\Users\ibrahim.koymen\basic projects\got20\node_modules\got-cjs\dist\source\core\index.js:313:18)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at ClientRequest.<anonymous> (C:\Users\ibrahim.koymen\basic projects\got20\node_modules\got-scraping\dist\resolve-protocol.js:37:28)
    at Object.onceWrapper (node:events:626:26)
    at ClientRequest.emit (node:events:511:28)
    at Socket.socketOnData (node:_http_client:575:11)
    at Socket.emit (node:events:511:28)
    at addChunk (node:internal/streams/readable:332:12)
    at readableAddChunk (node:internal/streams/readable:305:9)
    at Readable.push (node:internal/streams/readable:242:10)
    at TCP.onStreamRead (node:internal/stream_base_commons:190:23)
    at TCP.callbackTrampoline (node:internal/async_hooks:130:17)

Here is the code that reproduces the issue :

const {gotScraping} = require("got-scraping");

(async () => {
    try {
        const response = await gotScraping({
            url: 'https://api.apify.com/v2/browser-info',
            proxyUrl: 'http://108.59.14.200:13402'
        });
        console.log(response.body);
    } catch (e) {
        console.log(e);
    }
})();

The proxy URL 'http://108.59.14.200:13402' is provided by storm proxies. And while this fails the same request succeeds with using got and hpagent like the below:

import {HttpsProxyAgent} from 'hpagent';
import {got} from "got";

const response = await got('https://api.apify.com/v2/browser-info', {
    agent: {
        https: new HttpsProxyAgent({
            keepAlive: true,
            keepAliveMsecs: 1000,
            maxSockets: 256,
            maxFreeSockets: 256,
            scheduling: 'lifo',
            proxy: 'http://108.59.14.200:13402'
        })
    }
});

console.log(response.body);

As I said if I downgrade the node version to 18 the same code above just works. And I have some other proxy providers that just work in this code too. What might be the reason?

Invalid `connection: close` header

Occurred randomly. Needs to be fixed in header-generator

require('.').gotScraping.get('http://google.com').on('redirect', o => console.log(o.url.href)).then(res => console.log(res.request.options.headers, res.headers, res.body));
$ node demo.js 
http://www.google.com/
https://www.google.com/?gws_rd=ssl
node:internal/process/promises:246
          triggerUncaughtException(err, true /* fromPromise */);
          ^

RequestError: Invalid 'connection' header: close
    at Request._beforeError (/home/szm/Desktop/got-scraping/node_modules/got-cjs/dist/source/core/index.js:333:21)
    at Request._onResponseBase (/home/szm/Desktop/got-scraping/node_modules/got-cjs/dist/source/core/index.js:731:22)
    at ClientRequest.setHeader (/home/szm/Desktop/got-scraping/node_modules/http2-wrapper/source/client-request.js:520:10)
    at new ClientRequest (/home/szm/Desktop/got-scraping/node_modules/http2-wrapper/source/client-request.js:121:10)
    at module.exports (/home/szm/Desktop/got-scraping/node_modules/http2-wrapper/source/auto.js:195:29)
    at async Request._makeRequest (/home/szm/Desktop/got-scraping/node_modules/got-cjs/dist/source/core/index.js:1000:37)
    at async Request._onResponseBase (/home/szm/Desktop/got-scraping/node_modules/got-cjs/dist/source/core/index.js:728:17)
    at async Request._onResponse (/home/szm/Desktop/got-scraping/node_modules/got-cjs/dist/source/core/index.js:796:13) {
  input: undefined,
  code: 'ERR_GOT_REQUEST_ERROR',
  timings: undefined,
  options: Options {
    _unixOptions: { insecureHTTPParser: true },
    _internals: {
      request: [Function (anonymous)],
      agent: {
        http: TransformHeadersAgent {
          agent: Agent {
            _events: [Object: null prototype],
            _eventsCount: 2,
            _maxListeners: undefined,
            defaultPort: 80,
            protocol: 'http:',
            options: [Object: null prototype],
            requests: [Object: null prototype] {},
            sockets: [Object: null prototype] {},
            freeSockets: [Object: null prototype] {},
            keepAliveMsecs: 1000,
            keepAlive: false,
            maxSockets: Infinity,
            maxFreeSockets: 256,
            scheduling: 'lifo',
            maxTotalSockets: Infinity,
            totalSocketCount: 0,
            [Symbol(kCapture)]: false
          }
        },
        https: TransformHeadersAgent {
          agent: Agent {
            _events: [Object: null prototype],
            _eventsCount: 2,
            _maxListeners: undefined,
            defaultPort: 443,
            protocol: 'https:',
            options: [Object: null prototype],
            requests: [Object: null prototype] {},
            sockets: [Object: null prototype] {},
            freeSockets: [Object: null prototype] {},
            keepAliveMsecs: 1000,
            keepAlive: false,
            maxSockets: Infinity,
            maxFreeSockets: 256,
            scheduling: 'lifo',
            maxTotalSockets: Infinity,
            totalSocketCount: 0,
            maxCachedSessions: 100,
            _sessionCache: [Object],
            [Symbol(kCapture)]: false
          }
        },
        http2: undefined
      },
      h2session: undefined,
      decompress: true,
      timeout: {
        connect: undefined,
        lookup: undefined,
        read: undefined,
        request: 60000,
        response: undefined,
        secureConnect: undefined,
        send: undefined,
        socket: undefined
      },
      prefixUrl: '',
      body: undefined,
      form: undefined,
      json: undefined,
      cookieJar: undefined,
      ignoreInvalidCookies: false,
      searchParams: undefined,
      dnsLookup: undefined,
      dnsCache: undefined,
      context: {
        headerGenerator: HeaderGenerator {
          globalOptions: {
            browsers: [Array],
            operatingSystems: [Array],
            devices: [Array],
            locales: [Array],
            httpVersion: '2',
            browserListQuery: ''
          },
          browserListQuery: undefined,
          inputGeneratorNetwork: BayesianNetwork {
            nodesInSamplingOrder: [Array],
            nodesByName: [Object]
          },
          headerGeneratorNetwork: BayesianNetwork {
            nodesInSamplingOrder: [Array],
            nodesByName: [Object]
          },
          uniqueBrowsers: [
            [Object], [Object], [Object], [Object], [Object], [Object],
            [Object], [Object], [Object], [Object], [Object], [Object],
            [Object], [Object], [Object], [Object], [Object], [Object],
            [Object], [Object], [Object], [Object], [Object], [Object],
            [Object], [Object], [Object], [Object], [Object], [Object],
            [Object], [Object], [Object], [Object], [Object], [Object],
            [Object], [Object], [Object], [Object], [Object], [Object],
            [Object], [Object], [Object], [Object], [Object], [Object],
            [Object], [Object], [Object], [Object], [Object], [Object],
            [Object], [Object], [Object], [Object], [Object], [Object],
            [Object], [Object], [Object], [Object], [Object], [Object],
            [Object], [Object], [Object], [Object], [Object], [Object],
            [Object], [Object], [Object], [Object], [Object], [Object],
            [Object], [Object], [Object], [Object], [Object], [Object],
            [Object], [Object], [Object], [Object], [Object], [Object],
            [Object], [Object], [Object], [Object], [Object], [Object],
            [Object], [Object], [Object], [Object],
            ... 112 more items
          ]
        },
        useHeaderGenerator: true,
        insecureHTTPParser: true,
        sessionData: undefined
      },
      hooks: {
        init: [
          [Function: optionsValidationHandler],
          [Function: customOptionsHook]
        ],
        beforeRequest: [
          [Function: insecureParserHook],
          [Function: sessionDataHook],
          [Function: http2Hook],
          [AsyncFunction: proxyHook],
          [AsyncFunction: browserHeadersHook],
          [Function: tlsHook]
        ],
        beforeError: [],
        beforeRedirect: [ [Function: refererHook] ],
        beforeRetry: [],
        afterResponse: []
      },
      followRedirect: true,
      maxRedirects: 10,
      cache: undefined,
      throwHttpErrors: false,
      username: '',
      password: '',
      http2: true,
      allowGetBody: false,
      headers: {
        'sec-ch-ua': '"(Not(A:Brand";v="8", "Chromium";v="99", "Google Chrome";v="99"',
        'sec-ch-ua-mobile': '?0',
        'upgrade-insecure-requests': '1',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36',
        accept: 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'sec-fetch-site': 'same-site',
        'sec-fetch-mode': 'navigate',
        'sec-fetch-user': '?1',
        'sec-fetch-dest': 'document',
        'accept-encoding': 'gzip, deflate, br',
        'accept-language': 'en-US',
        connection: 'close',
        referer: 'http://www.google.com'
      },
      methodRewriting: false,
      dnsLookupIpVersion: undefined,
      parseJson: [Function: parse],
      stringifyJson: [Function: stringify],
      retry: {
        limit: 0,
        methods: [ 'GET', 'PUT', 'HEAD', 'DELETE', 'OPTIONS', 'TRACE' ],
        statusCodes: [
          408, 413, 429, 500,
          502, 503, 504, 521,
          522, 524
        ],
        errorCodes: [
          'ETIMEDOUT',
          'ECONNRESET',
          'EADDRINUSE',
          'ECONNREFUSED',
          'EPIPE',
          'ENOTFOUND',
          'ENETUNREACH',
          'EAI_AGAIN'
        ],
        maxRetryAfter: undefined,
        calculateDelay: [Function: calculateDelay],
        backoffLimit: Infinity,
        noise: 100
      },
      localAddress: undefined,
      method: 'GET',
      createConnection: undefined,
      cacheOptions: {
        shared: undefined,
        cacheHeuristic: undefined,
        immutableMinTimeToLive: undefined,
        ignoreCargoCult: undefined
      },
      https: {
        alpnProtocols: undefined,
        rejectUnauthorized: false,
        checkServerIdentity: undefined,
        certificateAuthority: undefined,
        key: undefined,
        certificate: undefined,
        passphrase: undefined,
        pfx: undefined,
        ciphers: 'TLS_AES_128_GCM_SHA256:TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:ECDHE-RSA-AES128-SHA:ECDHE-RSA-AES256-SHA:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128-SHA:AES256-SHA',
        honorCipherOrder: undefined,
        minVersion: 'TLSv1',
        maxVersion: 'TLSv1.3',
        signatureAlgorithms: 'ecdsa_secp256r1_sha256:rsa_pss_rsae_sha256:rsa_pkcs1_sha256:ecdsa_secp384r1_sha384:rsa_pss_rsae_sha384:rsa_pkcs1_sha384:rsa_pss_rsae_sha512:rsa_pkcs1_sha512',
        tlsSessionLifetime: undefined,
        dhparam: undefined,
        ecdhCurve: 'X25519:prime256v1:secp384r1',
        certificateRevocationLists: undefined
      },
      encoding: undefined,
      resolveBodyOnly: false,
      isStream: false,
      responseType: 'text',
      url: <ref *1> URL {
        [Symbol(context)]: URLContext {
          flags: 912,
          scheme: 'https:',
          username: '',
          password: '',
          host: 'www.google.com',
          port: null,
          path: [ '' ],
          query: 'gws_rd=ssl',
          fragment: null
        },
        [Symbol(query)]: URLSearchParams {
          [Symbol(query)]: [ 'gws_rd', 'ssl' ],
          [Symbol(context)]: [Circular *1]
        }
      },
      pagination: {
        transform: [Function: transform],
        paginate: [Function: paginate],
        filter: [Function: filter],
        shouldContinue: [Function: shouldContinue],
        countLimit: Infinity,
        backoff: 0,
        requestLimit: 10000,
        stackAllItems: false
      },
      setHost: true,
      maxHeaderSize: undefined
    },
    _merging: false,
    _init: [
      {
        handlers: [ [Function: fixDecompress] ],
        mutableDefaults: true,
        http2: true,
        https: { rejectUnauthorized: false },
        throwHttpErrors: false,
        timeout: { request: 60000 },
        retry: { limit: 0 },
        headers: { 'user-agent': undefined },
        context: {
          headerGenerator: HeaderGenerator {
            globalOptions: [Object],
            browserListQuery: undefined,
            inputGeneratorNetwork: [BayesianNetwork],
            headerGeneratorNetwork: [BayesianNetwork],
            uniqueBrowsers: [Array]
          },
          useHeaderGenerator: true,
          insecureHTTPParser: true
        },
        agent: {
          http: TransformHeadersAgent { agent: [Agent] },
          https: TransformHeadersAgent { agent: [Agent] }
        },
        hooks: {
          init: [
            [Function: optionsValidationHandler],
            [Function: customOptionsHook]
          ],
          beforeRequest: [
            [Function: insecureParserHook],
            [Function: sessionDataHook],
            [Function: http2Hook],
            [AsyncFunction: proxyHook],
            [AsyncFunction: browserHeadersHook],
            [Function: tlsHook]
          ],
          beforeRedirect: [ [Function: refererHook] ]
        }
      },
      { method: 'get' }
    ]
  }
}

Got-scraping is broken with NodeJS 18.17

When using a proxy, got-scraping is not working with the latest version of NodeJS LTS.
To reproduce the problem, install NodeJS 18.17 and run the following snippet :

const { gotScraping } = require('got-scraping');

(async () => {
	try {
		const res = await gotScraping.get({
			url: 'https://ipinfo.io/',
			proxyUrl: 'YOUR_PROXY_URL'
		});
		console.log(res.body);
	} catch(e) {
		console.error(e.message);
	}
})();

You should get the error "Proxy responded with 503 Service Unavailable: 3702 bytes".
The error does not occur with NodeJS 18.16.

Jest workers cannot handle circular structures

https://github.com/apify/got-scraping/compare/ts...e559bc263272236c2bce32d6515cfc47cb40788b#files_bucket (GitHub plz fix links)

szm@solus ~/Desktop/got-scraping $ npm test

> [email protected] test
> jest --maxWorkers=3 --collect-coverage

 PASS  test/options-validator.test.ts
  Options validation
    βœ“ should validate proxyUrl (7 ms)
    βœ“ should validate useHeaderGenerator (1 ms)
    βœ“ should validate headerGeneratorOptions (1 ms)

 PASS  test/proxy.test.ts
  Proxy
    βœ“ should not add an agent if proxyUrl is not provided (2 ms)
    βœ“ should add an agent if proxyUrl is provided (3 ms)
    βœ“ should throw on invalid proxy protocol (14 ms)
    agents
      βœ“ should support http request over http proxy (1 ms)
      βœ“ should support https request over http proxy
      βœ“ should support http2 request over http proxy (1 ms)
      βœ“ should support http request over https proxy (1 ms)
      βœ“ should support https request over https proxy (1 ms)
      βœ“ should support http2 request over https proxy
      βœ“ should support http request over http2 proxy (1 ms)
      βœ“ should support https request over http2 proxy (1 ms)
      βœ“ should support http2 request over http2 proxy

 PASS  test/agent.test.ts
  TransformHeadersAgent
    βœ“ Pascal-Case (2 ms)
    βœ“ transformRequest (19 ms)
    βœ“ leaves x-header as it is (4 ms)
    βœ“ http.request with agent (3 ms)
    βœ“ first header in sortedHeaders is always first (2 ms)
    respects native behavior
      βœ“ content-length removal (2 ms)
      βœ“ transfer-encoding removal (3 ms)
      βœ“ explicit content-length (2 ms)
      βœ“ explicit connection (2 ms)

 PASS  test/custom-options.test.ts
  Custom options
    βœ“ should move custom options to context (10 ms)

 PASS  test/scraping-defaults.test.ts
  Scraping defaults
    βœ“ should set correct defaults (26 ms)
    βœ“ should allow user to override the defaults (5 ms)
    βœ“ should have compatible defaults with node 10 (1 ms)

 PASS  test/browser-headers.test.ts
  Browser headers
    βœ“ should not generate headers without useHeaderGenerator (1 ms)
    βœ“ should generate headers with useHeaderGenerator (1 ms)
    βœ“ should add headers when http2 is used
    βœ“ should add headers when http1 is used (1 ms)
    βœ“ should pass option to header generator
    βœ“ should override default ua header (1 ms)
    βœ“ should have working generator (7 ms)
    βœ“ should have capitalized headers with http1 (3 ms)
    βœ“ should respect casing of unrecognized headers (3 ms)
    mergeHeaders
      βœ“ should merge headers
      βœ“ should allow deleting header
      βœ“ should allow adding header (1 ms)

node:internal/child_process/serialization:127
    const string = JSONStringify(message) + '\n';
                   ^

TypeError: Converting circular structure to JSON
    --> starting at object with constructor 'Object'
    |     property '_httpMessage' -> object with constructor 'Object'
    --- property 'socket' closes the circle
    at stringify (<anonymous>)
    at writeChannelMessage (node:internal/child_process/serialization:127:20)
    at process.target._send (node:internal/child_process:822:17)
    at process.target.send (node:internal/child_process:722:19)
    at reportSuccess (/home/szm/Desktop/got-scraping/node_modules/jest-worker/build/workers/processChild.js:67:11)
node:internal/child_process/serialization:127
    const string = JSONStringify(message) + '\n';
                   ^

TypeError: Converting circular structure to JSON
    --> starting at object with constructor 'Object'
    |     property '_httpMessage' -> object with constructor 'Object'
    --- property 'socket' closes the circle
    at stringify (<anonymous>)
    at writeChannelMessage (node:internal/child_process/serialization:127:20)
    at process.target._send (node:internal/child_process:822:17)
    at process.target.send (node:internal/child_process:722:19)
    at reportSuccess (/home/szm/Desktop/got-scraping/node_modules/jest-worker/build/workers/processChild.js:67:11)
node:internal/child_process/serialization:127
    const string = JSONStringify(message) + '\n';
                   ^

TypeError: Converting circular structure to JSON
    --> starting at object with constructor 'Object'
    |     property '_httpMessage' -> object with constructor 'Object'
    --- property 'socket' closes the circle
    at stringify (<anonymous>)
    at writeChannelMessage (node:internal/child_process/serialization:127:20)
    at process.target._send (node:internal/child_process:822:17)
    at process.target.send (node:internal/child_process:722:19)
    at reportSuccess (/home/szm/Desktop/got-scraping/node_modules/jest-worker/build/workers/processChild.js:67:11)
node:internal/child_process/serialization:127
    const string = JSONStringify(message) + '\n';
                   ^

TypeError: Converting circular structure to JSON
    --> starting at object with constructor 'Object'
    |     property '_httpMessage' -> object with constructor 'Object'
    --- property 'socket' closes the circle
    at stringify (<anonymous>)
    at writeChannelMessage (node:internal/child_process/serialization:127:20)
    at process.target._send (node:internal/child_process:822:17)
    at process.target.send (node:internal/child_process:722:19)
    at reportSuccess (/home/szm/Desktop/got-scraping/node_modules/jest-worker/build/workers/processChild.js:67:11)
 FAIL  test/main.test.ts
  ● Test suite failed to run

    Jest worker encountered 4 child process exceptions, exceeding retry limit

      at ChildProcessWorker.initialize (node_modules/jest-worker/build/workers/ChildProcessWorker.js:193:21)

Add Typescript Types?

It would be nice to add some Typescript type definitions. I may make a PR but no guarantees.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.