CrawlStrike

CrawlStrike, available herearrow-up-right, is a high-performance, multiprocessed recursive web crawler built for reconnaissance and surface mapping. It uses a Manager-based shared state to perform lightning-fast link discovery while ensuring no URL is processed twice.

Key Features

  • Multiprocessed Engine: Leverages all CPU cores for simultaneous request handling.

  • Deep Extraction: - Parses HTML tags (a, script, img, iframe, form).

    • Regex-based discovery in javascript, json, xml, and txt for absolute/relative paths.

  • Resumable Scans: Automated state saving to .pkl files. If a scan is interrupted with Ctrl+C, you can resume exactly where you left off by starting again the script with the same parameters.

  • Flexible Proxying: Native support for both HTTP and SOCKS5 (e.g., Tor integration).

  • Categorized Logging: Automatically sorts findings into 2xx.txt, 3xx.txt, 4xx.txt, 5xx.txt, and error.txt.

Installation

pip3 install -r requirements.txt

Usage

python3 crawlstrike.py [URL] [OPTIONS]

Arguments

Argument

Description

-w, --workers

Number of parallel processes (Default: CPU count)

--proxy

HTTP/HTTPS proxy (http://127.0.0.1:8080) If SOCKS5 proxy is defined, HTTP proxy will be disabled.

--socks

SOCKS proxy (socks5://127.0.0.1:9050)

--no-subdomains

Restrict crawl strictly to the main domain

--header

Add custom HTTP headers (Format: "Key: Value")

--follow-redirect

Follow HTTP redirects (3xx)

--output

Specify output folder (Defaults to domain name)

Examples

Output

By default the script will create an output folder named as the target URL domain:

If output folder is specified, the script will use it for the output:

Error Handling (error.txt)

The error.txt file captures non-HTTP network failures:

  • Network: ConnectError, ConnectTimeout (DNS or Firewall issues).

  • Protocol: SSLError, ProtocolError (Encryption/Handshake failures).

  • Streaming: ReadTimeout, ReadError (Interrupted data transfer).

  • Logic: InvalidURL, RemoteProtocolError.

Last updated