CrawlStrike

CrawlStrike, available here, is a high-performance, multiprocessed recursive web crawler built for reconnaissance and surface mapping. It uses a Manager-based shared state to perform lightning-fast link discovery while ensuring no URL is processed twice.

Key Features

Multiprocessed Engine: Leverages all CPU cores for simultaneous request handling.
Deep Extraction: - Parses HTML tags (a, script, img, iframe, form).
- Regex-based discovery in javascript, json, xml, and txt for absolute/relative paths.
Resumable Scans: Automated state saving to .pkl files. If a scan is interrupted with Ctrl+C, you can resume exactly where you left off by starting again the script with the same parameters.
Flexible Proxying: Native support for both HTTP and SOCKS5 (e.g., Tor integration).
Categorized Logging: Automatically sorts findings into 2xx.txt, 3xx.txt, 4xx.txt, 5xx.txt, and error.txt.

Installation

pip3 install -r requirements.txt

Usage

python3 crawlstrike.py [URL] [OPTIONS]

Arguments

Argument

Description

-w, --workers

Number of parallel processes (Default: CPU count)

--proxy

HTTP/HTTPS proxy (http://127.0.0.1:8080) If SOCKS5 proxy is defined, HTTP proxy will be disabled.

--socks

SOCKS proxy (socks5://127.0.0.1:9050)

--no-subdomains

Restrict crawl strictly to the main domain

--header

Add custom HTTP headers (Format: "Key: Value")

--follow-redirect

Follow HTTP redirects (3xx)

--output

Specify output folder (Defaults to domain name)

Examples

# Standard crawl with 20 workers
python3 crawlstrike.py https://example.com -w 20

# Crawl via Tor (SOCKS5)
python3 crawlstrike.py https://example.com --socks socks5://127.0.0.1:9050

# Custom headers and output folder
python3 crawlstrike.py https://example.com --header "Authorization: Bearer token" --output my_scan

Output

By default the script will create an output folder named as the target URL domain:

python3 crawlstrike.py https://target.com -w 20

target.com/             <-- Default Folder
├── 2xx.txt             # Successes
├── 3xx.txt             # Redirects
├── 4xx.txt             # Client Errors (404, etc.)
├── 5xx.txt             # Server Errors
└── error.txt           # Network/Proxy/SOCKS Failures
target.com.pkl          # Session state (for resuming)

If output folder is specified, the script will use it for the output:

python3 crawlstrike.py https://target.com -w 20 --output myscan

myscan/             <-- Default Folder
├── 2xx.txt             # Successes
├── 3xx.txt             # Redirects
├── 4xx.txt             # Client Errors (404, etc.)
├── 5xx.txt             # Server Errors
└── error.txt           # Network/Proxy/SOCKS Failures
myscan.pkl          # Session state (for resuming)

Error Handling (`error.txt`)

The error.txt file captures non-HTTP network failures:

Network: ConnectError, ConnectTimeout (DNS or Firewall issues).
Protocol: SSLError, ProtocolError (Encryption/Handshake failures).
Streaming: ReadTimeout, ReadError (Interrupted data transfer).
Logic: InvalidURL, RemoteProtocolError.

PreviousCacHeIT NextFfuf Mux

Last updated 2 days ago

hashtagKey Features

hashtagInstallation

hashtagUsage

hashtagArguments

hashtagExamples

hashtagOutput

hashtagError Handling (error.txt)