Bot Filter¶

The Bot Filter middleware in idum-proxy provides protection against unauthorized and malicious bot traffic. It allows you to create blacklists and whitelists to control which bots can access your application.

Configuration¶

Bot filtering is configured in the JSON configuration file as follows:

{
  "middlewares": {
    "security": {
      "bot_filter": {
        "enabled": true,
        "blacklist": [
          {
            "name": "googlebot",
            "user-agent": "crawl-***-***-***-***.googlebot.com"
          }
        ],
        "whitelist": []
      }
    }
  }
}

Configuration Options¶

Option	Type	Default	Description
`enabled`	boolean	`true`	Enables or disables the bot filter middleware
`blacklist`	array	`[]`	List of bot definitions to block
`whitelist`	array	`[]`	List of bot definitions to explicitly allow

Bot Definition Format¶

Each bot in the blacklist or whitelist is defined with the following properties:

Property	Type	Required	Description
`name`	string	Yes	Identifier for the bot (for logging and reference)
`user-agent`	string	No	Pattern to match against the User-Agent header
`ip`	string	No	IP address or CIDR range to match against client IP
`referer`	string	No	Pattern to match against the Referer header
`path`	string	No	URL path pattern to apply this rule to

At least one matching criterion (user-agent, ip, or referer) must be provided.

Filtering Behavior¶

The Bot Filter middleware processes requests in the following order:

If the feature is disabled (enabled: false), all requests pass through unaffected
If the request matches any entry in the whitelist, it is allowed to proceed
If the request matches any entry in the blacklist, it is blocked (returns 403 Forbidden)
If the request doesn't match any rule, it is allowed to proceed

Pattern Matching¶

idum-proxy supports different pattern matching techniques for different fields:

user-agent: Supports wildcard patterns with * (e.g., "crawl-*-*.googlebot.com")
ip: Supports exact IP addresses or CIDR notation (e.g., "192.168.1.1" or "192.168.1.0/24")
referer: Supports wildcard patterns with * (e.g., "*.example.com/*")
path: Supports wildcard patterns with * (e.g., "/api/*")

Examples¶

Blocking Known Bad Bots¶

"bot_filter": {
  "enabled": true,
  "blacklist": [
    {
      "name": "fake-googlebot",
      "user-agent": "*googlebot*",
      "ip": "192.168.1.100"
    },
    {
      "name": "scraper-bot",
      "user-agent": "*scraper*"
    },
    {
      "name": "malicious-crawler",
      "user-agent": "*crawler*",
      "path": "/admin/*"
    }
  ],
  "whitelist": []
}

Allowing Only Specific Bots¶

"bot_filter": {
  "enabled": true,
  "blacklist": [
    {
      "name": "all-bots",
      "user-agent": "*bot*"
    }
  ],
  "whitelist": [
    {
      "name": "google",
      "user-agent": "*googlebot*",
      "ip": "66.249.66.0/24"
    },
    {
      "name": "bing",
      "user-agent": "*bingbot*"
    }
  ]
}

Protecting Specific Paths¶

"bot_filter": {
  "enabled": true,
  "blacklist": [
    {
      "name": "any-bot-on-private-paths",
      "user-agent": "*bot*",
      "path": "/private/*"
    },
    {
      "name": "any-bot-on-admin",
      "user-agent": "*bot*",
      "path": "/admin/*"
    }
  ],
  "whitelist": []
}

Logs and Monitoring¶

When a bot is blocked by the filter, idum-proxy logs the event with the following information:

Bot name from the matching rule
Client IP address
User-Agent header
Request path
Matched rule details

These logs can be used to monitor bot activity and refine your filtering rules.

Best Practices¶

Start with monitoring: Initially deploy with minimal blocking to observe patterns
Verify legitimate bots: For search engines, verify their authenticity by reverse DNS lookup
Be specific: Target specific bot behaviors rather than broad patterns
Whitelist good bots: Explicitly whitelist legitimate bots you want to allow
Regularly update rules: Bot patterns change regularly, keep your rules current
Monitor false positives: Watch for legitimate traffic being blocked

Security Considerations¶

Bot filtering should be used as part of a larger security strategy
Sophisticated bots can spoof User-Agent headers and IP addresses
Consider combining with rate limiting and behavioral analysis for better protection
For critical applications, consider using CAPTCHA or JavaScript challenges in addition to bot filtering