Bot Filter¶
The Bot Filter middleware in idum-proxy provides protection against unauthorized and malicious bot traffic. It allows you to create blacklists and whitelists to control which bots can access your application.
Configuration¶
Bot filtering is configured in the JSON configuration file as follows:
{
"middlewares": {
"security": {
"bot_filter": {
"enabled": true,
"blacklist": [
{
"name": "googlebot",
"user-agent": "crawl-***-***-***-***.googlebot.com"
}
],
"whitelist": []
}
}
}
}
Configuration Options¶
Option | Type | Default | Description |
---|---|---|---|
enabled | boolean | true | Enables or disables the bot filter middleware |
blacklist | array | [] | List of bot definitions to block |
whitelist | array | [] | List of bot definitions to explicitly allow |
Bot Definition Format¶
Each bot in the blacklist or whitelist is defined with the following properties:
Property | Type | Required | Description |
---|---|---|---|
name | string | Yes | Identifier for the bot (for logging and reference) |
user-agent | string | No | Pattern to match against the User-Agent header |
ip | string | No | IP address or CIDR range to match against client IP |
referer | string | No | Pattern to match against the Referer header |
path | string | No | URL path pattern to apply this rule to |
At least one matching criterion (user-agent
, ip
, or referer
) must be provided.
Filtering Behavior¶
The Bot Filter middleware processes requests in the following order:
- If the feature is disabled (
enabled: false
), all requests pass through unaffected - If the request matches any entry in the
whitelist
, it is allowed to proceed - If the request matches any entry in the
blacklist
, it is blocked (returns 403 Forbidden) - If the request doesn't match any rule, it is allowed to proceed
Pattern Matching¶
idum-proxy supports different pattern matching techniques for different fields:
user-agent
: Supports wildcard patterns with*
(e.g.,"crawl-*-*.googlebot.com"
)ip
: Supports exact IP addresses or CIDR notation (e.g.,"192.168.1.1"
or"192.168.1.0/24"
)referer
: Supports wildcard patterns with*
(e.g.,"*.example.com/*"
)path
: Supports wildcard patterns with*
(e.g.,"/api/*"
)
Examples¶
Blocking Known Bad Bots¶
"bot_filter": {
"enabled": true,
"blacklist": [
{
"name": "fake-googlebot",
"user-agent": "*googlebot*",
"ip": "192.168.1.100"
},
{
"name": "scraper-bot",
"user-agent": "*scraper*"
},
{
"name": "malicious-crawler",
"user-agent": "*crawler*",
"path": "/admin/*"
}
],
"whitelist": []
}
Allowing Only Specific Bots¶
"bot_filter": {
"enabled": true,
"blacklist": [
{
"name": "all-bots",
"user-agent": "*bot*"
}
],
"whitelist": [
{
"name": "google",
"user-agent": "*googlebot*",
"ip": "66.249.66.0/24"
},
{
"name": "bing",
"user-agent": "*bingbot*"
}
]
}
Protecting Specific Paths¶
"bot_filter": {
"enabled": true,
"blacklist": [
{
"name": "any-bot-on-private-paths",
"user-agent": "*bot*",
"path": "/private/*"
},
{
"name": "any-bot-on-admin",
"user-agent": "*bot*",
"path": "/admin/*"
}
],
"whitelist": []
}
Logs and Monitoring¶
When a bot is blocked by the filter, idum-proxy logs the event with the following information:
- Bot name from the matching rule
- Client IP address
- User-Agent header
- Request path
- Matched rule details
These logs can be used to monitor bot activity and refine your filtering rules.
Best Practices¶
- Start with monitoring: Initially deploy with minimal blocking to observe patterns
- Verify legitimate bots: For search engines, verify their authenticity by reverse DNS lookup
- Be specific: Target specific bot behaviors rather than broad patterns
- Whitelist good bots: Explicitly whitelist legitimate bots you want to allow
- Regularly update rules: Bot patterns change regularly, keep your rules current
- Monitor false positives: Watch for legitimate traffic being blocked
Security Considerations¶
- Bot filtering should be used as part of a larger security strategy
- Sophisticated bots can spoof User-Agent headers and IP addresses
- Consider combining with rate limiting and behavioral analysis for better protection
- For critical applications, consider using CAPTCHA or JavaScript challenges in addition to bot filtering