[dnsdist] dnsdist tuning for high qps on nxdomain ddos
Jasper Aikema
jasper.aikema+dnsdist at gmail.com
Mon May 6 14:02:15 UTC 2024
> 200k QPS is fairly low based on what you describe. Would you mind
> sharing the whole configuration (redacting passwords and keys, of
> course), and telling us a bit more about the hardware dnsdist is running
on?
The server is a virtual server (Ubuntu 22.04) on our vmware platform with
16GB of memory and 8 cores (Intel Xeon 4214R @2.4Ghz). I have pasted the
new config at the bottom of this message.
> 6 times the amount of cores is probably not a good idea. I usually
> advise to make it so that the number of threads is roughly equivalent to
> the number of cores that are dedicated to dnsdist, so in your case the
> number of addLocal + the number of newServer + the number of TCP workers
> should ideally match the number of cores you have. If you need to
> overcommit the cores a bit that's fine, but keep it to something like
> twice the number of cores you have, not 10 times.
> I'm pretty sure this does not make sense, I would first go with the
> default until you see TCP/DoT connections are not processed correctly.
I did overcommit / try to tune, because I was getting a high number of
udp-in-errors and also a high number of Drops in showServers().
If those issues are gone, I agree there should be no reason to overcommit.
> When you say it doesn't work for NXDomain, I'm assuming you mean it
> doesn't solve the problem of random sub-domains attacks, not that a
> NXDomain is not properly cached/accounted?
Yes. That is indeed what I meant, the responses are getting cached, but
that is exactly why nxdomains attacks are working. They request a lot of
random sub-domains and caching doesnt help making it more responsive.
> I expect lowering the number of threads will reduce the context switches
> a lot. If you are still not getting good QPS numbers, I would suggest
> checking if disabling the rules help, to figure out the bottleneck. You
> might also want to take a look with "perf top -p <pid of dnsdist>"
> during the high load to see where the CPU time is spent.
I have updated the config and lowered the threads. But now I get a high
number of udp-in-errors. The perf top command gives:
Samples: 80K of event 'cpu-clock:pppH', 4000 Hz, Event count (approx.):
15028605853 lost: 0/0 drop: 0/0
Overhead Shared Object Symbol
4.78% [kernel] [k]
__lock_text_start
2.29% [kernel] [k]
copy_user_generic_unrolled
2.29% [kernel] [k]
copy_from_kernel_nofault
1.86% [nf_conntrack] [k]
__nf_conntrack_find_get
1.81% [kernel] [k] __fget_files
1.42% [kernel] [k]
_raw_spin_lock
1.39% [vmxnet3] [k]
vmxnet3_poll_rx_only
1.34% [kernel] [k]
finish_task_switch.isra.0
1.32% [nf_tables] [k] nft_do_chain
1.23% libc.so.6 [.] cfree
1.08% [kernel] [k]
__siphash_unaligned
1.07% [kernel] [k]
syscall_enter_from_user_mode
1.05% [kernel] [k]
memcg_slab_free_hook
1.00% [kernel] [k] memset_orig
We have the following configuration:
setACL({'0.0.0.0/0', '::/0'})
controlSocket("127.0.0.1:5900")
setKey("<pwd>")
webserver("127.0.0.1:8083")
setWebserverConfig({password=hashPassword("<pwd>")})
addLocal("<own IPv4>:53",{reusePort=true,tcpFastOpenQueueSize=100})
addLocal("<own IPv4>:53",{reusePort=true,tcpFastOpenQueueSize=100})
newServer({address="127.0.0.1:54", pool="all"})
newServer({address="127.0.0.1:54", pool="all"})
newServer({address="<bind server 1>:53", pool="abuse", tcpFastOpen=true,
maxCheckFailures=5, sockets=16})
newServer({address="<bind server 2>:53", pool="abuse", tcpFastOpen=true,
maxCheckFailures=5, sockets=16})
addAction(OrRule({OpcodeRule(DNSOpcode.Notify),
OpcodeRule(DNSOpcode.Update), QTypeRule(DNSQType.AXFR),
QTypeRule(DNSQType.IXFR)}), RCodeAction(DNSRCode.REFUSED))
addAction(AllRule(), PoolAction("all"))
We have removed the caching and qps blocker per IP, because we are
attacking it from 4 servers.
Already thanks for all the help you can give me.
Op ma 6 mei 2024 om 10:41 schreef Remi Gacogne via dnsdist <
dnsdist at mailman.powerdns.com>:
> Hi!
>
> On 03/05/2024 22:20, Jasper Aikema via dnsdist wrote:
> > Currently we are stuck at a max of +/- 200k qps for nxdomain requests
> > and want to be able to serve +/- 300k qps per server.
>
> 200k QPS is fairly low based on what you describe. Would you mind
> sharing the whole configuration (redacting passwords and keys, of
> course), and telling us a bit more about the hardware dnsdist is running
> on?
>
> > We have done the following:
> > - added multiple (6x the amount of cores) addLocal listeners for IPv4
> > and IPv6, with the options reusePort=true and tcpFastOpenQueueSize=100
> > - add multiple (2x the amount of cores) newServer to the backend, with
> > the options tcpFastOpen=true and sockets=(2x the amount of cores)
>
> 6 times the amount of cores is probably not a good idea. I usually
> advise to make it so that the number of threads is roughly equivalent to
> the number of cores that are dedicated to dnsdist, so in your case the
> number of addLocal + the number of newServer + the number of TCP workers
> should ideally match the number of cores you have. If you need to
> overcommit the cores a bit that's fine, but keep it to something like
> twice the number of cores you have, not 10 times.
>
> > - setMaxTCPClientThreads(1000)
> I'm pretty sure this does not make sense, I would first go with the
> default until you see TCP/DoT connections are not processed correctly.
>
> > And the defaults like caching requests (which doesn't work for nxdomain)
> > and limit the amount of qps per ip (which also doens't work for nxdomain
> > attack because they use public resolvers).
>
> When you say it doesn't work for NXDomain, I'm assuming you mean it
> doesn't solve the problem of random sub-domains attacks, not that a
> NXDomain is not properly cached/accounted?
> > When we simulate a nxdomain attack (with 200k qps and 500MBit of
> > traffic) , we get a high load on the dnsdist server (50% CPU for dsndist
> > and a lot of interrupts and context switches).
>
> I expect lowering the number of threads will reduce the context switches
> a lot. If you are still not getting good QPS numbers, I would suggest
> checking if disabling the rules help, to figure out the bottleneck. You
> might also want to take a look with "perf top -p <pid of dnsdist>"
> during the high load to see where the CPU time is spent.
>
> > So the question from me to you are:
> > - how much qps are you able to push through dnsdist using a powerdns or
> > bind backend
>
> It really depends on the hardware you have and the rules you are
> enabling, but it's quite common to see people pushing 400k+ QPS on a
> single DNSdist without a lot of fine tuning, and a fair amount of
> remaining head-room.
>
> > - have I overlooked some tuning parameters, e.g. more kernel parameters
> > or some dnsdist parameters
>
> I shared a few parameters a while ago: [1].
>
> > - what is the best method of sending packets for a domain to a seperate
> > backend, right we now we use 'addAction("<domain>",
> > PoolAction("abuse")), but is this the least CPU intensive one? Are there
> > better methods?
>
> It's the best method and should be really cheap.
>
> > I have seen eBPF socket filtering, but as far as I have seen that is
> for dropping unwanted packets.
>
> Correct. You could look into enabling AF_XDP / XSK [2] but I would
> recommend checking that you really cannot get the performance you want
> with normal processing first, as AF_XDP has some rough edges.
>
> [1]:
> https://mailman.powerdns.com/pipermail/dnsdist/2023-January/001271.html
> [2]: https://dnsdist.org/advanced/xsk.html
>
> Best regards,
> --
> Remi Gacogne
> PowerDNS B.V
> _______________________________________________
> dnsdist mailing list
> dnsdist at mailman.powerdns.com
> https://mailman.powerdns.com/mailman/listinfo/dnsdist
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.powerdns.com/pipermail/dnsdist/attachments/20240506/09003fc3/attachment-0001.htm>
More information about the dnsdist
mailing list