PowerDNS’s DNSdist is a versatile, performant and a generally awesome product for anything related to DNS proxying. Today I will describe how to use it as a forwarding, caching reverse proxy to improve performance for use cases like VPN servers, including DoT termination.

One of the simplest ways to setup DNS resolution on a system might be something similar to the following:

$ cat /etc/resolv.conf
nameserver 8.8.8.8
nameserver 8.8.4.4

This approach has several downsides however.

  • If one of the name servers decides to die, queries will take quite a long time to fail
  • Queries will not be balanced across the specified nameservers (unless options rotate is specified to round-robin)
  • Performance/latency is not taken into account during selection
  • Limited redundancy possibilities, as one usually can’t have more than 3 nameservers listed in resolv.conf

There is always a possibility of running a full standalone DNS recursor on the same system, but that also has some drawbacks:

  • Depending on the query load, average query latency might end up being several times worse than using public recursors due to inefficient caching
  • Connectivity filtering by ISPs along the path could cause queries for some names to be dropped, or be injected with incorrect answers

A hybrid approach in the form of a caching forwarder seems to be ideal for most applications – so why not use a smart piece of software to do that?

Configuration

An example DNSdist configuration file follows:

$ cat /etc/dnsdist/dnsdist.conf
-- Set up console for local access via the CLI
controlSocket("127.0.0.1:5199")
-- Key that is used internally to guard access to the controlSocket.
-- Generate by running the following command:
-- $ dnsdist -e 'makeKey()'
setKey("<READ ARTICLE>")

-- Set up listening addresses
addLocal("127.0.0.1:53")
addLocal("[::1]:53")
addLocal("10.254.1.1:53")
-- Allow queries from any source IP
addACL("0.0.0.0/0")
addACL("::/0")

-- Set up packet cache with a max of 100k entries and override TTLs to not cache things too long
pc = newPacketCache(100000, {maxTTL=300, minTTL=0, maxNegativeTTL=60})

-- Configure upstream IPv4 recursors using DoT with ECS
newServer({address="8.8.8.8:853", tls="openssl"})
newServer({address="1.0.0.1:853", tls="openssl"})

-- Configure upstream IPv6 recursors using DoT with ECS
newServer({address="[2001:4860:4860::8888]:853", tls="openssl"})
newServer({address="[2001:4860:4860::8844]:853", tls="openssl"})
newServer({address="[2606:4700:4700::1001]:853", tls="openssl"})
newServer({address="[2606:4700:4700::1111]:853", tls="openssl"})

-- Optionally forward queries to a local recursor
newServer({address="127.0.0.1:2053"})

-- assign the packet cache to the default pool
getPool(""):setCache(pc)

I have left comments in the file above, but in short, the example will set up a caching forwarding load balancer, which will:

  • listen on localhost and 10.254.1.1 on standard DNS ports, allowing queries from any source IP
  • send queries to upstream public recursors over DoT, performing health-checking to ensure queries are not being sent into a blackhole
    • The default QNAME that is health checked is a.root-servers.net., that can be changed with the checkName parameter of newServer(...).
  • send queries to 127.0.0.1:2053, which could be an instance of a local recursor

The load balancing policy we’re using here is called leastOutstanding:

The default load balancing policy is called leastOutstanding, which means the server with the least queries ‘in the air’ is picked. The exact selection algorithm is:

  1. pick the server with the least queries ‘in the air’ ;
  2. in case of a tie, pick the one with the lowest configured ‘order’ ;
  3. in case of a tie, pick the one with the lowest measured latency (over an average on the last 128 queries answered by that server).

Since all the upstream servers have the same configured order (weight), depending on the traffic load, either option 1 or option 3 are going to be used in this case.

Management

We can see what DNSdist is doing by invoking the handy CLI it comes with by running dnsdist -c

showServers()

The first useful command is showServers(), which displays a table with all the configured upstream servers, including basic metrics about them:

> showServers()
# Name Address State Qps Qlim Ord Wt Queries Drops Drate Lat TCP Outstanding Pools
0 8.8.8.8:853 up 0.0 0 1 1 12765 0 0.0 - 31.4 0
1 1.0.0.1:853 up 0.0 0 1 1 32578 0 0.0 - 6.7 0
2 [2001:4860:4860::8888]:853 up 0.0 0 1 1 504 0 0.0 - 79.8 0
3 [2001:4860:4860::8844]:853 up 0.0 0 1 1 7648 0 0.0 - 29.5 0
4 [2606:4700:4700::1001]:853 up 0.0 0 1 1 507 0 0.0 - 99.2 0
5 [2606:4700:4700::1111]:853 up 0.0 0 1 1 574 0 0.0 - 67.4 0
6 127.0.0.1:2053 up 0.0 0 1 1 65 0 0.0 70.1 - 0
All 0.0 54641 0

Seemingly 1.0.0.1 is the best performing server on average in this instance, as DNSdist seems to like it the most, judging by the highest number of “Queries” sent towards it.

On the other hand, 127.0.0.1:2053, which is an instance of pdns-recursor running on the same machine, is receiving the least amount of traffic. Having to perform recursion is slower than asking public recursors’ often hot caches.

I left the order for the backup recursor the same as for the other servers to demonstrate the impact of latency based routing, however if it is meant as a backup option, it might make sense to depref that server as well.

dumpStats()

Another handy CLI command is dumpStats(), displaying statistics about DNSdist as a whole:

> dumpStats()
acl-drops 0 latency1-10 34796
cache-hits 3239 latency10-50 18687
cache-misses 54646 latency100-1000 324
cpu-iowait 1467287 latency50-100 581
cpu-steal 3703609 no-policy 0
cpu-sys-msec 722605 noncompliant-queries 0
cpu-user-msec 1641516 noncompliant-responses 0
doh-query-pipe-full 0 outgoing-doh-query-pipe-full 0
doh-response-pipe-full 0 proxy-protocol-invalid 0
downstream-send-errors 0 queries 57885
downstream-timeouts 0 rdqueries 57885
dyn-block-nmg-size 0 real-memory-usage 53055488
dyn-blocked 0 responses 57885
empty-queries 0 rule-drop 0
fd-usage 92 rule-nxdomain 0
frontend-noerror 57105 rule-refused 0
frontend-nxdomain 769 rule-servfail 0
frontend-servfail 11 rule-truncated 0
latency-avg100 8574.5 security-status 1
latency-avg1000 9209.9 self-answered 0
latency-avg10000 12343.2 servfail-responses 11
latency-avg1000000 725.9 special-memory-usage 38002688
latency-count 57879 tcp-cross-protocol-query-pipe-full 0
latency-doh-avg100 0.0 tcp-cross-protocol-response-pipe-full 0
latency-doh-avg1000 0.0 tcp-listen-overflows 304
latency-doh-avg10000 0.0 tcp-query-pipe-full 0
latency-doh-avg1000000 0.0 trunc-failures 0
latency-dot-avg100 0.0 udp-in-csum-errors 42
latency-dot-avg1000 0.0 udp-in-errors 63
latency-dot-avg10000 0.0 udp-noport-errors 57
latency-dot-avg1000000 0.0 udp-recvbuf-errors 0
latency-slow 23 udp-sndbuf-errors 0
latency-sum 721334 udp6-in-csum-errors 0
latency-tcp-avg100 8711.1 udp6-in-errors 0
latency-tcp-avg1000 887.4 udp6-noport-errors 13
latency-tcp-avg10000 88.9 udp6-recvbuf-errors 0
latency-tcp-avg1000000 0.9 udp6-sndbuf-errors 0
latency0-1 3468 uptime 197951

Metrics and latency monitoring

DNSdist also comes with a Prometheus endpoint that we can use to gather metrics from. All of the statistics above and more can be scraped, stored and graphed.

https://grafana.com/grafana/dashboards/13692-dnsdist-dashboard/

Drawbacks

I don’t see any drawbacks in the usage of DNSdist itself, however there might be some discussion regarding advocating the usage of public DNS recursors.

We can mitigate resiliency concerns by running a DNS recursor software on the same server as a fallback.

That does not alleviate concerns regarding giving DNS queries to cloud giants. One could however speculate about the usefulness of DNS query information.