On the morning of March 8, a post to Hacker News stated that “All .fj domains have gone offline”, listing several hostnames in domains within the Fiji top level domain (known as a ccTLD) that had become unreachable. Commenters in the associated discussion thread had mixed results in being able to reach .fj hostnames—some were successful, while others saw failures. The fijivillage news site also highlighted the problem, noting that the issue also impacted Vodafone’s M-PAiSA app/service, preventing users from completing financial transactions.
The impact of this issue can be seen in traffic to Cloudflare customer zones in the .com.fj second-level domain. The graph below shows that HTTP traffic to these zones dropped by approximately 40% almost immediately starting around midnight UTC on March 8. Traffic volumes continued to decline throughout the rest of the morning.
Looking at Cloudflare’s 22.214.171.124 resolver data for queries for .com.fj hostnames, we can also see that error volume associated with those queries climbs significantly starting just after midnight as well. This means that our resolvers encountered issues with the answers from .fj servers.
This observation suggests that the problem was strictly DNS related, rather than connectivity related—Cloudflare Radar does not show any indication of an Internet disruption in Fiji coincident with the start of this problem.
It was suggested within the Hacker News comments that the problem could be DNSSEC related. Upon further investigation, it appears that may be the cause. In verifying the DNSSEC record for the .fj ccTLD, shown in the
dig output below, we see that it states
EDE: 9 (DNSKEY Missing): 'no SEP matching the DS found for fj.'
kdig fj. soa +dnssec @126.96.36.199 ;; ->>HEADER<<- opcode: QUERY; status: SERVFAIL; id: 12710 ;; Flags: qr rd ra; QUERY: 1; ANSWER: 0; AUTHORITY: 0; ADDITIONAL: 1 ;; EDNS PSEUDOSECTION: ;; Version: 0; flags: do; UDP size: 1232 B; ext-rcode: NOERROR ;; EDE: 9 (DNSKEY Missing): 'no SEP matching the DS found for fj.' ;; QUESTION SECTION: ;; fj. IN SOA ;; Received 73 B ;; Time 2022-03-08 08:57:41 EST ;; From 188.8.131.52@53(UDP) in 17.2 ms
Extended DNS Error 9 (EDE: 9) is defined as “A DS record existed at a parent, but no supported matching DNSKEY record could be found for the child.” The Cloudflare Learning Center article on DNSKEY and DS records explains this relationship:
The DS record is used to verify the authenticity of child zones of DNSSEC zones. The DS key record on a parent zone contains a hash of the KSK in a child zone. A DNSSEC resolver can therefore verify the authenticity of the child zone by hashing its KSK record, and comparing that to what is in the parent zone’s DS record.
Ultimately, it appears that around midnight UTC, the .fj zone started to be signed with a key that was not in the root zone DS, possibly as the result of a scheduled rollover that happened without checking that the root zone was updated first by IANA, which updates the root zone. (IANA owns contact with the TLD operators, and instructs the Root Zone Publisher on the changes to make in the next version of the root zone.)
DNSSEC problems as the root cause of the observed issue align with the observation in the Hacker News comments that some were able to access .fj websites, while others were not. Users behind resolvers doing strict DNSSEC validation would have seen an error in their browser, while users behind less strict resolvers would have been able to access the sites without a problem.
Further analysis of Cloudflare resolver metrics indicates that the problem was resolved around 1400 UTC, when the DS was updated. When DNSSEC is improperly configured for a single domain name, it can cause problems accessing websites or applications in that zone. However, when the misconfiguration occurs at a ccTLD level, the impact is much more significant. Unfortunately, this seems to occur all too often.
(Thank you to Ólafur Guðmundsson for his DNSSEC expertise.)