Scraping the complete IPv4 WHOIS Address Space
WHOIS data is the most important data source for ipapi.is. WHOIS records often contain the following information:
- The organization that is responsible for an IP address
- The country and postal address of said organization
- Contact information such as email addresses, phone numbers, fax numbers
- Potentially the organization's website by parsing email addresses and taking the domain name (Many organization use the same domain for their email addresses and websites)
Based on primary WHOIS records, interesting secondary information can be derived:
- The organization's website - The domain name from email addresses is often the same domain name that is used in the organization's website.
- Geolocation Intelligence - Based on postal addresses and other attributes in WHOIS records, geolocation intelligence can be derived.
Since WHOIS data is such a fundamental data source for IP address data, ipapi.is needs to have a synced copy of all existing WHOIS records.
WHOIS databases are hosted by the five Regional Internet Registries:
- American Registry for Internet Numbers (ARIN)
- Réseaux IP Européens Network Coordination Centre (RIPE NCC)
- Asia-Pacific Network Information Centre (APNIC)
- African Network Information Centre (AFRINIC)
- Latin America and Caribbean Network Information Centre (LACNIC)
Most WHOIS databases can be downloaded from the respective Regional Internet Registy's website.
However, the publicly available WHOIS databases often don't include the full WHOIS record for privacy reasons. The following WHOIS information is potentially critical:
- Registrant Name: The name of the domain registrant is critical as it can reveal the identity of the individual or organization behind the domain, potentially impacting privacy.
- Email Addresses: Email addresses associated with the domain, including the registrant's and administrative contact's email addresses, are critical because they can be used for communication but may also expose personal or sensitive information.
- Postal Addresses: Physical addresses in WHOIS records are considered critical as they reveal the location of the domain registrant, which can be sensitive information.
For that reason, there is a need to manually query a substantial part of the IP address space with a
whois
client.
This blog article explains how the full IPv4 address space can be efficiently queried.
Useful Observations
The entire IPv4 address space contains around 4 billion IPv4 addresses. It would be very painful to
periodically scrape 4 billion IP addresses, since whois
servers would block us
constantly.
However, the IP address space is divided into networks. When querying a random IP address with
whois 70.81.10.212
, the following record is obtained:
The WHOIS record from above indicates that the IP 70.81.10.212
is included in the network
70.80.0.0 - 70.83.255.255
This means that every whois
request to any other IP in the network
70.80.0.0 - 70.83.255.255
is futile and not necessary. Put differently, since most IP
addresses are structured into networks, there only need to be as many requests to WHOIS servers as there
are networks.
Having said that, the IPv4 address space is still divided into a huge number of networks and it is not a trivial task to scrape the entire WHOIS address space.
The next question would be: Is it possible to find a list of all allocated networks? Such a list would
make it easier to know how many whois queries have to be conducted. A first good starting point would be
to find
out how many inetnum
and inet6num
objects are in the downloadable database of
all RIRs?
As of 8th October 2023, the following JSON file contains the number of networks that are in the downloadable databases of all RIR's except ARIN (ARIN doesn't provide a database - not even a censored one):
What can be learned from those numbers? A very rough estimate of the number of total networks in the entire IP space (IPv4 and IPv6) might be around 7 to 12 million unique networks.
An Algorithm to Scrape the entire WHOIS Address Space
Based on the above observations, it is possible to formulate an algorithm. The following steps need to be accomplished by the proposed algorithm:
- Load all existing WHOIS records from disk that are younger than 4 weeks (not stale)
- Extract the networks from the loaded WHOIS records and add it to a lookup table
- Find the remaining IP space that is not covered by the loaded networks (Excluding reserved IP addresses). This IP space is a array of IP ranges not covered by the existing WHOIS records. Iterate over each IP range of this array:
- Take a random IP address of the current IP range and conduct a WHOIS lookup
- If the WHOIS lookup was successful, store the WHOIS lookup to disk
- If the WHOIS lookup failed, add the failed IP to a blacklist and don't use it anymore
- Repeat the above process until there is no remaining IP space left.
The above algorithm has a drawback: If a WHOIS lookup fails for a certain IP address, it is very likely that there are thousands of other IP addresses that will fail for that range. However, this bad IP range is unknown. Therefore, the above algorithm needs to fail on every single such bad IP address until it can ends.
Conclusion
Crawling the entire WHOIS address space is a challenging task. The main reason for the difficulty are unknown bad IP ranges. Furthermore, the huge number of allocated/assigned networks makes the periodic scraping challenging because WHOIS servers block hosts that make too many requests.