Scraping WHOIS data
Published: October 3, 2023
Last Modified: October 8, 2023
whois scraping ipv4 address space

Scraping the complete IPv4 WHOIS Address Space

WHOIS data is the most important data source for ipapi.is. WHOIS records often contain the following information:

  • The organization that is responsible for an IP address
  • The country and postal address of said organization
  • Contact information such as email addresses, phone numbers, fax numbers
  • Potentially the organization's website by parsing email addresses and taking the domain name (Many organization use the same domain for their email addresses and websites)

Based on primary WHOIS records, interesting secondary information can be derived:

  • The organization's website - The domain name from email addresses is often the same domain name that is used in the organization's website.
  • Geolocation Intelligence - Based on postal addresses and other attributes in WHOIS records, geolocation intelligence can be derived.

Since WHOIS data is such a fundamental data source for IP address data, ipapi.is needs to have a synced copy of all existing WHOIS records.

WHOIS databases are hosted by the five Regional Internet Registries:

Most WHOIS databases can be downloaded from the respective Regional Internet Registy's website.

However, the publicly available WHOIS databases often don't include the full WHOIS record for privacy reasons. The following WHOIS information is potentially critical:

  • Registrant Name: The name of the domain registrant is critical as it can reveal the identity of the individual or organization behind the domain, potentially impacting privacy.
  • Email Addresses: Email addresses associated with the domain, including the registrant's and administrative contact's email addresses, are critical because they can be used for communication but may also expose personal or sensitive information.
  • Postal Addresses: Physical addresses in WHOIS records are considered critical as they reveal the location of the domain registrant, which can be sensitive information.

For that reason, there is a need to manually query a substantial part of the IP address space with a whois client.

This blog article explains how the full IPv4 address space can be efficiently queried.

Useful Observations

The entire IPv4 address space contains around 4 billion IPv4 addresses. It would be very painful to periodically scrape 4 billion IP addresses, since whois servers would block us constantly.

However, the IP address space is divided into networks. When querying a random IP address with whois 70.81.10.212, the following record is obtained:

% IANA WHOIS server
% for more information on IANA, visit http://www.iana.org
% This query returned 1 object

refer:        whois.arin.net

inetnum:      70.0.0.0 - 70.255.255.255
organisation: ARIN
status:       ALLOCATED

whois:        whois.arin.net

changed:      2004-01
source:       IANA

# whois.arin.net

NetRange:       70.80.0.0 - 70.83.255.255
CIDR:           70.80.0.0/14
NetName:        VL-17BL
NetHandle:      NET-70-80-0-0-1
Parent:         NET70 (NET-70-0-0-0-0)
NetType:        Direct Allocation
OriginAS:       
Organization:   Videotron Ltee (VL-421)
RegDate:        2004-07-14
Updated:        2020-10-08
Ref:            https://rdap.arin.net/registry/ip/70.80.0.0



OrgName:        Videotron Ltee
OrgId:          VL-421
Address:        150 Beaubien West
City:           Montreal
StateProv:      QC
PostalCode:     H2V 1C4
Country:        CA
RegDate:        2020-05-15
Updated:        2020-11-02
Comment:        https://www.videotron.com
Ref:            https://rdap.arin.net/registry/entity/VL-421


OrgTechHandle: NOC1226-ARIN
OrgTechName:   Network Operations Center
OrgTechPhone:  +1-514-380-7100 
OrgTechEmail:  SRIP@videotron.com
OrgTechRef:    https://rdap.arin.net/registry/entity/NOC1226-ARIN

OrgAbuseHandle: ABUSE263-ARIN
OrgAbuseName:   Network Operations Center
OrgAbusePhone:  +1-514-281-8498 
OrgAbuseEmail:  abuse@videotron.ca
OrgAbuseRef:    https://rdap.arin.net/registry/entity/ABUSE263-ARIN

The WHOIS record from above indicates that the IP 70.81.10.212 is included in the network 70.80.0.0 - 70.83.255.255

This means that every whois request to any other IP in the network 70.80.0.0 - 70.83.255.255 is futile and not necessary. Put differently, since most IP addresses are structured into networks, there only need to be as many requests to WHOIS servers as there are networks.

Having said that, the IPv4 address space is still divided into a huge number of networks and it is not a trivial task to scrape the entire WHOIS address space.

The next question would be: Is it possible to find a list of all allocated networks? Such a list would make it easier to know how many whois queries have to be conducted. A first good starting point would be to find out how many inetnum and inet6num objects are in the downloadable database of all RIRs?

As of 8th October 2023, the following JSON file contains the number of networks that are in the downloadable databases of all RIR's except ARIN (ARIN doesn't provide a database - not even a censored one):

{
  "ripe": {
    "numNetsIPv4": 4177981,
    "numNetsIPv6": 931893,
    "numIPv4": 1383978430
  },
  "afrinic": {
    "numNetsIPv4": 149612,
    "numNetsIPv6": 32040,
    "numIPv4": 205617412
  },
  "twnic": {
    "numNetsIPv4": 90895,
    "numNetsIPv6": 1334,
    "numIPv4": 27476174
  },
  "krnic": {
    "numNetsIPv4": 4153,
    "numNetsIPv6": 110,
    "numIPv4": 111065120
  },
  "apnic": {
    "numNetsIPv4": 1185174,
    "numNetsIPv6": 93315,
    "numIPv4": 1035455286
  },
  "jpnic": {
    "numNetsIPv4": 537512,
    "numNetsIPv6": 7141,
    "numIPv4": 179191974
  }
}

What can be learned from those numbers? A very rough estimate of the number of total networks in the entire IP space (IPv4 and IPv6) might be around 7 to 12 million unique networks.

An Algorithm to Scrape the entire WHOIS Address Space

Based on the above observations, it is possible to formulate an algorithm. The following steps need to be accomplished by the proposed algorithm:

  1. Load all existing WHOIS records from disk that are younger than 4 weeks (not stale)
  2. Extract the networks from the loaded WHOIS records and add it to a lookup table
  3. Find the remaining IP space that is not covered by the loaded networks (Excluding reserved IP addresses). This IP space is a array of IP ranges not covered by the existing WHOIS records. Iterate over each IP range of this array:
    • Take a random IP address of the current IP range and conduct a WHOIS lookup
    • If the WHOIS lookup was successful, store the WHOIS lookup to disk
    • If the WHOIS lookup failed, add the failed IP to a blacklist and don't use it anymore
  4. Repeat the above process until there is no remaining IP space left.

The above algorithm has a drawback: If a WHOIS lookup fails for a certain IP address, it is very likely that there are thousands of other IP addresses that will fail for that range. However, this bad IP range is unknown. Therefore, the above algorithm needs to fail on every single such bad IP address until it can ends.

Conclusion

Crawling the entire WHOIS address space is a challenging task. The main reason for the difficulty are unknown bad IP ranges. Furthermore, the huge number of allocated/assigned networks makes the periodic scraping challenging because WHOIS servers block hosts that make too many requests.