ipv4 address space
Scraping the complete IPv4 WHOIS Address Space
WHOIS data is the most important data source for ipapi.is. WHOIS records often contain the
- The organization that is responsible for an IP address
- The country and postal address of said organization
- Contact information such as email addresses, phone numbers, fax numbers
- Potentially the organization's website by parsing email addresses and taking the domain name (Many
organization use the same domain for their email addresses and websites)
Based on primary WHOIS records, interesting secondary information can be derived:
- The organization's website - The domain name from email addresses is often the same
domain name that is used in the organization's website.
- Geolocation Intelligence - Based on postal addresses and other attributes in WHOIS
geolocation intelligence can be derived.
Since WHOIS data is such a fundamental data source for IP address data, ipapi.is needs to
have a synced copy of all existing WHOIS records.
WHOIS databases are hosted by the five Regional Internet Registries:
Most WHOIS databases can be
downloaded from the respective Regional Internet Registy's website.
However, the publicly available WHOIS databases often don't include the full WHOIS record for privacy
reasons. The following WHOIS information is potentially critical:
- Registrant Name: The name of the domain registrant is critical as it can reveal the
identity of the individual or organization behind the domain, potentially impacting privacy.
- Email Addresses: Email addresses associated with the domain, including the
registrant's and administrative contact's email addresses, are critical because they can be used for
communication but may also expose personal or sensitive information.
- Postal Addresses: Physical addresses in WHOIS records are considered critical as they
reveal the location of the domain registrant, which can be sensitive information.
For that reason, there is a need to manually query a substantial part of the IP address space with a
This blog article explains how the full IPv4 address space can be efficiently queried.
The entire IPv4 address space contains around 4 billion IPv4 addresses. It would be very painful to
periodically scrape 4 billion IP addresses, since
whois servers would block us
However, the IP address space is divided into networks. When querying a random IP address with
whois 22.214.171.124, the following record is obtained:
% IANA WHOIS server
% for more information on IANA, visit http://www.iana.org
% This query returned 1 object
inetnum: 126.96.36.199 - 188.8.131.52
NetRange: 184.108.40.206 - 220.127.116.11
Parent: NET70 (NET-70-0-0-0-0)
NetType: Direct Allocation
Organization: Videotron Ltee (VL-421)
OrgName: Videotron Ltee
Address: 150 Beaubien West
PostalCode: H2V 1C4
OrgTechName: Network Operations Center
OrgAbuseName: Network Operations Center
The WHOIS record from above indicates that the IP
18.104.22.168 is included in the network
22.214.171.124 - 126.96.36.199
This means that every
whois request to any other IP in the network
188.8.131.52 - 184.108.40.206 is futile and not necessary. Put differently, since most IP
addresses are structured into networks, there only need to be as many requests to WHOIS servers as there
Having said that, the IPv4 address space is still divided into a huge number of networks and it is not a
trivial task to scrape the entire WHOIS address space.
The next question would be: Is it possible to find a list of all allocated networks? Such a list would
make it easier to know how many whois queries have to be conducted. A first good starting point would be
out how many
inet6num objects are in the downloadable database of
As of 8th October 2023, the following JSON file contains the number of networks that are in the
downloadable databases of all RIR's except ARIN (ARIN doesn't provide a database - not even a censored
What can be learned from those numbers?
A very rough estimate of the number of total networks in the entire IP space (IPv4 and IPv6) might be
around 7 to 12
million unique networks.
An Algorithm to Scrape the entire WHOIS Address Space
Based on the above observations, it is possible to formulate an algorithm. The following steps need to be
accomplished by the proposed algorithm:
- Load all existing WHOIS records from disk that are younger than 4 weeks (not stale)
- Extract the networks from the loaded WHOIS records and add it to a lookup table
- Find the remaining IP space that is not covered by the loaded networks (Excluding reserved IP
addresses). This IP space is a array of IP ranges not covered by the existing WHOIS records.
Iterate over each IP range of this array:
Repeat the above process until there is no remaining IP space left.
- Take a random IP address of the current IP range and conduct a WHOIS lookup
- If the WHOIS lookup was successful, store the WHOIS lookup to disk
- If the WHOIS lookup failed, add the failed IP to a blacklist and don't use it anymore
The above algorithm has a drawback: If a WHOIS lookup fails for a certain IP address, it is
very likely that there are thousands of other IP addresses that will fail for that range. However, this
bad IP range is unknown. Therefore, the above
algorithm needs to fail on every single such bad IP address until it can ends.
Crawling the entire WHOIS address space is a challenging task. The main reason for the difficulty are
unknown bad IP ranges. Furthermore, the huge number of allocated/assigned networks makes the
periodic scraping challenging because WHOIS servers block hosts that make too many requests.