datacenter
Published: September 13, 2023
Last Modified: September 23, 2023
hosting detection cloud provider detection it security

An Algorithm to Detect Hosting Providers and their IP Ranges

This blog article introduces an efficient way to detect previously unknown hosting providers and their IP ranges. The detection algorithm aims to detect hosting providers from the entire Internet.

Too long; Didn't read

This blog article demonstrates how it is possible to systematically detect new, previously unknown hosting providers from the Internet. Knowing the IP ranges of hosting providers is important for many IT security applications and yields better coverage for the is_datacenter field of the ipapi.is API.

The algorithm developed in this blog post found many new hosting providers. You can download the file classifiedHostingProviders.tsv that contains around 96,000 IP ranges of previously unknown hosting providers. Those 96,000 hosting IP ranges belong to more than 2000 newly detected hosting providers. At the point of writing, the algorithm is still running.

Download classifiedHostingProviders.tsv

What are hosting providers?

Hosting providers are organizations that allow third parties to purchase computing resources such as Virtual Private Servers (VPS) or bare metal servers. Such servers are often assigned a public IP address and they are reachable from anywhere in the Internet. Hosting instances are used to run web servers, mail servers, file servers, SSH servers or other kind of software that requires steady uptime and a publicly reachable IP addresses.

The definition of hosting providers is quite lenient for the purpose of ipapi.is and includes all of the following:

Put differently: Every organization that allows anybody to quickly and anonymously obtain hosting resources or IP addresses is considered to be a hosting provider.

Why is hosting detection important?

Hosting resources can be used to run software that was developed with malicious intent in mind. Threat actors often abuse hosting infrastructure to run proxy servers or VPN servers in order to anonymously commit cyber crime. Furthermore, hosting resources are also frequently abused to run advanced bots or crawlers. For instance, many bot programmers run Headless Chrome from cloud instances.

Therefore, by knowing whether an IP address belongs to a hosting provider, it can be assumed that it is a potential threat. This knowledge helps to mitigate malicious traffic.

Furthermore, it is not sufficient to only know the IP ranges of the major hosting providers. Many threat actors move their operation to smaller and less known hosting providers that are not well known.

The avid reader might object at this point: "Such illegal software can also be run on personal laptops or workstations. Why are hosting providers more prone to malicious activity?"

It is correct that threat actors don't exclusively use hosting resources to commit their crimes.

However, from the perspective of an website or app operator, there is almost no good reason why organic user traffic should originate from hosting IP ranges. Or to put it differently: No legitimate user is browsing the web from a server hosted somewhere in the Internet.

The only plausible reason why humans might have hosting IP addresses is because they are using VPN or Proxy servers that are hosted in the cloud. In all other cases, it can be assumed that traffic originating from hosting ranges comes from bots and other malicious programs hosted in the cloud or in datacenters.

There are likely some edge cases where legitimate users have cloud IP ranges, but usually, the average human Internet user surfs with either a residential or mobile IP address (ISP IP addresses).

Goals of this Research

The goal of this article is to present a scalable algorithm that finds new hosting providers and the IP ranges that belong to them. The false positive rate - classifying organizations as hosting providers even though they are not - should be as small as possible. The newly detected hosting IP ranges will be used by the ipapi.is API to populate the is_datacenter field.

Some examples for IP addresses that belong to hosting providers:

It is not possible to find every single hosting provider that exists in the Internet. But it certainly is possible to find a substantial part of existing hosting providers that are out there.

The Hosting Detection Algorithm

The hosting detection algorithm is explained in the following sections. The algorithm is fully automatic and thus does not require manual human interaction. The algorithm consists of six different processing steps. Some processing steps are inherently risky and can lead to false positives (Classifying IP addresses as hosting IPs, even though they are not).

In general, the false positive rate should be minimized as much as possible, even if the false negative rate is increasing as a consequence. Put differently: The algorithm should rather miss to detect some hosting providers instead of risking to wrongly classify a hosting providers.

Step 1: Download a List of Top 1 Million Domain Names

In a first step, a list of the top 1 million domain names is downloaded from Cloudflare Radar. This list includes the Top 1,000,000 domains of the entire Internet. If you are interested in how exactly the Domain Ranking is computed, you can read about Cloudflare's Domain Ranking method on their blog.

Step 2: Lookup the IP address for every Domain Name

In a next step, each of the 1 million domain names must be resolved. Resolving means that the DNS name is translated into an IP address. This is a rather time consuming process, since it involves looking up 1 million domain names. The following python3 script is doing exactly that:

import socket
import random
import json
import os


def flush(results, fn='res.json'):
    print(f'Flushing {len(results)} IPs to disk')
    parsed = dict()

    if os.path.exists(fn):
        with open(fn) as fd:
            parsed = json.load(fd)

    for key, value in results.items():
        parsed[key] = value

    with open(fn, 'w') as fd:
        json.dump(parsed, fd, indent=2)


def lookup_addresses(domain_list, flush_after=200):
    ip_addresses = dict()
    for domain in domain_list:
        try:
            ip = socket.gethostbyname(domain)
            ip_addresses[domain] = ip
        except socket.gaierror as err:
            ip_addresses[domain] = str(err)

        n = len(ip_addresses)
        if n > 0 and n % flush_after == 0:
            flush(ip_addresses)
            ip_addresses = dict()

    return ip_addresses


if __name__ == "__main__":
    domain_list = open('top1M.csv').read().split('\n')
    random.shuffle(domain_list)
    print(f'Looking up {len(domain_list)} domains')
    # Set the custom DNS server at the system level
    socket.resolver = "1.1.1.1"
    lookup_addresses(domain_list)

The lookup process took around 2 days and can be parallelized of course. After looking up all 1 million domain names, a JSON file is obtained that has the following structure (Only a small excerpt of the full file is shown):

{
  "bonanza-play.com": "104.21.72.206",
  "rhythm.cloud": "76.223.17.25",
  "poryadok.ru": "104.22.65.119",
  "casino-x-noq.buzz": "104.21.19.128",
  "pin-up-casino64.ru": "172.67.173.230",
  "nostroy.ru": "89.253.229.54",
  "mostbet-ru.life": "172.67.204.37",
  "artmotion.net": "104.26.6.15",
  "latnoticias365.com": "51.77.14.1",
  "rtkba.com": "104.21.89.225",
  "transitcard.ru": "89.104.86.143",
  "alsatiapolynia.com": "[Errno -5] No address associated with hostname",
  "xhfu.cn": "[Errno 8] nodename nor servname provided, or not known",
  "joycasino-a16.top": "172.67.179.211",
  "redstarslots.ru": "176.10.250.233",
  "tdspsden.com": "172.64.149.13",
  "vireq.com": "13.224.103.119",
  "rickhendrickdodge.com": "54.243.57.127",
  "pulsure.dk": "185.31.79.5",
  "loomis.com": "52.17.152.5",
  "allianceservices.im": "[Errno 8] nodename nor servname provided, or not known",
  "loups-garous-en-ligne.com": "172.67.72.203"
}

Step 3: Clean the obtained IP Addresses

The next step involves cleaning the JSON file from those IP addresses that ipapi.is API already detects as hosting provider IP addresses. The explanation is obvious: IP addresses that are known to belong to hosting providers don't need to be detected again. Furthermore, duplicates are removed from the resulting list of IP addresses.

After those two steps, a list of IP addresses that ipapi.is API currently does not classify as hosting IP addresses is obtained.

From 1,001,400 domain names in total, 906,519 IPv4 addresses were obtained (The rest failed to resolve correctly). From those 906,519 IPv4 addresses, 721,823 were already classified to be datacenter IPs by the ipapi.is API. The rest (184,696 IPs) were de-duplicated and yielded 100,1400 unique IP addresses that are candidates for the algorithm.

The list is called candidate IP address list, since there is a good chance that those IP addresses belong to a hosting provider that is previously unknown to ipapi.is API.

Why is that the case?

While some organizations that are not hosting providers might choose to host their own domain (Such as universities, large organizations or government entities), most organizations do not run their own datacenters and rent hosting resources from a professional hosting provider. And since the list contains 1 million domains, it is very likely that a significant share of the existing hosting providers of the Internet is represented in this list.

This is an excerpt of the candidate IP address list:

[
  "66.51.127.80",
  "119.110.249.22",
  "185.145.195.71",
  "209.203.26.244",
  "176.102.65.18",
  "85.92.117.211",
  "45.135.121.27",
  "178.248.235.42",
  "178.35.253.211",
  "89.30.219.98",
  "194.9.149.53",
  "210.31.101.1",
  "61.31.224.233",
  "185.165.31.203",
  "193.148.244.24",
]

There is no trivial way to infer whether the IP address belongs to a hosting provider or not without having more information about each particular IP address.

A straightforward idea is to find the organization that is the owner of this IP address and to detect whether the owning organization is a hosting provider or not. Based on the organization name alone it is (usually) not possible to answer this. Therefore, the organization's website must be crawled.

But first, the organizations responsible for the IPs in the candidate IP address list must be found.

Step 4: Obtain WHOIS Records for every IP Address

In a first step, the organization that owns the IP address needs to be obtained. This is possible by conducting a WHOIS lookup for each IP address of the candidate IP address list. For example, the WHOIS lookup for whois 66.51.127.80 yields the following WHOIS record:

NetRange:       66.51.120.0 - 66.51.127.255
CIDR:           66.51.120.0/21
NetName:        FLYIO
NetHandle:      NET-66-51-120-0-1
Parent:         NET66 (NET-66-0-0-0-0)
NetType:        Direct Allocation
OriginAS:       
Organization:   Fly.io, Inc. (FLYIO)
RegDate:        2021-12-06
Updated:        2021-12-06
Ref:            https://rdap.arin.net/registry/ip/66.51.120.0


OrgName:        Fly.io, Inc.
OrgId:          FLYIO
Address:        PO Box 803338 #19104
City:           Chicago
StateProv:      IL
PostalCode:     60680-3338
Country:        US
RegDate:        2017-01-18
Updated:        2023-07-07
Ref:            https://rdap.arin.net/registry/entity/FLYIO


OrgTechHandle: SANDE663-ARIN
OrgTechName:   Sanders, Scott 
OrgTechPhone:  +1-803-767-0060 
OrgTechEmail:  scott@jssjr.com
OrgTechRef:    https://rdap.arin.net/registry/entity/SANDE663-ARIN

OrgAbuseHandle: ABUSE8489-ARIN
OrgAbuseName:   Abuse
OrgAbusePhone:  +1-312-626-4490 
OrgAbuseEmail:  abuse@fly.io
OrgAbuseRef:    https://rdap.arin.net/registry/entity/ABUSE8489-ARIN

OrgNOCHandle: FLYOP-ARIN
OrgNOCName:   Fly Ops
OrgNOCPhone:  +1-312-283-4377 
OrgNOCEmail:  ops@fly.io
OrgNOCRef:    https://rdap.arin.net/registry/entity/FLYOP-ARIN

OrgTechHandle: BERRY359-ARIN
OrgTechName:   Berryman, Steve 
OrgTechPhone:  +447886749129 
OrgTechEmail:  steve@fly.io
OrgTechRef:    https://rdap.arin.net/registry/entity/BERRY359-ARIN

OrgTechHandle: FLYOP-ARIN
OrgTechName:   Fly Ops
OrgTechPhone:  +1-312-283-4377 
OrgTechEmail:  ops@fly.io
OrgTechRef:    https://rdap.arin.net/registry/entity/FLYOP-ARIN
              

Limitations of conducting WHOIS lookups

Because our candidate IP list contains 100,1400 IP addresses, it is possible that WHOIS servers are rate limiting us when querying too fast. Therefore, a realistic speed that stays under the radar is maybe 20,000 WHOIS lookups per day and the whole process takes at least 5 days (Which is fine).

Step 5: Parse the WHOIS record and extract the Company Name and Domain

Based on this WHOIS example from above, it is still not possible to say whether the organization Fly.io, Inc. is a hosting provider or not. The next goal is to find the organization's website URL. Two attributes from the WHOIS record can be of help:

  • The organization name can be parsed from the OrgName: Fly.io, Inc. attribute. The organization name can be Googled and the first search result could be the organization's website.
  • The domain can be parsed from the OrgAbuseEmail: abuse@fly.io attribute and the domain might be the same as the domain in the organization's website (Which is correct with fly.io).

Possible Limitations:

  • The WHOIS record does not include a domain. Action: Proceed with the organization name.
  • The WHOIS record does not include a organization name. Action: Proceed with the domain.
  • If both domain and organization name are not available, the algorithm aborts.
If a domain is available in the WHOIS record, the following limitations apply:
  • The WHOIS record includes a misleading domain that is not the domain of the organization's website. Action: Check with a blacklist of known bad domains if this is the case.
  • The organization domain is not the primary website of the organization and only a technical domain. Action: This cannot be detected, a false positive is obtained.
If only the organization name is available in the WHOIS record, the following limitations apply:
  • The name of the organization cannot be found on Google. Action: Abort the algorithm.
  • The name of the organization leads to wrong search results and a wrong organization URL is obtained. Problem: It is not possible to easily say whether the search result is really the organization's website. Action: Use text similarity metric between organization name and url.

Step 6: Crawl the Website and Classify the Website Text

After visiting the website fly.io, it is possible to understand that Fly.io, Inc. is a infrastructure as a service organization that sells specialized hosting resources. This is close enough and the organization Fly.io, Inc. and all their known IP ranges can be classified as hosting IP ranges.

The only way to classify a organization's website to be a hosting provider or not is with some kind of text classification approach. If the website includes a certain quantity of hosting keywords, the website is classified as hosting provider. Some machine learning can be used, but this is out of the scope for this quick article.

Limitations of Crawling a Website

  • Crawling the URL results in a ban, since the website has crawling protection. Action: Abort the algorithm or try again later.
  • The website rate limits requests and thus the request is blocked. Action: Abort the algorithm or try again later.
  • The crawling results in some kind of error (Certificate error, 404 Not Found, 503 Server Error, or similar). Action: Abort the algorithm.

Limitations of Classifying Text

  • The classification result is a false negative (No hosting provider even though the website is one). Action: This happens and does not have a negative impact.
  • The classification result is a false positive (Classified as hosting provider, even though the website is not one). Problem: This has a large negative impact. Action: If the scoring is weak or indecisive, put the score result on a list for human verification.
  • The website's language might be in any language. Therefore the text classification algorithm must support the most commonly used language in the Internet, which is hard to implement. It is better to translate the website into English. Action: Use Google translate automatically (Do they rate limit translation?).

Algorithm Pseudo Code

What follows is the pseudo code of the bespoken hosting detection algorithm. The code below implements the steps 4 - 6. The first three steps were left out, since they are trivial to implement.

const hostingDetectionAlgorithm = async () => {
  const candidateIps = [
    "66.51.127.80",
    "119.110.249.22",
    "61.31.224.233",
    "193.148.244.24",
    // ... (huge list of 100k candidate IPs)
  ];


  let i = 0;
  let whoisLookupServer = 'standard';
  let failures = {};
  let classified = {};

  while (i < candidateIps.length) {
    const ip = candidateIps[i];
    let whoisRecord = whoisLookup(ip, whoisLookupServer);

    if (whoisServerDeniedRequest(whoisRecord)) {
      if (haveUnblockedWhoisLookupServer()) {
        whoisLookupServer = getNextWhoisLookupServer();
        continue;
      } else {
        waitForFiveMinutes();
        continue;
      }
    }

    // we have a good whois lookup result
    const orgName = getOrgFromWhois(whoisRecord);
    const allDomains = getDomainsFromWhois(whoisRecord);

    if (!orgName && !allDomains) {
      failures[ip] = {
        ip: ip,
        status: 'failed',
        error: 'Cannot parse org name and domain name from whois record'
      };
      i++;
      continue;
    }

    let orgUrlCandidates = [];

    if (allDomains) {
      // remove all domains that cannot be the organization's website domain
      let filteredDomains = filterBadDomains(allDomains);
      // order the filtered domain by frequency in the WHOIS record
      let sortedByFrequency = sortByDomainFrequency(filteredDomains);
      // insert trial organization url candidates according to domain frequency
      for (const domain of sortedByFrequency) {
        if (isSimilar(domain, orgName)) {
          orgUrlCandidates.push(`https://${domain}`);
          orgUrlCandidates.push(`https://www.${domain}`);
        }
      }
    }

    // try to get the organization's website
    for (const url of orgUrlCandidates) {
      let crawlResponse = crawlHtml(url);
      if (isCrawlSuccessful(crawlResponse)) {
        const isHostingProvider = classifyText(crawlResponse);
        classified[ip] = {
          isHostingProvider: isHostingProvider,
          orgName: orgName,
          url: url,
        };
        break;
      }
    }
    i++;
  }
};

Conclusion

The algorithm above is rather complex. Unfortunately, there are many steps in the algorithm that can go wrong or where false assumptions can be made. Furthermore, making WHOIS lookups and crawling websites with thousands of IP addresses and organization names is rather slow and getting blocked is an issue.

Nevertheless, the discovery speed of the hosting detection algorithm doesn't need to be large. Furthermore, for every discovered hosting provider, the algorithm becomes more efficient, since there are less candidate IP addresses.

To conclude, it can be said that thousands of new hosting providers could be detected by applying the hosting detection algorithm on the candidate IP address list. This gives ipapi.is one of the best hosting detection API's that exist!