Published:
September 13, 2023
Last Modified:
September 23, 2023
hosting detection
cloud provider detection
it security
An Algorithm to Detect Hosting Providers
and their IP Ranges
This blog article introduces an efficient way to detect previously unknown hosting providers and their IP
ranges. The detection algorithm aims to detect hosting providers from the entire Internet.
Too long; Didn't read
This blog article demonstrates how it is possible to systematically detect new, previously unknown hosting
providers from the Internet. Knowing the IP ranges of hosting providers is important for many IT security
applications and yields better coverage for the is_datacenter
field of the ipapi.is API.
The algorithm developed in this blog post found many new hosting providers. You can download the
file classifiedHostingProviders.tsv that contains
around 96,000 IP ranges of previously unknown
hosting providers. Those 96,000 hosting IP ranges belong to more than 2000 newly detected hosting
providers. At the point of writing, the algorithm is still running.
Download
classifiedHostingProviders.tsv
What are hosting providers?
Hosting providers are organizations that allow third parties to purchase computing resources such as
Virtual Private Servers (VPS) or bare metal servers. Such servers are often assigned a
public
IP address and they are reachable from anywhere in the Internet. Hosting instances are
used to run web servers, mail servers, file servers, SSH
servers or other kind of software that requires steady uptime and a publicly reachable IP addresses.
The definition of
hosting providers is quite lenient for the purpose of ipapi.is and includes all of the
following:
Put differently: Every organization that allows anybody to quickly and anonymously obtain hosting
resources or IP addresses is considered to be a hosting provider.
Why is hosting detection important?
Hosting resources can be used to run software that was developed with malicious
intent in mind.
Threat actors often abuse hosting infrastructure to run proxy
servers or VPN servers in order to anonymously commit cyber crime. Furthermore, hosting resources are also
frequently abused to
run advanced bots or crawlers. For instance, many bot programmers run Headless Chrome from cloud
instances.
Therefore, by knowing whether an IP address belongs to a hosting provider, it can be assumed that it is a
potential threat. This knowledge helps to mitigate malicious traffic.
Furthermore, it is not sufficient to only know the IP ranges of the major hosting providers. Many threat
actors move their operation to smaller and less known hosting providers that are not well known.
The avid reader might object at this point: "Such illegal software can also be run on
personal laptops or workstations. Why are hosting providers more prone to malicious activity?"
It is correct that threat actors don't exclusively use hosting resources to commit their crimes.
However, from the perspective of an website or app operator, there is almost no
good reason why organic user traffic should originate from hosting IP ranges. Or to put it differently:
No legitimate user is browsing the web from a server hosted somewhere in the Internet.
The only plausible
reason why humans might have hosting IP addresses is because they are using VPN or Proxy servers that are
hosted in the cloud. In all other cases, it can be assumed that traffic originating from hosting ranges
comes from bots and
other malicious programs hosted in the cloud or in datacenters.
There are likely some edge cases where legitimate users have cloud IP ranges, but usually, the average
human Internet user surfs with either a residential or mobile IP address (ISP IP addresses).
Goals of this Research
The goal of this article is to present a scalable algorithm that finds new hosting providers and the
IP ranges that belong to them. The false positive rate - classifying organizations as hosting providers
even
though they are not - should be as small as possible. The newly detected hosting
IP ranges will be used by the ipapi.is API to populate the
is_datacenter
field.
Some examples for IP addresses that belong to hosting providers:
It is not possible to find every single hosting provider that exists in the Internet. But it certainly is
possible
to
find a substantial part of existing hosting providers that are out there.
The Hosting Detection Algorithm
The hosting detection algorithm is explained in the following sections. The algorithm is fully
automatic and thus does not require manual human interaction. The algorithm consists of six different
processing steps. Some processing steps are inherently risky and can lead to false positives (Classifying
IP addresses as hosting IPs, even though they are not).
In general, the false positive rate should be minimized as much as possible, even if the false negative
rate is
increasing as a consequence. Put differently: The algorithm should rather miss to detect some hosting
providers instead of risking to wrongly classify a hosting providers.
Step 1: Download a List of Top 1 Million Domain Names
In a first step, a list of the top 1 million domain names is downloaded from Cloudflare Radar. This list includes the Top 1,000,000
domains of the entire Internet. If you are interested in how exactly the Domain Ranking is computed, you
can read about Cloudflare's Domain Ranking
method on their blog.
Step 2: Lookup the IP address for every Domain Name
In a next step, each of the 1 million domain names must be resolved. Resolving means that the DNS name is
translated into an IP
address. This is a rather time consuming process, since it involves looking up 1 million domain names.
The
following python3
script is doing exactly that:
import socket
import random
import json
import os
def flush(results, fn='res.json'):
print(f'Flushing {len(results)} IPs to disk')
parsed = dict()
if os.path.exists(fn):
with open(fn) as fd:
parsed = json.load(fd)
for key, value in results.items():
parsed[key] = value
with open(fn, 'w') as fd:
json.dump(parsed, fd, indent=2)
def lookup_addresses(domain_list, flush_after=200):
ip_addresses = dict()
for domain in domain_list:
try:
ip = socket.gethostbyname(domain)
ip_addresses[domain] = ip
except socket.gaierror as err:
ip_addresses[domain] = str(err)
n = len(ip_addresses)
if n > 0 and n % flush_after == 0:
flush(ip_addresses)
ip_addresses = dict()
return ip_addresses
if __name__ == "__main__":
domain_list = open('top1M.csv').read().split('\n')
random.shuffle(domain_list)
print(f'Looking up {len(domain_list)} domains')
# Set the custom DNS server at the system level
socket.resolver = "1.1.1.1"
lookup_addresses(domain_list)
The lookup process took around 2 days and can be parallelized of course. After looking up all 1 million
domain names, a JSON file is obtained that has the following structure (Only a small excerpt of the full
file is
shown):
{
"bonanza-play.com": "104.21.72.206",
"rhythm.cloud": "76.223.17.25",
"poryadok.ru": "104.22.65.119",
"casino-x-noq.buzz": "104.21.19.128",
"pin-up-casino64.ru": "172.67.173.230",
"nostroy.ru": "89.253.229.54",
"mostbet-ru.life": "172.67.204.37",
"artmotion.net": "104.26.6.15",
"latnoticias365.com": "51.77.14.1",
"rtkba.com": "104.21.89.225",
"transitcard.ru": "89.104.86.143",
"alsatiapolynia.com": "[Errno -5] No address associated with hostname",
"xhfu.cn": "[Errno 8] nodename nor servname provided, or not known",
"joycasino-a16.top": "172.67.179.211",
"redstarslots.ru": "176.10.250.233",
"tdspsden.com": "172.64.149.13",
"vireq.com": "13.224.103.119",
"rickhendrickdodge.com": "54.243.57.127",
"pulsure.dk": "185.31.79.5",
"loomis.com": "52.17.152.5",
"allianceservices.im": "[Errno 8] nodename nor servname provided, or not known",
"loups-garous-en-ligne.com": "172.67.72.203"
}
Step 3: Clean the obtained IP Addresses
The next step involves cleaning the JSON file from those IP addresses that ipapi.is API
already detects as hosting provider IP addresses. The explanation is obvious: IP addresses that are known
to belong to hosting providers don't need to be detected again. Furthermore, duplicates are removed from
the resulting list of IP
addresses.
After those two steps, a list of IP addresses that ipapi.is API currently
does not classify as hosting IP addresses is obtained.
From 1,001,400
domain names in total, 906,519
IPv4 addresses were obtained (The
rest failed to resolve correctly). From
those 906,519
IPv4 addresses, 721,823
were already classified to be datacenter
IPs
by the ipapi.is API. The rest (184,696
IPs) were de-duplicated and
yielded 100,1400
unique IP addresses that are candidates for the algorithm.
The list is called candidate IP address list, since
there is a
good chance that those IP
addresses belong to a hosting provider that is previously unknown to ipapi.is API.
Why is that the case?
While some organizations that are not hosting providers might choose to host their own domain (Such as
universities, large organizations or government entities), most organizations do not run their own
datacenters
and rent hosting resources from a professional hosting provider. And since the list contains 1
million domains, it is very likely that a significant share of the existing hosting providers of the
Internet is represented in this list.
This is an excerpt of the candidate IP address list:
[
"66.51.127.80",
"119.110.249.22",
"185.145.195.71",
"209.203.26.244",
"176.102.65.18",
"85.92.117.211",
"45.135.121.27",
"178.248.235.42",
"178.35.253.211",
"89.30.219.98",
"194.9.149.53",
"210.31.101.1",
"61.31.224.233",
"185.165.31.203",
"193.148.244.24",
]
There is no trivial way to infer whether the IP address belongs to a hosting provider or
not without having more information about each particular IP address.
A straightforward idea is to find the organization that is the owner of this IP address
and to detect whether the owning organization is a hosting provider or not. Based on the organization name
alone it is (usually) not possible to answer this. Therefore, the organization's website must be crawled.
But first, the organizations responsible for the IPs in the candidate IP address list must be
found.
Step 4: Obtain WHOIS Records for every IP Address
In a first step, the organization that owns the IP address needs to be obtained. This is
possible by
conducting a WHOIS lookup for each IP address of the candidate IP address list. For example, the
WHOIS lookup for
whois 66.51.127.80
yields the following WHOIS record:
NetRange: 66.51.120.0 - 66.51.127.255
CIDR: 66.51.120.0/21
NetName: FLYIO
NetHandle: NET-66-51-120-0-1
Parent: NET66 (NET-66-0-0-0-0)
NetType: Direct Allocation
OriginAS:
Organization: Fly.io, Inc. (FLYIO)
RegDate: 2021-12-06
Updated: 2021-12-06
Ref: https://rdap.arin.net/registry/ip/66.51.120.0
OrgName: Fly.io, Inc.
OrgId: FLYIO
Address: PO Box 803338 #19104
City: Chicago
StateProv: IL
PostalCode: 60680-3338
Country: US
RegDate: 2017-01-18
Updated: 2023-07-07
Ref: https://rdap.arin.net/registry/entity/FLYIO
OrgTechHandle: SANDE663-ARIN
OrgTechName: Sanders, Scott
OrgTechPhone: +1-803-767-0060
OrgTechEmail: scott@jssjr.com
OrgTechRef: https://rdap.arin.net/registry/entity/SANDE663-ARIN
OrgAbuseHandle: ABUSE8489-ARIN
OrgAbuseName: Abuse
OrgAbusePhone: +1-312-626-4490
OrgAbuseEmail: abuse@fly.io
OrgAbuseRef: https://rdap.arin.net/registry/entity/ABUSE8489-ARIN
OrgNOCHandle: FLYOP-ARIN
OrgNOCName: Fly Ops
OrgNOCPhone: +1-312-283-4377
OrgNOCEmail: ops@fly.io
OrgNOCRef: https://rdap.arin.net/registry/entity/FLYOP-ARIN
OrgTechHandle: BERRY359-ARIN
OrgTechName: Berryman, Steve
OrgTechPhone: +447886749129
OrgTechEmail: steve@fly.io
OrgTechRef: https://rdap.arin.net/registry/entity/BERRY359-ARIN
OrgTechHandle: FLYOP-ARIN
OrgTechName: Fly Ops
OrgTechPhone: +1-312-283-4377
OrgTechEmail: ops@fly.io
OrgTechRef: https://rdap.arin.net/registry/entity/FLYOP-ARIN
Limitations of conducting WHOIS lookups
Because our candidate IP list contains 100,1400
IP addresses, it is
possible that WHOIS
servers are rate limiting us when querying too fast. Therefore, a realistic speed that stays under the
radar is maybe 20,000 WHOIS lookups per day and the whole process takes at least 5 days (Which is fine).
Step 5: Parse the WHOIS record and extract the Company Name and Domain
Based on this WHOIS example from above, it is still not possible to say whether the organization
Fly.io, Inc.
is a
hosting provider or not. The next goal is to find the organization's website URL. Two attributes
from
the WHOIS record can be of help:
- The organization name can be parsed from the
OrgName: Fly.io, Inc.
attribute. The
organization name
can be Googled and the first search result could be the organization's website.
- The domain can be parsed from the
OrgAbuseEmail: abuse@fly.io
attribute and the domain
might be the same as the domain in the organization's website (Which is correct with
fly.io
).
Possible Limitations:
- The WHOIS record does not include a domain. Action: Proceed with the organization name.
- The WHOIS record does not include a organization name. Action: Proceed with the domain.
- If both domain and organization name are not available, the algorithm aborts.
If a domain is available in the WHOIS record, the following limitations apply:
- The WHOIS record includes a misleading domain that is not the domain of the organization's website.
Action:
Check with a blacklist of known bad domains if this is the case.
- The organization domain is not the primary website of the organization and only a technical domain.
Action:
This cannot be detected, a false positive is obtained.
If only the organization name is available in the WHOIS record, the following limitations apply:
- The name of the organization cannot be found on Google. Action: Abort the algorithm.
- The name of the organization leads to wrong search results and a wrong organization URL is obtained.
Problem: It
is not possible to easily say whether the search result is really the organization's website. Action:
Use
text similarity metric between organization name and url.
Step 6: Crawl the Website and Classify the Website Text
After visiting the website fly.io, it is possible to understand that
Fly.io, Inc.
is a
infrastructure as a
service organization that
sells specialized hosting resources. This is close enough and the organization
Fly.io, Inc.
and
all their known IP ranges can be classified as hosting IP ranges.
The only way to classify a organization's website to be a hosting provider or not is with some kind
of text classification approach. If the website includes a certain quantity of hosting keywords, the
website is classified as hosting provider. Some machine learning can be used, but this is out of the scope
for this quick article.
Limitations of Crawling a Website
- Crawling the URL results in a ban, since the website has crawling protection. Action: Abort the
algorithm or try again later.
- The website rate limits requests and thus the request is blocked. Action: Abort the
algorithm or try again later.
- The crawling results in some kind of error (Certificate error, 404 Not Found, 503 Server Error, or
similar). Action: Abort the algorithm.
Limitations of Classifying Text
-
The classification result is a false negative (No hosting provider even though the website is one).
Action: This happens and does not have a negative impact.
-
The classification result is a false positive (Classified as hosting provider, even though the website
is not one). Problem: This has a large negative impact. Action: If the scoring is weak or indecisive,
put the score result on a list for human verification.
- The
website's language might be in any language. Therefore the text classification algorithm must support
the
most commonly used language in the Internet, which is hard to implement. It is better to translate the
website into English. Action: Use Google translate automatically (Do they rate limit translation?).
Algorithm Pseudo Code
What follows is the pseudo code of the bespoken hosting detection algorithm. The code below implements the
steps 4 - 6. The first three steps were left out, since they are trivial to implement.
const hostingDetectionAlgorithm = async () => {
const candidateIps = [
"66.51.127.80",
"119.110.249.22",
"61.31.224.233",
"193.148.244.24",
// ... (huge list of 100k candidate IPs)
];
let i = 0;
let whoisLookupServer = 'standard';
let failures = {};
let classified = {};
while (i < candidateIps.length) {
const ip = candidateIps[i];
let whoisRecord = whoisLookup(ip, whoisLookupServer);
if (whoisServerDeniedRequest(whoisRecord)) {
if (haveUnblockedWhoisLookupServer()) {
whoisLookupServer = getNextWhoisLookupServer();
continue;
} else {
waitForFiveMinutes();
continue;
}
}
// we have a good whois lookup result
const orgName = getOrgFromWhois(whoisRecord);
const allDomains = getDomainsFromWhois(whoisRecord);
if (!orgName && !allDomains) {
failures[ip] = {
ip: ip,
status: 'failed',
error: 'Cannot parse org name and domain name from whois record'
};
i++;
continue;
}
let orgUrlCandidates = [];
if (allDomains) {
// remove all domains that cannot be the organization's website domain
let filteredDomains = filterBadDomains(allDomains);
// order the filtered domain by frequency in the WHOIS record
let sortedByFrequency = sortByDomainFrequency(filteredDomains);
// insert trial organization url candidates according to domain frequency
for (const domain of sortedByFrequency) {
if (isSimilar(domain, orgName)) {
orgUrlCandidates.push(`https://${domain}`);
orgUrlCandidates.push(`https://www.${domain}`);
}
}
}
// try to get the organization's website
for (const url of orgUrlCandidates) {
let crawlResponse = crawlHtml(url);
if (isCrawlSuccessful(crawlResponse)) {
const isHostingProvider = classifyText(crawlResponse);
classified[ip] = {
isHostingProvider: isHostingProvider,
orgName: orgName,
url: url,
};
break;
}
}
i++;
}
};
Conclusion
The algorithm above is rather complex. Unfortunately, there are many steps in the algorithm that can go
wrong or where false assumptions can be made. Furthermore, making WHOIS lookups and crawling websites with
thousands of IP addresses and organization names is rather slow and getting blocked is an issue.
Nevertheless, the discovery speed of the hosting detection algorithm doesn't need to be large.
Furthermore,
for every discovered hosting provider, the algorithm becomes more efficient, since there are less
candidate IP addresses.
To conclude, it can be said that thousands of new hosting providers could be detected by applying the
hosting
detection algorithm on the candidate IP address list. This gives ipapi.is one of the best hosting detection API's that exist!